Infrastructure & DevOps6 January 2026·11 min read

Monitoring with Grafana and Prometheus: Complete Setup Guide

Prometheus metrics and PromQL, Grafana dashboards, alerting rules, monitoring Node.js and Next.js applications, and implementing custom business metrics.

GrafanaPrometheusMonitoringDevOpsAlertingNode.js

Why Monitoring Matters

You cannot improve what you cannot measure. Without monitoring, you discover problems when customers complain — or when your application goes down entirely. With proper monitoring, you see degradation before it becomes an outage, identify bottlenecks before they affect users, and make capacity planning decisions based on data rather than guesses.

Prometheus and Grafana are the industry-standard open-source monitoring stack. Prometheus collects and stores metrics. Grafana visualizes them. Together, they provide comprehensive observability for any application.

Prometheus: The Metrics Engine

Prometheus is a time-series database designed for monitoring. It scrapes metrics from your applications at configurable intervals (typically every 15 seconds), stores them efficiently, and provides a powerful query language for analysis.

Metric Types

Counter: A value that only goes up. Total HTTP requests served, total errors, total bytes processed. Use for tracking throughput and error rates.

Gauge: A value that can go up or down. Current CPU usage, memory usage, active connections, queue depth. Use for tracking current state.

Histogram: Samples observations into configurable buckets. Request duration distribution, response size distribution. Use for latency percentiles (p50, p95, p99).

Summary: Similar to histogram but calculates percentiles on the application side. Use when you need pre-calculated percentiles but do not need bucket-level flexibility.

How Prometheus Collects Metrics

Prometheus uses a pull model. Your application exposes a /metrics endpoint that returns metrics in Prometheus text format. Prometheus periodically scrapes this endpoint. This pull model means Prometheus controls the collection rate and can detect when targets are down (a failed scrape is itself a signal).

PromQL: Querying Your Metrics

PromQL is Prometheus's query language. Essential queries include:

Request rate: rate(http_requests_total[5m]) — average requests per second over the last 5 minutes

Error rate: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) — error percentage

Latency percentile: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) — p95 latency

Resource usage: process_resident_memory_bytes / 1024 / 1024 — memory usage in MB

PromQL supports aggregation (sum, avg, max, min), filtering by labels, and mathematical operations. Master these four query types and you can monitor any application.

Grafana: Visualization

Grafana connects to Prometheus as a data source and provides dashboards for visualizing metrics:

Dashboard Design Principles

The Four Golden Signals: Every service dashboard should show latency, traffic, errors, and saturation. These four signals cover the vast majority of monitoring needs.

Top-down layout: Start with high-level service health at the top, drill into specific components below.

Time range consistency: Use dashboard-level time range controls so all panels show the same window.

Annotations: Mark deployments, incidents, and configuration changes on your graphs. Correlating metrics with events is essential for root cause analysis.

Essential Panels

Request rate graph: Time series showing requests per second. Sudden drops indicate outages; spikes indicate traffic surges.

Error rate graph: Time series showing error percentage. Any sustained increase above your baseline triggers investigation.

Latency heatmap: Shows request duration distribution over time. Identify slow requests visually.

Resource gauges: CPU, memory, and disk usage as gauge panels with thresholds (green/yellow/red).

Alerting Rules

Prometheus Alertmanager handles alert routing and deduplication. Define alerting rules in Prometheus that fire when conditions are met:

High error rate: Alert when error rate exceeds 1% for 5 minutes

High latency: Alert when p95 latency exceeds 2 seconds for 5 minutes

Resource exhaustion: Alert when disk usage exceeds 80%

Target down: Alert when a scrape target is unreachable for 2 minutes

Route alerts to Slack, PagerDuty, email, or webhooks based on severity. Critical alerts page on-call engineers. Warning alerts go to a Slack channel.

Monitoring Node.js and Next.js Applications

Use the prom-client npm package to instrument Node.js applications. Create a metrics registry, define custom metrics (request counters, duration histograms), and expose them on a /metrics endpoint.

For Next.js specifically, instrument API routes with middleware that measures request duration and status codes. Track Server Component render times. Monitor edge function execution duration if using Vercel Edge Runtime.

Custom Metrics

Beyond default metrics, track business-specific signals:

User signups per minute: Track conversion rates in real-time

Order processing time: Monitor your fulfillment pipeline

Cache hit ratio: Measure caching effectiveness

Queue depth: Track background job backlogs

External API latency: Monitor third-party dependency performance

Custom metrics turn your monitoring from "is it up?" into "is the business healthy?" — a far more valuable question. Need help setting up monitoring? Contact us.

The Beyond Horizon Team

Engineering-led digital studio based in India. We build production-grade web apps, mobile apps, AI systems, and SaaS platforms — and write about what we learn along the way.

About Us →Our Work →

Keep Reading

All Articles →

Infrastructure & DevOps

Docker for Web Developers: A Practical Guide

From writing efficient Dockerfiles for Next.js to Docker Compose for local development and CI/CD integration. The containerization guide web developers actually need.

9 min readRead →

Infrastructure & DevOps

Setting Up CI/CD for Next.js with GitHub Actions

Automate testing, building, and deploying Next.js applications. Trunk-based development, testing strategies, and infrastructure as code for modern web teams.

9 min readRead →

Infrastructure & DevOps

Terraform for Beginners: Infrastructure as Code Guide

Core Terraform concepts, AWS and GCP examples, modules, workspaces, CI/CD integration, and how Terraform compares to Pulumi for infrastructure management.

11 min readRead →

Have a Project in Mind?

We build fast, SEO-ready web and mobile applications.