Revolutionizing Financial Monitoring: Building a Team Dashboard With OpenObserve

Find out how building a unified OpenObserve dashboard transformed my financial monitoring journey—cutting downtime, lowering cost, and enabling proactive operations.

Sushma Kukkadapu

May. 09, 25 · Analysis

Likes (1)

Comment

Save

2.2K Views

After a particularly grueling Thursday spent troubleshooting a publish API outage last year, I remember turning to my colleague and saying, "There has to be a better way." Four years into my software engineering career in fintech, and we were still piecing together information from disparate monitoring tools whenever something went wrong. That frustration kickstarted my two-month journey to build what's now become our team's most valuable asset: a comprehensive OpenObserve dashboard that's transformed how we monitor our services.

Finding the Right Tool

After spending a few weekends researching options, I narrowed down our choices to three observability platforms. OpenObserve won out because it offered:

A unified approach to logs, metrics, and traces
Better cost efficiency compared to competitors
The flexibility we needed for our mixed Java and Node.js stack

The Implementation Journey

Building our dashboard was anything but straightforward. I hit several roadblocks along the way.

The first challenge was instrumenting our services. For our Java-based publish and calculate processing service, I implemented something like this:

    Java
   
 

   import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.metrics.LongCounter;
import io.opentelemetry.api.metrics.Meter;

public class PublishApiInstrumentation {
    private final LongCounter successCounter;
    private final LongCounter failureCounter;
    
    public PublishApiInstrumentation(OpenTelemetry openTelemetry) {
        Meter meter = openTelemetry.getMeter("publish-processing");
        
        successCounter = meter
            .counterBuilder("publish.api.success")
            .setDescription("Number of successful publish API calls")
            .build();
            
        failureCounter = meter
            .counterBuilder("publish.api.failure")
            .setDescription("Number of failed publish API calls")
            .build();
    }
    
    public void recordSuccess() {
        successCounter.add(1);
    }
    
    public void recordFailure() {
        failureCounter.add(1);
    }
}
  

Initially, I made the rookie mistake of over-instrumenting everything. Our first iteration was sending so much telemetry data that it was both expensive and overwhelming. I had to take a step back and ask, "What actually matters to us?" This led to a more focused approach.

For our Node.js services, including our account management microservice, I took a similar but slightly different approach:

    JavaScript
   
 

   const opentelemetry = require('@opentelemetry/api');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

// Configure the meter
const meter = opentelemetry.metrics.getMeter('account-service');

// Create counters for API metrics
const apiCallCounter = meter.createCounter('account.api.calls', {
  description: 'Count of API calls to account service'
});

const apiErrorCounter = meter.createCounter('account.api.errors', {
  description: 'Count of API errors in account service'
});

// Middleware to track API metrics
function metricsMiddleware(req, res, next) {
  const startTime = Date.now();
  
  // Add response handlers
  res.on('finish', () => {
    const duration = Date.now() - startTime;
    
    // Record call with appropriate attributes
    apiCallCounter.add(1, {
      route: req.route.path,
      method: req.method,
      statusCode: res.statusCode
    });
    
    // Record errors
    if (res.statusCode >= 400) {
      apiErrorCounter.add(1, {
        route: req.route.path,
        method: req.method,
        statusCode: res.statusCode
      });
    }
  });
  
  next();
}
  

The real magic happened when I started building the queries and visualizations. After several iterations, I came up with a set of OpenObserve Query Language (OQL) queries that gave us the insights we needed:

    SQL
   
 

   -- For monitoring API success rates
logs 
| json_extract(body, '$.service', '$.statusCode', '$.endpoint', '$.duration') 
| where service='publish-api' 
| timechart span=1m, 
  count(*) as total_requests,
  count(statusCode < 400) as successful_requests,
  count(statusCode >= 400) as failed_requests 
| eval success_rate = (successful_requests * 100.0) / total_requests
  

For tracking CPU utilization, which had been a persistent blind spot for us, I used:

    SQL
   
   metrics 
| filter name = 'system.cpu.utilization' 
| filter service = 'publish-gateway' 
| timechart avg(value) by host span=5m

This helped us identify a particularly CPU-hungry endpoint that was consuming resources disproportionately during peak hours. We optimized it, reducing CPU usage by 40% and saving us from having to provision additional instances.

Setting Up Alerts That Don't Drive Us Crazy

Anyone who's worked in operations knows the pain of alert fatigue. I'd been burned before by noisy alerts, so I was determined to get this right.

One of our most valuable alerts watches for sudden increases in API failure rates:

    YAML
   
 

   alert: PublishAPIFailureRateHigh
expr: |
  sum(rate(publish_api_failures[5m])) / sum(rate(publish_api_requests[5m])) > 0.05
for: 2m
labels:
  severity: critical
  team: finance-tech
annotations:
  summary: "Publish API failure rate exceeds 5%"
  description: "The failure rate for the publish API has exceeded 5% for more than 2 minutes."
  

The key was setting thresholds that were sensitive enough to catch real issues but not so sensitive that we'd get woken up for nothing. Finding that balance took several weeks of tuning.

The Real-World Impact

It's been two months since we fully implemented our monitoring solution, and the impact has been greater than I expected:

Faster resolution: What used to take us 3+ hours to diagnose now typically takes less than an hour. That's not just good for our stress levels—it's directly improving our user experience.
Real cost savings: By identifying resource-hungry operations and optimizing them, we've reduced our cloud infrastructure costs by 22%.
Breaking down silos: Perhaps the most unexpected benefit has been how the dashboard has improved collaboration between engineering and finance teams. Our finance colleagues now have visibility into technical metrics that affect their business operations, leading to more informed business decisions.

What's Next

We're not stopping here. I'm already working on phase two, which includes:

Implementing distributed tracing to better understand end-to-end finance user interaction flows
Adding anomaly detection using machine learning models
Connecting technical metrics to business outcomes, like correlating API performance with successful publish rates

Building this monitoring solution has been one of the most satisfying projects of my career. It's transformed how our team works, moving us from reactive firefighting to proactive system management. In the high-stakes world of financial services, that's not just a technical improvement—it's a business transformation.

API Dashboard (Mac OS) teams

Opinions expressed by DZone contributors are their own.

Related

Trending