DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • 3 Reasons Why VPs of Engineering Are Choosing Low-Code
  • Be a Better Team Player
  • The Rising Risks and Opportunities in API Security
  • Maximizing Efficiency With the Test Automation Pyramid: Leveraging API Tests for Optimal Results

Trending

  • Doris: Unifying SQL Dialects for a Seamless Data Query Ecosystem
  • Developers Beware: Slopsquatting and Vibe Coding Can Increase Risk of AI-Powered Attacks
  • My LLM Journey as a Software Engineer Exploring a New Domain
  • Scalable, Resilient Data Orchestration: The Power of Intelligent Systems
  1. DZone
  2. Data Engineering
  3. Databases
  4. Revolutionizing Financial Monitoring: Building a Team Dashboard With OpenObserve

Revolutionizing Financial Monitoring: Building a Team Dashboard With OpenObserve

Find out how building a unified OpenObserve dashboard transformed my financial monitoring journey—cutting downtime, lowering cost, and enabling proactive operations.

By 
Sushma Kukkadapu user avatar
Sushma Kukkadapu
·
May. 09, 25 · Analysis
Likes (1)
Comment
Save
Tweet
Share
2.2K Views

Join the DZone community and get the full member experience.

Join For Free

After a particularly grueling Thursday spent troubleshooting a publish API outage last year, I remember turning to my colleague and saying, "There has to be a better way." Four years into my software engineering career in fintech, and we were still piecing together information from disparate monitoring tools whenever something went wrong. That frustration kickstarted my two-month journey to build what's now become our team's most valuable asset: a comprehensive OpenObserve dashboard that's transformed how we monitor our services.

An image of an office setting.


Finding the Right Tool

After spending a few weekends researching options, I narrowed down our choices to three observability platforms. OpenObserve won out because it offered:

  1. A unified approach to logs, metrics, and traces
  2. Better cost efficiency compared to competitors
  3. The flexibility we needed for our mixed Java and Node.js stack

The Implementation Journey

Building our dashboard was anything but straightforward. I hit several roadblocks along the way.

The first challenge was instrumenting our services. For our Java-based publish and calculate processing service, I implemented something like this:

Java
 
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.metrics.LongCounter;
import io.opentelemetry.api.metrics.Meter;

public class PublishApiInstrumentation {
    private final LongCounter successCounter;
    private final LongCounter failureCounter;
    
    public PublishApiInstrumentation(OpenTelemetry openTelemetry) {
        Meter meter = openTelemetry.getMeter("publish-processing");
        
        successCounter = meter
            .counterBuilder("publish.api.success")
            .setDescription("Number of successful publish API calls")
            .build();
            
        failureCounter = meter
            .counterBuilder("publish.api.failure")
            .setDescription("Number of failed publish API calls")
            .build();
    }
    
    public void recordSuccess() {
        successCounter.add(1);
    }
    
    public void recordFailure() {
        failureCounter.add(1);
    }
}


Initially, I made the rookie mistake of over-instrumenting everything. Our first iteration was sending so much telemetry data that it was both expensive and overwhelming. I had to take a step back and ask, "What actually matters to us?" This led to a more focused approach.

For our Node.js services, including our account management microservice, I took a similar but slightly different approach:

JavaScript
 
const opentelemetry = require('@opentelemetry/api');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

// Configure the meter
const meter = opentelemetry.metrics.getMeter('account-service');

// Create counters for API metrics
const apiCallCounter = meter.createCounter('account.api.calls', {
  description: 'Count of API calls to account service'
});

const apiErrorCounter = meter.createCounter('account.api.errors', {
  description: 'Count of API errors in account service'
});

// Middleware to track API metrics
function metricsMiddleware(req, res, next) {
  const startTime = Date.now();
  
  // Add response handlers
  res.on('finish', () => {
    const duration = Date.now() - startTime;
    
    // Record call with appropriate attributes
    apiCallCounter.add(1, {
      route: req.route.path,
      method: req.method,
      statusCode: res.statusCode
    });
    
    // Record errors
    if (res.statusCode >= 400) {
      apiErrorCounter.add(1, {
        route: req.route.path,
        method: req.method,
        statusCode: res.statusCode
      });
    }
  });
  
  next();
}


The real magic happened when I started building the queries and visualizations. After several iterations, I came up with a set of OpenObserve Query Language (OQL) queries that gave us the insights we needed:

SQL
 
-- For monitoring API success rates
logs 
| json_extract(body, '$.service', '$.statusCode', '$.endpoint', '$.duration') 
| where service='publish-api' 
| timechart span=1m, 
  count(*) as total_requests,
  count(statusCode < 400) as successful_requests,
  count(statusCode >= 400) as failed_requests 
| eval success_rate = (successful_requests * 100.0) / total_requests


For tracking CPU utilization, which had been a persistent blind spot for us, I used:

SQL
 
metrics 
| filter name = 'system.cpu.utilization' 
| filter service = 'publish-gateway' 
| timechart avg(value) by host span=5m


This helped us identify a particularly CPU-hungry endpoint that was consuming resources disproportionately during peak hours. We optimized it, reducing CPU usage by 40% and saving us from having to provision additional instances.

Setting Up Alerts That Don't Drive Us Crazy

Anyone who's worked in operations knows the pain of alert fatigue. I'd been burned before by noisy alerts, so I was determined to get this right.

One of our most valuable alerts watches for sudden increases in API failure rates:

YAML
 
alert: PublishAPIFailureRateHigh
expr: |
  sum(rate(publish_api_failures[5m])) / sum(rate(publish_api_requests[5m])) > 0.05
for: 2m
labels:
  severity: critical
  team: finance-tech
annotations:
  summary: "Publish API failure rate exceeds 5%"
  description: "The failure rate for the publish API has exceeded 5% for more than 2 minutes."


The key was setting thresholds that were sensitive enough to catch real issues but not so sensitive that we'd get woken up for nothing. Finding that balance took several weeks of tuning.

The Real-World Impact

It's been two months since we fully implemented our monitoring solution, and the impact has been greater than I expected:

  1. Faster resolution: What used to take us 3+ hours to diagnose now typically takes less than an hour. That's not just good for our stress levels—it's directly improving our user experience.
  2. Real cost savings: By identifying resource-hungry operations and optimizing them, we've reduced our cloud infrastructure costs by 22%. 
  3. Breaking down silos: Perhaps the most unexpected benefit has been how the dashboard has improved collaboration between engineering and finance teams. Our finance colleagues now have visibility into technical metrics that affect their business operations, leading to more informed business decisions.

What's Next

We're not stopping here. I'm already working on phase two, which includes:

  • Implementing distributed tracing to better understand end-to-end finance user interaction flows
  • Adding anomaly detection using machine learning models
  • Connecting technical metrics to business outcomes, like correlating API performance with successful publish rates

Building this monitoring solution has been one of the most satisfying projects of my career. It's transformed how our team works, moving us from reactive firefighting to proactive system management. In the high-stakes world of financial services, that's not just a technical improvement—it's a business transformation.

API Dashboard (Mac OS) teams

Opinions expressed by DZone contributors are their own.

Related

  • 3 Reasons Why VPs of Engineering Are Choosing Low-Code
  • Be a Better Team Player
  • The Rising Risks and Opportunities in API Security
  • Maximizing Efficiency With the Test Automation Pyramid: Leveraging API Tests for Optimal Results

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: