Revolutionizing Financial Monitoring: Building a Team Dashboard With OpenObserve
Find out how building a unified OpenObserve dashboard transformed my financial monitoring journey—cutting downtime, lowering cost, and enabling proactive operations.
Join the DZone community and get the full member experience.
Join For FreeAfter a particularly grueling Thursday spent troubleshooting a publish API outage last year, I remember turning to my colleague and saying, "There has to be a better way." Four years into my software engineering career in fintech, and we were still piecing together information from disparate monitoring tools whenever something went wrong. That frustration kickstarted my two-month journey to build what's now become our team's most valuable asset: a comprehensive OpenObserve dashboard that's transformed how we monitor our services.
Finding the Right Tool
After spending a few weekends researching options, I narrowed down our choices to three observability platforms. OpenObserve won out because it offered:
- A unified approach to logs, metrics, and traces
- Better cost efficiency compared to competitors
- The flexibility we needed for our mixed Java and Node.js stack
The Implementation Journey
Building our dashboard was anything but straightforward. I hit several roadblocks along the way.
The first challenge was instrumenting our services. For our Java-based publish and calculate processing service, I implemented something like this:
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.metrics.LongCounter;
import io.opentelemetry.api.metrics.Meter;
public class PublishApiInstrumentation {
private final LongCounter successCounter;
private final LongCounter failureCounter;
public PublishApiInstrumentation(OpenTelemetry openTelemetry) {
Meter meter = openTelemetry.getMeter("publish-processing");
successCounter = meter
.counterBuilder("publish.api.success")
.setDescription("Number of successful publish API calls")
.build();
failureCounter = meter
.counterBuilder("publish.api.failure")
.setDescription("Number of failed publish API calls")
.build();
}
public void recordSuccess() {
successCounter.add(1);
}
public void recordFailure() {
failureCounter.add(1);
}
}
Initially, I made the rookie mistake of over-instrumenting everything. Our first iteration was sending so much telemetry data that it was both expensive and overwhelming. I had to take a step back and ask, "What actually matters to us?" This led to a more focused approach.
For our Node.js services, including our account management microservice, I took a similar but slightly different approach:
const opentelemetry = require('@opentelemetry/api');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
// Configure the meter
const meter = opentelemetry.metrics.getMeter('account-service');
// Create counters for API metrics
const apiCallCounter = meter.createCounter('account.api.calls', {
description: 'Count of API calls to account service'
});
const apiErrorCounter = meter.createCounter('account.api.errors', {
description: 'Count of API errors in account service'
});
// Middleware to track API metrics
function metricsMiddleware(req, res, next) {
const startTime = Date.now();
// Add response handlers
res.on('finish', () => {
const duration = Date.now() - startTime;
// Record call with appropriate attributes
apiCallCounter.add(1, {
route: req.route.path,
method: req.method,
statusCode: res.statusCode
});
// Record errors
if (res.statusCode >= 400) {
apiErrorCounter.add(1, {
route: req.route.path,
method: req.method,
statusCode: res.statusCode
});
}
});
next();
}
The real magic happened when I started building the queries and visualizations. After several iterations, I came up with a set of OpenObserve Query Language (OQL) queries that gave us the insights we needed:
-- For monitoring API success rates
logs
| json_extract(body, '$.service', '$.statusCode', '$.endpoint', '$.duration')
| where service='publish-api'
| timechart span=1m,
count(*) as total_requests,
count(statusCode < 400) as successful_requests,
count(statusCode >= 400) as failed_requests
| eval success_rate = (successful_requests * 100.0) / total_requests
For tracking CPU utilization, which had been a persistent blind spot for us, I used:
metrics
| filter name = 'system.cpu.utilization'
| filter service = 'publish-gateway'
| timechart avg(value) by host span=5m
This helped us identify a particularly CPU-hungry endpoint that was consuming resources disproportionately during peak hours. We optimized it, reducing CPU usage by 40% and saving us from having to provision additional instances.
Setting Up Alerts That Don't Drive Us Crazy
Anyone who's worked in operations knows the pain of alert fatigue. I'd been burned before by noisy alerts, so I was determined to get this right.
One of our most valuable alerts watches for sudden increases in API failure rates:
alert: PublishAPIFailureRateHigh
expr: |
sum(rate(publish_api_failures[5m])) / sum(rate(publish_api_requests[5m])) > 0.05
for: 2m
labels:
severity: critical
team: finance-tech
annotations:
summary: "Publish API failure rate exceeds 5%"
description: "The failure rate for the publish API has exceeded 5% for more than 2 minutes."
The key was setting thresholds that were sensitive enough to catch real issues but not so sensitive that we'd get woken up for nothing. Finding that balance took several weeks of tuning.
The Real-World Impact
It's been two months since we fully implemented our monitoring solution, and the impact has been greater than I expected:
- Faster resolution: What used to take us 3+ hours to diagnose now typically takes less than an hour. That's not just good for our stress levels—it's directly improving our user experience.
- Real cost savings: By identifying resource-hungry operations and optimizing them, we've reduced our cloud infrastructure costs by 22%.
- Breaking down silos: Perhaps the most unexpected benefit has been how the dashboard has improved collaboration between engineering and finance teams. Our finance colleagues now have visibility into technical metrics that affect their business operations, leading to more informed business decisions.
What's Next
We're not stopping here. I'm already working on phase two, which includes:
- Implementing distributed tracing to better understand end-to-end finance user interaction flows
- Adding anomaly detection using machine learning models
- Connecting technical metrics to business outcomes, like correlating API performance with successful publish rates
Building this monitoring solution has been one of the most satisfying projects of my career. It's transformed how our team works, moving us from reactive firefighting to proactive system management. In the high-stakes world of financial services, that's not just a technical improvement—it's a business transformation.
Opinions expressed by DZone contributors are their own.
Comments