Green coding - Confluent: Sustainability through data streaming

This is a guest post for the Computer Weekly Developer Network written by Peter Pugh-Jones in his capacity as director of financial services at Confluent.

Confluent is a full-scale data streaming platform that enables users to access, store and manage data as continuous, real-time streams – built by the original creators of Apache Kafka, Confluent expands the benefits of Kafka with enterprise-grade features, while removing the burden of Kafka management or monitoring.

As TechTarget reminds us Apache Kafka is a distributed publish-subscribe messaging system that receives data from disparate source systems and makes the data available to target systems in real-time.

Pugh-Jones writes in full as follows…

It’s no secret that big data can come with a big environmental impact.

For every one gigabyte stored in the cloud, around seven kilowatts of energy are used per hour. That means that if a business stores ten terabytes of data, it’s creating a carbon footprint equivalent to 500 kg of CO2.

But it’s not just storage, all aspects of processing, analysis and software development come with their own environmental burdens. But here’s the good news, data-driven technologies are also helping to change the world for the better.

Systematic sustainability

Digital data is helping to streamline processes, driving environmental decision making and uncovering new and more sustainable ways of working.

Developers are also pushing for their own solutions. This concept of ‘green coding’ — a practice that prioritises efficiency and sustainability in software development — is gaining increasing traction, as is a push for greater awareness of tech supply chains and ethical data storage decisions.

So how can businesses adopt greener coding practices and where does data streaming fit into this sustainable future?

Batch vs real-time streaming

When it comes to managing data, businesses must make a choice between batch processing (processing large volumes at scheduled intervals) and data streaming (continuously processing data in real-time as it arrives).

The common misconception is that continuous streaming must be less efficient because it requires constant processing power. Similar to a TV on standby, the logic goes that running something continuously is worse than switching it on and off when needed.

In reality, the opposite is true. Batch processing requires such regular and intensive spikes in processing power that it’s significantly less efficient than a continuous low-level stream. To continue our television analogy, it’s equivalent to turning a TV on and off 500 times a minute rather than leaving it on standby. The result is actually far more energy being consumed.

In contrast, data streaming means that a constant flow of data can be processed at all times. Rather than waiting for a mass of messages to accumulate and then processing them via one huge spike in CPU power, you’re instead running one message at a time and using a far smaller amount of energy. In terms of resource consumption, the result is a gently flowing stream rather than a sudden tidal wave.

Providing predictable processing

Streaming data doesn’t just benefit from lower CPU usage, it also provides for more predictable processing overall. By switching from sudden spikes in processing to a continuous predictable flow, data streaming allows organisations to better predict and forecast their requirements. Take Apache Flink as an example. Apache Flink is an open source stream-processing framework — a popular technology for those who opt for real-time data streaming.

With Flink Actions (the operations that are applied to data streams when using Apache Flink), users can not only enable real-time data processing, but also analytics. These analytics help organisations develop a clear and reliable understanding of their usage, with less need to build in contingency for sudden unexpected spikes. By lowering this need for unused cloud contingency, organisations can increase efficiency, reduce costs, and ultimately develop more sustainable processing.

Frameworks like Apache Flink — managed via a data-streaming platform (Ed: could our erstwhile author here be referring to Confluent perchance?) also come with the added benefit of being serverless.

Systematic sustainability

Batch vs real-time streaming

Providing predictable processing

Serverless suitability