yahoohadoop.tumblr.com/post/122544585051/apache-storm-roadmap-at-yahoo

By Bobby Evans and the Apache Storm team at Yahoo

Yahoo uses Apache Storm for large scale distributed stream processing. Think of stream processing like Apache Hadoop but instead of batching the data first, it processes the data as it arrives.  Stream processing gives Yahoo the ability to take action on data in near real-time, and we use it for just about everything that is latency sensitive.

Storm is commonly used when lower latency is required to deliver a better user experience. These use-cases include identifying a breaking news story and promoting it, showing trending search terms as they happen, helping to identify and block SPAM, or letting advertisers see the impact of their campaigns as quickly as possible.

In my opinion, the most interesting use cases are feedback loops. One of my favorites is ad budgeting. We have multiple data centers with ad servers all around the globe.  Advertisers give us a budget for how much they want to spend on a particular campaign. If we show too few ads we are leaving money on the table and if we show too many ads, we cannot charge for the excess ad impressions that we could have sold to someone else. Storm gives us the ability to globally adjust to changes as they happen, and accurately control the number of ads shown in a campaign. So, for example, when Yahoo live streams an NFL game and England unexpectedly goes crazy over it, we can shift ad budget from US data centers across the pond and maximize our revenue in real-time.

Our Storm deployments have grown dramatically over the last few years, with the number of nodes we have doubling in the last six months alone.

Fig 1. Storm cluster growth

Fig 2. Storm topology growth

Much of this growth is made possible by our previous development effort to add security and multi-tenancy to Storm. We are very excited to announce that this is now in open source under storm-0.10.1-beta. This is a critical milestone for many enterprise use cases with security concerns in adopting Storm.

It is not just Yahoo that sees the power of stream processing. There seem to be new announcements about stream processing showing up all the time: Twitter publishing their results on Heron, and DataTorrent announcing that they will release their core as open source under the name Apex. It is great to see so much competition happening in the market. I am a firm believer in competition driving positive change.

Recently, I gave a talk at Hadoop Summit 2015 on our efforts at Yahoo to scale Apache Storm to 4000+ nodes (96,000 cores) in a single cluster. The question I was asked the most after the talk was “When will this be in open source?” I was also frequently asked “How can we help out?” As a result, we are shifting our development efforts to better align with open source and to reduce the amount of time it takes for this work to be made available to end users.

To help with the debuggability and management of Storm, we will be pushing back STORM-902 to allow for easy searching of the logs out of the box; STORM-901 so that developers can get access to heap dumps, gc logs, or anything else a topology may want to export for debugging; and STORM-412 to let devs change the logging config for a topology on the fly. Additionally, in 0.10.0 we switched the logging framework to log4j2 with support for RFC5424 to help external tools consume the logs produced.

We are adding in resource aware scheduling through STORM-893 to let users get better utilization of their clusters and better predictability in the execution of their topologies. We are going beyond what YARN and Mesos currently can do by adding in network as a resource, which can solve huge problems on shared clusters or very large topologies. We have some groups that want to consume 6 GBytes/sec that, if scheduled poorly, can saturate a 40 gigabit TOR downlink. I recently talked with some people working on the Large Hadron Collider. They were exploring if Storm could be used to process the 4 TBytes/sec that they deal with. I would love to see Storm working at that scale, and I firmly believe that we will get there.

We are pushing back STORM-411 to let users distribute more than just a single jar to their topology with a distributed cache like API, while additionally giving them the ability to update items in the cache without bringing down a topology. This also sets the groundwork for STORM-167 to let users do a rolling upgrade of a topology.

To make life simpler for users to meet SLAs, we are exploring intelligent backpressure through STORM-886 and STORM-907, and automatic topology scaling through STORM-815 and STORM-594.

We are exploring failure monitoring to detect and blacklist bad nodes thought STORM-909, and plan to help with STORM-166 to ensure Nimbus is not a single point of failure.

We are also committed to making sure Storm is competitive with other streaming technologies in performance and stability through things like STORM-889.

And longer term we plan to drive STORM-911 giving Storm the ability to run topologies on bare hardware, YARN, Mesos, or just about any other system.  

We intend to up-merge to the master branch on a weekly basis. This combines with our continuous integration and continuous deployment environment so that after a feature goes into open source, we will have it deployed to a staging/development environment at scale within two weeks, and to a production environment after another few weeks. This should give end users a lot of confidence in using Storm for business critical use cases.

Yahoo loves Storm. Adoption is growing and Storm is solving real world use cases all the time.  We want to share the phenomenal success we have been seeing with the rest of the open source community.  Expect to hear more from us as our development efforts continue.


Comments (0)

Sign in to post comments.