Computer Science

Stream Processing

Stream processing is a method of data processing that involves continuous processing of data in real-time as it is generated. It involves the use of streaming data pipelines that can handle large volumes of data and process it in parallel. Stream processing is commonly used in applications such as financial trading, social media analytics, and IoT data processing.

Written by Perlego with AI-assistance

8 Key excerpts on "Stream Processing"

  • Parallel Computing Architectures and APIs
    eBook - ePub

    Parallel Computing Architectures and APIs

    IoT Big Data Stream Processing

    big data . To tackle the challenge of processing big data, the MapReduce framework was developed by Google. This system needs to process millions of webpages every second. Over the last five years, thanks to the availability of several open-source implementations of this framework, such as Hadoop, MapReduce has become a dominant framework for solving big data processing problems. Batch processing refers to the computational analysis of bounded data sets. In practical terms, this means that those data sets are available and retrievable as a whole from some form of storage. We know the size of the data set at the start of the computational process, and the duration of that process is limited in time.
    However, it was observed that volume is not the only challenge when processing sensing data that is continuously being generated, and that the velocity of data generation is also an important challenge that needs to be tackled.
    There are a number of applications where data is expected to be ingested not only in high volume but also with high velocity , requiring real-time analytics or data processing. These applications include: stock trading, network monitoring, social media–based business analytics, and so on.
    There are two fundamental aspects that have changed the way data needs to be processed, which create the need for a complete paradigm shift:
    • • The size of data has evolved to the point that it has become intractable by existing data management systems.
    • • The rate of change in data is so rapid that processing also needs to be done in real time.
    The limitation of the previous approach in managing and analyzing high-volume and velocity data in real time has led to the development of new and sophisticated distributed techniques and technologies. The processing of data with such features is defined as data Stream Processing or Stream Processing. Stream Processing refers to the computational analysis of unbounded data sets; it is concerned with the processing of data as it arrives to the system. Given the unbounded nature of data streams, the Stream Processing systems need to run constantly for as long as the stream is delivering new data, which is theoretically forever.
    Stream Processing can be subdivided into three areas:
    1. 1. DSMS where online query languages have been explored.
    2. 2. Online data processing algorithms where the aim is to process the data in a single pass.
  • Designing Big Data Platforms
    eBook - ePub

    Designing Big Data Platforms

    How to Use, Deploy, and Maintain Big Data Systems

    • Yusuf Aytas(Author)
    • 2021(Publication Date)
    • Wiley
      (Publisher)
    Stream Big Data processing is a vital part of a modern Big Data platform. Stream Processing can help a modern Big Data platform from many different perspectives. Streaming data gives the platform to cleanse, filter, categorize, and analyze the data while it is in motion. Thus, we don't have to store irrelevant and fruitless data to disk. With stream Big Data processing, we get a chance to respond to user interactions or events swiftly rather than waiting for more significant periods. Having fast loops of discovery and acting can introduce a competitive advantage to the businesses. Streaming solutions bring additional agility with added risk. We can change the processing pipeline and see the results very quickly. Nevertheless, it poses the threat of losing data or mistaking it.
    We can employ a new set of solutions with stream Big Data processing. Some of these solutions are as follows:
    • Fraud detection
    • Anomaly detection
    • Alerting
    • Real‐time monitoring
    • Instant machine learning updates
    • Ad hoc analysis of real‐time data.

    6.2 Defining Stream Data Processing

    In Chapter 5 , I defined offline Big Data processing as free of commitment to the user or events with respect to time. On the other hand, stream Big Data processing has to respond promptly. The delay between requests and responses can differ from seconds to minutes. Nonetheless, there is still a commitment to respond within a specified amount of time. A stream never ends. Hence, there is a commitment to responding to continuous data when processing streaming Big Data.
    Data streams are unbounded. Yet, most of the analytical functions like sum or average require bounded sets. Stream Processing has the concept of windowing to get bounded sets of data. Windows enable computations where it would not be possible since there is no end to a stream. For example, we can't find the average page views per visitor to a website since there will be more visitors and more page views every second. The solution is splitting data into defined chunks to reason about it in a bounded fashion called windowing. There are several types of windows, but the most notable ones are time and count windows, where we divide data into finite chunks by time or count, respectively. In addition to windowing, some Stream Processing frameworks support watermarking helps a Stream Processing engine to manage the arrivals of late events. Fundamentally, a watermark defines a waiting period for an engine to consider late events. If an event arrives within a watermark, the engine recomputes a query. If the event arrives later than the watermark, the engine drops the event. When dealing with the average page views problem, a better approach would be finding average page views per visitor in the last five minutes, as shown in Figure 6.1 . We can also use a count window to calculate averages after receiving x
  • The Internet of Things
    eBook - ePub

    The Internet of Things

    From Data to Insight

    • John Davies, Carolina Fortuna(Authors)
    • 2020(Publication Date)
    • Wiley
      (Publisher)
    22 ] and their purpose is to support the human understanding and deciding on the phenomena under consideration. For instance, a genomics researcher has to understand the temporal behavior of specific genes to plan their research while a power plant manager has to understand what is happening with a nuclear cooling system to be able to schedule maintenances, etc.
    Advances in stream data processing can be seen in algorithms, platforms, as well as application areas. As also explained in this section, new algorithms are being developed for compressing [10 ], reducing [13 ], summarizing [15 , 17 ], learning, and mining [23 ], as well as visualizing [21 ] streams of data.
    With respect to supporting technology and infrastructure, several platforms such as S4, Storm, Millwheel, Samza, Spark, Flink, Heron, Summingbird, etc., exist. A good analysis of these platforms can be found in [5 ]. Various commercial cloud platforms also provide some stream analytics services, including for IoT: Microsoft Azure Stream Analytics, Amazon AWS Kinesis, etc.
    With respect to application areas, the stream analytics is used for fraud detection, for audience, network, traffic analysis [5 ], disaster management [24 ], health [25 ], smart grids [26 ], and so on.

    5.3 Architectures and Languages

    We distinguish two main types of stream data processing systems. The first type is based on existing and well‐understood relational database principles. These systems are sometimes called first‐generation streaming systems [27 ] or data stream management systems (DSMSs ). Examples include STREAM [28 ] and Aurora [29 ]. This generation of streaming systems also introduced the concept of continuous queries over infinite continuous streams of data as opposed to the well‐known one‐time queries of traditional database management systems (DBMS ) over bounded, stored datasets [28 ]. While adapting relational query languages such as SQL might suffice for simple continuous queries, growing complexity of queries, e.g. adding aggregation, subqueries, and windowing constructs, makes the applied semantics obscure [30 ]. Hence, various continuous query language s were created: CQL [30 ], ESL [31 ], Hancock [32 ], …. Specialized streaming systems and query languages for sensor network applications also exist [33
  • Big Data Analytics with Hadoop 3
    eBook - ePub

    Big Data Analytics with Hadoop 3

    Build highly effective analytics solutions to gain valuable insight into your big data

    Stream Processing with Apache Flink

    In this chapter, we will look at Stream Processing using Apache Flink and how the framework can be used to process data as soon as it arrives to build exciting real-time applications. We will start with the DataStream API and look at various operations that can be performed.
    We will be looking at the following:
    • Data processing using the DataStream API 
    • Transformations 
    • Aggregations 
    • Window 
    • Physical partitioning 
    • Rescaling 
    • Data sinks
    • Event time and watermarks 
    • Kafka connector 
    • Twitter connector 
    • Elasticsearch connector 
    • Cassandra connector 
    Passage contains an image

    Introduction to streaming execution model

    Flink is an open source framework for distributed Stream Processing that:
    • Provides results that are accurate, even in the case of out-of-order or late-arriving data
    • Is stateful and fault tolerant, and can seamlessly recover from failures while maintaining an exactly-once application state
    • Performs on a large scale, running on thousands of nodes with very good throughput and latency characteristics
    The following diagram is a generalized view of Stream Processing:
    Many of Flink's features - state management, handling out-of-order data, flexible windowing – are essential for computing accurate results on unbounded datasets and are enabled by Flink's streaming execution model:
    • Flink guarantees exactly-once semantics for stateful computations. Stateful means that applications can maintain an aggregation or summary of data that has been processed over time, and Flink's checkpointing mechanism ensures exactly-once semantics for an application's state in the event of a failure:
    • Flink supports Stream Processing and windowing with event-time semantics. Event time makes it easy to compute accurate results over streams where events arrive out of order and where events may arrive delayed:
    • Flink supports flexible windowing based on time, count, or sessions, in addition to data-driven windows. Windows can be customized with flexible triggering conditions to support sophisticated streaming patterns. Flink's windowing makes it possible to model the reality of the environment in which data is created:
  • Modern Big Data Processing with Hadoop
    • V Naresh Kumar, Prashant Shindgikar(Authors)
    • 2018(Publication Date)
    • Packt Publishing
      (Publisher)

    Designing Real-Time Streaming Data Pipelines

    The first three chapters of this book all dealt with batch data. Having learned about the installation of Hadoop, data ingestion tools and techniques, and data stores, let's turn to data streaming. Not only will we look at how we can handle real-time data streams, but also how to design pipelines around them.
    In this chapter, we will cover the following topics:
    • Real-time streaming concepts
    • Real-time streaming components
    • Apache Flink versus Spark
    • Apache Spark versus Storm
    Passage contains an image

    Real-time streaming concepts

    Let's understand a few key concepts relating to real-time streaming applications in the following sections.
    Passage contains an image

    Data stream

    The data stream is a continuous flow of data from one end to another end, from sender to receiver, from producer to consumer. The speed and volume of the data may vary; it may be 1 GB of data per second or it may be 1 KB of data per second or per minute.
    Passage contains an image

    Batch processing versus real-time data processing

    In batch processing, data is collected in batches and each batch is sent for processing. The batch interval can be anything from one day to one minute. In today's data analytics and business intelligence world, data will not be processed in a batch for more than one day. Otherwise, business teams will not have any insight about what's happening to the business in a day-to-day basis. For example, the enterprise data warehousing team may collect all the orders made during the last 24 hours and send all these collected orders to the analytics engine for reporting.
    The batch can be of one minute too. In the Spark framework (we will learn Spark in Chapter 7 , Large-Scale Data Processing Frameworks ), data is processed in micro batches.
    In real-time processing, data (event) is transferred (streamed) from the producer (sender) to the consumer (receiver) as soon as an event is produced at the source end. For example, on an e-commerce website, orders gets processed immediately in an analytics engine as soon as the customer places the same order on that website. The advantage is that the business team of that company gets full insights about its business in real time (within a few milliseconds or sub milliseconds). It will help them adjust their promotions to increase their revenue, all in real-time.
  • Intelligence at the Edge
    eBook - ePub

    Intelligence at the Edge

    Using SAS with the Internet of Things (Hardcover edition)

    • Michael Harvey(Author)
    • 2020(Publication Date)
    • SAS Institute
      (Publisher)
    The answer: SAS Event Stream Processing. It enables you to process and analyze continuously flowing real-world events in real time. Events arrive through high-throughput, low-latency data flows called event streams. These data flows are generated by occurrences such as sensor readings or market data. Each event within an event stream can be represented as a data record that consists of any number of fields. For example, an event generated by a pressure sensor could include two fields: a pressure reading and a timestamp. A more complex financial trade event could include multiple fields for transaction type, shares traded, price, broker, seller, stock symbol, timestamp, and so on. SAS Event Stream Processing can process the pressure data or the trades at any given moment. It can alert you to events of interest the instant that they occur.
    Innovations in technology have enabled the reduction of the cost and size of sensors. Now sensors can be readily deployed within industrial equipment and consumer products. The number of sensors available has exploded, and a large portion of these sensors are now connected through the internet. The deluge of resulting data streams is often called Big Data . The Internet of Things (IoT) attaches a plethora of devices, sensors, and objects in our world to the internet. Big Data is collected and processed in real time from these “things.”
    SAS Event Stream Processing processes real-world data as it is generated . This instantly processed data is called streaming data. Processing streaming data introduces a paradigm shift from the traditional approach, where data is captured and stored in a database. After an event from an event stream is processed, it can be stored or discarded. Subsequent results of event stream data processing can also be stored and explored.
    When time-sensitivity is important, processing streaming data at the point of generation is critical. For example, suppose that you are using sensing devices to track a customer who is browsing products at a retail establishment or online. Based on customer or product location (in real space or cyberspace), a system processing streaming data can generate an offer in real time to entice a purchase. An application that uses data at rest is not nimble enough to make these suggestions. Another example, also involving sensing devices, is the real-time tracking of vibrations in airliner engines. When anomalous patterns are detected (perhaps as the result of a bird impact), pilots can be alerted immediately so that they can take corrective action. Catastrophic failure can be avoided.
  • Designing Cloud Data Platforms
    real-time processing As we discussed in chapter 3, the processing layer, highlighted in figure 6.1, is the heart of the data platform implementation. This is where all the required business logic is applied and all the data validations and data transformations take place. The processing layer also plays an important role in providing ad hoc access to the data in the data platform. Figure 6.1 The processing layer is where business logic is applied and all data validations and data transformations take place, as well as where ad hoc access to data is provided. So far in this book, we’ve used data processing and analytics scenarios focused on batch data processing. In these scenarios, we assumed that data can be extracted from the source system on regular intervals or that it naturally arrives in the form of files that need to be processed. Batch is not the only way data can be delivered and analyzed in our cloud data platform. You may have already heard the term “real-time data processing,” and in this chapter, we will explore this form of processing and its common use cases. Let’s start with some definitions and use cases. When people use the terms “real-time” or “streaming” in the context of a data platform, it can mean different things to different people, and it is relevant in two layers of a data platform—the ingestion layer and the processing layer. Real-time or streaming ingestion takes place when you have a pipeline that streams data, one message at a time, from a source into a destination such as storage or the data warehouse or both. While the term real-time processing isn’t clearly defined anywhere, much of the available product documentation, blog posts, and books use the term to refer to straightforward data transformations applied to streaming data
  • Mastering Hadoop 3
    eBook - ePub

    Mastering Hadoop 3

    Big data processing at scale to unlock unique business insights

    • Chanchal Singh, Manish Kumar(Authors)
    • 2019(Publication Date)
    • Packt Publishing
      (Publisher)

    Real-Time Stream Processing in Hadoop

    All industries have started adopting big data technology, as they have seen the advantages that companies are gaining after implementing it into their existing business model. Traditionally, companies were more focused on batch job implementation, and there has always been a lag of several minutes, or sometimes hours, between the arrival of data and it being displayed to the user. This leads to a delay in decision making, which in turn leads to revenue loss. This is where real-time analytics comes into the picture.
    Real-time analytics is a methodology in which data is processed immediately after the system receives it and processed data gets available for use. Spark Streaming helps in achieving such objectives very
    efficiently . This chapter will cover a brief introduction to the following topics:
    • Spark Streaming
    • Integration of Apache Kafka with Apache Spark Streaming
    • Common stream data patterns
    • Streaming design considerations
    • Case studies
    This chapter is intended to help all developers and business analysts to understand the overall integration strategy, the advantages of different integration APIs, and what scenarios to keep in mind during project implementation.
    Passage contains an image

    Technical requirements

    You will be required to have basic knowledge of Linux and Apache Hadoop 3.0.
    The code files of this chapter can be found on GitHub: https://github.com/PacktPublishing/Mastering-Hadoop-3/tree/master/Chapter09
    Check out the following video to see the code in action: http://bit.ly/2T3yYfz
    Passage contains an image

    What are streaming datasets?

    Streaming datasets are about doing data processing, not on bounded data, but on unbounded data. Typical datasets are bounded. That means they are complete. At the very least, you will process data as if it were complete. Realistically, we know that there will always be new data, but as far as data processing is concerned, we will treat it as if it were a complete dataset. In the case of bounded data, data processing is done in phases and until and unless one phase is complete, other phases of data processing do not start. Another way to think about bounded data processing is that we will be done analyzing the data before new data comes in. Bounded datasets are finite in size. The following diagram represents how bounded data is processed using a typical MapReduce batch processing engine:
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.