Computer Science

Apache Flink

Apache Flink is an open-source, distributed computing system that processes large amounts of data in real-time. It is designed to support batch processing, stream processing, and graph processing. Flink provides a unified programming model for all these processing modes and can run on various platforms, including Hadoop, Kubernetes, and standalone clusters.

Written by Perlego with AI-assistance

8 Key excerpts on "Apache Flink"

  • Data Lake for Enterprises
    • Tomcy John, Pankaj Misra(Authors)
    • 2017(Publication Date)
    • Packt Publishing
      (Publisher)
    Part 3 ):
    Figure 03: Technology mapping for Data Ingestion Layer
    Inline with our use case of SCV, the data from the messaging layer is taken in by this layer and then enriched and transformed accordingly and passed onto the Lambda Layer. We might also pass this data to the Data Storage Layer for persisting as well.
    In this layer there might be other technologies such as Kafka Consumer, Flume and so on. to take of certain aspects in the real working example of SCV. Part 3 will bring these technologies together so that a clear SCV is derived for enterprise use.
    Passage contains an image

    What is Apache Flink?

    Apache Flink is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. - flink.apache.org
    Apache Flink is a community-driven open source framework for distributed big data analytics, like Hadoop and Spark. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs.
    - Wikipedia
    Apache’s definition of Flink is somewhat easy to understand and the second part of Wikipedia's definition is quite hard to understand. For the time being just understand that Flink brings a unified programming model for handling stream and batch data using one technology.
    This chapter in no way covers in a comprehensive way the working of Apache Flink. Apache Flink is a topic by itself spanning an entire book.
    However, without giving too much details, it tries to cover many aspects of this awesome tool. We will skim through some of the core aspects and we will also give you enough information to actually use Flink in your Data Lake implementation.
  • Big Data Analytics with Hadoop 3
    eBook - ePub

    Big Data Analytics with Hadoop 3

    Build highly effective analytics solutions to gain valuable insight into your big data

    Stream Processing with Apache Flink

    In this chapter, we will look at stream processing using Apache Flink and how the framework can be used to process data as soon as it arrives to build exciting real-time applications. We will start with the DataStream API and look at various operations that can be performed.
    We will be looking at the following:
    • Data processing using the DataStream API 
    • Transformations 
    • Aggregations 
    • Window 
    • Physical partitioning 
    • Rescaling 
    • Data sinks
    • Event time and watermarks 
    • Kafka connector 
    • Twitter connector 
    • Elasticsearch connector 
    • Cassandra connector 
    Passage contains an image

    Introduction to streaming execution model

    Flink is an open source framework for distributed stream processing that:
    • Provides results that are accurate, even in the case of out-of-order or late-arriving data
    • Is stateful and fault tolerant, and can seamlessly recover from failures while maintaining an exactly-once application state
    • Performs on a large scale, running on thousands of nodes with very good throughput and latency characteristics
    The following diagram is a generalized view of stream processing:
    Many of Flink's features - state management, handling out-of-order data, flexible windowing – are essential for computing accurate results on unbounded datasets and are enabled by Flink's streaming execution model:
    • Flink guarantees exactly-once semantics for stateful computations. Stateful means that applications can maintain an aggregation or summary of data that has been processed over time, and Flink's checkpointing mechanism ensures exactly-once semantics for an application's state in the event of a failure:
    • Flink supports stream processing and windowing with event-time semantics. Event time makes it easy to compute accurate results over streams where events arrive out of order and where events may arrive delayed:
    • Flink supports flexible windowing based on time, count, or sessions, in addition to data-driven windows. Windows can be customized with flexible triggering conditions to support sophisticated streaming patterns. Flink's windowing makes it possible to model the reality of the environment in which data is created:
  • Mastering Hadoop 3
    eBook - ePub

    Mastering Hadoop 3

    Big data processing at scale to unlock unique business insights

    • Chanchal Singh, Manish Kumar(Authors)
    • 2019(Publication Date)
    • Packt Publishing
      (Publisher)
    We have a huge number of data processing tools available in the market. Most of them are open sourced and a few of them are commercial. The question is, how many processing tools or engines do we need? Can't we have just one processing framework that can fulfill the processing requirement of each and every use case that has different processing patterns? Apache Spark was built for the purpose of solving these problem and came up with a unified system architecture where use cases ranging from batch, near-real-time, machine learning models, and so on can be solved using the rich Spark API. 
    Apache Spark was not suitable for real-time processing use cases where event-by-event processing is needed. Apache Flink came up with a few new design models to solve similar problems that Spark was trying to solve in addition to its real-time processing capability. 
    Apache Flink is an open source distributed processing framework for stream and batch processing. The dataflow engine is the core of Flink and provides capabilities such as distribution, communication, and fault tolerance. Flink is very much similar to Spark but has an API for custom memory management, real-time data processing, and so on, which makes it a little different from Spark, which works on micro batches instead of real time. 
    Passage contains an image

    Flink architecture

    Like other distributed processing engines, Apache Fink also follows the master slave architecture. The Job manager is a master and the Task Manager are worker processes. The following diagram shows the Apache Flink architecture:
    • Job manager : The Job manager is the master process of the Flink cluster and works as a coordinator. The responsibility of the Job manager  is not only to manage the entire life cycle of data flow but also track the progress and state of each stream and operator. It also coordinates the dataflow execution in distributed environments. The Job manager maintains the checkpoint metadata in fault-tolerant storage systems so that if an active Job manager goes down, the standby Job manager
  • Parallel Computing Architectures and APIs
    eBook - ePub

    Parallel Computing Architectures and APIs

    IoT Big Data Stream Processing

    However, maintaining a Lambda system was a hassle to the extent that it entailed building, provisioning, and maintaining two independent versions of the processing pipelines that had to be reconciled at the end. This begged for a unified solution: A functioning stream processing system with the ability to effectively provide “batch processing” functionality on demand. Apache Flink is a noteworthy example (see Subsection 21.1.2.3 ‘Stream Processing Platforms/Engine’).

    20.1.2 Data Stream Processing

    Due to different challenges and requirements posed by different application domains, several open-source platforms have emerged for real-time data stream processing. Although different platforms share the concept of handling data as continuous unbounded streams, and process it immediately as the data is collected, they follow different architectural models and offer different capabilities and features. Each platform offers very specific special features that make its architecture unique, and some features make a stream processing platform more applicable than others for different scenarios.
    A taxonomy is derived after studying different open-source platforms, including DSMS), CEP systems, and stream processing systems such as Storm.
    20.1.2.1 Data Stream Processing Systems
    The aspects underlying stream processing platforms or engines consist of distributed computing, parallel computing, and message passing. Dataflow has always been the core element of stream processing systems. “Data streams” or “streams” in the stream processing context refers to the infinite dataflow within the system. Stream processing platforms are designed to run on top of distributed and parallel computing technologies such as clusters to process real-time streams of data. The cluster computing technology allows a collection of connected computers to work together providing adequate computing power for the processing of large data sets. Such a feature is so important for stream processing platforms that clusters have become one of its essential modules for processing high-velocity big data.
  • Apache Spark 2: Data Processing and Real-Time Analytics
    eBook - ePub

    Apache Spark 2: Data Processing and Real-Time Analytics

    Master complex big data processing, stream analytics, and machine learning with Apache Spark

    • Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei(Authors)
    • 2018(Publication Date)
    • Packt Publishing
      (Publisher)

    Apache Spark Streaming

    The Apache Streaming module is a stream processing-based module within Apache Spark. It uses the Spark cluster, to offer the ability to scale to a high degree. Being based on Spark, it is also highly fault tolerant, having the ability to rerun failed tasks by checkpointing the data stream that is being processed. The following topics will be covered in this chapter after an introductory section, which will provide a practical overview of how Apache Spark processes stream-based data:
    • Error recovery and checkpointing
    • TCP-based stream processing
    • File streams
    • Kafka stream source
    For each topic, we will provide a worked example in Scala and show how the stream-based architecture can be set up and tested.
    Passage contains an image

    Overview

    The following diagram shows potential data sources for Apache Streaming, such as Kafka, Flume, and HDFS:
    These feed into the Spark Streaming module and are processed as Discrete Streams. The diagram also shows that other Spark module functionality, such as machine learning, can be used to process stream-based data.
    The fully processed data can then be an output for HDFS, databases, or dashboards. This diagram is based on the one at the Spark streaming website, but we wanted to extend it to express the Spark module functionality:
    When discussing Spark Discrete Streams, the previous figure, taken from the Spark website at http://spark.apache.org/ , is the diagram that we would like to use.
    The green boxes in the previous figure show the continuous data stream sent to Spark being broken down into a Discrete Stream (DStream ).
    A DStream is nothing other than an ordered set of RDDs. Therefore, Apache Spark Streaming is not real streaming, but micro-batching. The size of the RDDs backing the DStream determines the batch size. This way DStreams can make use of all the functionality provided by RDDs including fault tolerance and the capability of being spillable to disk. The size of each element in the stream is then based on a batch time, which might be two seconds.
  • Recent Advances in Security, Privacy, and Trust for Internet of Things (IoT) and Cyber-Physical Systems (CPS)
    • Kuan-Ching Li, Brij B. Gupta, Dharma P. Agrawal, Kuan-Ching Li, Brij B. Gupta(Authors)
    • 2020(Publication Date)
    IDC.com , 2019). Almost IoT devices generate data continuously in streams. So, live stream data-processing techniques become a critical part of big data-processing techniques. Today, there are a lot of live stream processing frameworks like Apache Spark, Apache Storm, Apache Samza, Apache Flink, Amazon Kinesis Streams. Each one has different pros and cons, but Apache Spark is the most standout framework because it is a mature framework with a large community, proven in a lot of real projects, and readily supports SQL querying (Nasiri et al., 2019). In Apache Spark, there are two APIs support live stream data processing: Spark Streaming and Structured Streaming.
    • Spark Streaming is an old API. It processes data streams using DStreams built on RDDs.
    • Structured Streaming is a new API introduced from Apache Spark 2.x. This API processes structured data streams with relation queries on Datasets and DataFrames.

    11.3.4.2 Spark Streaming API

    Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be received from many sources like Kafka, Flume, Kinesis, Twitter, TCP sockets, and can be processed using functions like map, reduce, join, and window. Finally, processed data can be pushed out to file systems, databases, and live dashboards. Figure 11.3 shows the Spark Streaming Model. In the real world, you can apply Spark's machine learning and graph-processing algorithms on data streams (Spark Streaming Programming Guide, 2019); (Kienzler, 2017).
    FIGURE 11.3
    The Spark Streaming model. (Source: https://spark.apache.org ).
    Internally, Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Figure 11.4 shows the inner workings of Spark Streaming.
    FIGURE 11.4
    The inner workings of Spark Streaming. (Source: https://spark.apache.org
  • Machine Learning with Apache Spark Quick Start Guide
    eBook - ePub

    Machine Learning with Apache Spark Quick Start Guide

    Uncover patterns, derive actionable insights, and learn from big data using MLlib

    stream processing engines available to allow us to do this, including—but not limited—to the following:
    • Apache Spark: https://spark.apache.org/
    • Apache Storm: http://storm.apache.org/
    • Apache Flink: https://flink.apache.org/
    • Apache Samza: http://samza.apache.org/
    • Apache Kafka (via its Streams API): https://kafka.apache.org/documentation/
    • KSQL: https://www.confluent.io/product/ksql/
    Though a detailed comparison of the available stream processing engines is beyond the scope of this book, you are encouraged to explore the preceding links and study the differing architectures available. For the purposes of this chapter, we will be using Apache Spark's Structured Streaming engine as our stream processing engine of choice.
    Passage contains an image

    Streaming using Apache Spark

    At the time of writing, there are two stream processing APIs available in Spark:
    • Spark Streaming (DStreams): https://spark.apache.org/docs/latest/streaming-programming-guide.html
    • Structured Streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
    Passage contains an image

    Spark Streaming (DStreams)

    Spark Streaming (DStreams) extends the core Spark API and works by dividing real-time data streams into input batches that are then processed by Spark's core API, resulting in a final stream of processed batches , as illustrated in Figure 8.2 . A sequence of RDDs form what is known as a discretized stream (or DStream), which represents the continuous stream of data:
    Figure 8.2: Spark Streaming (DStreams)
    Passage contains an image

    Structured Streaming

    Structured Streaming , on the other hand, is a newer and highly optimized stream processing engine built on the Spark SQL engine in which streaming data can be stored and processed using Spark's Dataset/DataFrame API (see Chapter 1 , The Big Data Ecosystem ). As of Spark 2.3, Structured Streaming offers the ability to process data streams using both micro-batch processing, with latencies as low as 100 milliseconds, and continuous processing , with latencies as low as 1 millisecond (thereby providing true
  • Modern Big Data Processing with Hadoop
    • V Naresh Kumar, Prashant Shindgikar(Authors)
    • 2018(Publication Date)
    • Packt Publishing
      (Publisher)
    Any real-time application is expected to be available all the time with no stoppage whatsoever. The event collection, processing, and storage components should be configured with the underlined assumptions of high availability. Any failure to any components will cause major disruptions to the running of the business. For example, in a credit card fraud detection application, all the fraudulent transactions need to be declined. If the application stops midway and is unable to decline fraudulent transactions, then it will result in heavy losses.
    Passage contains an image

    Low latency

    In any real-time application, the event should flow from source to target in a few milliseconds. The source collects the event, and a processing framework moves the event to its target data store where it can be analyzed further to find trends and patterns. All these should happen in real time, otherwise it may impact business decisions. For example, in a credit card fraud detection application, it is expected that all incoming transactions should be analyzed to find possible fraudulent transactions, if any. If the stream processing takes more than the desired period of time, it may be possible that these transactions may pass through the system, causing heavy losses to the business.
    Passage contains an image

    Scalable processing frameworks

    Hardware failure may cause disruption to the stream processing application. To avoid this common scenario, we always need a processing framework that offers built-in APIs to support continuous computation, fault tolerant event state management, checkpoint features in the event of failures, in-flight aggregations, windowing, and so on. Fortunately, all the recent Apache projects such as Storm, Spark, Flink, and Kafka do support all and more of these features out of the box. The developer can use these APIs using Java, Python, and Scala.
    Passage contains an image

    Horizontal scalability

    The stream-processing platform should support horizontal scalability. That means adding more physical servers to the cluster in the event of a higher incoming data load to maintain throughput SLA. This way, the performance of processing can be increased by adding more nodes rather than adding more CPUs and memory to the existing servers; this is called vertical scalability .
    Passage contains an image

    Storage

    The preferable format of a stream is key-value pair. This format is very well represented by the JSON and Avro formats. The preferred storage to persist key-value type data is NoSQL data stores such as HBase and Cassandra. There are in total 100 NoSQL open source databases in the market these days. It's very challenging to choose the right database, one which supports storage to real-time events, because all these databases offer some unique features for data persistence. A few examples are schema agnostic, highly distributable, commodity hardware support, data replication, and so on.