码迷,mamicode.com
首页 > 其他好文 > 详细

Spark Streaming

时间:2020-07-08 13:26:09      阅读:59      评论:0      收藏:0      [点我收藏+]

标签:mina   ide   boa   suse   __name__   tom   bin   pad   upd   

Concept

http://spark.apache.org/streaming/

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.

 

Ease of Use

Build applications through high-level operators.

Spark Streaming brings Apache Spark‘s language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python.

 

Fault Tolerance

Stateful exactly-once semantics out of the box.

Spark Streaming recovers both lost work and operator state (e.g. sliding windows) out of the box, without any extra code on your part.


Spark Integration

Combine streaming with batch and interactive queries.

By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state. Build powerful interactive applications, not just analytics.

 

Deployment Options

Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ. You can also define your own custom data sources.

You can run Spark Streaming on Spark‘s standalone cluster mode or other supported cluster resource managers. It also includes a local run mode for development. In production, Spark Streaming uses ZooKeeper and HDFS for high availability.

 

 

Overview

http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

 

技术图片

 

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

技术图片

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

This guide shows you how to start writing Spark Streaming programs with DStreams. You can write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2), all of which are presented in this guide. You will find tabs throughout this guide that let you choose between code snippets of different languages.

 

Example

http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example

http://dblab.xmu.edu.cn/blog/1749-2/

from __future__ import print_function
 
import sys
 
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
 
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: stateful_network_wordcount.py <hostname> <port>", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="PythonStreamingStatefulNetworkWordCount")
    ssc = StreamingContext(sc, 1)
    ssc.checkpoint("file:///usr/local/spark/mycode/streaming/")
 
    # RDD with initial state (key, value) pairs
    initialStateRDD = sc.parallelize([(uhello, 1), (uworld, 1)])
 
    def updateFunc(new_values, last_sum):
        return sum(new_values) + (last_sum or 0)
 
    lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
    running_counts = lines.flatMap(lambda line: line.split(" "))                          .map(lambda word: (word, 1))                          .updateStateByKey(updateFunc, initialRDD=initialStateRDD)
 
    running_counts.pprint()
 
    ssc.start()
    ssc.awaitTermination()
 
 

 

Spark Streaming

标签:mina   ide   boa   suse   __name__   tom   bin   pad   upd   

原文地址:https://www.cnblogs.com/lightsong/p/13266293.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!