2，StructuredStreaming的事件时间和窗口操作

时间：2018-09-11 14:12:20 阅读：424 评论：0 收藏：0 [点我收藏+]

标签：term trigger imp 例子情况 str net structure roc

推荐阅读：1，StructuredStreaming简介

使用Structured Streaming基于事件时间的滑动窗口的聚合操作是很简单的，很像分组聚合。在一个分组聚合操作中，聚合值被唯一保存在用户指定的列中。在基于窗口的聚合的情况下，对于行的事件时间的每个窗口，维护聚合值。

如前面的例子，我们运行wordcount操作，希望以10min窗口计算，每五分钟滑动一次窗口。也即，12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20 这些十分钟窗口中进行单词统计。12:00 - 12:10意思是在12:00之后到达12:10之前到达的数据，比如一个单词在12:07收到。这个单词会影响12:00 - 12:10, 12:05 - 12:15两个窗口。

结果表将如下所示。

技术分享图片

import org.apache.spark.sql.streaming.Trigger
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import spark.implicits._

val lines=spark.readStream.format("socket").option("host", "127.0.0.1").option("port", 9999).option("includeTimestamp", true).load()

val words=lines.as[(String, Timestamp)].flatMap(line=>line._1.split(" ").map(word=> (word, line._2))).toDF("word", "timestamp")
val windowedCounts=words.withWatermark("timestamp", "30 seconds").groupBy(window($"timestamp", "30 seconds", "15 seconds"), $"word").count()
val query=windowedCounts.writeStream.outputMode("Append").format("console").trigger(Trigger.ProcessingTime(5000)).option("truncate", "false").start()
query.awaitTermination()

推荐阅读：

Spark Structured Streaming高级特性

Spark Streaming 中管理 Kafka Offsets 的几种方式

2，StructuredStreaming的事件时间和窗口操作

标签：term trigger imp 例子情况 str net structure roc

原文地址：https://www.cnblogs.com/wangfengxia/p/9626874.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行