一.简介
watermark是一种衡量Event Time进展的机制,它是数据本身的一种隐藏属性。通常基于Event Time的数据,自身都包含一个timestamp.watermark用来处理乱序事件,而正确的处理乱序事件,通常用watermark机制结合window来实现(https://blog.csdn.net/qq_19968255/article/details/108911958)。
流处理从事件产生,到流经source,再到operator,中间是有一个过程和时间,虽然大部分情况下,流到operator的数据都是按照事件产生的时间顺序来的,但是也不排除由于网络、背压(短时负载高峰导致系统接收数据的速率远高于它处理数据的速率)等原因,导致乱序的产生(out-of-order或者说late element)。
但是对于late element,我们又不能无限期的等下去,必须要有个机制来保证一个特定的时间后,必须触发window去进行计算了。此时就是watermark发挥作用了,它表示当达到watermark到达之后,在watermark之前的数据已经全部达到(即使后面还有延迟的数据)。
二.理解
有一个带时间戳的事件流,但是由于某种原因它们并不是按顺序到达的。图中的数字代表事件发生的时间戳,第一个到达的事件发生的时间4,然后它后面跟着更早时间(2)事件。
理解1
数据流中第一个元素的时间4,但是不能直接把它按照第一个元素输出,因为数据是乱序到达,也许有更早的数据没有到达。事实上,我们能预见一些这个流的未来,也就是我们的排序算子至少要等到 2 这条数据的到达再输出结果。
有缓存,就必然有延迟。
理解2
首先,我们应用程序从看到时间4的数据,然后看到时间2的数据。是否会有比2更早的数据,也许会有,也许不会,可以一直等待下去,可能一直阻塞。
必须在特定的时间间隔下,输出流。
理解3
watermark 的作用,他们定义了何时不再等待更早的数据。
理解4
不同策略来生成 watermark。
我们知道每个事件都会延迟一段时间才到达,而这些延迟差异会比较大,所以有些事件会比其他事件延迟更多。一种简单的方法是假设这些延迟不会超过某个最大值。Flink 把这种策略称作 “有界无序生成策略”(bounded-out-of-orderness)。当然也有很多更复杂的方式去生成 watermark,但是对于大多数应用来说,固定延迟的方式已经足够了。
如果想要构建一个类似排序的流应用,可以使用 Flink 的 ProcessFunction。它提供了对事件时间计时器(基于 watermark 触发回调)的访问,还提供了可以用来缓存数据的托管状态接口。
理解5
使用EventTime来处理数据流更准确,获取方式:一种是data stream source内部处理,一种是通过timestam assigner/watermark generator。
在source里头定义的话,即使用SourceFunction里头定义的SourceContext接口的collectWithTimestamp、emitWatermark方法,前者用来assign event timestamp,后者用来emit watermark。
在source外头定义的话,就是通过DataStream的assignTimestampsAndWatermarks方法,设置timestampAndWatermarkAssigner;它有两种类型:
AssignerWithPeriodicWatermarks(定义了getCurrentWatermark方法,用于返回当前的watermark;periodic间隔参数通过env.getConfig().setAutoWatermarkInterval(1000)来设置);
系统会以一个固定的时间值定期检查event time的进展。
AssignerWithPunctuatedWatermarks(定义了checkAndGetNextWatermark方法,该方法会在extractTimestamp方法执行之后被调用(调用时通过方法参数传递刚获取的extractedTimestamp)
决定是否产生一个新的watermark,不会周期性生产,只根据event time来更新watermark。
三.示例
流计算,统计wordcount
object DataStreamDemo {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment//设置时间分配器env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)//设置并行度env.setParallelism(1)//每9秒发出一个watermarkenv.getConfig.setAutoWatermarkInterval(9000)val line = env.socketTextStream("localhost",9999)import org.apache.flink.api.scala._val counts = line.filter(f=> !StringUtils.isNullOrWhitespaceOnly(f)).map(new LineSplitter).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[Tuple3[String, Long, Integer]](){
var currentMaxTimestamp = 0L//这个控制失序已经延迟的度量val maxOutOfOrderness = 10000L//获取Watermarkoverride def getCurrentWatermark: Watermark = {
val tmpTimestamp = currentMaxTimestamp - maxOutOfOrdernessprintln(s"wall clock is ${System.currentTimeMillis()} new watermark ${tmpTimestamp}")new Watermark(tmpTimestamp)}//获取EventTimeoverride def extractTimestamp(element: (String, Long, Integer), previousElementTimestamp: Long): Long = {
val timestamp = element._2currentMaxTimestamp = Math.max(timestamp,currentMaxTimestamp)println(s"get timestamp is $timestamp currentMaxTimestamp $currentMaxTimestamp")timestamp}}).keyBy(0).timeWindow(Time.seconds(20)).sum(2)counts.print()env.execute("WordCount")}}
//构造出element以及它的event time.然后把次数赋值为1
class LineSplitter extends MapFunction[String,Tuple3[String, Long, Integer]]{
override def map(value: String): (String, Long, Integer) = {
val arrays = value.toLowerCase.split("\\W+")new Tuple3[String, Long, Integer](arrays(0), arrays(1).toLong, 1)}
}
maxOutOfOrderness 这个参数在设置的时候往往根据经验来。
MaxOutOfOrderness设置的太小,而自身数据发送时由于网络等原因导致乱序或者late太多,那么最终的结果就是会有很多单条的数据在window中被触发,数据的正确性影响太大。如果设置太大,导致设置的Watermark太小,使得Watermark没有用,因为原本在很短的时间内,一个窗口的所有的数据都到达了,但是不得不等Watermark一点点变大, 才能触发计算。
输入
aa 1601365080
bb 1601465080
aa 1601365080
aa 1601365080
aa 1601365080
aa 1601466080
bb 1601466080
bb 1601467080
bb 1601468080
ee 1601469080
ee 1601479080
aa 1601489080
cc 1601589080
服务端
get timestamp is 1601365080 currentMaxTimestamp 1601365080
wall clock is 1601365690607 new watermark 1601355080
wall clock is 1601365699610 new watermark 1601355080
get timestamp is 1601465080 currentMaxTimestamp 1601465080
wall clock is 1601365708612 new watermark 1601455080
(aa,1601365080,1)
wall clock is 1601365717615 new watermark 1601455080
wall clock is 1601365726616 new watermark 1601455080
wall clock is 1601365735619 new watermark 1601455080
wall clock is 1601365744621 new watermark 1601455080
wall clock is 1601365753624 new watermark 1601455080
wall clock is 1601365762627 new watermark 1601455080
wall clock is 1601365771629 new watermark 1601455080
wall clock is 1601365780631 new watermark 1601455080
wall clock is 1601365789634 new watermark 1601455080
wall clock is 1601365798636 new watermark 1601455080
wall clock is 1601365807638 new watermark 1601455080
wall clock is 1601365816641 new watermark 1601455080
wall clock is 1601365825643 new watermark 1601455080
wall clock is 1601365834645 new watermark 1601455080
wall clock is 1601365843647 new watermark 1601455080
wall clock is 1601365852648 new watermark 1601455080
wall clock is 1601365861650 new watermark 1601455080
get timestamp is 1601365080 currentMaxTimestamp 1601465080
wall clock is 1601365870652 new watermark 1601455080
wall clock is 1601365879654 new watermark 1601455080
wall clock is 1601365888657 new watermark 1601455080
wall clock is 1601365897659 new watermark 1601455080
get timestamp is 1601365080 currentMaxTimestamp 1601465080
get timestamp is 1601365080 currentMaxTimestamp 1601465080
wall clock is 1601365906661 new watermark 1601455080
wall clock is 1601365915662 new watermark 1601455080
wall clock is 1601365924664 new watermark 1601455080
wall clock is 1601365933667 new watermark 1601455080
wall clock is 1601365942669 new watermark 1601455080
wall clock is 1601365951671 new watermark 1601455080
wall clock is 1601365960674 new watermark 1601455080
get timestamp is 1601466080 currentMaxTimestamp 1601466080
wall clock is 1601365969676 new watermark 1601456080
wall clock is 1601365978678 new watermark 1601456080
wall clock is 1601365987679 new watermark 1601456080
wall clock is 1601365996682 new watermark 1601456080
wall clock is 1601366005685 new watermark 1601456080
get timestamp is 1601466080 currentMaxTimestamp 1601466080
wall clock is 1601366014687 new watermark 1601456080
get timestamp is 1601467080 currentMaxTimestamp 1601467080
wall clock is 1601366023690 new watermark 1601457080
wall clock is 1601366032692 new watermark 1601457080
wall clock is 1601366041693 new watermark 1601457080
get timestamp is 1601468080 currentMaxTimestamp 1601468080
wall clock is 1601366050695 new watermark 1601458080
get timestamp is 1601469080 currentMaxTimestamp 1601469080
wall clock is 1601366059696 new watermark 1601459080
wall clock is 1601366068698 new watermark 1601459080
wall clock is 1601366077701 new watermark 1601459080
wall clock is 1601366086703 new watermark 1601459080
wall clock is 1601366095704 new watermark 1601459080
wall clock is 1601366104707 new watermark 1601459080
wall clock is 1601366113710 new watermark 1601459080
wall clock is 1601366122712 new watermark 1601459080
wall clock is 1601366131714 new watermark 1601459080
wall clock is 1601366140717 new watermark 1601459080
wall clock is 1601366149720 new watermark 1601459080
wall clock is 1601366158722 new watermark 1601459080
wall clock is 1601366167724 new watermark 1601459080
get timestamp is 1601479080 currentMaxTimestamp 1601479080
wall clock is 1601366176727 new watermark 1601469080
wall clock is 1601366185729 new watermark 1601469080
wall clock is 1601366194732 new watermark 1601469080
get timestamp is 1601489080 currentMaxTimestamp 1601489080
wall clock is 1601366203733 new watermark 1601479080
wall clock is 1601366212735 new watermark 1601479080
wall clock is 1601366221736 new watermark 1601479080
wall clock is 1601366230739 new watermark 1601479080
wall clock is 1601366239741 new watermark 1601479080
wall clock is 1601366248744 new watermark 1601479080
wall clock is 1601366257746 new watermark 1601479080
wall clock is 1601366266749 new watermark 1601479080
wall clock is 1601366275750 new watermark 1601479080
wall clock is 1601366284752 new watermark 1601479080
wall clock is 1601366293754 new watermark 1601479080
wall clock is 1601366302755 new watermark 1601479080
wall clock is 1601366311757 new watermark 1601479080
wall clock is 1601366320759 new watermark 1601479080
wall clock is 1601366329760 new watermark 1601479080
wall clock is 1601366338763 new watermark 1601479080
wall clock is 1601366347765 new watermark 1601479080
get timestamp is 1601589080 currentMaxTimestamp 1601589080
wall clock is 1601366356768 new watermark 1601579080
(bb,1601465080,4)
(aa,1601466080,1)
(ee,1601469080,2)
(aa,1601489080,1)
wall clock is 1601366365771 new watermark 1601579080
wall clock is 1601366374773 new watermark 1601579080
当监听9999端口,开始输入aa 1601365080 ,
(get timestamp is 1601365080 currentMaxTimestamp 1601365080),每9秒输出(wall clock is 1601365690607 new watermark 1601355080)
接着输入 bb 1601465080 ,此时watermark为1601455080,此时watermark超过了aa所在窗口的endtime (1601365080 + 20s),那么会触发计算,从而会有((aa,1601365080,1))统计输出,触发计算时间点:
- watermark超过了window的endtime。
- 在该window中有数据。
输入后续:
aa 1601365080
aa 1601365080
aa 1601365080
aa 1601466080
bb 1601466080
bb 1601467080
bb 1601468080
ee 1601469080
ee 1601479080
aa 1601489080
cc 1601589080
最后一个 cc输入,改变watermark,使得当前时间(1601589080),触发计算,有很多时间戳(1601365080)没有被统计,因为之前的窗口已经计算完相同时间的,它会被丢弃掉,原窗口中的内容不会立即被删除,而是会再次等待一段时间,即watermark小于end-time + allowedLateness时,后续的该窗口的数据到达时会纳入到原窗口,再次触发计算。而watermark >= end_time + allowedLateness,后续的还有属于该窗口的数据到达时,那么这种数据只能被删除了,因为系统不会无限制的等下去,这既会增加window buffer的大小,也会引起不必要的性能下降。
参考
http://wuchong.me/blog/2018/11/18/flink-tips-watermarks-in-apache-flink-made-easy/
https://www.jianshu.com/p/7d524ef8143c
公众号
名称:大数据计算
微信号:bigdata_limeng