报错背景
版本
从百度预编译版v0.13.15至v0.14.12.7都有这个问题
doris 表导入至hive,采用doris的export
70多G的数据表
单分区
150桶
三副本
共3*150个tablet
export任务参数
“timeout” = “3600”
“tablet_num_per_task”=“1”
{“partitions”:["*"],“exec mem limit”:2147483648,“column separator”:"\t",“line delimiter”:"\n",“tablet num”:150,“broker”:“hdfs_broker”,“coord num”:150"}
报错内容
凌晨的export任务概率性的cancelled
show export order by createtime desc;
ErrorMsg显示
type:RUN_FAIL; msg:export exporting job fail. query id: bc2db4bd87dc4893-b40d4025438e10e7, Failed to get query fragments context. Query may be timeout or be cancelled. host:
日志跟踪
fe-master节点日志
根据JobId搜索日志最底下
process 14%失败
往上翻
fe日志这里,
deregister query id bc2db4bd87dc4893-b7a6d8b0d6db5339
unfinished instance: bc2db4bd87dc4893-b7a6d8b0d6db533a
第一次的bc2db4bd87dc4893-b7a6d8b0d6db5339的bc2db4bd87dc4893-b7a6d8b0d6db533a执行失败然后cancel了,
后面又重新注册
register query id = bc2db4bd87dc4893-b40d4025438e10e7,
然后fragment分发给一个be
dispatch load job: bc2db4bd87dc4893-b40d4025438e10e7 to [TNetworkAddress(hostname:be-xx, port:9060)]
但是他直接就exec plan fragment failed了
exec plan fragment failed, errmsg=Failed to get query fragments context. Query may be timeout or be cancelled. host: , code: INTERNAL_ERROR, fragmentId=F43, backend=be-xx:9060
be节点日志
第一次bc2db4bd87dc4893-b7a6d8b0d6db5339这个job到be后,
他的instance
bc2db4bd87dc4893-b7a6d8b0d6db533a
失败了
报错
bc2db4bd87dc4893-b7a6d8b0d6db533a: Thrift rpc error: Fail to write to broker, broker:TNetworkAddress(hostname=broker-xx, port=8000) failed:THRIFT_EAGAIN (timed out)
fe中重新第二次尝试的bc2db4bd87dc4893-b40d4025438e10e7的instance任务
bc2db4bd87dc4893-b40d4025438e10e8
也返回了reason 3
reason:3是有特殊含义嘛
查阅be代码
broker日志
2021-08-05 14:18:06 [ pool-2-thread-20:604944856 ] - [ INFO ] receive a open writer request,
2021-08-05 14:18:26 [ pool-2-thread-20:604965097 ] - [ WARN ] Error closing output stream.
java.net.SocketException: Socket closed
一段时间(10s)关闭了
然后将"tablet_num_per_task"="1"改为5后
这里用了120s关闭.
分析
跟我之前碰到的这个类似,但是日志略有不同