当前位置: 代码迷 >> 综合 >> Doris报错there is no scanNode Backend
  详细解决方案

Doris报错there is no scanNode Backend

热度:60   发布时间:2023-11-28 01:20:47.0

背景

业务开发侧3.8号反应SparkStreaming流失扫Doris表(查询sql)报错

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 20, hd012.corp.yodao.com, executor 7): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: errCode = 2, 
detailMessage = there is no scanNode Backend. [126101: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 14587381: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 213814: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS)]

报错

detailMessage = there is no scanNode Backend. [126101: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 14587381: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 213814: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS)]

源码分析

//黑名单对象
private static Map<Long, Pair<Integer, String>> blacklistBackends = Maps.newConcurrentMap();//任务执行过程中需要getHost,返回值为TNetworkAddress对象
public static TNetworkAddress getHost(long backendId,List<TScanRangeLocation> locations,ImmutableMap<Long, Backend> backends,Reference<Long> backendIdRef)//getHost()方法中通过backendId获得backend对象
Backend backend = backends.get(backendId);//判断backend对象是否可用
//可用就返回TNetworkAddress对象
//不可用就在locations对象中循环遍历去找一个候选的backend对象
//如果刚刚不可用的backend与候选backend对象id一致,则continue
//如果不一致,则判断是否可用,可用则返回改候选be的TNetworkAddress
//不可用则继续换换下一个候选beif (isAvailable(backend)) {
    backendIdRef.setRef(backendId);return new TNetworkAddress(backend.getHost(), backend.getBePort());
}  else {
    for (TScanRangeLocation location : locations) {
    if (location.backend_id == backendId) {
    continue;}// choose the first alive backend(in analysis stage, the locations are random)Backend candidateBackend = backends.get(location.backend_id);if (isAvailable(candidateBackend)) {
    backendIdRef.setRef(location.backend_id);return new TNetworkAddress(candidateBackend.getHost(), candidateBackend.getBePort());}}
}public static boolean isAvailable(Backend backend) {
    return (backend != null && backend.isAlive() && !blacklistBackends.containsKey(backend.getId()));
}//若直至最后都不能返回一个be,则返回异常原因
// no backend returned
throw new UserException("there is no scanNode Backend. " +getBackendErrorMsg(locations.stream().map(l -> l.backend_id).collect(Collectors.toList()),backends, locations.size()));// get the reason why backends can not be chosen.
private static String getBackendErrorMsg(List<Long> backendIds, ImmutableMap<Long, Backend> backends, int limit) {
    List<String> res = Lists.newArrayList();for (int i = 0; i < backendIds.size() && i < limit; i++) {
    long beId = backendIds.get(i);Backend be = backends.get(beId);if (be == null) {
    res.add(beId + ": not exist");} else if (!be.isAlive()) {
    res.add(beId + ": not alive");} else if (blacklistBackends.containsKey(beId)) {
    Pair<Integer, String> pair = blacklistBackends.get(beId);res.add(beId + ": in black list(" + (pair == null ? "unknown" : pair.second) + ")");} else {
    res.add(beId + ": unknown");}}return res.toString();
}//blacklistBackends对象的put
public static void addToBlacklist(Long backendID, String reason) {
    if (backendID == null) {
    return;}blacklistBackends.put(backendID, Pair.create(FeConstants.heartbeat_interval_second + 1, reason));LOG.warn("add backend {} to black list. reason: {}", backendID, reason);
}public static void addToBlacklist(Long backendID, String reason) {
    if (backendID == null) {
    return;}blacklistBackends.put(backendID, Pair.create(FeConstants.heartbeat_interval_second + 1, reason));LOG.warn("add backend {} to black list. reason: {}", backendID, reason);
}

原因分析

根据任务报错
detailMessage = there is no scanNode Backend. [126101: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 14587381: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 213814: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS)]
分析,BE id为126101、14587381、213814三个节点在黑名单的原因可能就是Ocurrs time out with specfied time 10000 MICROSECONDS
那么说明很可能当时3.8号这三台BE挂了
根据社区同学之前的经验的第7点
可以推测,很可能当时因为任务或者配置不当导致BE挂了

  • broker或者其他任务压垮了BE服务
  • max_broker_concurrency
  • max_bytes_per_broker_scanner

在这里插入图片描述

具体的报错因为问题出现时间在3.8号,到今天20多天过去了,期间经历了Doris集群扩容、节点重新编排等运维工作,日志以及很多备份无法恢复了,只能依据Ocurrs time out with specfied time 10000 MICROSECONDS推测可能当时BE挂了,然后我们的服务都会挂载在supervisord上的,所以会自启动(之前没有完善好节点服务不可用的Prometheus rules&alertmanager的告警)
后续如果再出现相同问题继续完善此文章

解决措施

部署了be节点服务不可用的Prometheus rules&alertmanager的告警
调整fe.conf中的配置
配置好spark任务、broker任务在执行时的配置
暂时没有什么实质性的方案,如果问题复现后继续跟踪,补充解决措施

  相关解决方案