Skywalking-APM

-权威指南

SkyWalking 6.x 架构图

服务端

运行条件

Linux centos 7.x
Jdk1.8

配置文件

alarm-settings.yml

rules:

# Rule unique name, must be ended with `_rule`.

endpoint_percent_rule:

# Metrics value need to be long, double or int

metrics-name: endpoint_percent

threshold: 75

op: <

# The length of time to evaluate the metrics

period: 10

# How many times after the metrics match the condition, will trigger alarm

count: 3

# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.

silence-period: 10

service_percent_rule:

metrics-name: service_percent

# [Optional] Default, match all services in this metrics

include-names:

- service_a

- service_b

threshold: 85

op: <

period: 10

count: 4

Service average response time over 1s in last 3 minutes.
Service success rate lower than 80% in last 2 minutes.
Service 90% response time is over 1s in last 3 minutes
Service Instance average response time over 1s in last 2 minutes.
Endpoint average response time over 1s in last 2 minutes.

application.yml

receiver-register:

  default:

receiver-trace:

  default:

    bufferPath: ../trace-buffer/  # Path to trace buffer files, suggest to use absolute path

    bufferOffsetMaxFileSize: 100 # Unit is MB

    bufferDataMaxFileSize: 500 # Unit is MB

    bufferFileCleanWhenRestart: false

    sampleRate: ${SW_TRACE_SAMPLE_RATE:1000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.

receiver-jvm:

  default:

service-mesh:

  default:

    bufferPath: ../mesh-buffer/  # Path to trace buffer files, suggest to use absolute path

    bufferOffsetMaxFileSize: 100 # Unit is MB

    bufferDataMaxFileSize: 500 # Unit is MB

    bufferFileCleanWhenRestart: false

istio-telemetry:

  default:

envoy-metric:

  default:

receiver_zipkin:

  default:

    host: 0.0.0.0

    port: 9411

    contextPath: /

TTL

In SkyWalking, there are two types of observability data, besides metadata.

Record, including trace and alarm. Maybe log in the future.
Metric, including such as p99/p95/p90/p75/p50, heatmap, success rate, cpm(rpm) etc. Metric is separated in minute/hour/day/month dimensions in storage, different indexes or tables.

# Set a timeout on metrics data. After the timeout has expired, the metrics data will automatically be deleted.

    enableDataKeeperExecutor: ${SW_CORE_ENABLE_DATA_KEEPER_EXECUTOR:true} # Turn it off then automatically metrics data delete will be close.

    dataKeeperExecutePeriod: ${SW_CORE_DATA_KEEPER_EXECUTE_PERIOD:5} # How often the data keeper executor runs periodically, unit is minute

    recordDataTTL: ${SW_CORE_RECORD_DATA_TTL:90} # Unit is minute

    minuteMetricsDataTTL: ${SW_CORE_MINUTE_METRIC_DATA_TTL:90} # Unit is minute

    hourMetricsDataTTL: ${SW_CORE_HOUR_METRIC_DATA_TTL:36} # Unit is hour

    dayMetricsDataTTL: ${SW_CORE_DAY_METRIC_DATA_TTL:45} # Unit is day

    monthMetricsDataTTL: ${SW_CORE_MONTH_METRIC_DATA_TTL:18} # Unit is month

ElasticSearch 6 storage TTL

Specifically:

You have following settings in Elasticsearch storage.

# Those data TTL settings will override the same settings in core module.

recordDataTTL: ${SW_STORAGE_ES_RECORD_DATA_TTL:7} # Unit is day

otherMetricsDataTTL: ${SW_STORAGE_ES_OTHER_METRIC_DATA_TTL:45} # Unit is day

monthMetricsDataTTL: ${SW_STORAGE_ES_MONTH_METRIC_DATA_TTL:18} # Unit is month

recordDataTTL affects Record data.
otherMetricsDataTTL affects minute/hour/day dimensions of metrics. minuteMetricsDataTTL, hourMetricsDataTTL and dayMetricsDataTTL are still there, but the Unit of them changed to DAY too. If you want to set them manually, please remove otherMetricsDataTTL.
monthMetricsDataTTL affects month dimension of metrics.

storage

Native supported storage

ElasticSearch 6

MySQL
TiDB

ElasticSearch 6

storage:

  elasticsearch:

    # nameSpace: ${SW_NAMESPACE:""}

    # user: ${SW_ES_USER:""} # User needs to be set when Http Basic authentication is enabled

    # password: ${SW_ES_PASSWORD:""} # Password to be set when Http Basic authentication is enabled

    #trustStorePath: ${SW_SW_STORAGE_ES_SSL_JKS_PATH:""}

    #trustStorePass: ${SW_SW_STORAGE_ES_SSL_JKS_PASS:""}

    clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}

    protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}

    indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:2}

    indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:0}

    # Those data TTL settings will override the same settings in core module.

    recordDataTTL: ${SW_STORAGE_ES_RECORD_DATA_TTL:7} # Unit is day

    otherMetricsDataTTL: ${SW_STORAGE_ES_OTHER_METRIC_DATA_TTL:45} # Unit is day

    monthMetricsDataTTL: ${SW_STORAGE_ES_MONTH_METRIC_DATA_TTL:18} # Unit is month

    # Batch process setting, refer to https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.5/java-docs-bulk-processor.html

    bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:2000} # Execute the bulk every 2000 requests

    bulkSize: ${SW_STORAGE_ES_BULK_SIZE:20} # flush the bulk every 20mb

    flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests

    concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests

sampleRate

receiver-trace:

  default:

    bufferPath: ../trace-buffer/  # Path to trace buffer files, suggest to use absolute path

    bufferOffsetMaxFileSize: 100 # Unit is MB

    bufferDataMaxFileSize: 500 # Unit is MB

    bufferFileCleanWhenRestart: false

    sampleRate: ${SW_TRACE_SAMPLE_RATE:1000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.

客户端

运行条件

Agent is available for JDK 6 - 12.
Find agent folder in SkyWalking release package
Set agent.service_name in config/agent.config. Could be any String in English.
Set collector.backend_service in config/agent.config. Default point to 127.0.0.1:11800, only works for local backend.
Add -javaagent:/path/to/skywalking-package/agent/skywalking-agent.jar to JVM argument. And make sure to add it before the -jar argument.

配置文件

agent/config/agent.config

property key	Description	Default
agent.namespace
agent.service_name
agent.sample_n_per_3_secs
agent.authentication
agent.span_limit_per_segment
agent.ignore_suffix
agent.is_open_debugging_class
agent.active_v2_header
agent.instance_uuid
agent.instance_properties[key]=value
agent.cause_exception_depth
agent.active_v1_header
agent.cool_down_threshold
agent.force_reconnection_period
agent.operation_name_threshold
collector.grpc_channel_check_interval
collector.app_and_service_register_check_interval
collector.backend_service
collector.grpc_upstream_timeout
logging.level
logging.file_name
logging.output
logging.dir
logging.pattern
logging.max_file_size
logging.max_history_files
jvm.buffer_size
buffer.channel_size
buffer.buffer_size
dictionary.service_code_buffer_size
dictionary.endpoint_name_buffer_size
plugin.peer_max_length
plugin.mongodb.trace_param
plugin.mongodb.filter_length_limit
plugin.elasticsearch.trace_dsl
plugin.springmvc.use_qualified_name_as_endpoint_name
plugin.toolit.use_qualified_name_as_operation_name
plugin.mysql.trace_sql_parameters
plugin.mysql.sql_parameters_max_length
plugin.postgresql.sql_parameters_max_length
plugin.solrj.trace_statement
plugin.solrj.trace_ops_params
plugin.light4j.trace_handler_chain
plugin.opgroup.*

成功案例

智能日志管理平台 https://developer.qiniu.com/insight

Pandora 智能日志管理平台是一站式的日志数据管理平台，具有日志统一存储、实时检索、查询和分析、监控告警能力，并提供计算引擎（流式计算、批量计算）对数据做进一步的分析，同时支持异常检测和预测等机器学习功能，帮助用户提升运维、运营效率，快速查找和定位问题，广泛应用于在线业务监控、运维排障、安全审计、用户业务分析等场景。

参考文档

https://blog.csdn.net/gzy11/article/details/86679473#1322_mysql_175
https://blog.csdn.net/gzy11/article/details/86679585#_4
https://developer.qiniu.com/insight/manual/5435/skywalking-tracking-tomcat-services