标签:webhook testing splay this 内容 nosql ann 还需 ESS
大数据之数据收集
数据收集是大数据的基础。散落在各处的数据,只有经过了数据收集,才会集中起来,提供了后续处理的可能。从大数据技术发展以来,出现了很多数据收集的技术框架,本文试图在若干流行的数据收集解决方案上加以叙述。
评估一个技术框架是否适合某个业务场景,通常需要考虑多个方面。
l 最基本的,考虑接口是否适配,收集socket数据了还是log数据,输出到哪里;
l 考虑技术框架的性能,是否满足业务的需求;
l 还需要考虑灵活性,如果需要做一些过滤或者自定义开发,是否容易;
l 考虑对性能的影响,数据收集不能影响了业务系统本身的运行,不能资源消耗太大;
l 考虑运维的难易程度,有的技术方案依赖很多,配置很复杂,就容易出错;
l 考虑技术框架是否高可靠,不会出现丢数据的情况。
相信通过上面几个方面的判断,应该可以找到合适的技术框架。
一、Flume
Flume是一种分布式,可靠且可用的服务,用于有效地收集,聚合和移动大量流事件数据。
source接口适配
|
Source Type |
Comments |
|
Avro Source |
Listens on Avro port and receives events from external Avro client streams. |
|
Thrift Source |
Listens on Thrift port and receives events from external Thrift client streams. |
|
Exec Source |
Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true). |
|
JMS Source |
JMS Source reads messages from a JMS destination such as a queue or topic. |
|
SSL and JMS Source |
MS client implementations typically support to configure SSL/TLS via some Java system properties defined by JSSE (Java Secure Socket Extension). |
|
Spooling Directory Source |
This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. |
|
Taildir Source |
Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files. |
|
Kafka Source |
Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics. |
|
NetCat TCP Source |
A netcat-like source that listens on a given port and turns each line of text into an event. |
|
NetCat UDP Source |
As per the original Netcat (TCP) source, this source that listens on a given port and turns each line of text into an event and sent via the connected channel. Acts like nc -u -k -l [host] [port]. |
|
Sequence Generator Source |
A simple sequence generator that continuously generates events with a counter that starts from 0, increments by 1 and stops at totalEvents. |
|
Syslog Sources |
Reads syslog data and generate Flume events. |
|
Syslog TCP Source |
The original, tried-and-true syslog TCP source. |
|
Multiport Syslog TCP Source |
his is a newer, faster, multi-port capable version of the Syslog TCP source. |
|
Syslog UDP Source |
|
|
HTTP Source |
A source which accepts Flume Events by HTTP POST and GET. |
|
Stress Source |
StressSource is an internal load-generating source implementation which is very useful for stress tests. |
|
Custom Source |
A custom source is your own implementation of the Source interface. |
|
Scribe Source |
Scribe is another type of ingest system. To adopt existing Scribe ingest system, Flume should use ScribeSource based on Thrift with compatible transfering protocol. |
|
|
|
sink接口适配
|
Sink Type |
Comments |
|
HDFS Sink |
This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types. |
|
Hive Sink |
This sink streams events containing delimited text or JSON data directly into a Hive table or partition. Events are written using Hive transactions. |
|
Logger Sink |
Logs event at INFO level. Typically useful for testing/debugging purpose. |
|
Avro Sink |
This sink forms one half of Flume’s tiered collection support. Flume events sent to this sink are turned into Avro events and sent to the configured hostname / port pair. |
|
Thrift Sink |
This sink forms one half of Flume’s tiered collection support. Flume events sent to this sink are turned into Thrift events and sent to the configured hostname / port pair. |
|
IRC Sink |
The IRC sink takes messages from attached channel and relays those to configured IRC destinations. |
|
File Roll Sink |
Stores events on the local filesystem. |
|
Null Sink |
Discards all events it receives from the channel. |
|
HBaseSink |
This sink writes data to HBase. The Hbase configuration is picked up from the first hbase-site.xml encountered in the classpath. A class implementing HbaseEventSerializer which is specified by the configuration is used to convert the events into HBase puts and/or increments. |
|
HBase2Sink |
HBase2Sink is the equivalent of HBaseSink for HBase version 2. |
|
AsyncHBaseSink |
This sink writes data to HBase using an asynchronous model. |
|
MorphlineSolrSink |
This sink extracts data from Flume events, transforms it, and loads it in near-real-time into Apache Solr servers, which in turn serve queries to end users or search applications. |
|
ElasticSearchSink |
This sink writes data to an elasticsearch cluster. By default, events will be written so that the Kibana graphical interface can display them - just as if logstash wrote them. |
|
Kafka Sink |
This is a Flume Sink implementation that can publish data to a Kafka topic. |
|
HTTP Sink |
Behaviour of this sink is that it will take events from the channel, and send those events to a remote service using an HTTP POST request. The event content is sent as the POST body. |
|
Custom Sink |
A custom sink is your own implementation of the Sink interface. A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. |
由上可见,Flume支持的接口比较丰富,最常用的基于文件的日志收集source以及同步到kafka的sink。
处理能力,处理能力和机器性能和数据都有关,在考虑的时候,既需要考虑每秒多少条数据,也需要考虑每秒多少兆数据。通常,在以file作为channel的时候,Flume可以支持每秒几十兆的数据处理,以memory作为channel的时候,可以支持每秒几百兆的数据处理。
灵活性,Flume的灵活性还是不错的,在数据处理的各个环节都预留有接口,方便进行个性化开发,再加上Flume本身也是java语言开发的,就更友好一些。
消耗,Flume本身的资源消耗还是比较多的,如果对资源消耗敏感,经过参数调优之后,使用的资源能够降低不少。
维护性,Flume的依赖不多,jar包下载既可用,有人觉得配置起来比较麻烦,不过灵活性带来的就是复杂性,个人感觉还好。
可靠性,Flume的可靠性在几个方面都有体现,首先需要选择合适的channel,来保证消息处理的可靠性,其次Flume自身还待遇LB的功能。
http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
https://blog.csdn.net/lijinqi1987/article/details/77506034
二、Logstash
Logstash本身是作为ELK的一员存在的,负责数据摄入,后来慢慢的也接入了更多的数据源和数据端。
Input接口
|
Input Plugin |
Description |
|
azure_event_hubs |
Receives events from Azure Event Hubs |
|
beats |
Receives events from the Elastic Beats framework |
|
cloudwatch |
Pulls events from the Amazon Web Services CloudWatch API |
|
couchdb_changes |
Streams events from CouchDB’s _changes URI |
|
dead_letter_queue |
read events from Logstash’s dead letter queue |
|
elasticsearch |
Reads query results from an Elasticsearch cluster |
|
exec |
Captures the output of a shell command as an event |
|
file |
Streams events from files |
|
ganglia |
Reads Ganglia packets over UDP |
|
gelf |
Reads GELF-format messages from Graylog2 as events |
|
generator |
Generates random log events for test purposes |
|
github |
Reads events from a GitHub webhook |
|
google_cloud_storage |
Extract events from files in a Google Cloud Storage bucket |
|
google_pubsub |
Consume events from a Google Cloud PubSub service |
|
graphite |
Reads metrics from the graphite tool |
|
heartbeat |
Generates heartbeat events for testing |
|
http |
Receives events over HTTP or HTTPS |
|
http_poller |
Decodes the output of an HTTP API into events |
|
imap |
Reads mail from an IMAP server |
|
irc |
Reads events from an IRC server |
|
java_generator |
Generates synthetic log events |
|
java_stdin |
Reads events from standard input |
|
jdbc |
Creates events from JDBC data |
|
jms |
Reads events from a Jms Broker |
|
jmx |
Retrieves metrics from remote Java applications over JMX |
|
kafka |
Reads events from a Kafka topic |
|
kinesis |
Receives events through an AWS Kinesis stream |
|
log4j |
Reads events over a TCP socket from a Log4j SocketAppender object |
|
lumberjack |
Receives events using the Lumberjack protocl |
|
meetup |
Captures the output of command line tools as an event |
|
pipe |
Streams events from a long-running command pipe |
|
puppet_facter |
Receives facts from a Puppet server |
|
rabbitmq |
Pulls events from a RabbitMQ exchange |
|
redis |
Reads events from a Redis instance |
|
relp |
Receives RELP events over a TCP socket |
|
rss |
Captures the output of command line tools as an event |
|
s3 |
Streams events from files in a S3 bucket |
|
s3_sns_sqs |
Reads logs from AWS S3 buckets using sqs |
|
salesforce |
Creates events based on a Salesforce SOQL query |
|
snmp |
Polls network devices using Simple Network Management Protocol (SNMP) |
|
snmptrap |
Creates events based on SNMP trap messages |
|
sqlite |
Creates events based on rows in an SQLite database |
|
sqs |
Pulls events from an Amazon Web Services Simple Queue Service queue |
|
stdin |
Reads events from standard input |
|
stomp |
Creates events received with the STOMP protocol |
|
syslog |
Reads syslog messages as events |
|
tcp |
Reads events from a TCP socket |
|
|
Reads events from the Twitter Streaming API |
|
udp |
Reads events over UDP |
|
unix |
Reads events over a UNIX socket |
|
varnishlog |
Reads from the varnish cache shared memory log |
|
websocket |
Reads events from a websocket |
|
wmi |
Creates events based on the results of a WMI query |
|
xmpp |
Receives events over the XMPP/Jabber protocol |
可以看到Logstash除了支持file,stdout,kafka等常规的input之外,还支持很多乱七八糟的input,这说明其作为一个ELK的数据摄入是合格的,但是否是我们需要的,则要仔细评估。
Output接口
|
Output Plugin |
Description |
|
boundary |
Sends annotations to Boundary based on Logstash events |
|
circonus |
Sends annotations to Circonus based on Logstash events |
|
cloudwatch |
Aggregates and sends metric data to AWS CloudWatch |
|
csv |
Writes events to disk in a delimited format |
|
datadog |
Sends events to DataDogHQ based on Logstash events |
|
datadog_metrics |
Sends metrics to DataDogHQ based on Logstash events |
|
elastic_app_search |
Sends events to the Elastic App Search solution |
|
elasticsearch |
Stores logs in Elasticsearch |
|
|
Sends email to a specified address when output is received |
|
exec |
Runs a command for a matching event |
|
file |
Writes events to files on disk |
|
ganglia |
Writes metrics to Ganglia’s gmond |
|
gelf |
Generates GELF formatted output for Graylog2 |
|
google_bigquery |
Writes events to Google BigQuery |
|
google_cloud_storage |
Uploads log events to Google Cloud Storage |
|
google_pubsub |
Uploads log events to Google Cloud Pubsub |
|
graphite |
Writes metrics to Graphite |
|
graphtastic |
Sends metric data on Windows |
|
http |
Sends events to a generic HTTP or HTTPS endpoint |
|
influxdb |
Writes metrics to InfluxDB |
|
irc |
Writes events to IRC |
|
java_sink |
Discards any events received |
|
java_stdout |
Prints events to the STDOUT of the shell |
|
juggernaut |
Pushes messages to the Juggernaut websockets server |
|
kafka |
Writes events to a Kafka topic |
|
librato |
Sends metrics, annotations, and alerts to Librato based on Logstash events |
|
loggly |
Ships logs to Loggly |
|
lumberjack |
Sends events using the lumberjack protocol |
|
metriccatcher |
Writes metrics to MetricCatcher |
|
mongodb |
Writes events to MongoDB |
|
nagios |
Sends passive check results to Nagios |
|
nagios_nsca |
Sends passive check results to Nagios using the NSCA protocol |
|
opentsdb |
Writes metrics to OpenTSDB |
|
pagerduty |
Sends notifications based on preconfigured services and escalation policies |
|
pipe |
Pipes events to another program’s standard input |
|
rabbitmq |
Pushes events to a RabbitMQ exchange |
|
redis |
Sends events to a Redis queue using the RPUSH command |
|
redmine |
Creates tickets using the Redmine API |
|
riak |
Writes events to the Riak distributed key/value store |
|
riemann |
Sends metrics to Riemann |
|
s3 |
Sends Logstash events to the Amazon Simple Storage Service |
|
sns |
Sends events to Amazon’s Simple Notification Service |
|
solr_http |
Stores and indexes logs in Solr |
|
sqs |
Pushes events to an Amazon Web Services Simple Queue Service queue |
|
statsd |
Sends metrics using the statsd network daemon |
|
stdout |
Prints events to the standard output |
|
stomp |
Writes events using the STOMP protocol |
|
syslog |
Sends events to a syslog server |
|
tcp |
Writes events over a TCP socket |
|
timber |
Sends events to the Timber.io logging service |
|
udp |
Sends events over UDP |
|
webhdfs |
Sends Logstash events to HDFS using the webhdfs REST API |
|
websocket |
Publishes messages to a websocket |
|
xmpp |
Posts events over XMPP |
|
zabbix |
Sends events to a Zabbix server |
与Input类似,Output首先是一大批ES自己的东西,对于Hadoop系统的支持本身比较少,但在一些NoSQL数据库方面的支持,相对多一些,比如Redis、MongoDB等。
处理能力,Logstash处理能力在每秒几千条的规模上。
灵活性,Logstash提供了强大的数据过滤和预处理能力。
消耗,Logstash对资源的要求比较高,需要比较多的内存资源。
运维,Logstash本身以JRuby写成,依赖和配置的复杂度比较高。
可靠,Logstash是单机运行,极端情况下存在丢数据的可能。
https://www.elastic.co/guide/en/logstash/current/index.html
https://doc.yonyoucloud.com/doc/logstash-best-practice-cn/index.html
三、FileBeat
FileBeat也是ES推出的数据收集的技术框架,相对于Logstash而言,支持数据处理的能力要弱一些,不过这正是其目的——轻量化,资源消耗就很低。
Input接口
|
Input type |
Comments |
|
Log |
read lines from log files. |
|
Stdin |
read events from standard in. |
|
Container |
read containers log files. |
|
Kafka |
read from topics in a Kafka cluster. |
|
Redis |
read entries from Redis slowlogs. |
|
UDP |
read events over UDP. |
|
Docker |
read logs from Docker containers. |
|
TCP |
read events over TCP. |
|
Syslog |
read events over TCP or UDP, this input will parse BSD (rfc3164) event and some variant. |
|
s3 |
retrieve logs from S3 objects that are pointed by messages from specific SQS queues. |
|
NetFlow |
read NetFlow and IPFIX exported flows and options records over UDP. |
|
Google Pub/Sub |
read messages from a Google Cloud Pub/Sub topic subscription. |
|
Azure eventhub |
read messages from an azure eventhub. |
可以看到FileBeat支持的Input种类比较少,但是常规的文件、标准输出、Kafka等都支持,另外对特定产品的支持还是有的。
Output接口
|
Output type |
Comments |
|
Elasticsearch |
sends the transactions directly to Elasticsearch by using the Elasticsearch HTTP API. |
|
Logstash |
sends events directly to Logstash by using the lumberjack protocol, which runs over TCP. Logstash allows for additional processing and routing of generated events. |
|
Kafka |
sends the events to Apache Kafka. |
|
Redis |
inserts the events into a Redis list or a Redis channel. |
|
File |
dumps the transactions into a file where each transaction is in a JSON format. |
|
Console |
writes events in JSON format to stdout. |
由以上组件接口可以看出,FileBeat在支持持久化方面还是比较弱的,由于本身孵化自ES,所以持久化的大部分为ES系组件,但对Kafka和Redis的支持,也能满足一定的需求。
处理能力,FileBeat的处理能力相对一般,满足基本的需求可以,类似每秒几千条数据,如果数据量再多,就需要特殊处理。
灵活性,FileBeat提供了一定的灵活性,不过GO语言本身就有门槛,不如Flume和Logstash灵活性和处理能力强。
消耗,由于以GO编写,所以正常情况下,资源消耗不多,但当遇到event的消息比较大时,在默认配置下容易出现OOM的情况。
运维,FileBeat本身是用GO写的,所以没有额外的依赖,但配置文件采用类YAML格式,可配的内容还是比较多的。
可靠,没看到FileBeat本身对数据传输过程本身防丢失采取的策略,所以极端情况下,存在丢数据的可能。
https://www.elastic.co/guide/en/beats/filebeat/current/index.html
标签:webhook testing splay this 内容 nosql ann 还需 ESS
原文地址:https://www.cnblogs.com/029zz010buct/p/12621098.html