天天看点

Apache Druid 解析ORC及parquet格式的数据

Apache Druid可以从本地或者HDFS批量摄取数据,现在最新版本(0.18)也支持直接解析

ORC

parquet

格式的数据,但是要使用这个功能还需要进行简单的配置。

官方文档说明

Apache Druid打包了所有的核心扩展(参考本文附件),您可以通过将需要的扩展名添加到

common.runtime.properties

中的

druid.extensions.loadList

。例如,要加载

postqresql-metadata-storage

druid-hdfs-storage

扩展,请使用配置:

druid.extensions.loadList=["postgresql-metadata-storage", "druid-hdfs-storage"]
           

所以当我们需要Druid 解析ORC及Parquet格式的数据时,就需要这样配置:

druid.extensions.loadList=["druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches","druid-orc-extensions","druid-parquet-extensions"]
           

附件

Name Description Docs
druid-avro-extensions Support for data in Apache Avro data format. link
druid-azure-extensions Microsoft Azure deep storage.
druid-basic-security Support for Basic HTTP authentication and role-based access control.
druid-bloom-filter Support for providing Bloom filters in druid queries.
druid-datasketches Support for approximate counts and set operations with Apache DataSketches.
druid-google-extensions Google Cloud Storage deep storage.
druid-hdfs-storage HDFS deep storage.
druid-histogram Approximate histograms and quantiles aggregator. Deprecated, please use the DataSketches quantiles aggregator from the

druid-datasketches

extension instead.
druid-kafka-extraction-namespace Apache Kafka-based namespaced lookup. Requires namespace lookup extension.
druid-kafka-indexing-service Supervised exactly-once Apache Kafka ingestion for the indexing service.
druid-kinesis-indexing-service Supervised exactly-once Kinesis ingestion for the indexing service.
druid-kerberos Kerberos authentication for druid processes.
druid-lookups-cached-global A module for lookups providing a jvm-global eager caching for lookups. It provides JDBC and URI implementations for fetching lookup data.
druid-lookups-cached-single Per lookup caching module to support the use cases where a lookup need to be isolated from the global pool of lookups
druid-orc-extensions Support for data in Apache ORC data format.
druid-parquet-extensions Support for data in Apache Parquet data format. Requires druid-avro-extensions to be loaded.
druid-protobuf-extensions Support for data in Protobuf data format.
druid-ranger-security Support for access control through Apache Ranger.
druid-s3-extensions Interfacing with data in AWS S3, and using S3 as deep storage.
druid-ec2-extensions Interfacing with AWS EC2 for autoscaling middle managers UNDOCUMENTED
druid-stats Statistics related module including variance and standard deviation.
mysql-metadata-storage MySQL metadata store.
postgresql-metadata-storage PostgreSQL metadata store.
simple-client-sslcontext Simple SSLContext provider module to be used by Druid's internal HttpClient when talking to other Druid processes over HTTPS.
druid-pac4j OpenID Connect authentication for druid processes.
aliyun-oss-extensions Aliyun OSS deep storage
ambari-metrics-emitter Ambari Metrics Emitter
druid-cassandra-storage Apache Cassandra deep storage.
druid-cloudfiles-extensions Rackspace Cloudfiles deep storage and firehose.
druid-distinctcount DistinctCount aggregator
druid-redis-cache A cache implementation for Druid based on Redis.
druid-time-min-max Min/Max aggregator for timestamp.
sqlserver-metadata-storage Microsoft SQLServer deep storage.
graphite-emitter Graphite metrics emitter
statsd-emitter StatsD metrics emitter
kafka-emitter Kafka metrics emitter
druid-thrift-extensions Support thrift ingestion
druid-opentsdb-emitter OpenTSDB metrics emitter
materialized-view-selection, materialized-view-maintenance Materialized View
druid-moving-average-query Support for Moving Average and other Aggregate Window Functions in Druid queries.
druid-influxdb-emitter InfluxDB metrics emitter
druid-momentsketch Support for approximate quantile queries using the momentsketch library
druid-tdigestsketch Support for approximate sketch aggregators based on T-Digest
gce-extensions GCE Extensions