Parquet 編碼方式

2021-11-11 02:00:15

雖然 Parquet 的文檔裡寫了很多編碼方式（https://github.com/apache/parquet-format/blob/master/Encodings.md ），但實際上 Parquet 隻支援兩種編碼：PLAIN 和 Dictionary Encoding。隻能設定開 Dictionary 或者不開 Dictionary。而且隻支援檔案粒度的設定，不支援列粒度的，不能對某一具體列設定編碼。

（https://issues.apache.org/jira/browse/PARQUET-1058 ）

https://issues.apache.org/jira/browse/PARQUET-796

the boolean enableDictionary determines whether dictionary encoding is used for all columns or none of them. 這裡邊這個哥們說可以用 OriginalType.TIMESTAMP_MILLIS 來打開 Delta Encoding。但是 Parquet 的文檔裡都沒寫這個資料類型，那就不能作為官方的功能了。

檔案層 API 編碼設定方式：初始化 ParquetWriter 時選擇是否打開 Dictionary Encoding（enableDictionary 參數）。

Parquet 接口

如果用 Spark，是設定一個參數 parquet.enable.dictionary：

https://stackoverflow.com/questions/45488227/how-to-set-parquet-file-encoding-in-spark

在檢視 Parquet 具體某一列的編碼方式時，可以調試 ParquetReader 的 build() 函數，裡面會讀取 Parquet 檔案的 Footer，裡邊有整個檔案的 Metadata，包括每個 RowGroup 中每個 Column 的編碼方式。

Parquet 編碼方式

繼續閱讀

Linux 7 中配置Apache服務，及禁止ip通路，删除apache廣告頁面。

Apache配置檔案中的deny和allow的使用

Apache 配置預設編碼

伺服器配置——Apache

Apache靜态檔案通路配置（書封伺服器）

apache httpd 配置

大資料排錯SparkSpark叢集啟動時候，JAVA_HOME is not sethadoop叢集，某台伺服器jps無任何輸出IDEAkafkahadoopspark sqlfile permissionsIDEA本地測試 - OutOfMemoryError: GC overhead limit exceededhdfs負載均衡

Ubuntu16.04安裝Apache+MySQL+PHP1. 安裝Apache2. 安裝MySQL3. 安裝PHP4. 安裝phpMyAdmin

ubuntu14.04下安裝hbse1.0.1.1

Apache配置SSLApache配置SSL

Windows下配置Apache的SSL服務

User Defined Hadoop DataType

Apache2.4.x 配置檔案詳解Apache配置需要了解如下：開始講解：

配置apache支援PHP（win7）

Ambari介紹和架構原理

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結