码迷,mamicode.com
首页 > 其他好文 > 详细

<Parquet><Physical Properties>

时间:2017-07-30 13:57:11      阅读:309      评论:0      收藏:0      [点我收藏+]

标签:not   ado   sim   tip   get   nbsp   less   lzo   set   

Parquet

  • Parquet is a columnar storage format for Hadoop.
  • Parquet is designed to make the advantages of compressed, efficient colunmar data representation available to any project in the Hadoop ecosystem.

Physical Properties

  • Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters.
  • Now, parquet file provides the following physical properties.
    • parquet.block.size: The block size is the size of a row group being buffered in memory. This limits the memory usage when writing. Larger values will improve the I/O when reading but consume more memory when writing. Default size is 134217728 bytes (= 128 * 1024 * 1024).
    • parquet.page.size: The page size if for compression. When reading, each page is the smallest unit that must be read fully to access a single record. If the value is too small, the compression will deteriorate. Default size is 1048576 bytes (= 1 * 1024 * 1024).
    • parquet.compression: The compression algorithm used to compress pages. It should be one of uncompressedsnappygziplzo. Default is uncompressed.
    • parquet.enable.dictionary: The boolean value  is to enable/disable dictionary encoding. It should be one of either true or false. Default is true.

Parquet Row Group Size

Row Group

  • Even though Parquet is a column-orientied format, the largest sections of data are groups of row data rows.
  • Records are organized into row groups so that the file is splittable and each split contains complete records.
  • Here’s a simple picture of how data is stored for a simple schema with columns A, in green, and B, in blue:
    技术分享
  • Why row groups? --> If the entire file were organized by columns then the underlying HDFS blocks would contain just a column or two of each record. Reassembling those records to process them would require shuffling almost all of the data around to the right place. As below:
    技术分享
  • There is another benefit to organizing data into row groups: memory consumption. Before Parquet can write the first data value in column B, it needs to write the last value of column A. All column-oriented formas need to buffer record data in memory until those records are written all at once.
  • You can control row group size by setting parquet.block.size, in bytes(default: 128MB). Parquet buffers data in its final encoded and compressed form, which uses less memory and means that the amount of buffered data is the same sa the row group size on disk. 
  • That makes the row group size the most important setting. It controls both:
    • The amount of memory consumed for each open Parquet file, and
    • The layout of column data on disk.

  The row group setting is a trade-off between these two. 

 

 

 

 

 

 

FYI

 

<Parquet><Physical Properties>

标签:not   ado   sim   tip   get   nbsp   less   lzo   set   

原文地址:http://www.cnblogs.com/wttttt/p/7258771.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!