CMU15721-Spring2024课程笔记

本文最后更新于 2024年11月24日 凌晨

some papers that worth to read:

// todo

Overview

  • data cubes -> data warehouses -> shared-disk -> lakehouse
  • ETL tool
  • push query to data / pull data to query
  • shared-nothing / shared-disk

Data Formats

  • storage model:
    • n-ary: store all the attributes for a single tuple contiguously
    • decomposition: store a single attribute for all tuples contiguously
    • partition attributes across(PAX): hybrid, bertically partion attributes
      • using column chunks
  • open-source: parquet / orc / arrow
  • encoding:
    • dictionary compression for column
    • zstd for block compression
    • zone maps / bloom filters for filters
  • nested data in columns:
    • shredding
    • length + presence

CMU15721-Spring2024课程笔记
https://gentlecold.top/20241119/cmu15721-note/
作者
GentleCold
发布于
2024年11月19日
许可协议