CMU15721-Spring2024课程笔记
本文最后更新于 2025年1月5日 晚上
some papers that worth to read:
// todo
Overview
- data cubes -> data warehouses -> shared-disk -> lakehouse
- ETL tool
- push query to data / pull data to query
- shared-nothing / shared-disk
Data Formats
- storage model:
- n-ary: store all the attributes for a single tuple contiguously
- decomposition: store a single attribute for all tuples contiguously
- partition attributes across(PAX): hybrid, bertically partion attributes
- using column chunks
- open-source: parquet / orc / arrow
- encoding:
- dictionary compression for column
- zstd for block compression
- zone maps / bloom filters for filters
- nested data in columns:
- shredding
- length + presence
Query Execution
- three optimizations:
- data parallelization(vectorization)
- task parallelization(multi-threading)
- code specialization(pre compile / JIT)
- process model
- iterator model
- materialization model
- vectorized / batch model
- may contain tuples that do not satisfy filters
- solution: offset or bitmaps
- may contain tuples that do not satisfy filters
- processing direction
- top to bottom(pull)(iterator model)
- easy to control output
- additional overhead because ‘Next()’
- bottom to top(push)
- allow tighter control
- may not control intermediate result sizes
- difficult to implement some operators (sort merge join)
- top to bottom(pull)(iterator model)
Query Execution II
CMU15721-Spring2024课程笔记
https://gentlecold.top/20241119/cmu15721-note/