当前位置：首页 > news >正文

Apache Paimon Append Queue表解析

news 2025/7/16 3:45:48

a) 定义

在此模式下，将append table视为由bucket分隔的queue。

同一bucket中的每条record都是严格排序的，流式读取将完全按照写入顺序将record传输到下游。

使用此模式，无需特殊配置，所有数据都将作为queue进入一个bucket，还可以定义bucket和bucket-key，以启用更大的并行度和分散数据。

在这里插入图片描述

b) Compaction

默认情况下，sink node将自动执行compaction以控制文件数量，以下参数调整compaction策略：

Key	Default	Type	Description
write-only	false	Boolean	If set to true, compactions and snapshot expiration will be skipped. This option is used along with dedicated compact jobs.
compaction.min.file-num	5	Integer	For file set [f_0,…,f_N], the minimum file number which satisfies sum(size(f_i)) >= targetFileSize to trigger a compaction for append table. This value avoids almost-full-file to be compacted, which is not cost-effective.
compaction.max.file-num	50	Integer	For file set [f_0,…,f_N], the maximum file number to trigger a compaction for append table, even if sum(size(f_i)) < targetFileSize. This value avoids pending too much small files, which slows down the performance.
full-compaction.delta-commits	(none)	Integer	Full compaction will be constantly triggered after delta commits.

c) Streaming Source

目前仅支持Flink引擎。

i）Streaming Read Order

对于streaming reads，records按以下顺序生成：

两条记录来自不同的分区
- 如果scan.plan-sort-partition设置为true，分区值较小的记录将先生成。
- 否则，将首先生成具有较早分区创建时间的记录。
两条记录来自同一分区的同一个桶，先written的记录将先生成。
两条记录来自同一分区的两个不同桶，不同的桶由不同的任务处理，它们之间不保证有序。

ii) Watermark 定义

定义reading Paimon tables的watermark。

CREATE TABLE T (
    `user` BIGINT,
    product STRING,
    order_time TIMESTAMP(3),
    WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
) WITH (...);

-- launch a bounded streaming job to read paimon_table
SELECT window_start, window_end, COUNT(`user`) FROM TABLE(
 TUMBLE(TABLE T, DESCRIPTOR(order_time), INTERVAL '10' MINUTES)) GROUP BY window_start, window_end;

可以启用Flink Watermark alignment，确保没有sources/splits/shards/partitions额外增加watermarks：

Key	Default	Type	Description
scan.watermark.alignment.group	(none)	String	A group of sources to align watermarks.
scan.watermark.alignment.max-drift	(none)	Duration	Maximal drift to align watermarks, before we pause consuming from the source/task/partition.

iii) Bounded Stream

Streaming Source可以有界，指定"scan.bounded.watermark"定义有界流模式的结束条件，遇到更大的watermark snapshot时stream reading将结束。

snapshot中的Watermark由writer生成，例如，指定kafka source并定义watermark，当使用此kafka source写入Paimon表时，Paimon表的snapshots将生成相应的watermark，以便在streaming reads此Paimon表时使用bounded watermark功能。

CREATE TABLE kafka_table (
    `user` BIGINT,
    product STRING,
    order_time TIMESTAMP(3),
    WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka'...);

-- launch a streaming insert job
INSERT INTO paimon_table SELECT * FROM kakfa_table;

-- launch a bounded streaming job to read paimon_table
SELECT * FROM paimon_table /*+ OPTIONS('scan.bounded.watermark'='...') */;

d）创建Append table并指定bucket key示例

CREATE TABLE MyTable (
  product_id BIGINT,
  price DOUBLE,
  sales BIGINT
) WITH (
      'bucket' = '8',
      'bucket-key' = 'product_id'
      );

查看全文

http://www.dtcms.com/a/8820.html

【语音识别】- 几个主流模型

数据库的介绍、分类、作用和特点

【C++精简版回顾】14.（重载2）流重载

【Python】python离线安装依赖

3D工业相机及品牌集合

蓝月亮，蓝禾，三七互娱，顺丰，康冠科技，金证科技24春招内推

git入门

PCIE Order Set

java spring cloud 企业电子招标采购系统源码:营造全面规范安全的电子招投标环境，促进招投标市场健康可持续发展

大宇、固特、希亦超声波清洗机实测，哪款清洗效果好？一篇掌握

Laravel Octane 和 Swoole 协程的使用分析二

Unity 向量计算、欧拉角与四元数转换、输出文本、告警、错误、修改时间、定时器、路径、

SQL server创建数据库

leetcode--接雨水（双指针法，动态规划，单调栈）

【AI Agent系列】【MetaGPT多智能体学习】3. 开发一个简单的多智能体系统，兼看MetaGPT多智能体运行机制

python66-Python的循环之常用工具函数

pyspark（一） DataFrame结合jupyter入门

Redis内存淘汰策略详解

Java面试题总结6

【GPTs分享】每日GPTs分享之Image Generator Tool

加密和签名的区别及应用场景

详解字符串函数＜string.h＞（上）

详解IP安全：IPSec协议簇 | AH协议 | ESP协议 | IKE协议

回溯 Leetcode 47 全排列II

鸿蒙ArkTs开发WebView问题总结

ChatGPT学习第三周

SpringBoot 自定义映射规则resultMap association一对一

Nacos配置

动态规划--（算法竞赛、蓝桥杯）--二维费用背包

如何学习自然语言处理之语言模型