当前位置: 首页 > news >正文

openGauss DB4AI与scikit-learn模块对比探究

openGauss当前版本支持了原生DB4AI能力,引入原生AI算子,简化操作流程,充分利用数据库优化器、执行器的优化与执行能力,获得高性能的数据库内模型训练能力。

本文介绍了笔者采用鸢尾花数据集,对openGauss DB4AI功能进行测试的一些情况。

获取数据集

从kaggle上获取Iris数据集

import kagglehub# Download latest version
path = kagglehub.dataset_download("saurabh00007/iriscsv")print("Path to dataset files:", path)

运行结果

hbu@Pauls-MacBook-Air db4ai % python3 get-Iris-data.py 
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.11), please consider upgrading to the latest version (0.3.12).
Path to dataset files: /Users/hbu/.cache/kagglehub/datasets/saurabh00007/iriscsv/versions/1
hbu@Pauls-MacBook-Air db4ai % ls  -lth /Users/hbu/.cache/kagglehub/datasets/saurabh00007/iriscsv/versions/1
total 16
-rw-r--r--  1 hbu  staff   5.0K Apr 23 15:09 Iris.csv

将文件移到项目便于管理的目录, 如:project_dir/data/Iris.csv

openGauss SVM分类

  1. 将鸢尾花数据集分为trian和test,并分别导入到openGauss数据库的表iris_train和iris_test。
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sqlalchemy.engine import create_enginefrom sqlalchemy.dialects.postgresql.base import PGDialectcurrent_path = os.path.dirname(os.path.realpath(__file__))
# 从CSV文件读取鸢尾花数据集
iris = pd.read_csv(current_path+'/../data/Iris.csv')
# 查看数据集大小
print(iris.shape)
# 将数据集划分为训练集和测试集,测试集占总数据的20%
train, test = train_test_split(iris, test_size=0.2)# 存储训练和测试数据到openGauss数据库
# engine=pg_manager.create_engine()
engine=create_engine("postgresql+psycopg2://username:password@hostname:port/databasename")
train.to_sql("iris_train", engine, if_exists='replace', index=False)
#test.columns = [col.lower() for col in test.columns]
test.to_sql("iris_test", engine, if_exists='replace', index=False)
  1. 查看模型iris_svm_model,如果存在则DROP.
postgres=> SELECT modelname from gs_model_warehouse where modelname='iris_svm_model';modelname    
----------------iris_svm_model
(1 row)postgres=> drop model iris_svm_model;
DROP MODEL
  1. 基于表iris_train数据创建SVM算法模型,参数均为默认值(此时kernel=linear)。
postgres=> CREATE MODEL iris_svm_model USING svm_classification FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" TARGET "PetalWidthCm"<=2.5 FROM iris_train;
MODEL CREATED. PROCESSED 1
  1. 检查基于表iris_train训练出的SVM分类模型iris_svm_model
postgres=> \x
Expanded display is on.
postgres=> SELECT * from gs_model_warehouse where modelname='iris_svm_model';
-[ RECORD 1 ]---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
modelname             | iris_svm_model
modelowner            | 16413
createtime            | 2025-04-28 17:22:35.526005
processedtuples       | 120
discardedtuples       | 0
preprocesstime        | 0
exectime              | .000798
iterations            | 4
outputtype            | 16
modeltype             | svm_classification
query                 | CREATE MODEL iris_svm_model USING svm_classification FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" TARGET "PetalWidthCm"<=2.5 FROM iris_train;
modeldata             | \x5c7830313030303030303030303030303030303030303030303030303030303030303030363430303030303065383033303030303961393939393939393939396539336636363636363636363636363665653366666361396631643234643632343033666462343830663638303030303030303037623134616534376531376138343366303030303030303030303030303030303030303030303030303030306530336630323030303030303030303030303030303030303030303030303030663033663030303030303030303030303030303030303030303030303030303030303030303430303030303030353030303030303035303030303030303130303030303031613030303030306666666630303030396136613136666661623465633233663434613632616439323533376234336665616539353366383763303162353366383031343437616533333666393933663763336661653363363637343961336636383030303030303031303030303030303030303030303031303030303030303032303030303030303130303030303030303031
weight                | 
hyperparametersnames  | {batch_size,decay,learning_rate,max_iterations,max_seconds,optimizer,tolerance,seed,verbose,lambda,kernel,components,gamma,degree,coef0}
hyperparametersvalues | {1000,.95,.8,100,0,gd,.0005,1745832155,false,.01,linear,0,.5,2,1}
hyperparametersoids   | {23,701,701,23,23,1043,701,23,16,701,1043,23,701,23,701}
coefnames             | 
coefvalues            | 
coefoids              | 
trainingscoresname    | {accuracy,f1,precision,recall,loss}
trainingscoresvalue   | {1,1,1,1,.000535965}
modeldescribe         | 
  1. 采用SVM算法模型,基于表iris_test数据进行预测验证
postgres=> SELECT "Species", PREDICT BY iris_svm_model (FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"  ) as "PREDICT" FROM iris_test;Species     | PREDICT 
-----------------+---------Iris-versicolor | tIris-virginica  | tIris-setosa     | tIris-virginica  | tIris-versicolor | tIris-setosa     | tIris-setosa     | tIris-versicolor | tIris-setosa     | tIris-setosa     | tIris-setosa     | tIris-virginica  | tIris-virginica  | tIris-setosa     | tIris-versicolor | tIris-setosa     | tIris-setosa     | tIris-virginica  | tIris-versicolor | tIris-versicolor | tIris-virginica  | tIris-setosa     | tIris-versicolor | tIris-virginica  | tIris-versicolor | tIris-versicolor | tIris-versicolor | tIris-setosa     | tIris-virginica  | tIris-versicolor | t
(30 rows)%% 验证是否有假值,即分类失败的值
postgres=> select "Species", "PREDICT" from (SELECT "Species", PREDICT BY iris_svm_model (FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"  ) as "PREDICT" FROM iris_test) as t  where "PREDICT"=False;Species | PREDICT 
---------+---------
(0 rows)

备注: 由于sqlalchemy插入的数据存在列名大小写,故早SQL操作Species之类韩大小写的列不可以混用全大写或小写,SQL中只要涉及相关值,均需采用双引号。

  1. 补充内容

修改模型训练参数事例

postgres=> CREATE MODEL iris_svm_model USING svm_classification FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" TARGET "PetalWidthCm"<=2.5 FROM iris_train with kernel='linear';
MODEL CREATED. PROCESSED 1

Sklearn SVM分类

  1. 采用sklearn模块进行svm分类时,默认核函数为RBF(即gaussian)

由于openGauss的DB4AI功能SVM算法默认和函数为linear,故sklearn模块创建SVM累时需要指定核函数,model = svm.SVC(verbose=1, kernel='linear').

import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
current_path = os.path.dirname(os.path.realpath(__file__))
iris = pd.read_csv(current_path+'/../data/Iris.csv')
train, test = train_test_split(iris, test_size=0.3)
# 提取训练集和测试集的特征和标签
train_x = train[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
train_y = train.Species
test_x = test[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
test_y = test.Species# 创建支持向量机(SVM)分类器模型
# openGuass 默认SVM kernel函数为linear,其支持linear/gaussian/polynomial 核函数
# SVC默认kernel函数为RBF(即gaussian核函数)
# just for linear、SVM:kernel = "linear" linear/gaussian/polynomial 核函数
model = svm.SVC(verbose=1, kernel='linear')
# 在训练集上拟合SVM模型
model.fit(train_x, train_y)
# 使用训练好的模型对测试集进行预测
prediction = model.predict(test_x)
# 打印SVM模型的准确性
print('The accuracy of the SVM is:', metrics.accuracy_score(prediction, test_y))
print(metrics.zero_one_loss(test_y, prediction))

运行日志

[LibSVM]*
optimization finished, #iter = 16
obj = -0.748057, rho = -1.452044
nSV = 3, nBSV = 0
*
optimization finished, #iter = 5
obj = -0.203684, rho = -1.507091
nSV = 3, nBSV = 0
*
optimization finished, #iter = 33
obj = -13.730499, rho = -8.624603
nSV = 20, nBSV = 16
Total nSV = 24
The accuracy of the SVM is: 0.9777777777777777
0.022222222222222254
  1. 检查sklearn SVM模型预测结果集
import numpy as np
def check_prediction(prediction: np.ndarray, test: pd.DataFrame):raw=test.Species.tolist()cnt=0for idx, specie in enumerate(raw):is_euqal=Falseif  prediction[idx]==specie:is_euqal=Trueelse:cnt+=1print(idx,is_euqal, specie, prediction[idx])return cnt
cnt=check_prediction(prediction, test)
cnt

运行结果

0 True Iris-versicolor Iris-versicolor
1 True Iris-setosa Iris-setosa
2 True Iris-virginica Iris-virginica
3 True Iris-versicolor Iris-versicolor
4 True Iris-versicolor Iris-versicolor
5 True Iris-setosa Iris-setosa
6 True Iris-setosa Iris-setosa
7 True Iris-versicolor Iris-versicolor
8 True Iris-versicolor Iris-versicolor
9 True Iris-versicolor Iris-versicolor
10 True Iris-versicolor Iris-versicolor
11 True Iris-versicolor Iris-versicolor
12 True Iris-setosa Iris-setosa
13 True Iris-setosa Iris-setosa
14 True Iris-setosa Iris-setosa
15 True Iris-virginica Iris-virginica
16 True Iris-virginica Iris-virginica
17 True Iris-setosa Iris-setosa
18 True Iris-setosa Iris-setosa
19 True Iris-virginica Iris-virginica
20 True Iris-versicolor Iris-versicolor
21 True Iris-setosa Iris-setosa
22 True Iris-versicolor Iris-versicolor
23 True Iris-virginica Iris-virginica
24 True Iris-virginica Iris-virginica
25 True Iris-virginica Iris-virginica
26 True Iris-setosa Iris-setosa
27 True Iris-versicolor Iris-versicolor
28 True Iris-setosa Iris-setosa
29 True Iris-virginica Iris-virginica
30 True Iris-setosa Iris-setosa
31 True Iris-versicolor Iris-versicolor
32 True Iris-setosa Iris-setosa
33 True Iris-virginica Iris-virginica
34 True Iris-virginica Iris-virginica
35 True Iris-versicolor Iris-versicolor
36 True Iris-virginica Iris-virginica
37 True Iris-setosa Iris-setosa
38 True Iris-versicolor Iris-versicolor
39 True Iris-virginica Iris-virginica
40 True Iris-versicolor Iris-versicolor
41 False Iris-versicolor Iris-virginica
42 True Iris-setosa Iris-setosa
43 True Iris-virginica Iris-virginica
44 True Iris-virginica Iris-virginica

算法表现

openGauss和sklearn,采用相同训练和测试数据选取SVM算法训练后的模型,在进行预测的结果显示openGuass准确率(100%)更好,在进行更多训练时,笔者也观测到sklearn SVM算法训练模型预测结果准确度达到100%,这种结果可能是两者除核函数外的一些默认参数差异造成结果。

笔者主要为了对比openGauss DB4AI 能力进行探究,对两者结果表现的差异性不做深入分析。

异常处理

  • Pandas 操作数据库openGauss版本问题,AssertionError: Could not determine version from string ‘(openGauss-lite 5.0.3 build 89d144c2) compiled at 2024-07-31 21:39:16 commit 0 last mr release’

    sqlalchemy.engine创建Engine之前,需要sqlalchemy中有关openGauss版本信息获取函数,进行修改注入。

    from sqlalchemy.dialects.postgresql.base import PGDialect
    PGDialect._get_server_version_info = lambda *args: (9, 2)
    
  • Pandas操作PostgreSQL数据库,告警 pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection

    # reference: https://cloud.tencent.com/developer/ask/sof/106578644
    from sqlalchemy.engine import create_engine
    engine=create_engine("postgresql+psycopg2://username:password@hostname:port/databasename")
    test=pd.read_sql('''SELECT * FROM  iris_test;''', con = engine,index_col="id")
    print(test)
    
    /Users/hbu/SUFE/AI/Gemini/db4ai/iris-svm-model.py:41: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.test=pd.read_sql('''SELECT * FROM  iris_test;''', con = conn,index_col="id")sepallengthcm  sepalwidthcm  petallengthcm  petalwidthcm          species
    id                                                                            
    86             6.0           3.4            4.5           1.6  Iris-versicolor
    3              4.7           3.2            1.3           0.2      Iris-setosa
    146            6.7           3.0            5.2           2.3   Iris-virginica
    76             6.6           3.0            4.4           1.4  Iris-versicolor
    41             5.0           3.5            1.3           0.3      Iris-setosa
    35             4.9           3.1            1.5           0.1      Iris-setosa
    

    解决方法:先用psycopg2驱动建立到PostgreSQL数据库的连接,然后再进行数据操作。

    from sqlalchemy.engine import create_engine
    engine=create_engine("postgresql+psycopg2://username:password@hostname:port/databasename")
    test=pd.read_sql('''SELECT * FROM  iris_test;''', con = engine,index_col="id")
    print(test)
    

相关文章:

  • 大模型如何应对内容安全:原理、挑战与技术路径探讨
  • Vue 组件通信方式总览
  • 智能指针之设计模式6
  • 1.6二重积分
  • Python中的单例模式:深入探索元类与装饰器实现
  • 安卓基础(HashMap和ArrayList)
  • 注意力机制:从 MHA、MQA、GQA、MLA 到 NSA、MoBA
  • 东莞SMT贴片加工工艺优化解析
  • 代码随想录算法训练营 Day34 动态规划Ⅱ 路径
  • 魔三与指北者新品发布会在茅台镇圆满举办,开启音乐设备新篇章
  • 北京市延庆区“禅苑茶事“非遗项目挂牌及茶事院正式启用
  • 孙宇晨将出席迪拜Token2049 与特朗普次子共话加密未来
  • 使用腾讯地图检索地点
  • .NET8配置组件
  • 锁和事务谁在外层
  • c++进阶——多态
  • word文档插入公式后行距变大怎么办?
  • 一文了解无人机系统
  • Ubuntu18.04安装IntelliJ IDEA2025步骤
  • 互容是什么意思?
  • 深入贯彻中央八项规定精神学习教育中央指导组培训会议召开
  • 安徽省公安厅原副厅长刘海石主动投案,正接受审查调查
  • 五一假期上海推出首批16条“市民健康路线”,这些健康提示请收好
  • 发挥全国劳模示范引领作用,加速汽车产业电智化转型
  • 国家统计局:一季度全国规模以上文化及相关产业企业营业收入增长6.2%
  • 上海通报5起违反中央八项规定精神问题