当前位置：首页 > news >正文

openGauss DB4AI与scikit-learn模块对比探究

news 2025/7/3 12:49:35

openGauss当前版本支持了原生DB4AI能力，引入原生AI算子，简化操作流程，充分利用数据库优化器、执行器的优化与执行能力，获得高性能的数据库内模型训练能力。

本文介绍了笔者采用鸢尾花数据集，对openGauss DB4AI功能进行测试的一些情况。

获取数据集

从kaggle上获取Iris数据集

import kagglehub# Download latest version
path = kagglehub.dataset_download("saurabh00007/iriscsv")print("Path to dataset files:", path)

运行结果

hbu@Pauls-MacBook-Air db4ai % python3 get-Iris-data.py 
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.11), please consider upgrading to the latest version (0.3.12).
Path to dataset files: /Users/hbu/.cache/kagglehub/datasets/saurabh00007/iriscsv/versions/1
hbu@Pauls-MacBook-Air db4ai % ls  -lth /Users/hbu/.cache/kagglehub/datasets/saurabh00007/iriscsv/versions/1
total 16
-rw-r--r--  1 hbu  staff   5.0K Apr 23 15:09 Iris.csv

将文件移到项目便于管理的目录, 如：project_dir/data/Iris.csv

openGauss SVM分类

将鸢尾花数据集分为trian和test，并分别导入到openGauss数据库的表iris_train和iris_test。

import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sqlalchemy.engine import create_enginefrom sqlalchemy.dialects.postgresql.base import PGDialectcurrent_path = os.path.dirname(os.path.realpath(__file__))
# 从CSV文件读取鸢尾花数据集
iris = pd.read_csv(current_path+'/../data/Iris.csv')
# 查看数据集大小
print(iris.shape)
# 将数据集划分为训练集和测试集，测试集占总数据的20%
train, test = train_test_split(iris, test_size=0.2)# 存储训练和测试数据到openGauss数据库
# engine=pg_manager.create_engine()
engine=create_engine("postgresql+psycopg2://username:password@hostname:port/databasename")
train.to_sql("iris_train", engine, if_exists='replace', index=False)
#test.columns = [col.lower() for col in test.columns]
test.to_sql("iris_test", engine, if_exists='replace', index=False)

查看模型iris_svm_model，如果存在则DROP.

postgres=> SELECT modelname from gs_model_warehouse where modelname='iris_svm_model';modelname    
----------------iris_svm_model
(1 row)postgres=> drop model iris_svm_model;
DROP MODEL

基于表iris_train数据创建SVM算法模型，参数均为默认值（此时kernel=linear）。

postgres=> CREATE MODEL iris_svm_model USING svm_classification FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" TARGET "PetalWidthCm"<=2.5 FROM iris_train;
MODEL CREATED. PROCESSED 1

检查基于表iris_train训练出的SVM分类模型iris_svm_model

postgres=> \x
Expanded display is on.
postgres=> SELECT * from gs_model_warehouse where modelname='iris_svm_model';
-[ RECORD 1 ]---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
modelname             | iris_svm_model
modelowner            | 16413
createtime            | 2025-04-28 17:22:35.526005
processedtuples       | 120
discardedtuples       | 0
preprocesstime        | 0
exectime              | .000798
iterations            | 4
outputtype            | 16
modeltype             | svm_classification
query                 | CREATE MODEL iris_svm_model USING svm_classification FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" TARGET "PetalWidthCm"<=2.5 FROM iris_train;
modeldata             | \x5c7830313030303030303030303030303030303030303030303030303030303030303030363430303030303065383033303030303961393939393939393939396539336636363636363636363636363665653366666361396631643234643632343033666462343830663638303030303030303037623134616534376531376138343366303030303030303030303030303030303030303030303030303030306530336630323030303030303030303030303030303030303030303030303030663033663030303030303030303030303030303030303030303030303030303030303030303430303030303030353030303030303035303030303030303130303030303031613030303030306666666630303030396136613136666661623465633233663434613632616439323533376234336665616539353366383763303162353366383031343437616533333666393933663763336661653363363637343961336636383030303030303031303030303030303030303030303031303030303030303032303030303030303130303030303030303031
weight                | 
hyperparametersnames  | {batch_size,decay,learning_rate,max_iterations,max_seconds,optimizer,tolerance,seed,verbose,lambda,kernel,components,gamma,degree,coef0}
hyperparametersvalues | {1000,.95,.8,100,0,gd,.0005,1745832155,false,.01,linear,0,.5,2,1}
hyperparametersoids   | {23,701,701,23,23,1043,701,23,16,701,1043,23,701,23,701}
coefnames             | 
coefvalues            | 
coefoids              | 
trainingscoresname    | {accuracy,f1,precision,recall,loss}
trainingscoresvalue   | {1,1,1,1,.000535965}
modeldescribe         |

采用SVM算法模型，基于表iris_test数据进行预测验证

postgres=> SELECT "Species", PREDICT BY iris_svm_model (FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"  ) as "PREDICT" FROM iris_test;Species     | PREDICT 
-----------------+---------Iris-versicolor | tIris-virginica  | tIris-setosa     | tIris-virginica  | tIris-versicolor | tIris-setosa     | tIris-setosa     | tIris-versicolor | tIris-setosa     | tIris-setosa     | tIris-setosa     | tIris-virginica  | tIris-virginica  | tIris-setosa     | tIris-versicolor | tIris-setosa     | tIris-setosa     | tIris-virginica  | tIris-versicolor | tIris-versicolor | tIris-virginica  | tIris-setosa     | tIris-versicolor | tIris-virginica  | tIris-versicolor | tIris-versicolor | tIris-versicolor | tIris-setosa     | tIris-virginica  | tIris-versicolor | t
(30 rows)%% 验证是否有假值，即分类失败的值
postgres=> select "Species", "PREDICT" from (SELECT "Species", PREDICT BY iris_svm_model (FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"  ) as "PREDICT" FROM iris_test) as t  where "PREDICT"=False;Species | PREDICT 
---------+---------
(0 rows)

备注: 由于sqlalchemy插入的数据存在列名大小写，故早SQL操作Species之类韩大小写的列不可以混用全大写或小写，SQL中只要涉及相关值，均需采用双引号。

补充内容

修改模型训练参数事例

postgres=> CREATE MODEL iris_svm_model USING svm_classification FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" TARGET "PetalWidthCm"<=2.5 FROM iris_train with kernel='linear';
MODEL CREATED. PROCESSED 1

Sklearn SVM分类

采用sklearn模块进行svm分类时，默认核函数为RBF(即gaussian)

由于openGauss的DB4AI功能SVM算法默认和函数为linear，故sklearn模块创建SVM累时需要指定核函数，model = svm.SVC(verbose=1, kernel='linear').

import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
current_path = os.path.dirname(os.path.realpath(__file__))
iris = pd.read_csv(current_path+'/../data/Iris.csv')
train, test = train_test_split(iris, test_size=0.3)
# 提取训练集和测试集的特征和标签
train_x = train[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
train_y = train.Species
test_x = test[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
test_y = test.Species# 创建支持向量机（SVM）分类器模型
# openGuass 默认SVM kernel函数为linear，其支持linear/gaussian/polynomial 核函数
# SVC默认kernel函数为RBF(即gaussian核函数)
# just for linear、SVM：kernel = "linear" linear/gaussian/polynomial 核函数
model = svm.SVC(verbose=1, kernel='linear')
# 在训练集上拟合SVM模型
model.fit(train_x, train_y)
# 使用训练好的模型对测试集进行预测
prediction = model.predict(test_x)
# 打印SVM模型的准确性
print('The accuracy of the SVM is:', metrics.accuracy_score(prediction, test_y))
print(metrics.zero_one_loss(test_y, prediction))

运行日志

[LibSVM]*
optimization finished, #iter = 16
obj = -0.748057, rho = -1.452044
nSV = 3, nBSV = 0
*
optimization finished, #iter = 5
obj = -0.203684, rho = -1.507091
nSV = 3, nBSV = 0
*
optimization finished, #iter = 33
obj = -13.730499, rho = -8.624603
nSV = 20, nBSV = 16
Total nSV = 24
The accuracy of the SVM is: 0.9777777777777777
0.022222222222222254

检查sklearn SVM模型预测结果集

import numpy as np
def check_prediction(prediction: np.ndarray, test: pd.DataFrame):raw=test.Species.tolist()cnt=0for idx, specie in enumerate(raw):is_euqal=Falseif  prediction[idx]==specie:is_euqal=Trueelse:cnt+=1print(idx,is_euqal, specie, prediction[idx])return cnt
cnt=check_prediction(prediction, test)
cnt

运行结果

0 True Iris-versicolor Iris-versicolor
1 True Iris-setosa Iris-setosa
2 True Iris-virginica Iris-virginica
3 True Iris-versicolor Iris-versicolor
4 True Iris-versicolor Iris-versicolor
5 True Iris-setosa Iris-setosa
6 True Iris-setosa Iris-setosa
7 True Iris-versicolor Iris-versicolor
8 True Iris-versicolor Iris-versicolor
9 True Iris-versicolor Iris-versicolor
10 True Iris-versicolor Iris-versicolor
11 True Iris-versicolor Iris-versicolor
12 True Iris-setosa Iris-setosa
13 True Iris-setosa Iris-setosa
14 True Iris-setosa Iris-setosa
15 True Iris-virginica Iris-virginica
16 True Iris-virginica Iris-virginica
17 True Iris-setosa Iris-setosa
18 True Iris-setosa Iris-setosa
19 True Iris-virginica Iris-virginica
20 True Iris-versicolor Iris-versicolor
21 True Iris-setosa Iris-setosa
22 True Iris-versicolor Iris-versicolor
23 True Iris-virginica Iris-virginica
24 True Iris-virginica Iris-virginica
25 True Iris-virginica Iris-virginica
26 True Iris-setosa Iris-setosa
27 True Iris-versicolor Iris-versicolor
28 True Iris-setosa Iris-setosa
29 True Iris-virginica Iris-virginica
30 True Iris-setosa Iris-setosa
31 True Iris-versicolor Iris-versicolor
32 True Iris-setosa Iris-setosa
33 True Iris-virginica Iris-virginica
34 True Iris-virginica Iris-virginica
35 True Iris-versicolor Iris-versicolor
36 True Iris-virginica Iris-virginica
37 True Iris-setosa Iris-setosa
38 True Iris-versicolor Iris-versicolor
39 True Iris-virginica Iris-virginica
40 True Iris-versicolor Iris-versicolor
41 False Iris-versicolor Iris-virginica
42 True Iris-setosa Iris-setosa
43 True Iris-virginica Iris-virginica
44 True Iris-virginica Iris-virginica

算法表现

openGauss和sklearn，采用相同训练和测试数据选取SVM算法训练后的模型，在进行预测的结果显示openGuass准确率(100%)更好，在进行更多训练时，笔者也观测到sklearn SVM算法训练模型预测结果准确度达到100%，这种结果可能是两者除核函数外的一些默认参数差异造成结果。

笔者主要为了对比openGauss DB4AI 能力进行探究，对两者结果表现的差异性不做深入分析。

异常处理

Pandas 操作数据库openGauss版本问题，AssertionError: Could not determine version from string ‘(openGauss-lite 5.0.3 build 89d144c2) compiled at 2024-07-31 21:39:16 commit 0 last mr release’

在sqlalchemy.engine创建Engine之前，需要sqlalchemy中有关openGauss版本信息获取函数，进行修改注入。
```
from sqlalchemy.dialects.postgresql.base import PGDialect
PGDialect._get_server_version_info = lambda *args: (9, 2)
```

Pandas操作PostgreSQL数据库，告警 pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection

# reference: https://cloud.tencent.com/developer/ask/sof/106578644
from sqlalchemy.engine import create_engine
engine=create_engine("postgresql+psycopg2://username:password@hostname:port/databasename")
test=pd.read_sql('''SELECT * FROM  iris_test;''', con = engine,index_col="id")
print(test)

/Users/hbu/SUFE/AI/Gemini/db4ai/iris-svm-model.py:41: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.test=pd.read_sql('''SELECT * FROM  iris_test;''', con = conn,index_col="id")sepallengthcm  sepalwidthcm  petallengthcm  petalwidthcm          species
id                                                                            
86             6.0           3.4            4.5           1.6  Iris-versicolor
3              4.7           3.2            1.3           0.2      Iris-setosa
146            6.7           3.0            5.2           2.3   Iris-virginica
76             6.6           3.0            4.4           1.4  Iris-versicolor
41             5.0           3.5            1.3           0.3      Iris-setosa
35             4.9           3.1            1.5           0.1      Iris-setosa

解决方法：先用psycopg2驱动建立到PostgreSQL数据库的连接，然后再进行数据操作。

from sqlalchemy.engine import create_engine
engine=create_engine("postgresql+psycopg2://username:password@hostname:port/databasename")
test=pd.read_sql('''SELECT * FROM  iris_test;''', con = engine,index_col="id")
print(test)