openGauss DB4AI与scikit-learn模块对比探究
openGauss当前版本支持了原生DB4AI能力,引入原生AI算子,简化操作流程,充分利用数据库优化器、执行器的优化与执行能力,获得高性能的数据库内模型训练能力。
本文介绍了笔者采用鸢尾花数据集,对openGauss DB4AI功能进行测试的一些情况。
获取数据集
从kaggle上获取Iris数据集
import kagglehub# Download latest version
path = kagglehub.dataset_download("saurabh00007/iriscsv")print("Path to dataset files:", path)
运行结果
hbu@Pauls-MacBook-Air db4ai % python3 get-Iris-data.py
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.11), please consider upgrading to the latest version (0.3.12).
Path to dataset files: /Users/hbu/.cache/kagglehub/datasets/saurabh00007/iriscsv/versions/1
hbu@Pauls-MacBook-Air db4ai % ls -lth /Users/hbu/.cache/kagglehub/datasets/saurabh00007/iriscsv/versions/1
total 16
-rw-r--r-- 1 hbu staff 5.0K Apr 23 15:09 Iris.csv
将文件移到项目便于管理的目录, 如:project_dir/data/Iris.csv
openGauss SVM分类
- 将鸢尾花数据集分为trian和test,并分别导入到openGauss数据库的表iris_train和iris_test。
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sqlalchemy.engine import create_enginefrom sqlalchemy.dialects.postgresql.base import PGDialectcurrent_path = os.path.dirname(os.path.realpath(__file__))
# 从CSV文件读取鸢尾花数据集
iris = pd.read_csv(current_path+'/../data/Iris.csv')
# 查看数据集大小
print(iris.shape)
# 将数据集划分为训练集和测试集,测试集占总数据的20%
train, test = train_test_split(iris, test_size=0.2)# 存储训练和测试数据到openGauss数据库
# engine=pg_manager.create_engine()
engine=create_engine("postgresql+psycopg2://username:password@hostname:port/databasename")
train.to_sql("iris_train", engine, if_exists='replace', index=False)
#test.columns = [col.lower() for col in test.columns]
test.to_sql("iris_test", engine, if_exists='replace', index=False)
- 查看模型iris_svm_model,如果存在则DROP.
postgres=> SELECT modelname from gs_model_warehouse where modelname='iris_svm_model';modelname
----------------iris_svm_model
(1 row)postgres=> drop model iris_svm_model;
DROP MODEL
- 基于表iris_train数据创建SVM算法模型,参数均为默认值(此时
kernel=linear
)。
postgres=> CREATE MODEL iris_svm_model USING svm_classification FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" TARGET "PetalWidthCm"<=2.5 FROM iris_train;
MODEL CREATED. PROCESSED 1
- 检查基于表iris_train训练出的SVM分类模型
iris_svm_model
postgres=> \x
Expanded display is on.
postgres=> SELECT * from gs_model_warehouse where modelname='iris_svm_model';
-[ RECORD 1 ]---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
modelname | iris_svm_model
modelowner | 16413
createtime | 2025-04-28 17:22:35.526005
processedtuples | 120
discardedtuples | 0
preprocesstime | 0
exectime | .000798
iterations | 4
outputtype | 16
modeltype | svm_classification
query | CREATE MODEL iris_svm_model USING svm_classification FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" TARGET "PetalWidthCm"<=2.5 FROM iris_train;
modeldata | \x5c7830313030303030303030303030303030303030303030303030303030303030303030363430303030303065383033303030303961393939393939393939396539336636363636363636363636363665653366666361396631643234643632343033666462343830663638303030303030303037623134616534376531376138343366303030303030303030303030303030303030303030303030303030306530336630323030303030303030303030303030303030303030303030303030663033663030303030303030303030303030303030303030303030303030303030303030303430303030303030353030303030303035303030303030303130303030303031613030303030306666666630303030396136613136666661623465633233663434613632616439323533376234336665616539353366383763303162353366383031343437616533333666393933663763336661653363363637343961336636383030303030303031303030303030303030303030303031303030303030303032303030303030303130303030303030303031
weight |
hyperparametersnames | {batch_size,decay,learning_rate,max_iterations,max_seconds,optimizer,tolerance,seed,verbose,lambda,kernel,components,gamma,degree,coef0}
hyperparametersvalues | {1000,.95,.8,100,0,gd,.0005,1745832155,false,.01,linear,0,.5,2,1}
hyperparametersoids | {23,701,701,23,23,1043,701,23,16,701,1043,23,701,23,701}
coefnames |
coefvalues |
coefoids |
trainingscoresname | {accuracy,f1,precision,recall,loss}
trainingscoresvalue | {1,1,1,1,.000535965}
modeldescribe |
- 采用SVM算法模型,基于表
iris_test
数据进行预测验证
postgres=> SELECT "Species", PREDICT BY iris_svm_model (FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" ) as "PREDICT" FROM iris_test;Species | PREDICT
-----------------+---------Iris-versicolor | tIris-virginica | tIris-setosa | tIris-virginica | tIris-versicolor | tIris-setosa | tIris-setosa | tIris-versicolor | tIris-setosa | tIris-setosa | tIris-setosa | tIris-virginica | tIris-virginica | tIris-setosa | tIris-versicolor | tIris-setosa | tIris-setosa | tIris-virginica | tIris-versicolor | tIris-versicolor | tIris-virginica | tIris-setosa | tIris-versicolor | tIris-virginica | tIris-versicolor | tIris-versicolor | tIris-versicolor | tIris-setosa | tIris-virginica | tIris-versicolor | t
(30 rows)%% 验证是否有假值,即分类失败的值
postgres=> select "Species", "PREDICT" from (SELECT "Species", PREDICT BY iris_svm_model (FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" ) as "PREDICT" FROM iris_test) as t where "PREDICT"=False;Species | PREDICT
---------+---------
(0 rows)
备注: 由于sqlalchemy插入的数据存在列名大小写,故早SQL操作Species之类韩大小写的列不可以混用全大写或小写,SQL中只要涉及相关值,均需采用双引号。
- 补充内容
修改模型训练参数事例
postgres=> CREATE MODEL iris_svm_model USING svm_classification FEATURES "SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm" TARGET "PetalWidthCm"<=2.5 FROM iris_train with kernel='linear';
MODEL CREATED. PROCESSED 1
Sklearn SVM分类
- 采用sklearn模块进行svm分类时,默认核函数为RBF(即gaussian)
由于openGauss的DB4AI功能SVM算法默认和函数为linear,故sklearn模块创建SVM累时需要指定核函数,model = svm.SVC(verbose=1, kernel='linear')
.
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
current_path = os.path.dirname(os.path.realpath(__file__))
iris = pd.read_csv(current_path+'/../data/Iris.csv')
train, test = train_test_split(iris, test_size=0.3)
# 提取训练集和测试集的特征和标签
train_x = train[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
train_y = train.Species
test_x = test[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
test_y = test.Species# 创建支持向量机(SVM)分类器模型
# openGuass 默认SVM kernel函数为linear,其支持linear/gaussian/polynomial 核函数
# SVC默认kernel函数为RBF(即gaussian核函数)
# just for linear、SVM:kernel = "linear" linear/gaussian/polynomial 核函数
model = svm.SVC(verbose=1, kernel='linear')
# 在训练集上拟合SVM模型
model.fit(train_x, train_y)
# 使用训练好的模型对测试集进行预测
prediction = model.predict(test_x)
# 打印SVM模型的准确性
print('The accuracy of the SVM is:', metrics.accuracy_score(prediction, test_y))
print(metrics.zero_one_loss(test_y, prediction))
运行日志
[LibSVM]*
optimization finished, #iter = 16
obj = -0.748057, rho = -1.452044
nSV = 3, nBSV = 0
*
optimization finished, #iter = 5
obj = -0.203684, rho = -1.507091
nSV = 3, nBSV = 0
*
optimization finished, #iter = 33
obj = -13.730499, rho = -8.624603
nSV = 20, nBSV = 16
Total nSV = 24
The accuracy of the SVM is: 0.9777777777777777
0.022222222222222254
- 检查sklearn SVM模型预测结果集
import numpy as np
def check_prediction(prediction: np.ndarray, test: pd.DataFrame):raw=test.Species.tolist()cnt=0for idx, specie in enumerate(raw):is_euqal=Falseif prediction[idx]==specie:is_euqal=Trueelse:cnt+=1print(idx,is_euqal, specie, prediction[idx])return cnt
cnt=check_prediction(prediction, test)
cnt
运行结果
0 True Iris-versicolor Iris-versicolor
1 True Iris-setosa Iris-setosa
2 True Iris-virginica Iris-virginica
3 True Iris-versicolor Iris-versicolor
4 True Iris-versicolor Iris-versicolor
5 True Iris-setosa Iris-setosa
6 True Iris-setosa Iris-setosa
7 True Iris-versicolor Iris-versicolor
8 True Iris-versicolor Iris-versicolor
9 True Iris-versicolor Iris-versicolor
10 True Iris-versicolor Iris-versicolor
11 True Iris-versicolor Iris-versicolor
12 True Iris-setosa Iris-setosa
13 True Iris-setosa Iris-setosa
14 True Iris-setosa Iris-setosa
15 True Iris-virginica Iris-virginica
16 True Iris-virginica Iris-virginica
17 True Iris-setosa Iris-setosa
18 True Iris-setosa Iris-setosa
19 True Iris-virginica Iris-virginica
20 True Iris-versicolor Iris-versicolor
21 True Iris-setosa Iris-setosa
22 True Iris-versicolor Iris-versicolor
23 True Iris-virginica Iris-virginica
24 True Iris-virginica Iris-virginica
25 True Iris-virginica Iris-virginica
26 True Iris-setosa Iris-setosa
27 True Iris-versicolor Iris-versicolor
28 True Iris-setosa Iris-setosa
29 True Iris-virginica Iris-virginica
30 True Iris-setosa Iris-setosa
31 True Iris-versicolor Iris-versicolor
32 True Iris-setosa Iris-setosa
33 True Iris-virginica Iris-virginica
34 True Iris-virginica Iris-virginica
35 True Iris-versicolor Iris-versicolor
36 True Iris-virginica Iris-virginica
37 True Iris-setosa Iris-setosa
38 True Iris-versicolor Iris-versicolor
39 True Iris-virginica Iris-virginica
40 True Iris-versicolor Iris-versicolor
41 False Iris-versicolor Iris-virginica
42 True Iris-setosa Iris-setosa
43 True Iris-virginica Iris-virginica
44 True Iris-virginica Iris-virginica
算法表现
openGauss和sklearn,采用相同训练和测试数据选取SVM算法训练后的模型,在进行预测的结果显示openGuass准确率(100%)更好,在进行更多训练时,笔者也观测到sklearn SVM算法训练模型预测结果准确度达到100%,这种结果可能是两者除核函数外的一些默认参数差异造成结果。
笔者主要为了对比openGauss DB4AI 能力进行探究,对两者结果表现的差异性不做深入分析。
异常处理
-
Pandas 操作数据库openGauss版本问题,AssertionError: Could not determine version from string ‘(openGauss-lite 5.0.3 build 89d144c2) compiled at 2024-07-31 21:39:16 commit 0 last mr release’
在
sqlalchemy.engine
创建Engine之前,需要sqlalchemy中有关openGauss版本信息获取函数,进行修改注入。from sqlalchemy.dialects.postgresql.base import PGDialect PGDialect._get_server_version_info = lambda *args: (9, 2)
-
Pandas操作PostgreSQL数据库,告警 pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection
# reference: https://cloud.tencent.com/developer/ask/sof/106578644 from sqlalchemy.engine import create_engine engine=create_engine("postgresql+psycopg2://username:password@hostname:port/databasename") test=pd.read_sql('''SELECT * FROM iris_test;''', con = engine,index_col="id") print(test)
/Users/hbu/SUFE/AI/Gemini/db4ai/iris-svm-model.py:41: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.test=pd.read_sql('''SELECT * FROM iris_test;''', con = conn,index_col="id")sepallengthcm sepalwidthcm petallengthcm petalwidthcm species id 86 6.0 3.4 4.5 1.6 Iris-versicolor 3 4.7 3.2 1.3 0.2 Iris-setosa 146 6.7 3.0 5.2 2.3 Iris-virginica 76 6.6 3.0 4.4 1.4 Iris-versicolor 41 5.0 3.5 1.3 0.3 Iris-setosa 35 4.9 3.1 1.5 0.1 Iris-setosa
解决方法:先用psycopg2驱动建立到PostgreSQL数据库的连接,然后再进行数据操作。
from sqlalchemy.engine import create_engine engine=create_engine("postgresql+psycopg2://username:password@hostname:port/databasename") test=pd.read_sql('''SELECT * FROM iris_test;''', con = engine,index_col="id") print(test)