EDB Docs - WarehousePG v7.5 - Data Science Package for Python

The Data Science Package for Python is an EDB-provided package that bundles a curated collection of Python machine learning, statistical modeling, and data science modules for use with the WarehousePG PL/Python (PL/Python) procedural language.

The package requires WarehousePG (WHPG) 7.5.0 or later, which includes PL/Python 3.11. For information about PL/Python, see WarehousePG PL/Python Language Extension.

Available modules

The Data Science Package includes the following Python modules. Deep learning modules (TensorFlow, PyTorch) are CPU-only. Transitive dependencies are installed automatically. See Module reference for the full list with descriptions.

Category	Modules
Core scientific	numpy, scipy, pandas, scikit-learn, statsmodels, patsy, joblib
Plotting	matplotlib, seaborn, plotly
Acceleration and I/O	numexpr, bottleneck, pyarrow, h5py, openpyxl, netCDF4
Gradient boosting	xgboost, lightgbm, catboost
Deep learning (CPU)	tensorflow-cpu, keras, torch, torchvision, torchaudio
NLP and HuggingFace	nltk, gensim, spacy, transformers, sentence-transformers, InstructorEmbedding, accelerate, datasets, sacrebleu, rouge
Computer vision	pillow, scikit-image, opencv-python-headless
Probabilistic and time series	pymc, prophet, lifelines, pmdarima, tslearn, gluonts
Math and graph	networkx, sympy
Out-of-core and JIT	xarray, dask, numba
Explainability	shap, lime, pyLDAvis, imbalanced-learn
Data, database, and parsing	SQLAlchemy, psycopg2-binary, python-docx, pdfminer.six, feedparser, graphviz, holidays, formulaic, xmltodict, orjson, cryptography
Utilities	requests, beautifulsoup4, lxml, PyYAML, tqdm, regex, Jinja2, Cython, pydantic, typer
Web and XML schema	CherryPy, PyXB-X

Prerequisites

Before installing the Data Science Package for Python, ensure that:

Your WHPG cluster is running WHPG 7.5.0 or later.
PL/Python is enabled on your cluster. See WarehousePG PL/Python Language Extension.
You have sourced /usr/edb/whpg7/greenplum_path.sh and the $COORDINATOR_DATA_DIRECTORY and $GPHOME environment variables are set.
On air-gapped clusters, python3.11 is installed on all nodes. See Performing a minor upgrade for details.

Note

The pymc and prophet modules require tk at runtime. If you plan to use either module, install the tk OS package on every node in your cluster before installing the Data Science Package:

sudo yum install tk

Downloading and installing the Data Science Package

The Data Science Package is a large download (approximately 1.2 GB). Install it on each host in your WarehousePG cluster.

From the coordinator, download the package from the EDB repository:

RHEL 8, 9
RHEL 7

export EDB_SUBSCRIPTION_TOKEN=<your-token>
export EDB_REPO=gpsupp
curl -1sSLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/$EDB_REPO/setup.rpm.sh" | sudo -E bash
sudo dnf download edb-whpg7-data-science-python311

export EDB_SUBSCRIPTION_TOKEN=<your-token>
export EDB_REPO=gpsupp
curl -1sSLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/$EDB_REPO/setup.rpm.sh" | sudo -E bash
sudo yumdownloader edb-whpg7-data-science-python311

Where <your-token> is your EDB subscription token.

Create a file all_hosts on the coordinator that lists all hosts in the cluster:
```
cdw
scdw
sdw1
sdw2
sdw3
```

Use gpsync to transfer the package to all hosts, then use gpssh to install it:

RHEL 8, 9
RHEL 7

gpsync -f all_hosts <package-name> =:/tmp
gpssh -f all_hosts -e 'sudo dnf install -y /tmp/<package-name>'

gpsync -f all_hosts <package-name> =:/tmp
gpssh -f all_hosts -e 'sudo yum install -y /tmp/<package-name>'

Where <package-name> is the name of the package file you downloaded.

Restart WarehousePG:
```
gpstop -r
```

After installation, the Data Science Package modules are available at $GPHOME/ext/DataSciencePython3.11/lib/python3.11/site-packages/. The package installs a .pth file at /usr/lib/python3.11/site-packages/DataSciencePython.pth that makes the modules available to PL/Python automatically. No changes to greenplum_path.sh or PYTHONPATH are required.

Uninstalling the Data Science Package

Uninstall the package on all hosts and then restart WarehousePG:

gpssh -f all_hosts -e 'sudo dnf remove -y edb-whpg7-data-science-python311'
gpstop -r

Note

After uninstalling the Data Science Package, any user-defined functions that import modules from this package will return an error.

Module reference

Module	Description
accelerate	HuggingFace library for training and inference at scale
beautifulsoup4	Screen-scraping library
bottleneck	Fast NumPy array functions written in C
catboost	High-performance gradient boosting on decision trees
CherryPy	Object-oriented HTTP framework
cryptography	Cryptographic recipes and primitives
Cython	Compiler for writing C extensions for Python
dask	Parallel computing library that scales NumPy, pandas, and scikit-learn
datasets	HuggingFace community-driven open-source library of datasets
feedparser	Universal feed parser for RSS, Atom, and CDF feeds
formulaic	Implementation of Wilkinson formulas
gensim	Python framework for fast Vector Space Modelling
gluonts	Probabilistic time series modeling
graphviz	Simple Python interface for Graphviz
h5py	Read and write HDF5 files from Python
holidays	Generate and work with holidays in Python
imbalanced-learn	Tools for classification with imbalanced classes
InstructorEmbedding	Text embedding using instruction-tuned models
Jinja2	Fast and expressive template engine
joblib	Lightweight pipelining with Python functions
keras	Deep learning API built on TensorFlow
lifelines	Survival analysis including Kaplan Meier, Nelson Aalen, and regression
lightgbm	Fast, distributed, high-performance gradient boosting framework
lime	Local Interpretable Model-Agnostic Explanations for machine learning classifiers
lxml	XML and HTML processing library combining libxml2/libxslt with ElementTree
matplotlib	Python plotting package
netCDF4	Object-oriented Python interface to the netCDF version 4 library
networkx	Creation, manipulation, and study of complex networks
nltk	Natural language toolkit
numba	JIT compiler for Python using LLVM
numexpr	Fast numerical expression evaluator for NumPy
numpy	Scientific computing with N-dimensional arrays
opencv-python-headless	Computer vision library (no GUI dependencies)
openpyxl	Read and write Excel 2010 xlsx/xlsm files
orjson	Fast Python JSON library supporting dataclasses, datetimes, and numpy
pandas	Data analysis and manipulation
patsy	Describing statistical models and building design matrices
pdfminer.six	PDF parser and analyzer
pillow	Python Imaging Library
plotly	Interactive graphing library
pmdarima	Python equivalent of R's `forecast::auto.arima`
prophet	Automatic forecasting procedure
psycopg2-binary	PostgreSQL database adapter for Python
pyarrow	Cross-language in-memory data development platform (Apache Arrow)
pydantic	Data validation using Python type hints
pyLDAvis	Interactive topic model visualization
pymc	Statistical modeling and probabilistic machine learning
PyXB-X	Generate Python code for classes corresponding to XMLSchema data structures
PyYAML	YAML parser and emitter for Python
regex	Alternative regular expression module
requests	HTTP library
rouge	Full Python ROUGE score implementation
sacrebleu	Shareable, comparable, and reproducible BLEU, chrF, and TER scores
scikit-image	Image processing algorithms for SciPy
scikit-learn	Machine learning data mining and analysis
scipy	Scientific computing (integration, optimization, signal processing)
seaborn	Statistical data visualization based on matplotlib
sentence-transformers	Multilingual sentence, paragraph, and image embeddings using BERT
shap	Unified approach to explain the output of any machine learning model
spacy	Large-scale natural language processing
SQLAlchemy	Database abstraction library
statsmodels	Statistical modeling and hypothesis testing
sympy	Computer algebra system
tensorflow-cpu	Numerical computation using data flow graphs (CPU-only)
torch	Tensors and dynamic neural networks (CPU-only)
torchaudio	Audio processing for PyTorch (CPU-only)
torchvision	Computer vision datasets, models, and transforms for PyTorch (CPU-only)
tqdm	Fast, extensible progress meter
transformers	State-of-the-art machine learning for JAX, PyTorch, and TensorFlow
tslearn	Machine learning toolkit dedicated to time-series data
typer	CLI builder based on Python type hints
xarray	N-dimensional labeled arrays and datasets
xgboost	Gradient boosting for classification and ranking
xmltodict	Makes working with XML feel like working with JSON

Data Science Package for Python v7.5

Available modules

Prerequisites

Note

Downloading and installing the Data Science Package

Uninstalling the Data Science Package

Note

Module reference

← Prev

↑ Up

Next →