The Data Science Package for Python is an EDB-provided package that bundles a curated collection of Python machine learning, statistical modeling, and data science modules for use with the WarehousePG PL/Python (PL/Python) procedural language.
The package requires WarehousePG (WHPG) 7.5.0 or later, which includes PL/Python 3.11. For information about PL/Python, see WarehousePG PL/Python Language Extension.
Available modules
The Data Science Package includes the following Python modules. Deep learning modules (TensorFlow, PyTorch) are CPU-only. Transitive dependencies are installed automatically. See Module reference for the full list with descriptions.
| Category | Modules |
|---|---|
| Core scientific | numpy, scipy, pandas, scikit-learn, statsmodels, patsy, joblib |
| Plotting | matplotlib, seaborn, plotly |
| Acceleration and I/O | numexpr, bottleneck, pyarrow, h5py, openpyxl, netCDF4 |
| Gradient boosting | xgboost, lightgbm, catboost |
| Deep learning (CPU) | tensorflow-cpu, keras, torch, torchvision, torchaudio |
| NLP and HuggingFace | nltk, gensim, spacy, transformers, sentence-transformers, InstructorEmbedding, accelerate, datasets, sacrebleu, rouge |
| Computer vision | pillow, scikit-image, opencv-python-headless |
| Probabilistic and time series | pymc, prophet, lifelines, pmdarima, tslearn, gluonts |
| Math and graph | networkx, sympy |
| Out-of-core and JIT | xarray, dask, numba |
| Explainability | shap, lime, pyLDAvis, imbalanced-learn |
| Data, database, and parsing | SQLAlchemy, psycopg2-binary, python-docx, pdfminer.six, feedparser, graphviz, holidays, formulaic, xmltodict, orjson, cryptography |
| Utilities | requests, beautifulsoup4, lxml, PyYAML, tqdm, regex, Jinja2, Cython, pydantic, typer |
| Web and XML schema | CherryPy, PyXB-X |
Prerequisites
Before installing the Data Science Package for Python, ensure that:
- Your WHPG cluster is running WHPG 7.5.0 or later.
- PL/Python is enabled on your cluster. See WarehousePG PL/Python Language Extension.
- You have sourced
/usr/edb/whpg7/greenplum_path.shand the$COORDINATOR_DATA_DIRECTORYand$GPHOMEenvironment variables are set. - On air-gapped clusters,
python3.11is installed on all nodes. See Performing a minor upgrade for details.
Note
The pymc and prophet modules require tk at runtime. If you plan to use either module, install the tk OS package on every node in your cluster before installing the Data Science Package:
sudo yum install tk
Downloading and installing the Data Science Package
The Data Science Package is a large download (approximately 1.2 GB). Install it on each host in your WarehousePG cluster.
From the coordinator, download the package from the EDB repository:
export EDB_SUBSCRIPTION_TOKEN=<your-token> export EDB_REPO=gpsupp curl -1sSLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/$EDB_REPO/setup.rpm.sh" | sudo -E bash sudo dnf download edb-whpg7-data-science-python311
export EDB_SUBSCRIPTION_TOKEN=<your-token> export EDB_REPO=gpsupp curl -1sSLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/$EDB_REPO/setup.rpm.sh" | sudo -E bash sudo yumdownloader edb-whpg7-data-science-python311
Where
<your-token>is your EDB subscription token.Create a file
all_hostson the coordinator that lists all hosts in the cluster:cdw scdw sdw1 sdw2 sdw3
Use
gpsyncto transfer the package to all hosts, then usegpsshto install it:gpsync -f all_hosts <package-name> =:/tmp gpssh -f all_hosts -e 'sudo dnf install -y /tmp/<package-name>'
gpsync -f all_hosts <package-name> =:/tmp gpssh -f all_hosts -e 'sudo yum install -y /tmp/<package-name>'
Where
<package-name>is the name of the package file you downloaded.Restart WarehousePG:
gpstop -r
After installation, the Data Science Package modules are available at $GPHOME/ext/DataSciencePython3.11/lib/python3.11/site-packages/. The package installs a .pth file at /usr/lib/python3.11/site-packages/DataSciencePython.pth that makes the modules available to PL/Python automatically. No changes to greenplum_path.sh or PYTHONPATH are required.
Uninstalling the Data Science Package
Uninstall the package on all hosts and then restart WarehousePG:
gpssh -f all_hosts -e 'sudo dnf remove -y edb-whpg7-data-science-python311' gpstop -r
Note
After uninstalling the Data Science Package, any user-defined functions that import modules from this package will return an error.
Module reference
| Module | Description |
|---|---|
| accelerate | HuggingFace library for training and inference at scale |
| beautifulsoup4 | Screen-scraping library |
| bottleneck | Fast NumPy array functions written in C |
| catboost | High-performance gradient boosting on decision trees |
| CherryPy | Object-oriented HTTP framework |
| cryptography | Cryptographic recipes and primitives |
| Cython | Compiler for writing C extensions for Python |
| dask | Parallel computing library that scales NumPy, pandas, and scikit-learn |
| datasets | HuggingFace community-driven open-source library of datasets |
| feedparser | Universal feed parser for RSS, Atom, and CDF feeds |
| formulaic | Implementation of Wilkinson formulas |
| gensim | Python framework for fast Vector Space Modelling |
| gluonts | Probabilistic time series modeling |
| graphviz | Simple Python interface for Graphviz |
| h5py | Read and write HDF5 files from Python |
| holidays | Generate and work with holidays in Python |
| imbalanced-learn | Tools for classification with imbalanced classes |
| InstructorEmbedding | Text embedding using instruction-tuned models |
| Jinja2 | Fast and expressive template engine |
| joblib | Lightweight pipelining with Python functions |
| keras | Deep learning API built on TensorFlow |
| lifelines | Survival analysis including Kaplan Meier, Nelson Aalen, and regression |
| lightgbm | Fast, distributed, high-performance gradient boosting framework |
| lime | Local Interpretable Model-Agnostic Explanations for machine learning classifiers |
| lxml | XML and HTML processing library combining libxml2/libxslt with ElementTree |
| matplotlib | Python plotting package |
| netCDF4 | Object-oriented Python interface to the netCDF version 4 library |
| networkx | Creation, manipulation, and study of complex networks |
| nltk | Natural language toolkit |
| numba | JIT compiler for Python using LLVM |
| numexpr | Fast numerical expression evaluator for NumPy |
| numpy | Scientific computing with N-dimensional arrays |
| opencv-python-headless | Computer vision library (no GUI dependencies) |
| openpyxl | Read and write Excel 2010 xlsx/xlsm files |
| orjson | Fast Python JSON library supporting dataclasses, datetimes, and numpy |
| pandas | Data analysis and manipulation |
| patsy | Describing statistical models and building design matrices |
| pdfminer.six | PDF parser and analyzer |
| pillow | Python Imaging Library |
| plotly | Interactive graphing library |
| pmdarima | Python equivalent of R's forecast::auto.arima |
| prophet | Automatic forecasting procedure |
| psycopg2-binary | PostgreSQL database adapter for Python |
| pyarrow | Cross-language in-memory data development platform (Apache Arrow) |
| pydantic | Data validation using Python type hints |
| pyLDAvis | Interactive topic model visualization |
| pymc | Statistical modeling and probabilistic machine learning |
| PyXB-X | Generate Python code for classes corresponding to XMLSchema data structures |
| PyYAML | YAML parser and emitter for Python |
| regex | Alternative regular expression module |
| requests | HTTP library |
| rouge | Full Python ROUGE score implementation |
| sacrebleu | Shareable, comparable, and reproducible BLEU, chrF, and TER scores |
| scikit-image | Image processing algorithms for SciPy |
| scikit-learn | Machine learning data mining and analysis |
| scipy | Scientific computing (integration, optimization, signal processing) |
| seaborn | Statistical data visualization based on matplotlib |
| sentence-transformers | Multilingual sentence, paragraph, and image embeddings using BERT |
| shap | Unified approach to explain the output of any machine learning model |
| spacy | Large-scale natural language processing |
| SQLAlchemy | Database abstraction library |
| statsmodels | Statistical modeling and hypothesis testing |
| sympy | Computer algebra system |
| tensorflow-cpu | Numerical computation using data flow graphs (CPU-only) |
| torch | Tensors and dynamic neural networks (CPU-only) |
| torchaudio | Audio processing for PyTorch (CPU-only) |
| torchvision | Computer vision datasets, models, and transforms for PyTorch (CPU-only) |
| tqdm | Fast, extensible progress meter |
| transformers | State-of-the-art machine learning for JAX, PyTorch, and TensorFlow |
| tslearn | Machine learning toolkit dedicated to time-series data |
| typer | CLI builder based on Python type hints |
| xarray | N-dimensional labeled arrays and datasets |
| xgboost | Gradient boosting for classification and ranking |
| xmltodict | Makes working with XML feel like working with JSON |