Data Science Package for Python v7.5

The Data Science Package for Python is an EDB-provided package that bundles a curated collection of Python machine learning, statistical modeling, and data science modules for use with the WarehousePG PL/Python (PL/Python) procedural language.

The package requires WarehousePG (WHPG) 7.5.0 or later, which includes PL/Python 3.11. For information about PL/Python, see WarehousePG PL/Python Language Extension.

Available modules

The Data Science Package includes the following Python modules. Deep learning modules (TensorFlow, PyTorch) are CPU-only. Transitive dependencies are installed automatically. See Module reference for the full list with descriptions.

CategoryModules
Core scientificnumpy, scipy, pandas, scikit-learn, statsmodels, patsy, joblib
Plottingmatplotlib, seaborn, plotly
Acceleration and I/Onumexpr, bottleneck, pyarrow, h5py, openpyxl, netCDF4
Gradient boostingxgboost, lightgbm, catboost
Deep learning (CPU)tensorflow-cpu, keras, torch, torchvision, torchaudio
NLP and HuggingFacenltk, gensim, spacy, transformers, sentence-transformers, InstructorEmbedding, accelerate, datasets, sacrebleu, rouge
Computer visionpillow, scikit-image, opencv-python-headless
Probabilistic and time seriespymc, prophet, lifelines, pmdarima, tslearn, gluonts
Math and graphnetworkx, sympy
Out-of-core and JITxarray, dask, numba
Explainabilityshap, lime, pyLDAvis, imbalanced-learn
Data, database, and parsingSQLAlchemy, psycopg2-binary, python-docx, pdfminer.six, feedparser, graphviz, holidays, formulaic, xmltodict, orjson, cryptography
Utilitiesrequests, beautifulsoup4, lxml, PyYAML, tqdm, regex, Jinja2, Cython, pydantic, typer
Web and XML schemaCherryPy, PyXB-X

Prerequisites

Before installing the Data Science Package for Python, ensure that:

  • Your WHPG cluster is running WHPG 7.5.0 or later.
  • PL/Python is enabled on your cluster. See WarehousePG PL/Python Language Extension.
  • You have sourced /usr/edb/whpg7/greenplum_path.sh and the $COORDINATOR_DATA_DIRECTORY and $GPHOME environment variables are set.
  • On air-gapped clusters, python3.11 is installed on all nodes. See Performing a minor upgrade for details.
Note

The pymc and prophet modules require tk at runtime. If you plan to use either module, install the tk OS package on every node in your cluster before installing the Data Science Package:

sudo yum install tk

Downloading and installing the Data Science Package

The Data Science Package is a large download (approximately 1.2 GB). Install it on each host in your WarehousePG cluster.

  1. From the coordinator, download the package from the EDB repository:

    Where <your-token> is your EDB subscription token.

  2. Create a file all_hosts on the coordinator that lists all hosts in the cluster:

    cdw
    scdw
    sdw1
    sdw2
    sdw3
  3. Use gpsync to transfer the package to all hosts, then use gpssh to install it:

    Where <package-name> is the name of the package file you downloaded.

  4. Restart WarehousePG:

    gpstop -r

After installation, the Data Science Package modules are available at $GPHOME/ext/DataSciencePython3.11/lib/python3.11/site-packages/. The package installs a .pth file at /usr/lib/python3.11/site-packages/DataSciencePython.pth that makes the modules available to PL/Python automatically. No changes to greenplum_path.sh or PYTHONPATH are required.

Uninstalling the Data Science Package

Uninstall the package on all hosts and then restart WarehousePG:

gpssh -f all_hosts -e 'sudo dnf remove -y edb-whpg7-data-science-python311'
gpstop -r
Note

After uninstalling the Data Science Package, any user-defined functions that import modules from this package will return an error.

Module reference

ModuleDescription
accelerateHuggingFace library for training and inference at scale
beautifulsoup4Screen-scraping library
bottleneckFast NumPy array functions written in C
catboostHigh-performance gradient boosting on decision trees
CherryPyObject-oriented HTTP framework
cryptographyCryptographic recipes and primitives
CythonCompiler for writing C extensions for Python
daskParallel computing library that scales NumPy, pandas, and scikit-learn
datasetsHuggingFace community-driven open-source library of datasets
feedparserUniversal feed parser for RSS, Atom, and CDF feeds
formulaicImplementation of Wilkinson formulas
gensimPython framework for fast Vector Space Modelling
gluontsProbabilistic time series modeling
graphvizSimple Python interface for Graphviz
h5pyRead and write HDF5 files from Python
holidaysGenerate and work with holidays in Python
imbalanced-learnTools for classification with imbalanced classes
InstructorEmbeddingText embedding using instruction-tuned models
Jinja2Fast and expressive template engine
joblibLightweight pipelining with Python functions
kerasDeep learning API built on TensorFlow
lifelinesSurvival analysis including Kaplan Meier, Nelson Aalen, and regression
lightgbmFast, distributed, high-performance gradient boosting framework
limeLocal Interpretable Model-Agnostic Explanations for machine learning classifiers
lxmlXML and HTML processing library combining libxml2/libxslt with ElementTree
matplotlibPython plotting package
netCDF4Object-oriented Python interface to the netCDF version 4 library
networkxCreation, manipulation, and study of complex networks
nltkNatural language toolkit
numbaJIT compiler for Python using LLVM
numexprFast numerical expression evaluator for NumPy
numpyScientific computing with N-dimensional arrays
opencv-python-headlessComputer vision library (no GUI dependencies)
openpyxlRead and write Excel 2010 xlsx/xlsm files
orjsonFast Python JSON library supporting dataclasses, datetimes, and numpy
pandasData analysis and manipulation
patsyDescribing statistical models and building design matrices
pdfminer.sixPDF parser and analyzer
pillowPython Imaging Library
plotlyInteractive graphing library
pmdarimaPython equivalent of R's forecast::auto.arima
prophetAutomatic forecasting procedure
psycopg2-binaryPostgreSQL database adapter for Python
pyarrowCross-language in-memory data development platform (Apache Arrow)
pydanticData validation using Python type hints
pyLDAvisInteractive topic model visualization
pymcStatistical modeling and probabilistic machine learning
PyXB-XGenerate Python code for classes corresponding to XMLSchema data structures
PyYAMLYAML parser and emitter for Python
regexAlternative regular expression module
requestsHTTP library
rougeFull Python ROUGE score implementation
sacrebleuShareable, comparable, and reproducible BLEU, chrF, and TER scores
scikit-imageImage processing algorithms for SciPy
scikit-learnMachine learning data mining and analysis
scipyScientific computing (integration, optimization, signal processing)
seabornStatistical data visualization based on matplotlib
sentence-transformersMultilingual sentence, paragraph, and image embeddings using BERT
shapUnified approach to explain the output of any machine learning model
spacyLarge-scale natural language processing
SQLAlchemyDatabase abstraction library
statsmodelsStatistical modeling and hypothesis testing
sympyComputer algebra system
tensorflow-cpuNumerical computation using data flow graphs (CPU-only)
torchTensors and dynamic neural networks (CPU-only)
torchaudioAudio processing for PyTorch (CPU-only)
torchvisionComputer vision datasets, models, and transforms for PyTorch (CPU-only)
tqdmFast, extensible progress meter
transformersState-of-the-art machine learning for JAX, PyTorch, and TensorFlow
tslearnMachine learning toolkit dedicated to time-series data
typerCLI builder based on Python type hints
xarrayN-dimensional labeled arrays and datasets
xgboostGradient boosting for classification and ranking
xmltodictMakes working with XML feel like working with JSON