Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

joojowalker/awesome-python-data-science

Open more actions menu
 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

134 Commits
134 Commits
 
 
 
 

Repository files navigation

Awsome Python Data Science

A curated list of Python libraries used for data science.

Contents

Machine Learning Frameworks

  • scikit-learn - Machine learning.
  • CatBoost - Gradient boosting library with categorical features support.
  • LightGBM - Fast, distributed, high performance gradient boosting.
  • Xgboost - Scalable, Portable and Distributed Gradient Boosting.
  • PyMC - Probabilistic Programming.
  • statsmodels - Statistical modeling and econometrics.
  • SymPy - A computer algebra system.
  • NetworkX - Creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
  • dask-ml - Distributed and parallel machine learning.
  • imbalanced-learn - Perform under sampling and over sampling.
  • lightning - Large-scale linear models.
  • sklearn-crfsuite - API for CRFsuite, Conditional Random Fields for labeling sequential data.
  • vowpal_porpoise - Wrapper for vowpal_wabbit.
  • scikit-optimize - Sequential model-based optimization with a scipy.optimize interface.
  • BayesianOptimization - Global optimization with gaussian processes.
  • gplearn - Genetic Programming.
  • scikit-multilearn - Scikit-learn based module for multi-label.
  • mlens - ML-Ensemble high performance ensemble learning.
  • speedml - Speed start machine learning projects.
  • fastFM - Factorization Machines.
  • python-glmnet - glmnet package for fitting generalized linear models.
  • hmmlearn - Hidden Markov Models.
  • vecstack - stacking (machine learning technique).
  • bayespy - Bayesian inference tools.
  • modAL - Modular Active Learning framework
  • deap - Evolutionary computation framework.
  • pyro - Deep universal probabilistic programming with PyTorch.
  • civisml-extensions - scikit-learn-compatible estimators from Civis Analytics.
  • hyperopt-sklearn - Hyper-parameter optimization for sklearn.
  • zhusuan - A Library for Bayesian Deep Learning, Generative Models, Based on Tensorflow.
  • Kaggler - Code for Kaggle Data Science Competitions. Includes FTRL.
  • modAL - A modular active learning framework.
  • scikit-survival - Survival analysis built on top of scikit-learn.
  • dstoolbox - Tools that make working with scikit-learn and pandas easier.
  • dowhy - A unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
  • modin - Unify the way you interact with your data.
  • pyomo - Python Optimization MOdels.
  • pymc-learn - Practical probabilistic machine learning.
  • BAMBI - BAyesian Model-Building Interface.
  • pymc4 - A high-level probabilistic programming interface for TensorFlow Probability.
  • combo - A Python Toolbox for Machine Learning Model Combination.

Scientific

  • NumPy - A fundamental package for scientific computing with Python.
  • SciPy - A Python-based ecosystem of open-source software for mathematics, science, and engineering.
  • Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools.
  • Numba - NumPy aware dynamic Python compiler using LLVM.
  • blaze - NumPy and Pandas for databases.
  • astropy - Astronomy and astrophysics.
  • Biopython - Astronomy and astrophysics.
  • PyDy - Multibody Dynamics.
  • DIPY - Diffusion MR Imaging.
  • bcolz - Columnar data container that can be compressed.
  • nilearn - NeuroImaging.
  • patsy - Describing statistical models using symbolic formulas.
  • numexpr - Fast numerical array expression evaluator.
  • dask - Parallel computing with task scheduling.
  • or-tools - Google's Operations Research tools. Classical CS algorithms.
  • cvxpy - Python-embedded modeling language for convex optimization problems.

Deep Learning Frameworks

  • Tensorflow - DL Framework.
  • PyTorch - DL Framework.
  • onnx - Open Neutral Network Exchange.
  • Keras - High-level neutral networks API.
  • tensorlayer - A Deep Learning and Reinforcement Learning Library for Researchers and Engineers.
  • chainer - A flexible framework of neural networks for deep learning.
  • mxnet - Apache MXNet: A flexible and efficient library for deep learning.

Deep Learning Tools

  • Edward - Probabilistic programming language in TensorFlow.
  • pomegranate - Probabilistic modelling.
  • skorch - Scikit-learn PyTorch.
  • DLTK - Deep Learning Toolkit for Medical Image Analysis.
  • sonnet - TensorFlow-based neural network library.
  • rasa_core - Dialogue engine.
  • luminoth - Computer Vision.
  • allennlp - NLP Research library.
  • spotlight - Pytorch Recommender framework.
  • tensorforce - TensorFlow library for applied reinforcement learning.
  • tensorboard-pytorch - Tensorboard for pytorch.
  • keras-vis - Neural network visualization toolkit for keras.
  • hyperas - Keras + Hyperopt.
  • spaCy - Natural Language processing.
  • tensorboard_logger - Log TensorBoard events without touching TensorFlow.
  • keras-contrib - Keras community contributions.
  • tfdeploy - Deploy tensorflow graphs.
  • ktext - Utilities for preprocessing text for deep learning with Keras.
  • foolbox - Python toolbox to create adversarial examples that fool neural networks.
  • pytorch/vision - Datasets, Transforms and Models specific to Computer Vision.
  • gluon-nlp - NLP made easy.
  • PyTorch-GAN - PyTorch implementations of Generative Adversarial Networks.
  • pytorch/ignite - High-level library to help with training neural networks in PyTorch.
  • NMT - Neural machine translation and neural sequence modeling.
  • Netron - Visualizer for deep learning and machine learning models.
  • gpytorch - A highly efficient and modular implementation of Gaussian Processes in PyTorch.
  • tensorly - Tensor Learning in Python.
  • einops - Deep learning operations reinvented.
  • hiddenlayer - Neural network graphs and training metrics for PyTorch, Tensorflow, and Keras.
  • dgl - Python package built to ease deep learning on graph, on top of existing DL frameworks.
  • segmentation_models.pytorch - Segmentation models with pretrained backbones.
  • pytorch-lightning - The lightweight PyTorch wrapper.

Deep Learning Projects

Visualization

  • matplotlib - 2D plotting.
  • seaborn - Visualization library.
  • bokeh - Interactive web plotting.
  • plotly - Collaborative web plotting.
  • dash - Interactive Web plotting.
  • altair - Declarative statistical visualization.
  • folium - Leaflet.js Maps.
  • geoplot - High-level geospatial data visualization.
  • datashader - Graphics pipeline system.
  • mplleaftlet - Matplotlib plots from Python into interactive Leaflet web maps.
  • matplotlib-venn - Area-weighted venn-diagrams.
  • pyLDAvis - Interactive topic model visualization.
  • cufflinks - Productivity Tools for Plotly + Pandas.
  • scatterText - Visualizations of how language differs among document types.
  • plotnine - ggplot for python.
  • ggpy - ggplot for python.
  • mizani - scales package.
  • bqplot - Plotting library for IPython/Jupyter Notebooks.
  • PtitPrince - Raindrop cloud.
  • joypy - Ridgeline plots.
  • dtreeviz - Decision tree visualization and model interpretation.
  • ipyvolume - 3d plotting for Python in the Jupyter notebook based on IPython widgets using WebGL.

AutoML

  • Nevergrad - Gradient-free optimization.
  • featuretools - Automated feature engineering.
  • auto-sklearn - Automated machine learning.
  • tpot - Automated machine learning.
  • auto_ml - Automated machine learning.
  • MLBox - Automated Machine Learning python library.
  • devol - Automated deep neural network design via genetic programming.
  • skll - SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.
  • autokeras - Automated machine learning in Keras.
  • SMAC3 - Sequential Model-based Algorithm Configuration.

Exploration

  • mlxtend - A library of extension and helper modules for Python's data analysis and machine learning libraries.
  • yellowbrick - Visual analysis and diagnostic tools.
  • pandas-profiling - Profiling reports for pandas DataFrame objects.
  • Skater - Model Agnostic Interpretation.
  • Dora - Exploratory data analysis.
  • sklearn-evaluation - scikit-learn model evaluation.
  • fitter - simple class to identify the distribution from which a data samples is generated from.
  • missingno - Missing data visualization.
  • hypertools - Gaining geometric insights into high-dimensional data.
  • scikit-plot - Plotting functionality to scikit-learn objects.
  • elih - Explain Machine Learning.
  • kmeans_smote - Oversampling for imbalanced learning based on k-means and SMOTE.
  • pyUpSet - UpSet suite of visualisation methods.
  • lime - Explaining the predictions of any machine learning classifier.
  • pandas-summary - An extension to pandas dataframes describe function.
  • SauceCat/PDPbox - Partial dependence plot toolbox.
  • shap - A unified approach to explain the output of any machine learning model.
  • eli5 - Debug machine learning classifiers and explain their predictions.
  • rfpimp - Permutation and drop-column importance for scikit-learn random forests.
  • pypeln - Concurrent data pipelines made easy.
  • pycm - Multi-class confusion matrix library in Python.
  • great_expectations - Always know what to expect from your data.
  • innvestigate - A toolbox to iNNvestigate neural networks' predictions.
  • alibi - Algorithms for monitoring and explaining machine learning models.
  • InterpretML - Fit interpretable models. Explain blackbox machine learning.
  • cleanlab - Finding label errors in datasets and learning with noisy labels.
  • dtale - Flask/React client for visualizing pandas data structures
  • dabl - Data Analysis Baseline Library

Feature Extraction

General Feature Extraction

  • sklearn-pandas - Pandas integration with sklearn.
  • pdpipe - Easy pipelines for pandas DataFrames.
  • engarde - Defensive data analysis.
  • datacleaner - Tool that automatically cleans data sets and readies them for analysis.
  • categorical-encoding - sklearn compatible categorical variable encoders.
  • fancyimpute - Multivariate imputation and matrix completion algorithms.
  • raccoon - DataFrame with fast insert and appends.
  • kmodes - k-modes and k-prototypes clustering algorithm.
  • annoy - Approximate Nearest Neighbors.
  • datacleaner - Automatically cleans data sets and readies them for analysis.
  • scikit-feature - Filter methods for feature selection.
  • mifs - Parallelized Mutual Information based Feature Selection module.
  • skggm - Scikit-learn compatible estimation of general graphical models.
  • dirty_cat - Encoding methods for dirty categorical variables.
  • Impyute - Data imputations library to preprocess datasets with missing data.
  • eif - Extended Isolation Forest for Anomaly Detection.
  • featexp - Feature exploration for supervised learning.
  • feature_engine - Feature engineering package with sklearn like functionality.
  • stumpy - STUMPY is a powerful and scalable Python library that can be used for a variety of time series data mining tasks.
  • n2 - Lightweight approximate Nearest Neighbor library which runs faster even with large datasets.

Time Series

  • Causality - Causal analysis.
  • traces - Unevenly-spaced time series analysis.
  • PyFlux - Time series library for Python.
  • prophet - Tool for producing high quality forecasts.
  • tsfresh - Automatic extraction of relevant features from time series.
  • tslearn - Machine learning toolkit dedicated to time-series data.
  • pyts - A Python package for time series transformation and classification.
  • sktime - A scikit-learn compatible Python toolbox for learning with time series data.

Audio

  • python_speech_features - Speech features.
  • speechpy - A Library for Speech Processing and Recognition.
  • magenta - Music and Art Generation with Machine Intelligence.
  • librosa - Audio and music analysis.
  • pydub - Manipulate audio with a simple and easy high level interface.
  • pytorch/audio - simple audio I/O for pytorch.

Images and Video

  • pillow - PIL fork.
  • scikit-image - Image processing.
  • hmap - Image histogram remapping.
  • pyocr - A wrapper for Tesseract and Cuneiform (Optical Character Recognition).
  • scikit-video - Video processing.
  • moviepy - Video editing.
  • OpenCV - Open Source Computer Vision Library.
  • SimpleCV - Wrapper around OpenCV.
  • label-maker - Data Preparation for Satellite Machine Learning.
  • face_recognition - Facial recognition.
  • imgaug - Image augmentation.
  • pyvips - Fast image processing.
  • aeneas - Set of tools to automagically synchronize audio and text.
  • ImageHash - Image hashing.
  • Augmentor - Image augmentation library.
  • PyAV - Bindings for FFmpeg.
  • imutils - Convenience functions to make basic image processing operations.
  • albumentations - fast image augmentation library.

Geolocation

Web Content

  • sum - Automatic summarization of text documents and HTML.
  • textract - Extract text from any document.
  • newspaper - News extraction, article extraction and content curation.

Text/NLP

  • BlingFire - A lightning fast Finite State machine and REgular expression manipulation library.
  • BERT-pytorch - Google AI 2018 BERT pytorch implementation.
  • pytorch-pretrained-BERT - PyTorch version of Google AI's BERT model with script to load Google's pre-trained models.
  • gensim - Topic Modeling.
  • pattern - Web ining module.
  • probablepeople - Parsing unstructured western names into name components.
  • Expynent - Regular expression patterns.
  • mimesis - Generate synthetic data.
  • pyenchant - Spell checking.
  • parserator - Domain-specific probabilistic parsers.
  • scrubadub - Clean personally identifiable information from dirty dirty text.
  • usaddress - Parsing unstructured address strings into address components.
  • python-phonenumbers - Python port of Google's libphonenumber.
  • jellyfish - Approximate and phonetic matching of strings.
  • preprocessing - Simple interface for the CMU Pronouncing Dictionary.
  • langid - Stand-alone language identification system.
  • fuzzywuzzy - Fuzzy String Matching.
  • Fuzzy - Soundex, NYSIIS, Double Metaphone.
  • snowball - Snowball compiler and stemming algorithms.
  • leven - Levenshtein edit distance.
  • flashtext - Extract Keywords from sentence or Replace keywords in sentences.
  • polyglot - Multilingual text NLP processing toolkit.
  • sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.
  • pyfasttext - Binding for fastText.
  • python-wordsegment - English word segmentation.
  • pyahocorasick - Exact or approximate multi-pattern string search.
  • Wordbatch - Parallel text feature extraction for machine learning.
  • langdetect - Port of Google's language-detection library.
  • translation - Uses web services for text translation.
  • nltk - Natural Language Toolkit.
  • unidecode - ASCII transliterations of Unicode text.
  • pytorch/text - Data loaders and abstractions for text and NLP.
  • textdistance - Compute distance between sequences.
  • sent2vec - General purpose unsupervised sentence representations.
  • pyhunspell - Python bindings for the Hunspell spellchecker engine.
  • facebook/fastText - Library for fast text representation and classification.
  • textblob - Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
  • facebook/InferSent - Sentence embeddings (InferSent) and training code for NLI.
  • nmslib - Non-Metric Space Library.
  • google/sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.
  • ftfy - Fixes mojibake and other glitches in Unicode text, after the fact.
  • fletcher - Pandas ExtensionDType/Array backed by Apache Arrow.
  • textacy - NLP, before and after spaCy.
  • hmtl - Hierarchical Multi-Task Learning - A State-of-the-Art neural network model for several NLP tasks based on PyTorch and AllenNLP.
  • pytext - A natural language modeling framework based on PyTorch.
  • flair - A very simple framework for state-of-the-art Natural Language Processing.
  • LASER - Language-Agnostic SEntence Representations.
  • transformer-xl - Attentive Language Models Beyond a Fixed-Length Context.
  • textstat - Calculate readability statistics of a text object - paragraphs, sentences, articles.

Graphs

  • louvain - Louvain Community Detection.

Time

Ranking/Recommender

  • Surprise - Analyzing recommender systems.
  • trueskill - TrueSkill rating system.
  • LightFM - Hybrid recommendation algorithm.
  • implicit - Collaborative Filtering for Implicit Datasets.

Trading

  • Clairvoyant - Identify and monitor social/historical cues.
  • zipline - Algorithmic Trading Library.
  • qstrader - Advanced Trading Infrastructure.

Misc

  • sklearn-porter - Transpile trained scikit-learn estimators.
  • sklearn-compiledtrees - Compiled Decision Trees for scikit-learn.
  • Metrics - Machine learning evaluation metrics.
  • bonobo - Extract Transform Load.
  • pyemd - Earth Mover's Distance metric.
  • fastai - The fast.ai deep learning library, lessons, and tutorials.
  • mmh3 - MurmurHash3, a set of fast and robust hash functions.
  • fbpca - Fast Randomized PCA/SVD.
  • annoy - Approximate Nearest Neighbors.
  • mlcrate - Handy tools and functions.
  • pipeline - Standard Runtime For Every Real-Time Machine Learning.
  • tabulate - Pretty-print tabular data in Python, a library and a command-line utility.
  • crayon - A language-agnostic interface to TensorBoard.
  • faiss - A library for efficient similarity search and clustering of dense vectors.
  • neurtu - A Python package for parametric benchmarks.
  • py-spy - Sampling profiler for Python programs.

Deployment

  • palladium - Framework for setting up predictive analytics services.
  • lore - Lore makes machine learning approachable for Software Engineers and maintainable for Machine Learning Researchers.
  • kubeflow - Machine Learning Toolkit for Kubernetes.
  • great_expectations - F framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests.
  • mara/data-integration - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow.
  • airflow - ETL.
  • mlflow - Open source platform for the complete machine learning lifecycle.

Python Tools

  • pip-tools - Keeps dependencies up to date.
  • devpi - PyPI server and packaging/testing/release tool.
  • Jupyter Notebook - Notebooks are awseome.
  • click - CLI package.
  • sacredboard - Dashboard for sacred.
  • sacred - Reproduce computational experiments.
  • python-flamegraph - Statistical profiler which outputs in format suitable for FlameGraph.
  • magic-wormhole - get things from one computer to another, safely.
  • memory_profiler - monitoring memory usage of a python program.
  • line_profiler - Line-by-line profiling.
  • parse - Parse strings using a specification based on the Python format() syntax.
  • CleverCSV - CleverCSV is a Python package for handling messy CSV files

Data Gathering

  • gain - Web crawling framework based on asyncio.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • camelot - Camelot: PDF Table Extraction for Humans.

About

A curated list of Python libraries used for data science.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.