We wanted to create a developing open resources of data modeling practices using statistical and machine learning topics from scratch (into level) to a good level (grad or applied research level). Almost all useful data science models and statistical methods will be studied under one cover along with Notebook in Python. We think this developing resource is crucial to know for a data scientist when touching the data first time.
When building a data-based model in today's machine learning and big data era, it is crucial to explore it from holistic and comprehensive approaches using multivariate perspectives, thus, models are trained well and thus improved. Data mining and feature engineering using statistical and visualization methods are very crucial when touching data first. Advanced methods -such as manifold learning in exploration-, automating search algorithm for estimation -such as a grid or smart search method in exploitation-, deep tools -such as Scikitlearn and Keras in deepening-, good choices of activation functions and optimizers are needed to complete the modeling task including predictive and supervised modeling. Exploratory data analysis, discovering randomness and interdependence in features, model choice-fit-validation-improvement, outlier detection, and reproducibility are each requires data scientist to comprehend almost all the methods and tools beside time commitment.
Here, we apply data science methods and tools in a notebook/workshop style in the exploratory and applied format with real-world data, aiming to open so-called black-boxes. In our group, we have a statistician, a programmer, an applied mathematician, an AI expert, and a chameleon. While preparing our notes, we discuss the multi-aspects and theories of modeling practices using classical and all up-to-date methods and tools, and reflect these here. Of course, designing search grid and pipeline, writing python functions to automate these practices, setting up for stream data are some of the practices we include here as well.
The outline of the resources can be found in the table of contents (notebooks/All_Steps_A_Data_Scientist_Should_Know_and_Apply.ipynb). We post and update the related notebooks of each method under Notebook folder.
We will develop as we learn, discuss and apply methods. We owe thanks to the data scientists who share their resources since we have benefitted in our notes. Eventually, all notes and files will be reorganized in the form of workshops and notebooks.
YB on behalf of Data Science Group, Rochester, March 2020