Overview
This hands-on machine learning courses advances the participants’ data analysis skills. The course covers real-world predictive modeling and basic machine learning techniques that will help participants excel at data analysis in their organizations. The course immerses participants in working with R to lay a solid data science foundation and trains them in techniques that enables them to leverage their data in more sophisticated and powerful ways.
What You'll Learn
- Understanding machine learning and data science
- Introduction to data mining
- Working with missing values, outliers and duplicate records
- Working with linear regression models and classification models
- Performing cluster analysis
- Learning the dimension reduction techniques
Curriculum
- Data science as a quantitative discipline
- How to define Data Science scopes
- The many faces of Data Science: Data Mining, Data Analysis, Data Analytics, Machine Learning, Predictive Modeling, Statistical Learning, Mathematical Modeling. What are these all about?
- Data Mining as a data exploration process
- Machine Learning: supervised vs. unsupervised
- Machine Learning vs. Predictive Analytics
- Big Data Analytics: what is it and why it’s important
- Overview of data mining process cycle
- Understanding business needs and identifying new business opportunities
- Formulating a business problem and associated requirements
- Defining key quantitative metrics to measure success and evaluating business benefits
- Translating business requirements into technical requirements and documentation
- Formulating data models based on business and technical requirements
- Identifying a set of quantitative models based on technical requirements and metrics of success
- Running the models and evaluating results
- Selecting the best model
- Deploying the model
- Data sources
- Types of data
- Structured vs. unstructured data
- Static data vs. real-time data
- Types of data attributes: numerical vs. categorical
- Role of time factor and time trends in data analysis
- Working with missing values
- Main causes of missing data
- Understanding the importance of missing information
- Types of missing information
- Restoring missing values
- Imputing missing values and selecting imputation techniques
- Understanding and evaluating potential consequences of manipulating records with missing values
- Working with outliers
- Defining quantitative criteria for outlier detection in 1D cases
- Understanding role of outliers in model building
- Deciding on outlier removal
- Defining outlier detection metrics in multi-dimensional space
- Working with duplicate records
- Defining duplicates
- Understanding sources of duplicates
- Deciding on duplicate removal
- Why sampling may be important for Machine Learning
- Sampling techniques and sample bias
- Statistical hypothesis
- Z-score, t-score and F statistic
- P-values
- Implementation of hypothesis testing for model evaluation analysis
- What is Machine Learning?
- Supervised vs. unsupervised learning
- Overview of supervised Machine Learning
- Regression models
- Classification models
- Overview of unsupervised Machine Learning
- Clustering methods
- Principal component analysis and dimension reduction
- Association rules
- Overview of major steps in building and testing quantitative models
- Criteria for model selection
- How to prepare a training set
- Criteria for selecting model attributes/predictors
- Working with collinear variables
- Addressing imbalance problem
- Dealing with over-fitting; bias-variance tradeoff
- Validation and cross-validation
- Univariate regression vs. multiple regression
- Mathematical foundation of linear regression overview: least square method vs. maximum likelihood method
- Model assumptions
- Working with continuous attributes
- Dealing with collinear variable
- Model subset selection:
- Forward stepwise selection
- Backward selection
- Shrinkage methods: ridge regression and Lasso
- Dimension reduction
- Information criteria
- Automating model selection procedure
- Model parameter evaluation, R squared vs. adjusted R squared
- Validating the model
- Working with categorical variables
- Considering input variable interactions
- Dealing with imbalanced training sets
- Understanding confusion matrix
- Evaluating binary classifiers using ROC / AUC
- Overview of cluster analysis mathematical foundation
- K-means clustering method
- Algorithm overview
- Convergence criteria
- How to determine the number of clusters
- What is dimension reduction?
- The practical goals of dimension reduction implementation
- Principal component analysis vs. singular value decomposition
- How many components to choose
- What was not covered in the class
- Big Data Analytics – the future of machine learning: main tools and concepts
Who should attend
The course is highly recommended for –
- Data analysts
- Machine learning professionals
- Business analysts
- Data mining specialists
Prerequisites
Participants need to have intermediate-level data analysis skills and basic knowledge of descriptive statistics. Having experience working with R would be beneficial.
Technical requirements: Installed R and some R packages. Installation of RStudio is helpful, but not required.