This two-part workshop series provides an introduction to using R for
two popular machine learning techniques: clustering and
classification. Clustering involves identifying groups of similar
observations (called clusters) within data. Clustering can be an
effective tool for finding patterns and an important part of exploratory
data analysis. Classification refers to modeling categorical variables.
Classification models can provide insight into the relationship between
the predictors and response, as well as a way to make predictions about
new observations.
After this workshop series, learners should be able to:
- Assess whether classification or clustering are relevant to their research problems and data sets;
- Explain the tradeoffs between popular clustering algorithms;
- Run a clustering algorithm on their data;
- Build and train a classification model on their data;
- Use cross-validation to estimate accuracy and tune hyperparameters for classification models;
- Identify strategies to improve results from classification models.
Prerequisites: This workshop is designed for researchers who have data that they are
already working with in R. Participants must have taken DataLab’s
“Overview of Statistical Machine Learning,” “R Basics,” and “Regression
in R” workshop series, or have equivalent prior experience. Completion
of DataLab’s “Intermediate R” series is recommended but not required.
Participants must be comfortable with basic R syntax, and have the
latest version of R pre-installed and running on their laptops. The
focus of the workshop is on implementing clustering and classification
in R, and not on learning the R language itself. Bring your laptop with
the latest version of R and RStudio.