Machine Learning ECEN 250 Exam 1 Notes | Study notes Engineering

Introduction:

- ML is about deriving models from data to improve decision-making and outcomes.

- Learning can be supervised (with labeled data) or unsupervised (without labeled data).

- From data, we can learn similarity, associations, structures, and parameters, which can be applied to various real-world problems like image

classification, topic formation, and energy forecasting.

Data:

- Types of Data: Numeric (discrete [words/letters], continuous[numbers]) and categorical (ordinal[ordered], nominal[random]).

- Dataframe Structure: Rows (instances), columns (features), and dimensionality (number of attributes).

- Labeling: Labels are the target outputs the model learns to predict, add as column on csv/file directory structures/tools

- Dealing with Data Issues: Standardize inconsistent data, handle missing data (remove or impute), and manage outliers and unnecessary data.

Descriptive Statistics:

- Population vs Sample: Population is the complete set of data, while a sample is a statistically significant subset.

- Measures of Frequency: Count, proportions, and occurrence percentages help understand how often events occur.

- Measures of Central Tendency: Mean (arithmetic, geometric [exponential processes, compound interest], harmonic [pipeline flow, averaging of

flows, average resistance] ), median, and mode describe the central point of the data.

- Measures of Dispersion: Range, variance, and standard deviation describe how spread out the data is.

- Scaling: Normalization (range 0-1) and standardization (mean 0-1) used to rescale data for better performance in ML.

Clustering (Unsupervised Learning):

- Similarity: Measured using distance metrics like Euclidean distance(ED) to group similar data points into clusters.

- k-Means Clustering: An iterative algorithm that alternates between assigning points to clusters and updating cluster centers. (Initialization [choose

number k], Assignment [Assign to center based on ED], Update[recalculate by taking average], Repeat)

- k in k-Means: The number of clusters, choose carefully to avoid over- or under-clustering. (dissimilar or too specific)

- Other Clustering Techniques: Includes hierarchical clustering, DBSCAN, Gaussian Mixture Models, Spectral Clustering, Mean Shift, Affinity

Propagation, and Agglomerative Clustering.

Probability:

- Random Variable: Maps sample space outcomes to values (disc or cont) [weather {Sunny}, temperature{50} or{Hot}]

- Event: A subset of the sample space (the even of rolling an even number is A= {2, 4, 6}).

- Probability: Measures the likelihood of an event. Sum of all probabilities is 1.

- Distributions: PMF (probability mass function)(Assigns probabilities to each value of a discrete variable)

- PMF: Laptop Vendor L: P ( L = RR ) = 0.37, P ( L = Lenny ) = 0.28, P(L=Ace)=0.35.

- Joint: Specifies the probability of each combination of values for multiple random variables.

- P(L,T): Probability of a laptop from vendor L having trouble T.

- Marginal: Obtained by summing out one or more variables from a joint distribution.

- P(L)=∑ T P(L,T): Probability of a laptop being from vendor L, regardless of the trouble type.

- Conditional distributions: Probability distribution of one variable given the value of another.

- P(T∣L=Ace): Probability of trouble T given that the laptop is from Ace.

- Product Rule: Relates joint and conditional probabilities. P(R,T)=P(T)⋅P(R∣T)

- Bayes’ Rule: Allows us to reverse conditional probabilities. P(y∣x)=(P(x∣y)⋅P(y) )/P(x)

- Independence: Variables are independent if knowing one doesn’t affect the other. P(X,Y)=P(X)⋅P(Y). Conditional: P(X,Y∣Z)=P(X∣Z)⋅P(Y∣Z)

- Expectation: The average value of a random variable, weighted by probabilities. E(X)=i∑ xi ⋅P(X=xi ). Ex: For the set of observations {3,1,9,3}, the

expectation is: E(X)=1⋅P(X=1)+3⋅P(X=3)+9⋅P(X=9)

Regression:

- Linear Regression (line of best fit): Fits a linear model to the data by minimizing the error (RSS) between observed and predicted values.

- Error in Fit: The difference between observed and predicted values, which is minimized during the fitting process. Residual=Yi −Y^i

- Residual Sum of Squares (RSS), which measures the total error between the observed and predicted values. RSS=∑ (Yi −Y^i )^2

- Simple vs. Multiple vs. Polynomial:

- Simple: One independent variable, linear relationship. Y^=w0 +w1 X

- Multiple: Multiple independent variables, linear relationship. Y^=w0+w1X1+w2X2+⋯+wnXn

- Polynomial: Non-linear relationships, using polynomial terms of the independent variables. Y^=w0 +w1 X+w2 X^2

Error, Underfit, Overfit:

- (underfitting occurs when a model is too simple to capture the underlying patterns in the data, while overfitting occurs when a model learns the

training data too well, including noise, and performs poorly on new, unseen data)

- Goodness of Fit Metrics: Mean Squared Error (MSE), R-squared (R^2), Standard Error (S)

- MSE: Measures the average squared difference between observed and predicted values. Lower MSE indicates a better fit. MSE=1/n * RSS()

- R^2: Represents the proportion of variance in the dependent variable that is explained by the model. Ranges from 0 to 1, 1 indicates a perfect fit.

- Standard Error: Measures the accuracy of predictions. A lower standard error indicates a better fit.

- Bias vs Variance: Bias leads to underfitting, while variance leads to overfitting.

- As model complexity increases, bias decreases but variance increases.

- Generalization: Refers to how well a model performs on new, unseen data (not used during training). Low test error indicates good generalization.

- Regularization: Adding pressure on the modeling tuning to keep θ small Keeping θ small will reduce overfitting.

- Ridge Regresion: penalizes large weights by adding to the cost function (MSE()) a fraction of square of each weight. Scikit-Learn Ridge(alpha, solver)

- Lasso Regression: drives least important weights to zero (resulting in simpler model). Scikit-Learn Lasso(alpha)

- Covariance and Correlation: Covariance measures how two variables change together (+ = same direction, - + opposite), while correlation measures

the strength and direction of their linear relationship from -1 → 0. Correlation 1 = perfect positive, -1 = perfect negative, 0 = no relationship

Partial preview of the text

Download Machine Learning ECEN 250 Exam 1 Notes and more Study notes Engineering in PDF only on Docsity!

Introduction:

ML is about deriving models from data to improve decision-making and outcomes.
Learning can be supervised (with labeled data) or unsupervised (without labeled data).
From data, we can learn similarity, associations, structures, and parameters, which can be applied to various real-world problems like image classification, topic formation, and energy forecasting. Data:
Types of Data: Numeric (discrete [words/letters], continuous[numbers]) and categorical (ordinal[ordered], nominal[random]).
Dataframe Structure: Rows (instances), columns (features), and dimensionality (number of attributes).
Labeling: Labels are the target outputs the model learns to predict, add as column on csv/file directory structures/tools
Dealing with Data Issues: Standardize inconsistent data, handle missing data (remove or impute), and manage outliers and unnecessary data. Descriptive Statistics:
Population vs Sample: Population is the complete set of data, while a sample is a statistically significant subset.
Measures of Frequency: Count, proportions, and occurrence percentages help understand how often events occur.
Measures of Central Tendency: Mean (arithmetic, geometric [exponential processes, compound interest], harmonic [pipeline flow, averaging of flows, average resistance] ), median, and mode describe the central point of the data.
Measures of Dispersion: Range, variance, and standard deviation describe how spread out the data is.
Scaling: Normalization (range 0-1) and standardization (mean 0-1) used to rescale data for better performance in ML. Clustering (Unsupervised Learning):
Similarity: Measured using distance metrics like Euclidean distance(ED) to group similar data points into clusters.
k-Means Clustering: An iterative algorithm that alternates between assigning points to clusters and updating cluster centers. (Initialization [choose number k], Assignment [Assign to center based on ED], Update[recalculate by taking average], Repeat)
k in k-Means: The number of clusters, choose carefully to avoid over- or under-clustering. (dissimilar or too specific)
Other Clustering Techniques: Includes hierarchical clustering, DBSCAN, Gaussian Mixture Models, Spectral Clustering, Mean Shift, Affinity Propagation, and Agglomerative Clustering. Probability:
Random Variable: Maps sample space outcomes to values (disc or cont) [weather {Sunny}, temperature{50} or{Hot}]
Event: A subset of the sample space (the even of rolling an even number is A= {2, 4, 6}).
Probability: Measures the likelihood of an event. Sum of all probabilities is 1.
Distributions: PMF (probability mass function)(Assigns probabilities to each value of a discrete variable)
PMF: Laptop Vendor L: P ( L = RR ) = 0.37, P ( L = Lenny ) = 0.28, P(L=Ace)=0.35.
Joint: Specifies the probability of each combination of values for multiple random variables.
P(L,T): Probability of a laptop from vendor L having trouble T.
Marginal: Obtained by summing out one or more variables from a joint distribution.
P(L)=∑ T P(L,T): Probability of a laptop being from vendor L, regardless of the trouble type.
Conditional distributions: Probability distribution of one variable given the value of another.
P(T∣L=Ace): Probability of trouble T given that the laptop is from Ace.
Product Rule: Relates joint and conditional probabilities. P(R,T)=P(T)⋅P(R∣T)
Bayes’ Rule: Allows us to reverse conditional probabilities. P(y∣x)=(P(x∣y)⋅P(y))/P(x)
Independence: Variables are independent if knowing one doesn’t affect the other. P(X,Y)=P(X)⋅P(Y). Conditional: P(X,Y∣Z)=P(X∣Z)⋅P(Y∣Z)
Expectation: The average value of a random variable, weighted by probabilities. E(X)=i∑xi⋅P(X=xi). Ex: For the set of observations {3,1,9,3}, the expectation is: E(X)=1⋅P(X=1)+3⋅P(X=3)+9⋅P(X=9) Regression:
Linear Regression (line of best fit): Fits a linear model to the data by minimizing the error (RSS) between observed and predicted values.
Error in Fit: The difference between observed and predicted values, which is minimized during the fitting process. Residual=Yi−Y^i
Residual Sum of Squares (RSS), which measures the total error between the observed and predicted values. RSS=∑(Yi−Y^i)^
Simple vs. Multiple vs. Polynomial:
Simple: One independent variable, linear relationship. Y^=w0+w1X
Multiple: Multiple independent variables, linear relationship. Y^=w0+w1X1+w2X2+⋯+wnXn
Polynomial: Non-linear relationships, using polynomial terms of the independent variables. Y^=w0+w1X+w2X^ Error, Underfit, Overfit:
(underfitting occurs when a model is too simple to capture the underlying patterns in the data, while overfitting occurs when a model learns the training data too well, including noise, and performs poorly on new, unseen data)
Goodness of Fit Metrics: Mean Squared Error (MSE), R-squared (R^2), Standard Error (S)
MSE: Measures the average squared difference between observed and predicted values. Lower MSE indicates a better fit. MSE=1/n * RSS()
R^2: Represents the proportion of variance in the dependent variable that is explained by the model. Ranges from 0 to 1, 1 indicates a perfect fit.
Standard Error: Measures the accuracy of predictions. A lower standard error indicates a better fit.
Bias vs Variance: Bias leads to underfitting, while variance leads to overfitting.
As model complexity increases, bias decreases but variance increases.
Generalization: Refers to how well a model performs on new, unseen data (not used during training). Low test error indicates good generalization.
Regularization: Adding pressure on the modeling tuning to keep θ small Keeping θ small will reduce overfitting.
Ridge Regresion: penalizes large weights by adding to the cost function (MSE()) a fraction of square of each weight. Scikit-Learn Ridge(alpha, solver)
Lasso Regression: drives least important weights to zero (resulting in simpler model). Scikit-Learn Lasso(alpha)
Covariance and Correlation: Covariance measures how two variables change together (+ = same direction, - + opposite), while correlation measures the strength and direction of their linear relationship from -1 → 0. Correlation 1 = perfect positive, -1 = perfect negative, 0 = no relationship

Classification: (Surprised Learning)

Classification Problem: Predicting the class (y) of a given input based on its features (X). (To save space: KNN = K-Nearest Neighbors (K-NN))
K-NN: A simple, non-parametric classification algorithm that assigns a class based on the majority class of the k nearest neighbors. It is best for low-dimensional data with a large training set. Choose a distance (ED), select # of neighbors k, assign the class most frequent amongst k neighbors. - When to use: Modest dimensionality: Works well when the number of features is relatively small (less than 20). Lots of training data: Requires a large amount of training data to perform well. Advantages: Training is very fast, Learn complex target functions. Disadvantages: Slow at query (inference) times, irrelevant features can confuse the classifier.
Accuracy, Precision, Recall, F1: Metrics used to evaluate the performance of a classification model.
- Accuracy: The fraction of correctly classified instances out of the total number of instances. Not suitable for imbalanced datasets
- Precision: The fraction of true positive predictions out of all positive predictions. Measures # of the predicted positives are actually correct.
- Recall: The fraction of true positives out of all actual positives. Measures how many of the actual positives are correctly predicted.
- F1 Score: The harmonic mean of precision and recall. Useful when you want to balance precision and recall, especially in imbalanced datasets.
TP = Correctly predicted positive class, TN = …Negative class, FP = Incorrectly predicted + class (actual is -), FN = …Negative class (actual +) Inferencing:
Bayes’ Rule is used in Naïve Bayes to calculate the posterior probability of a class given the features, assuming conditional independence between features. [The formula for Bayes’ Rule is: P(Y∣X)= P(X∣Y)⋅P(Y) / P(X) ]
P(Y∣X): Posterior probability of class Y given features X. P(X∣Y): Likelihood of observing features X given class Y.
P(Y): Prior probability of class Y. P(X): Evidence (probability of observing features X).
Naïve Bayes is a simple, fast, and probabilistic classifier that works well with high-dimensional data. Conditional Independence Assumption, Robust to Irrelevant Features, relies on prior probabilities P(Y) and likelihoods P(X∣Y), which need to be estimated from the training data.
The parameters of Naïve Bayes are the prior probabilities P(Y) and the likelihoods P(X i ∣Y).
A confusion matrix is a table used to evaluate classification performance by comparing predicted and actual labels.
Different types of Naïve Bayes classifiers (Bernoulli = Binary, Multinomial = discrete (words), Gaussian = continuous) are chosen based on the type of data Robust ML Methodology:
Supervised ML Methodology: Involves gathering data, splitting it into training/validation/test sets, training the model, validating it, and finally testing it.
Feature Selection: Choosing the most relevant features for the model. Correlate Analysis (seaborn), Random Forest (importance), Lasso (regularize)
Feature Engineering: Modifying or creating new features to improve model performance.
Scaling of Random Variables (recall we did this in linear regression), Combining Random Variables into new Features
Adding new Features (eg. Polynomial features), Modifying Random Variables (eg. Fourier Transforms)
Data Rules: Never contaminate original data, separate training/test sets, Never make selection between alternative models/settings on data used to train or test the model. Keep separate validation set.
Train/Validation/Test Sets: Training set to train model (50-60% of data), validation set/tuning the model (15-25%), and test set/evaluation. (15-30%)
Parameters vs Hyperparameters: Parameters are internal variables learned from data (coefficients w0), while hyperparameters are set by the user and control the learning process (number of neighbors k in KNN or regularization strength gamma in Lasso). Parameters:
Parameters in Bayes’ Classifier: Represent the prior probabilities P(Y) and likelihoods P(X i ∣Y), used to calculate the posterior probability P(Y∣X).
Parameter Estimation: Done by counting the frequency of classes and feature values in the training data.
Handling Unseen Classes/Features: Use Laplace Smoothing to assign a small non-zero probability to unseen events, ensuring the model can still make predictions even if combinations are missing in the training data. Adds a constant (usually 1) to the count of each feature-class combination. SVM: (Support Vector Machine)
Support Vectors: The data points closest to the decision boundary in SVM that define optimal hyperplane. Directly influence position & orientaion
Margin: The distance between the decision boundary and the closest data points. Goal is maximizing the margin to improve generalization.
Problems SVM is Good At: Small datasets, high-dimensional data, non-linear classification, and outlier detection.
Kernel Trick: A method to handle non-linear data by mapping it to a higher-dimensional space using kernel functions (e.g., linear, polynomial, RBF). This allows SVMs to find linear decision boundaries in the transformed space. Decision Trees:
Learning Types: Deductive (general to specific), Inductive (specific to general), and Abductive (incomplete observations to likely explanations). Machine learning primarily uses inductive learning.
Decision Splits: Made using criteria like Gini Impurity (classification) or Mean Squared Error (regression). The goal is to maximize information gain or minimize variance.
When to Use Decision Trees: Small to medium datasets, interpretability is important, non-linear relationships, mixed data types, and quick prototyping. However, they can overfit and are sensitive to small changes in data. Random Forest:
Overfitting is when a model learns noise in the training data.
Detect overfitting by comparing training and validation/test performance.
Minimize overfitting through regularization, pruning (remove branches that have little importance), cross-validation, and ensemble methods.
Ensemble methods combine multiple models to improve accuracy. Can reduce variance and improve generalization.
Weak learners are simple models, while strong learners are more accurate and can generalize well to unseen data.
Boosting (trained weak learners sequentially), Bagging (trained on different subsets, sampled with replacement), and Pasting(sampled no replacement) are ensemble techniques.
Random Forests can be used for feature selection by measuring feature importance. Higher importance score = more influential.

Machine Learning ECEN 250 Exam 1 Notes, Study notes of Engineering

Related documents

Partial preview of the text

Download Machine Learning ECEN 250 Exam 1 Notes and more Study notes Engineering in PDF only on Docsity!