
Introduction:
- ML is about deriving models from data to improve decision-making and outcomes.
- Learning can be supervised (with labeled data) or unsupervised (without labeled data).
- From data, we can learn similarity, associations, structures, and parameters, which can be applied to various real-world problems like image
classification, topic formation, and energy forecasting.
Data:
- Types of Data: Numeric (discrete [words/letters], continuous[numbers]) and categorical (ordinal[ordered], nominal[random]).
- Dataframe Structure: Rows (instances), columns (features), and dimensionality (number of attributes).
- Labeling: Labels are the target outputs the model learns to predict, add as column on csv/file directory structures/tools
- Dealing with Data Issues: Standardize inconsistent data, handle missing data (remove or impute), and manage outliers and unnecessary data.
Descriptive Statistics:
- Population vs Sample: Population is the complete set of data, while a sample is a statistically significant subset.
- Measures of Frequency: Count, proportions, and occurrence percentages help understand how often events occur.
- Measures of Central Tendency: Mean (arithmetic, geometric [exponential processes, compound interest], harmonic [pipeline flow, averaging of
flows, average resistance] ), median, and mode describe the central point of the data.
- Measures of Dispersion: Range, variance, and standard deviation describe how spread out the data is.
- Scaling: Normalization (range 0-1) and standardization (mean 0-1) used to rescale data for better performance in ML.
Clustering (Unsupervised Learning):
- Similarity: Measured using distance metrics like Euclidean distance(ED) to group similar data points into clusters.
- k-Means Clustering: An iterative algorithm that alternates between assigning points to clusters and updating cluster centers. (Initialization [choose
number k], Assignment [Assign to center based on ED], Update[recalculate by taking average], Repeat)
- k in k-Means: The number of clusters, choose carefully to avoid over- or under-clustering. (dissimilar or too specific)
- Other Clustering Techniques: Includes hierarchical clustering, DBSCAN, Gaussian Mixture Models, Spectral Clustering, Mean Shift, Affinity
Propagation, and Agglomerative Clustering.
Probability:
- Random Variable: Maps sample space outcomes to values (disc or cont) [weather {Sunny}, temperature{50} or{Hot}]
- Event: A subset of the sample space (the even of rolling an even number is A= {2, 4, 6}).
- Probability: Measures the likelihood of an event. Sum of all probabilities is 1.
- Distributions: PMF (probability mass function)(Assigns probabilities to each value of a discrete variable)
- PMF: Laptop Vendor L: P ( L = RR ) = 0.37, P ( L = Lenny ) = 0.28, P(L=Ace)=0.35.
- Joint: Specifies the probability of each combination of values for multiple random variables.
- P(L,T): Probability of a laptop from vendor L having trouble T.
- Marginal: Obtained by summing out one or more variables from a joint distribution.
- P(L)=∑ T P(L,T): Probability of a laptop being from vendor L, regardless of the trouble type.
- Conditional distributions: Probability distribution of one variable given the value of another.
- P(T∣L=Ace): Probability of trouble T given that the laptop is from Ace.
- Product Rule: Relates joint and conditional probabilities. P(R,T)=P(T)⋅P(R∣T)
- Bayes’ Rule: Allows us to reverse conditional probabilities. P(y∣x)=(P(x∣y)⋅P(y) )/P(x)
- Independence: Variables are independent if knowing one doesn’t affect the other. P(X,Y)=P(X)⋅P(Y). Conditional: P(X,Y∣Z)=P(X∣Z)⋅P(Y∣Z)
- Expectation: The average value of a random variable, weighted by probabilities. E(X)=i∑ xi ⋅P(X=xi ). Ex: For the set of observations {3,1,9,3}, the
expectation is: E(X)=1⋅P(X=1)+3⋅P(X=3)+9⋅P(X=9)
Regression:
- Linear Regression (line of best fit): Fits a linear model to the data by minimizing the error (RSS) between observed and predicted values.
- Error in Fit: The difference between observed and predicted values, which is minimized during the fitting process. Residual=Yi −Y^i
- Residual Sum of Squares (RSS), which measures the total error between the observed and predicted values. RSS=∑ (Yi −Y^i )^2
- Simple vs. Multiple vs. Polynomial:
- Simple: One independent variable, linear relationship. Y^=w0 +w1 X
- Multiple: Multiple independent variables, linear relationship. Y^=w0+w1X1+w2X2+⋯+wnXn
- Polynomial: Non-linear relationships, using polynomial terms of the independent variables. Y^=w0 +w1 X+w2 X^2
Error, Underfit, Overfit:
- (underfitting occurs when a model is too simple to capture the underlying patterns in the data, while overfitting occurs when a model learns the
training data too well, including noise, and performs poorly on new, unseen data)
- Goodness of Fit Metrics: Mean Squared Error (MSE), R-squared (R^2), Standard Error (S)
- MSE: Measures the average squared difference between observed and predicted values. Lower MSE indicates a better fit. MSE=1/n * RSS()
- R^2: Represents the proportion of variance in the dependent variable that is explained by the model. Ranges from 0 to 1, 1 indicates a perfect fit.
- Standard Error: Measures the accuracy of predictions. A lower standard error indicates a better fit.
- Bias vs Variance: Bias leads to underfitting, while variance leads to overfitting.
- As model complexity increases, bias decreases but variance increases.
- Generalization: Refers to how well a model performs on new, unseen data (not used during training). Low test error indicates good generalization.
- Regularization: Adding pressure on the modeling tuning to keep θ small Keeping θ small will reduce overfitting.
- Ridge Regresion: penalizes large weights by adding to the cost function (MSE()) a fraction of square of each weight. Scikit-Learn Ridge(alpha, solver)
- Lasso Regression: drives least important weights to zero (resulting in simpler model). Scikit-Learn Lasso(alpha)
- Covariance and Correlation: Covariance measures how two variables change together (+ = same direction, - + opposite), while correlation measures
the strength and direction of their linear relationship from -1 → 0. Correlation 1 = perfect positive, -1 = perfect negative, 0 = no relationship