Introduction to Linear Regression
- We do see some other correlations between verbal and status for example, however, since we are trying to find a solution to a specific problem, we can focus on gamble response variable and income dependent variable(predictor).
- Before we start looking to create regression model, predictive value, errors, intercept or slope, we should look to see if the current relationship between income as the predictor and gamble as the response variable meets certain assumptions and conditions that is required for linear regression.
- In order for us to create a linear regression model, we need to make sure the relationship is linear, there is independence of errors meaning, the residuals are not influencing each other and they are not following a certain pattern, there is homoscedasticity between income and gamble so that the data does not look like a funnel and normality of error distribution where the observations are mostly around the predicted value and evenly distributed.
A Brief Tour of Scikit-learn (Sklearn)
- Scikit-learn is a python library that provides methods for data reading, data preparation, regression, classification, unsupervised clustering, and much more.
- We can see that random forest performance is much better than linear regression.
- We can further improve performance by optimizing parameters in random forests.
- Feel free to train and test on the full data set for a more suitable comparison of performance between models.
- We see that support vector regression performs better than linear regression but worse than random forests.
- Similar to the random forest example, the support vector machine parameters can be optimized such that error is minimized.
- We see that k-nearest neighbors algorithm outperform linear regression when trained on the full data set.
- In another post, I will outline some of the classification methods that are most common in the python machine learning library.
How our AI got Top 10 in the Fantasy Premier League using Data Science
- This will help us identify the teams that have too many expensive and under-performing players who rarely play the full 90 minutes each game due to the frequent squad rotation their coach employs, which makes them a bad investment in the long run, since they will not be generating fantasy points consistently each game.
- That will inform our algorithm to pick more players from those teams since their players are expected to generate a higher aggregate ROI in the long run because they would be involved in a lot more game action on average compared to players from teams that use frequent player rotation.
- Since the AVG Joe spent most of his/her budget on picking 11 very expensive players, he/she had to spend the remaining budget on the cheapest available players to fill in all the substitute positions, but none of these players can be used to generate fantasy points due to the fact that they never actually played in the real EPL games and were only used as team-fillers.
The Threat of the AI-powered Tyranny of Swing Voters
- Many great thinkers and statesmen expressed concerns about the tyranny of the majority problem, with Edmund Burke writing that, “the majority of the citizens is capable of exercising the most cruel oppressions upon the minority.” Founding Fathers of the United States such as James Madison and Thomas Jefferson, meanwhile, aimed to defend the rights of the minority by implementing relevant changes to the U.S. Constitution, which essentially endowed the minority with more power so as to prevent the appearance of said tyranny.
- However, while the tyranny of the majority is undoubtedly an important challenge, in modern democracies, effective checks and balances prevent the monopolization of power, and the minority is often capable of advancing its interests through certain procedures and institutions.
The Exploration Exploitation Trade-off
- In other words, the Agent learns a policy for making actions in a random Environment that is better than pure chance.
- In training an Agent to learn in a random Environment, the challenges of exploration and exploitation immediately arise.
- However, in order to find these actions leading to rewards, the Agent will have to sample from a set of actions and try-out different actions not previously selected.
- Exploration is when an Agent has to sample actions from a set of actions in order to obtain better rewards.
- In a multi-armed bandit problem (MAB) (or n-armed bandits), an Agent makes a choice from a set of actions.
- The goal of the Agent in a MAB problem is to maximize the rewards received from the Environment over a specified period.
- As we can see, the Agent has to balance exploring and exploiting actions to maximize the overall long-term reward.
Geolocations and geocodes instrument set for data analysis
- You need to extract key data, get the necessary details, visualize data points on the map and prepare them for the analysis or some learning algorithm.
- I want to share the instruments set I use for such tasks on the example of my native city.
- It has very convenient endpoints to search different venues (like tourist places, cafes, bus stations, etc.) by concrete coordinates or within some neighborhood, endpoints to discover venue details, ratings and a lot of other stuff.
- Our dataset contains a lot of venues with their categories, locations and regions attached.
- There are a lot of details in the API response, nevertheless, we will use only a few ones: extended categories, number of likes from users, rating and the information if the venue is open for now.
- By this time, our dataset with venues details connected is almost completed.
Understanding Power Analysis in AB Testing
- AB testing is taking two randomized samples from a population, a Control and a Variant sample, and determining if the difference between those two samples are significant.
- The graph below has the same distribution means, but with a smaller standard error, or higher sample size.
- If you do reject the null hypothesis, you are now more confident about not making a Type I error because your sample sizes are bigger.
- Now that we have some intuition on sample size and test distributions, we can have a stronger intuition when talking about statistical factors to consider when designing an experiment.
- When you are designing a test, you want to prepare your experiment in a way that you can confidently make statements about the difference (or absence of a difference) in the Variant sample, even if that difference is small.
Data Related Project Floundering?
- One common roadblock for the success of data-related projects is role confusion.
- When planning (or saving a floundering) data-related project consider if you have one or more specific person who has taken ownership of the following roles (note some individuals can “own” more than one role).
- This article identified role confusion as a barrier that can stall or prevent group productivity when working on data-related projects.
- Role confusion is when any one or more persons on a team are not sure what part they play in the project.
- When planning (or saving) a data-related project consider how to prevent (or solve for) role confusion by ensuring there is at least one specific person in charge of 1) requesting the data, 2) subject matter expertise, 3) understanding the audience, 4) providing or finding the data, and 5) method selection, data analysis, and analytical execution.
Understand Neural Networks & Model Generalization
- Training a deep neural network that can generalize well to new data is a challenging problem.
- When it comes to neural networks, regularization is a technique that makes slight modifications to the learning algorithm such that the model generalizes better.
- Proper regularization is a critical reason for better generalization performance because deep neural networks are often over-parametrized and likely to suffer from overfitting problems.
- In other words, this approach attempts to stop an estimator’s training phase early, at the point where it has learned to extract all meaningful relationships from the data, before beginning to model its noise.
- Batch normalization, besides having a regularization effect helps your model in other ways (allows for the use of higher learning rates, etc.).
- Indeed, most of the time, we cannot be sure that for each learning problem, there exists a learnable Neural Network model that can produce a generalization error as low as desired.
My NLP learning journey
- But a computer needs specialized processing techniques to understand raw text data.
- That’s why NLP attempts to use a variety of techniques to create structure out of text data.
- I will introduce a little bit nltk and spacy, both state-of-the-art libraries in NLP and the difference between them.
- Spacy: is an open-source Python library that parses and “understands” large volumes of text.
- The first step in processing text is to split up all the parts (words & punctuation) into “tokens”.
- And that’s exactly what Spacy is designed to do: you put in raw text and get back a Doc object, that comes with a variety of annotations.
- Given enough data, usage, and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances.
- LDA was introduced back in 2003 to tackle the problem of modeling text corpora and collections of discrete data.