Security and Cryptography Mistakes You Are Probably Doing All The Time
- It’s common for people to misuse cryptographic primitives, assume that something is secure by default or use outdated technology/algorithms.
- It’s entropy, so if we don’t want to force users to include special characters in their passwords, then what is the alternative to keep the entropy high enough?
- Creating these kinds of passwords satisfies both human and computer aspect of it, in other words, it’s easy to remember and reasonably hard to guess (high entropy, no way to brute force it).
- Note: In an ideal world, everybody would use a password manager and generate their random super high entropy passwords, but that’s not something we can expect of the average, non-tech savvy user.
- A lot of people think that Docker containers are secure by default, but that’s not the case.
The top 10 ML algorithms for data science in 5 minutes
- Linear regression finds a line that best fits a scattered data points on a graph.
- It attempts to represent the relationship between independent variables (the x values) and a numeric outcome (the y values) by fitting the equation of a line to that data.
- In this algorithm, the training model learns to predict values of the target variable by learning decision rules with a tree representation.
- Based on this, SVM finds an optimal boundary, called a hyperplane, which best separates the possible outputs by their class label.
- K-Means is for unsupervised learning, so we only use training data, X, and the number of clusters, K, that we want to identify.
- The algorithm iteratively assigns each data point to one of the K groups based on their features.
- A new data point is added to the cluster with the closest centroid based on similarity.
Introducing Bamboolib — a GUI for Pandas
- Our goal is to help people quickly learn and work with pandas, and we want to onboard the next generation of python data scientists.
- As the developer of the library stated, Bamboolib is designed to help you learn Pandas, so I don’t see a problem with going with the free option — most likely you won’t be working on some top-secret project if just starting out.
- Many times when preparing data for machine learning you’ll want to create dummy variables, ergo create a new column per unique value of a given attribute.
- It’s a good idea to do so because many machine learning algorithms can’t work with text data.
- Keep in mind — you won’t get any additional features with the paid version — the only real benefit is that your work will be private and that there’s an option for commercial use.
A Card Game for Teaching Machine Learning
- I have developed a card game for my students to help them understand the important principles of machine learning.
- The students discover the most important ideas in a playful way and develop a feeling for the functioning of machine learning procedures.
- The task is to divide the cards into 4 groups, which the students think to belong together.
- The task of the students is, therefore, to assign which of the 4 groups from task 1 belongs to which class of objects.The additional cards stand for a labeled dataset and the task clarifies the principle of supervised learning.
- The player with the card with the highest number of vertical lines wins the round and receives all 4 cards.
- (This can be judged by 2 of the players.) - If there are more cards of the trump category in the round, the trump card with the most lines wins.
Gaming on Reddit, Revisited
- One is that I haven’t given enough new information — maybe 2400 posts just isn’t enough to effectively train a model on this problem — but this feels like a cop-out since you can almost always say that you “need more data”, and I’ve seen fairly consistent performance between both iterations.
- Another is that I didn’t do enough feature engineering — maybe I need to dig deeper and add data about how many comments there are, or another aspect of these posts hidden in the original JSON files that were pulled.
- As always, thanks for taking this ride on the data science train, and let me know your thoughts on the project — is the SVC really the best model?
Be Yourself: The Data Scientists You See In Public Are Not Representative
- You don’t have to be ANYTHING like those people.
- While these people are definitely awesome and can sometimes be an inspiration, they’re also standing atop unknowable intersections of luck, opportunity, and survivorship bias.
- As data scientists, we’re supposed to be experts at understanding things like sampling, and hidden bias lurking in our data sets.
- The vast majority of people who work in data don’t live and breathe the stuff constantly.
- But despite not doing ANY of these things, these people will provide useful insights to their organizations, engineer robust systems, run experiments, and do important work.
- While we often hear about imposter syndrome, how the very people we admire feel skeptical that they’re worthy of the admiration they get, we hear less about just where the bar is before it’s acceptable to consider yourself part of this great community.
This is how you put the data in Data Science!
- Google’s vertical search engines like Google Images and Google Scholar wouldn’t last long if no one used them, so their varieties tell you a little something about what people tend to look for on the internet.
- (You know those invisibility potions don’t work, right?) You know that quality varies and it’s up to you to think critically about the source before you believe everything you read.
- Data providers use schema.org to tell us there’s a dataset on their page and describe some metadata about it.
- We let data providers use schema.org to tell us there’s a dataset on their page and describe some metadata about it.
- Sharing data (without an intermediary telling you to get lost) means that people can find and provide great resources even if they’ve got niche tastes… or obscure high school websites.
Pre-processing a Wikipedia dump for NLP model training — a write-up
- Wikipedia dumps are used frequently in modern NLP research for model training, especially with transformers like BERT, RoBERTa, XLNet, XLM, etc.
- As such, for any aspiring NLP researcher intent on getting to grips with models like these themselves, this write-up presents a complete picture (and code) of everything involved in downloading, extracting, cleaning and pre-processing a Wikipedia dump.
- Wikipedia dumps are freely available in multiple formats in many languages.
- For the English language Wikipedia, a full list of all available formats of the latest dump can be found here.
Improving Your Algo Trading By Using Monte Carlo Simulation and Probability Cones
- In this article, I will introduce how I use Monte Carlo simulation and Probability Cones to trade more effectively.
- Monte Carlo analysis utilizes computer simulation to solve problems, calculate probabilities and more — without having to solve theoretical equations.
- In its typical form, Monte Carlo analysis takes all the trades generated by a hypothetical backtest and randomly selects trade after trade to generate an equity curve.
- So, by using the Monte Carlo analysis, and combining the results, we can arrive at probabilities of certain events happening.
- There are quite a few dangers in using Monte Carlo for algo trading strategy analysis.
- The historical results should be consistent for a valid Monte Carlo analysis; only one strategy, or one approach, for the whole backtest.
- In such cases, the backtest masterpiece will yield a great looking Monte Carlo result — little chance of ruin, small drawdown probabilities and the like.
Investigating Differentiable Neural Architecture Search for Scientific Datasets
- We will compare DARTS to random search (which is actually quite good, see the table below) and state-of-the-art, hand-designed architectures such as ResNet. Most NAS studies, including the original DARTS paper, report experimental results using standard image datasets such as CIFAR and ImageNet. However, we believe that deep learning shows promise for scientific studies including biology, medicine, chemistry, and various physical sciences.
- If we examine the architecture weights and random search performance (below), we see that DARTS learned a much more sparse cell than on the Graphene task.
- Here we found that continuous DARTS modestly outperforms ResNet. Examining architecture weights and random search performance (shown below), we see a similar story to Galaxy Zoo. From the random search performance plot, there appears to be some architectures that perform much better than others (again note the log scale).