The New York City Taxi & Limousine Commission has released a staggeringly detailed historical dataset covering over 1.1 billion individual taxi trips in the city from January 2009 through June 2015.
Taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it’s a story of New York.
Full instructions to download and analyze the data for yourself are available on GitHub. These maps show every taxi pickup and drop off, respectively, in New York City from 2009–2015.
The official TLC trip record dataset contains data for over 1.1 billion taxi trips from January 2009 through June 2015, covering both yellow and green taxis.
The Uber data is not as detailed as the taxi data, in particular Uber provides time and location for pickups only, not drop offs, but I wanted to provide a unified dataset including all available taxi and Uber data.
This report compares the performance of three machine learning techniques for spam detection including Random Forest (RF), k-Nearest Neighbours (kNN) and Support Vector Machines (SVM).
The idea of automatically classifying spam and non-spam emails by applying machine learning methods has been popular in academia and has been a topic of interest for many researchers.
This comparison is a real-time process, and therefore the main drawback of this approach is that the kNN algorithm must compute the distance and sort all the training data for each prediction, which can be slow if given a large training dataset (James, Witten, Hastie, & Tibshirani, 2013, pp.
We determine from the results that k-Nearest Neighbours (kNN) and Support Vector Machine (SVM) perform similar weak regarding accuracy and Random Forest (RF) outperforms both.
Therefore due to its design Random Forest performs relatively well "out-of-the-box" compared to k-Nearest Neighbours and Support Vector Machine.
This family tree gleaned from the huge new dataset shows seven generations encompassing 6,000 individuals, with marriages marked in red.
The huge new dataset is the largest scientifically validated family tree based on publicly available information, says Yaniv Erlich, a data scientist and computational biologist at the New York Genome Center.
The team then selected questions around longevity and family dispersal to test the utility of their family tree, Erlich says.
Geni.com and MyHeritage recently established their own DNA test, and Erlich says future work could map genetic information that people provide through that product onto the existing genealogy data.
Also, the family tree Erlich and his team built is publically available, and he’d like to see other researchers take advantage of the resource to answer any number of genealogical and scientific questions.