Remember me

Register  |   Lost password?


Introduction to QuantLib Development - Intensive 3-day Training Course - September 10-12th, 2018 - Download Registration Form Here

 

Timing csv File Reading in R and Python

Wed, 20 Nov 2013 20:15:46 GMT

I am currently working with trade data organized in large csv files. I took this as an opportunity to learn the Pandas package in Python mostly for the HDF5 integration. As a sanity check, I decided to time three methods: read.csv in R base, fread in the data.table package in R, and read.csv in the Pandas package in Python. Let me first say that I am not a programmer. This test is not a definitive benchmark of the three methods. It is a benchmark of the “out-of-the-box” functionality available to the non-programmer trying to use code to get shit done. I used time.time() in Python and proc.time() in R to read 10,000,000 rows of mixed data types. Without further ado, here are the results: read.csv in R took 131.45 seconds, read.csv in Pandas took 27.741 seconds, and the winner was fread in the data.table package clocking in at 14.018 seconds. Unfortunately, fread needs more TLC in determining data types and missing values. Pandas just works right the first time. I also prefer the time manipulation in Pandas compared to the XTS package in R but not enough to deal with two languages in one project.

, , , , , , , , , ,