Remember me

Register  |   Lost password?


 

The Practical Quant's Blog

The Practical Quant Blog Header

GraphChi: Graph analytics over billions of edges using your laptop

December 12, 2012 Comments (0)

[Cross-posted on the O'Reilly Strate blog.] GraphChi is a spinoff project of GraphLab, an open source, distributed, in-memory software system for analytics and machine-learning. Designed specifically to run on a single computer with limited memory1 (DRAM), since its release a few months ago GraphChi has been used to analyze graphs with billions of edges. Running on a single machine means deployment and debugging are simpler. In addition it is no longer necessary to find (optimal) graph...

Shark: An open source, interactive SQL and Analytics system for Hadoop

November 27, 2012 Comments (0)

[Cross-posted on the O'Reilly Strata blog.] Hadoop's strength is in batch processing, MapReduce isn't particularly suited for interactive/adhoc queries. Real-time1 SQL queries (on Hadoop data) are usually performed using custom connectors to MPP databases. In practice this means having connectors between separate Hadoop and database clusters. Over the last few months a number of systems that provide fast SQL access within Hadoop clusters have garnered attention. Connectors between Hadoop and...

Real-time Analytics: Hokusai adds a temporal component to Count-Min Sketch

November 18, 2012 Comments (0)

Introduced in 2003 by Cormode and Muthukrishnan, the Count-Min sketch is a popular and simple algorithm for summarizing1 data streams. In particular it's often used to calculate simple frequencies, identify frequent elements (sometimes referred to as heavy hitters), and compute quantiles (see [1]). The Count-Min sketch can take advantage of distributed/parallel compute resources. Suppose a data stream has a domain of possible symbols, X = {s1,s2,..., sD}. The Count-Min sketch is comprised of...

Beyond bag-of-words: Using markup to Understand How Science is Written

November 11, 2012 Comments (0)

Newspapers and academic publications have long been popular data sources among text mining and natural language researchers. The advent of the web shifted some attention to unstructured text from online sources, but many researchers continue to use corpuses of academic papers. Their semi-structured nature make academic publications convenient to work with and built-in bibliometrics (citations) are standard in information sciences. It was citations analysis that was the basis of the original...

Reconstruction Error: A promising fresh take on Automatic Text Summarization

November 4, 2012 Comments (0)

I've long been intrigued by automatic text summarization. I've never had to dive in and play around with current algorithms - usually the first few sentences (or the abstract if its available) suffice - but I suspect that over time, the tools will get easier and more effective, and that I'll be using them routinely. The most popular1 approaches boil down to this simple idea: identify a few important sentences, and present those as the summary. The algorithms differ in how2 the key sentences are...

Mining Time-series with Trillions of Points: Dynamic Time Warping at scale

October 28, 2012 Comments (0)

Take a similarity measure that's already well-known by researchers who work with time-series, and devise an algorithm to compute it efficiently at scale. Suddenly intractable problems become tractable, and Big Data mining applications that use the metric are within reach. The classification, clustering, and indexing (search) of time series has important applications in many domains. In medicine EEG and ECG readings translate to time-series data collections with billions (even trillions) of...

Spark 0.6 Improves Performance and Accessibility

October 16, 2012 Comments (0)

[Cross-posted on the O'Reilly Strata blog.] In an earlier post I listed a few reasons why I've come to embrace and use Spark. In particular I described why Spark is well-suited for many distributed Big Data Analytics tasks such as iterative computations and interactive queries, where it outperforms Hadoop. With version 0.6, Spark becomes even0 faster and easier to use. The release notes contain all the detailed changes, but as you'll see from the highlights1 below, version 0.6 is a substantial...

Statwing Simplifies Data Analysis

October 5, 2012 Comments (0)

[Cross-posted on the O'Reilly Strata blog.] With so much focus on Big Data, the needs of many analysts who work with Small Data tend to get ignored. The default tool for many of these users remains spreadsheets1 and/or statistical packages which come with a lot of features and options. However many analysts need a very small subset of what these tools have to offer. Enter Statwing, a software-as-a-service provider for routine statistical analysis. While the tool is still in the early stages,...

How ZeroVM changes analytics in the cloud

September 19, 2012 Comments (0)

What’s so interesting about another open source virtualization platform? Find out by reading my new post on the O'Reilly Strata blog.

Seven Reasons I like Spark

August 21, 2012 Comments (0)

A large portion of this week’s Amp Camp at UC Berkeley, is devoted to an introduction to Spark – an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I’ve come to consider it a key part of my big data toolkit. For details see my Radar post.