point

 

 Remember me

Register  |   Lost password?

 

Next Dates: - Introduction to QuantLib Development with Luigi Ballabio, September 2 - 4, 2013 - £1700

 

The Practical Quant's Blog

The Practical Quant Blog Header

HBase looks more appealing to data scientists

June 16, 2013 Comments (0)

[A version of this post appears on the O'Reilly Strata blog.]When Hadoop users need to develop apps that are "latency sensitive", many of them turn to HBase1. Its tight integration with Hadoop makes it a popular data store for real-time applications. When I attended the first HBase conference last year, I was pleasantly surprised by the diversity of companies and applications that rely on HBase. This year's conference was even bigger and I ran into attendees from a wide range of companies....

It's getting easier to build Big Data Applications

June 9, 2013 Comments (0)

[A version of this post appears on the O'Reilly Strata blog.]Hadoop's low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQL (Impala), analytics (Cloudera ML + a partnership with SAS), and as of early this week, real-time search. The economics that led to Hadoop...

Tracking the progress of large-scale Query Engines

June 4, 2013 Comments (0)

[A version of this post appears on the O'Reilly Strata blog.]As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing to upload data into the cloud are beginning to explore Amazon Redshift1, Google BigQuery, and Qubole. A variety of analytic engines2 built for Hadoop are...

How signals, geometry, and topology are influencing data science

May 27, 2013 Comments (0)

[A version of this post appears on the O'Reilly Strata blog.]I've been noticing unlikely areas of mathematics pop-up in data analysis. While signal processing is a natural fit, topology, differential and algebraic geometry aren't exactly areas you associate with data science. But upon further reflection perhaps it shouldn't be so surprising that areas that deal in shapes, invariants, and dynamics, in high-dimensions, would have something to contribute to the analysis of large data sets. Without...

Improving options for unlocking your graph data

May 19, 2013 Comments (0)

[A version of this post appears on the O'Reilly Strata blog.]The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to "push the limits of graph computation and develop new ideas", but having a commercial company will accelerate development, and allow the hiring of resources dedicated to...

11 Essential Features that Visual Analysis Tools Should Have

May 12, 2013 Comments (0)

[A version of this post appeared on the O'Reilly Strata blog.]After recently playing with SAS Visual Analytics, I've been thinking about tools for visual analysis. By visual analysis I mean the type of analysis most recently popularized by Tableau, QlikView, and Spotfire: you encounter a data set for the first time, conduct exploratory data analysis, with the goal of discovering interesting patterns and associations. Having used a few visualization tools myself, here's a quick wish-list of...

Scalable streaming analytics using a single-server

May 5, 2013 Comments (0)

[A version of this post appears on the O'Reilly Strata blog.]For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to scale (ingest) massive amounts of data, and fit nicely with Hadoop and other cluster computing tools. For these distributed frameworks peak volume is function of network topology/bandwidth and the...

Tachyon: An open source, distributed, fault-tolerant, in-memory file system

April 28, 2013 Comments (0)

[A version of this post appears on the O'Reilly Strata blog.]In earlier posts I've written about how Spark and Shark run much faster than Hadoop and Hive by1 caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An example would be performing computations using Spark, saving it, and accessing the saved results in Hadoop MapReduce. An in-memory storage system would speed up sharing across jobs by...

Simpler workflow tools enable the rapid deployment of models

April 21, 2013 Comments (0)

[A version os this post appears on the O'Reilly Strata blog.]Data science often depends on data pipelines, that involve acquiring, transforming, and loading data. (If you're fortunate most of the data you need is already in usable form.) Data needs to be assembled and wrangled, before it can be visualized and analyzed. Many companies have data engineers (adept at using workflow tools like Azkaban and Oozie), who manage1 pipelines for data scientists and analysts.A workflow tool for data...

Single server systems can tackle Big Data

April 15, 2013 Comments (0)

[A version of this post appears on the O'Reilly Strata blog.]About a year ago a blog post from SAP posited1 that when it comes to analytics, most companies are in the multi-terabyte range: data sizes that are well-within the scope of distributed in-memory solutions like Spark, SAP HANA, ScaleOut Software, GridGain, and Terracotta.Around the same time a team of researchers from Microsoft went a step further. They released a study that concluded that for many data processing tasks, scaling by...