Richard Daley, one of the founders and chief strategy officer of
analytics and business intelligence specialist Pentaho, believes that
such a stack will begin to come together this year as consensus begins
to develop around certain big data reference architectures--though the
upper layers of the stack may have more proprietary elements than LAMP
does.
"The explosion of dynamic, interactive websites in the late 1990s and
early 2000s was driven, at least in part, by the LAMP stack, consisting
of Linux, Apache HTTP server, MySQL and PHP (or Perl or Python)."
"There's thousands of big data reference architectures out there,"
Daley says. "This is going to be more of a 'history repeats itself' kind
of thing. We saw the exact same thing happen back with the LAMP stack.
It's driven by pain. Pain is what's going to drive it initially; pain in
the form of cost and scale."
But, Daley says, organisations dealing with that pain with big data
technologies--42 percent of organisations were already engaged in some
form of big data initiative in 2013, according to a CompTIA
study--quickly
begin to see the upside of that data, particularly organisations that
leverage it for marketing or for network intrusion detection.
"In the last 12 months, we've seen more and more people doing big
data for gain," he says. "There is much more to gain from analysing and
utilising this big data than just storing it."
The explosion of dynamic, interactive websites in the late 1990s and
early 2000s was driven, at least in part, by the LAMP stack, consisting
of Linux, Apache HTTP server, MySQL and PHP (or Perl or Python). These
free and open source components are all individually powerful tools
developed independently, but come together like Voltron to form a Web
development platform that is more powerful than the sum of its parts.
The components are readily available and have open licenses with
relatively few restrictions. Perhaps most important, the source is
available, giving developers a tremendous amount of flexibility.
While the LAMP stack specifies the individual components (though
substitutions at certain layers aren't uncommon), the big data stack
Daley envisions has a lot more options at each layer, depending on the
application you have in mind.
'D' Is for the Data Layer
The bottom layer of the stack, the foundation, is the data layer.
This is the layer for the Hadoop distributions, NoSQL databases (HBase,
MongoDB, CouchDB and many others), even relational databases and
analytical databases like SAS, Greenplum, Teradata and Vertica.
"Any of those technologies can be used for big data applications,"
Daley says. "Hadoop and NoSQL are open, more scalable and more
cost-effective, but they can't do everything. That's where guys like
Greenplum and Vertica have a play for doing some very fast,
speed-of-thought analytical applications."
In many ways, this layer of the stack has the most work ahead of it,
Daley says. Relational and analytical databases have years of
development behind them, but Hadoop and NoSQL technologies are in
relatively early days yet.
"Hadoop and NoSQL, I have to say we are early," Daley says. ""We're
over the chasm in terms of adoption--we're beyond the early adopters.
But there's still a lot that needs to be done in terms of management,
services and operational capabilities for both of those environments.
Hadoop is a very, very complicated bit of technology and still rough
around the edges. If you look at the NoSQL environment, it's kind of a
mess. Every single NoSQL engine has its own query language."
'I' Is for the Integration Layer
The next layer up is the integration layer. This is where data prep,
data cleansing, data transformation and data integration happens.
"Very seldom do we only pull data from one source," Daley says. "If
we're looking at a customer-360 app, we're pulling data from three, four
or even five sources. When somebody has to do an analytical app or even
a predictive app, 70 percent of the time is spent in this layer,
mashing the data around."
While this layer is the "non-glamorous" part of big data, it's also
an area that's relatively mature, Daley says, with lots of utilities
(like Sqoop and Flume) and vendors out there filling the gaps.
'A' Is for the Analytics Layer
The next layer up is the analytics layer, where analytics and visualisation happen.
"Now I've got the data. I've got it stored and ready to be looked
at," Daley says. "I take a Tableau or Pentaho or Qlikview and visualise
that data. Do I have patterns? This is where people--business users--can
start to get some value out of it. This is also where I would include
search. It's not just slice-and-dice or dashboards.
This area too is relatively mature, though Daley acknowledges there's a way to go yet.
"We've got to figure out as an industry how to squeeze more juice out
of Hadoop--methods to get data faster," he says. "Maybe we acknowledge
that it's a batch environment and we need to put certain data in other
data sources? Vendors are working around the clock to make those
integrations better and better."
'P' Is for the Predictive/Prescriptive Analytics
The top layer of the stack is predictive/prescriptive analytics,
Daley says. This is where organisations start to truly recognise the
value of big data. Predictive analytics uses data (historical data,
external data and real-time data), business rules and machine learning
to make predictions and identify risks and opportunities.
One step further along is prescriptive analytics, sometimes
considered the holy grail of business analytics, which takes those
predictions and offers suggestions for ways to take advantage of future
opportunities or mitigate future risks, along with the implications of
the various options.
"You have to go through and do predictive to get value out of big
data," he says. "It's a low likelihood that you're going to get a lot of
value out of just slicing and dicing data. You've got to go all the way
up the stack."
"At least 70, maybe even 80 percent of what we see around big data
applications is now predictive or even prescriptive analytics," Daley
adds. "That's necessity, they mother of invention. It starts at the
bottom with data technology--storage, data manipulation,
transformations, basic analytics. But what's happening more and more,
finally, is predictive, advanced analytics is coming of age. It's
becoming more and more mainstream."
While predictive analytics are somewhat mature, it's currently an area only data scientists are equipped to handle.
"I think predictive is a lot farther along than the bottom layer of
the stack," Daley says. "From a technology standpoint, I think it's
mature. But we need to figure out how to get it into the hands of a lot
more users. We need to build it into apps that business users can access
versus just data scientists."
What's That Spell? DIAP? PAID?
Call it the DIAP stack. Or maybe start from the top and call it the
PAID stack. The trick now, Daley says, is not just adding more maturity
to component technologies like Hadoop and NoSQL, it's providing
integration up and down the stack.
"That's a very key point," he says. "To date, all these things are
separate. A lot of companies only do one of these things. Hortonworks
will only do the data side, they won't do integration, for example. But
customers like to go through and buy an integrated stack. We should at
least make sure that our products up and down those stacks are truly
integrated. That's where it's going to have to get to. In order to
really get adopted, products and vendors are going to need to work up
and down that stack. I need to support every flavor of Hadoop--at least
the commercially favorable ones. And it's the same thing for NoSQL."