Sunday, February 27, 2011

Big data, great Open Source Tools

Enterprises are grappling with the question how much data they manage data proliferates in the terabyte and petabyte.

Large datasets are known as "big data" for IT professionals.

Relational databases and statistics desktop or display packages cannot handle large data; Instead, you need to massively parallel software running on up to thousands of servers to do the job.

Many companies turn to open source tools, such as Apache Hadoop, when working with large data. For example, Twitter sends logging messages to Hadoop and writes the data directly into HDFS, Hadoop Distributed File System.

Hadoop can support applications ranging up to thousands of nodes and multiple petabytes, David Hill, Mesabi Group, said data-intensive Justfreebooks-find free eBooks. Received wide acceptance.

However, the term "big data" is just a general term for many different types of applications and Hadoop will not be suitable in each case, Hill has warned.

The acquisition, storage and analysis of large data depends on the nature of the particular application, Hill said. For example, scalability, network attached storage such as EMC (NYSE: EMC) Isilon or IBM (NYSE: IBM) SONAS (scale Out Network Attached Storage), you may be better to use with unstructured data, such as photographs or video rather than a tool like Hadoop, he suggested.

Working with large data can be classified into three basic categories, Mike Minelli, executive vice president of Analytics revolution, said Justfreebooks-find free eBooks.

One is information management, business intelligence is a second and the third is advanced Google analytics said Minelli.

Information management captures and stores information, BI analyzes data to see what happened in the past, and advanced predictive analytics is, looking at what point the data for the future, said Minelli.

Revolution analytics offers the open source r language and the revolution R Enterprise. These provide advanced analysis for terabyte-class DataSet. Analytics revolution is developing connectors to Hadoop and capacity for r to the execution of works on Google (Nasdaq: GOOG) Map/Reduce framework, Minelli, he said.

Data owners of large analytical functionality is available from AsterData; Netezza, now owned by IBM; Datameer, that relies on Apache Hadoop but is owner; and Paraccel, Minelli, he said.

IBM's InfoSphere Netezza products, Oracle (Nasdaq: ORCL) EMC Exadata and Greenplum are other proprietary tools to work on large data.

EMC has introduced a free community edition of your Greenplum database. This community edition is only software, said Hill of Mesabi Group.

Greenplum Community Edition doesn't compete with Hadoop; Luke Lonergan, vice president and chief technology officer of EMC data Computing products division, however, is "a project which intends to incorporate best of breed technologies together to provide the best choice of platform," said Justfreebooks-find free eBooks.

The initial version of Greenplum Community Edition includes three modules--collaborative Greenplum DB, MADlib and Alpine Miner, Lonergan, he said.

"Version of Greenplum DB included is an advanced development version that will provide the rapid innovation, MADlib provides a set of algorithms, data mining and machine learning and Alpine Miner provides an environment of Visual data mining that performs its algorithms directly within Greenplum DB," Lonergan processed.

Open source tools for large data include Hadoop, map/Reduce and Jaspersoft business intelligence tools.

Jaspersoft delivers business intelligence tools that provide analysis and reporting, ETL (extract, transform and load) for massively parallel analytical database included Greenplum EMC and HP (NYSE: HPQ) Game. Is a version in the works for IBM Netezza, Andrew Lampitt, Director of business intelligence at Jaspersoft, said Justfreebooks-find free eBooks.

Jaspersoft also provides native reports using open source connectors for Hadoop and various types of databases including MongoDB NoSQL, Riak, CouchDB and Infinispan.

Further, Jaspersoft open source has a bridge to the advanced analysis product r from revolution Analytics.

Open source tools provide insight into the code so that developers can find out what's inside when they make integration of Jaspersoft, Lampitt said.

"In nearly every instance, open source analytics will be cheaper and more flexible than traditional proprietary solutions, revolution Analytics Minelli, he said.

"The data volumes are growing to the point where companies are forced to scale their infrastructure and proprietary license costs skyrocket along the way. With open source technology, it gets the job done faster and more accurately at a fraction of the price, "he added.

Twitter-related circumstances, opting for Hadoop only because using proprietary tools would be too expensive.

Open source tools let enterprises create further new analytic techniques to better manage unstructured data, such as images and photographs, Minelli.

"Open source analytics tools allow you to create innovative analytics that you can bake in your company. In the ultra-competitive global economy today, is only possible to wait for a traditional service to develop a new analytical technique, "Minelli added.

As in other areas of it, we're likely to see a mix of proprietary technology, used for working with large data and open source.

"In the short term, the open source analytics will become more and more widely used and grow virally" opined Minelli. "In the long run you will see a mix or a mixture of techniques in highly competitive markets. My guess is that both will remain valid and necessary ".


View the original article here

No comments:

Post a Comment