50+ Open Source Tools for Big Data

February 5th, 2024

Hi Everyone, I’ve seen a few great lists lately of Open Source Tools for Big Data. So I thought I would share the best of what I’ve seen and use a little crowdsourcing from readers to see what’s missing and create a UPDATED master list.

Here is a very helpful landscape style visual of the Open Source Tools from the blog: Big Data Start-ups

Next we have another list that looks pretty solid from By Fari Payandeh at Blog: Big Data Studio

Fari writes: It was not easy to select a few out of many Open Source projects. My objective was to choose the ones that fit Big Data’s needs most. What has changed in the world of Open Source is that the big players have become stakeholders; IBM’s alliance with Cloud Foundry, Microsoft providing a development platform for Hadoop, Dell’s Open Stack-Powered Cloud Solution, VMware and EMC partnering on Cloud, Oracle releasing its NoSql database as Open Source.

The Final List comes from Datamation.com:

50 Top Open Source Tools for Big Data

1. Hadoop

You simply can’t talk about big data without mentioning Hadoop. The Apache distributed data processing software is so pervasive that often the terms “Hadoop” and “big data” are used synonymously. The Apache Foundation also sponsors a number of related projects that extend the capabilities of Hadoop, and many of them are mentioned below. In addition, numerous vendors offer supported versions of Hadoop and related technologies. Operating System: Windows, Linux, OS X.

2. MapReduce

Originally developed by Google, the MapReduce website describe it as “a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.” It’s used by Hadoop, as well as many other data processing applications. Operating System: OS Independent.

3. GridGain

GridGrain offers an alternative to Hadoop’s MapReduce that is compatible with the Hadoop Distributed File System. It offers in-memory processing for fast analysis of real-time data. You can download the open source version from GitHub or purchase a commercially supported version from the link above. Operating System: Windows, Linux, OS X.

4. HPCC

Developed by LexisNexis Risk Solutions, HPCC is short for “high performance computing cluster.” It claims to offer superior performance to Hadoop. Both free community versions and paid enterprise versions are available. Operating System: Linux.

5. Storm

Now owned by Twitter, Storm offers distributed real-time computation capabilities and is often described as the “Hadoop of realtime.” It’s highly scalable, robust, fault-tolerant and works with nearly all programming languages. Operating System: Linux.

Databases/Data Warehouses

6. Cassandra

Originally developed by Facebook, this NoSQL database is now managed by the Apache Foundation. It’s used by many organizations with large, active datasets, including Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco and Digg. Commercial support and services are available through third-party vendors. Operating System: OS Independent.

7. HBase

Another Apache project, HBase is the non-relational data store for Hadoop. Features include linear and modular scalability, strictly consistent reads and writes, automatic failover support and much more. Operating System: OS Independent.

8. MongoDB

MongoDB was designed to support humongous databases. It’s a NoSQL database with document-oriented storage, full index support, replication and high availability, and more. Commercial support is available through 10gen. Operating system: Windows, Linux, OS X, Solaris.

9. Neo4j

The “world’s leading graph database,” Neo4j boasts performance improvements up to 1000x or more versus relational databases. Interested organizations can purchase advanced or enterprise versions fromNeo Technology. Operating System: Windows, Linux.

10. CouchDB

Designed for the Web, CouchDB stores data in JSON documents that you can access via the Web or or query using JavaScript. It offers distributed scaling with fault-tolerant storage. Operating system: Windows, Linux, OS X, Android.

11. OrientDB

This NoSQL database can store up to 150,000 documents per second and can load graphs in just milliseconds. It combines the flexibility of document databases with the power of graph databases, while supporting features such as ACID transactions, fast indexes, native and SQL queries, and JSON import and export. Operating system: OS Independent.

12. Terrastore

Based on Terracotta, Terrastore boasts “advanced scalability and elasticity features without sacrificing consistency.” It supports custom data partitioning, event processing, push-down predicates, range queries, map/reduce querying and processing and server-side update functions. Operating System: OS Independent.

13. FlockDB

Best known as Twitter’s database, FlockDB was designed to store social graphs (i.e., who is following whom and who is blocking whom). It offers horizontal scaling and very fast reads and writes. Operating System: OS Independent.

14. Hibari

Used by many telecom companies, Hibari is a key-value, big data store with strong consistency, high availability and fast performance. Support is available through Gemini Mobile. Operating System: OS Independent.

15. Riak

Riak humbly claims to be “the most powerful open-source, distributed database you’ll ever put into production.” Users include Comcast, Yammer, Voxer, Boeing, SEOMoz, Joyent, Kiip.me, DotCloud, Formspring, the Danish Government and many others. Operating System: Linux, OS X.

16. Hypertable

This NoSQL database offers efficiency and fast performance that result in cost savings versus similar databases. The code is 100 percent open source, but paid support is available. Operating System: Linux, OS X.

Click Here to See #17 to 50