Born in the UC Berkley AMPLab, Spark is a fast data processing engine that is equal to the likes of MapReduce. Working well in the already existing Hadoop ecosystem, Apache Spark offers many more advantages than its predecessors. The dual functions performed by Spark means it is capable of performing batch processing, as well as newer processes such streaming, interactive queries and iterative algorithms.
A recent survey conducted by Typesafe, the commercial backer of Scala, an up-coming development language included responses from more than 2000 developers. Their responses about Apache Spark and its usability bring three primary things into focus.
From the surveyed developers, 71% have some sort of research or evaluation experience with Spark. 35% of developers are already using or will begin using Spark for future projects. The increasing popularity of Spark for data processing is set to make waves in the world of big data.
Compared to MapReduce, Apache-Spark offers a lot many advanced features. Over 78% of respondents have seen an improved processing performance in Spark. The ability to process event streams is also a big addition in the pros column for Spark.
One of the biggest advantages that comes with Spark is that even though Scala is the primary language used to create it, Spark supports all the other development languages. As heard from Typesafe’s architect for Big Data Products and Services, Dean Wampler, “Spark is written in Scala and it is pulling people towards Scala. Typically they’re coming from a Big Data ecosystem already, and they are used to working with Java, if they are developers, or languages like Python and R, if they are data scientists.
Fortunately for everyone, Spark supports several languages – Scala, Java, Python, and R is coming. So people don’t necessarily have to switch to Scala.”
No Major Roadblocks
The primary roadblock that developers mention is their own lack of experience in working with this particular system. The lack of support, detailed documentation about Apache-Spark as well as troubleshooting is a hurdle that needs to be overcome. When organizations do not require such a large-scale program, they will not be attracted towards it. Spark is relatively new and does not have enough commercial support in the form of big companies backing it.
On its Way to Replacing MapReduce
Dean Wampler addresses this issue best, “Spark still needs to mature in many ways, especially the newer modules, such as Spark-SQL and Spark-Streaming. Older tools, like Hadoop & MapReduce, have had a longer runway and hence more time to be hardened and expertise to be documented. All these issues are being addressed and they should be resolved relatively soon.”
Spark is well on its way to replacing MapReduce in the Hadoop ecosystem. Spark has not been around long enough to be able to troubleshoot all problems. More documentation for problems can only be provided once Apache-Spark has been used in multiple environments. However, this remains a slight hurdle because of the absence of a big company backing it. However with the superior functionality it provides, Spark is on the brink of becoming the most extensively used data processing system.