It only makes sense that as the community of Spark contributors got bigger the project would get even more ambitious. So when Spark 2.0 comes out in a matter of weeks it’s going to have at least three robust new features, according to Ion Stoica, the founder of Databricks and keynote speaker at Apache Big Data in Vancouver on Tuesday afternoon.
“Spark 2.0 is about taking what has worked and what we have learned from the users and making it even better,” Stoica said.
Queries will be more performant - the goal is 10x faster - through the success of Project Tungsten, an ongoing effort which set out to improve the efficiency of memory and CPU for applications. The three ways it’s succeeded is through cache-aware computation that uses algorithms and data structures to exploit memory hierarchy, code generation to exploit modern compilers and CPUs, and using application semantics to eliminate memory getting bogged down on garbage collection and the JVM object model.
“The more semantics you know the better you can optimize the applications,” Stoica said.
Spark 2.0 will ship with even more components from Project Tungsten, which has been rolling out in pieces, across multiple releases, since Spark 1.4 about a year ago.
Spark 2.0 will also feature improved APIs to make it “even easier” to write applications for Spark, a feature for the influx of data scientists that are now using Spark who aren’t necessarily full-blown developers and database admins. Part of this feature is the introduction of the Dataset API. Datasets are static typed extensions that use Resilient Distributed Dataset (RDD)-like operations, and when added to Spark’s dataframes, it creates a best-of-both-worlds approach.