The fundamental problem is that many open-source analysis tools haven’t kept pace with hardware advances. The computer industry has built incredibly powerful systems but many data analysts lack the tools to use them.
For anyone not a data scientist or programmer, this would not seem an obvious problem. But among data scientists, the widening gulf between the data science world and big data systems is a major anxiety.
“Open source data science software has already become incredibly important to how the world analyzes data and builds production machine learning and AI models,” McKinney noted, but many open-source tools aren’t funded sufficiently to keep up with advances on the compute side, he added.
This could have dire consequences, including significantly lowered productivity among data scientists. If they can’t find an efficient way to extract useful data from big data repositories, which keep getting bigger, a chokepoint is inevitable. If algorithms can’t be shared among different programming environments, another problem looms. If data sets can’t travel over a network to be read by data scientists working in another environment, everything—including the scientists—will tread water.
Computer costs, most likely, would remain high if data scientists had to keep using less efficient computing tools. An even bigger threat, McKinney predicted, is that inefficiency might push organizations to forsake open source software and fall back on less flexible, more expensive proprietary soft(Source: Wes McKinney)
To illustrate the gap between data science and big data, let’s take an AI model.
We hear about the development of new machine learning technologies all the time. And yet, “to use machine learning models, data scientists still need to load and access data, clean and manipulate it, explore it, find features, and then they must do it all in a reproducible way so that whenever new data comes in, they can update their model,” McKinney explained in a lecture entitled “Data science without borders” three years ago.
The emergence of more sophisticated AI algorithms wouldn’t make them immediately available to data scientists, because they’d still need all these tools to shape data that fit their models.
Genesis of Apache Arrow
McKinney believes that huge advancements in hardware made it inevitable for open-source software (OSS) community to recognize that OSS tools must change substantially.
1. Computing power
First, “the data science tools themselves in languages like Python and R lagged significantly behind advances in computing hardware,” McKinney observed. Most such tools are not designed to run on multi-cores, GPUs, or systems with lots of RAM “because they were designed a decade ago when you didn’t have things like a CPU with 16 cores.”
Further, they were all developed well before the birth of Cuda, a parallel computing platform and application programming interface model created by Nvidia.
Similarly, a massive acceleration of memory has occurred. McKinney noted, “Disk drives are getting a lot faster. You also have solid-state drives. It’s not just about memory speeds, but your ability to get access to data is getting faster.” But not open-source data science tools; again, they became a bottleneck.
3.Need for efficient bulk data movement
Other factors that triggered the movement is “the need to do efficient bulk data movement,” noted McKinney.
In a traditional model, you ingest all data into your Oracle database, or your database system. Then, you send SQL queries. So, all the data is owned by the database.
Under new models like data lakes, which can reside in data systems on the premises or in the cloud, it’s common that you need to extract data, a process that can be very slow, partly due to issues such as the required protocols for database connectivity standards. McKinney said the solution is a technology like Arrow, designed for very efficient bulk transfers.(Source: Wes McKinney)
4. Language independence
Then, there’s the issue of incompatible programming languages. In Google, Facebook, and other industry research labs, Python is now the primary machine learning user interface. But in addition to Python, other programming languages such as R, JVM, Julia, have not bowed out and gone away. Different people us different programming languages because they regard them as keys to their own productivity, said McKinney.
The desire to unplug the bottleneck associated with data access and interchanging data between systems initially drove McKinney to launch the Apache Arrow Project — marking the birth of a massive endeavor by the open-source community to develop a programming language-independent software framework.
Under such a framework, McKinney explained, “You arrange the data so that it can be analyzed in situ. You run your code directly on the data, rather than moving the data around.”
Hence, a new software framework must be designed for an ecosystem with multiple programming languages. But “we want to deal with the data to not be contaminated with programming language-specific details,” noted McKinney. The goal [of Apache Arrow] is to offer “seamless programming language interoperability at the data level, not necessarily at the code level.”
The development of Apache Arrow has been in the works for five years. McKinney acknowledged that assembling a team of open source developers shooting for a universal data standard has been “just an extraordinarily big endeavor.” The community bootstrapped the core to create a development platform on which they can begin to use for building either Arrow into their systems or building new systems based on Arrow.
Acknowledging the process has been “a grind,” McKinney said he is happy to report that things have gone well so far.
The next challenge is to build “a path to invest even more in the open-source community,” McKinney stressed. Rather than folding Ursa Labs, Ursa Computing will maintain a “Labs” team, and continue its leadership of the Apache project. Still a transition from a non-profit to a commercial operation is a big change.
McKinney defended the move, explaining, “Working more closely with enterprises to enhance their data platforms will allow us to learn more about how the Arrow project needs to evolve to meet current and future needs.” One example is working with managed cloud services. “It would be challenging to pursue as an open source project,” he added. Investment management fund to open source community.
McKinney became well known as “the man behind the most important tool in data science,” since he developed pandas, a software library written for the Python programming language for data manipulation and analysis. McKinney started with pandas as a closed source project in 2008 at AQR Capital, a Connecticut-based investment management firm.
The firm hired him after he earned his mathematics degree from MIT. A year later, McKinney had finished pandas and made it into a free open-source software. Looking back, McKinney said, “I thought I would try my hand at quant finance,” when he joined the hedge fund. But he ended up realizing that “working on data tools and data infrastructure was more my cup of tea than finance.”
He emphasized, “I really care a lot about empowerment and enabling people to be more efficient and more productive.” With Ursa Computing, his aspiration is to orchestrate the whole Apache Arrow ecosystem.