When programming toolkits are included in scientific papers, it’s usually an honorable mention, with perhaps some coding information appended to the supplementary materials. So when Nature, a glamorous scientific journal, published a review of the array programming library NumPy, researchers took notice.
Python is an increasingly popular coding language, favored for being open access and easy to learn. NumPy is a library designed for use within Python, and it is the mother of a great many other highly powerful Python tools. Briefly, it allows for easy manipulation of matrices and arrays – operations from linear algebra – which translates to applications in statistics, chemistry, biology, and yes, astronomy (remember when the first detection of gravitational waves made headlines, providing experimental evidence of Einstein’s Relativity? That discovery was made possible with NumPy-based analysis).
Array Programming with NumPy serves as our introduction into this coding toolkit. Let’s dig a little deeper into the review.
Why We Needed NumPy
In the beginning, there was a battle between two different Python array packages. One had more advanced matrix handling features to allow it to deal with data-heavy images. The other was nimble and could work efficiently as an API (it was bilingual in coding languages), allowing it to interface with C++ and Python.
Both had their strengths – a low-level language like C++ makes it super efficient and allows for lots of control over memory storage, but advanced matrix operations opened up the program to work with data like images coming from the Hubble Space Telescope (to a computer, an image is just a multi-dimensional matrix). When NumPy arrived on the scene, it combined the best features from both and unified the coding community.
What is Array Programming?
An array is a multi-dimensional matrix. It is a data structure that can be used to store complex information. Working with the data might include shaping, indexing, and vectorizing an array, and you might use linear algebra or scalar operations on it (multiplication, for example, is very different in linear algebra than in our common use of the word).
Indexing allows you to call specific elements out of the array. This allows access to that particular data point, or maybe you want to update the data in a particular location.
Operators allow for arithmetic on and between the arrays, and they can be scalar or refer to linear algebra operations.
And the secret to NumPy’s power lies largely in vectorization, which operates on the arrays themselves, rather than their indexed elements. Much of scientific analysis requires vectorization, but in a language like C++, code to accomplish it would be long and hard to follow. NumPy is a data-efficient toolkit that streamlines the process (so you can write much more concise code).
In sum, NumPy is an elegant programming library that:
- Provides tools for easy access of array data
- Uses memory efficiently, freeing up computer space
- Relies on tools that allow the user to write more concise, streamlined code
Computers Have Ecosystems, Too
NumPy is fifteen years old, but it’s as strong as ever – and supporting a growing ecosystem of scientific Python libraries. Some that are familiar to me include pandas, Matplotlib, and skimage. Others include Biopython (for biologists), Astropy (for astronomers), Quantecon (for economists), and NLTK (for linguists). Researchers in and out of STEM rely on NumPy-supported tools for their programming and data analysis.
NumPy, along with SciPy and Matplotlib, provide the Big Three as a foundation for the rest of these packages. With NumPy’s array programming, SciPy’s access to common scientific algorithms, and Matplotlib’s data visualization power, you get a scientific powerhouse – and all in the easy-to-learn Python language.
The reviewers even comment on Jupyter notebooks, a personal favorite of mine. Such an accessible interface (unlike that scary command line)! And when you can break your code into convenient blocks, along with written explanations or interpretations, it becomes easier to code accessibly and reproducibly. A Jupyter notebook becomes a place to intuitively tease out a story from the data. It’s also a popular way to share code, as this programmer did as a way to complement a discussion of neural networks (and hey look, it uses NumPy!)
Develops Code, Will Travel
In the final section of this review, the authors comment on the community and dedication of NumPy’s developers. NumPy – like Python – is open access. Developing it has required the patient input of dedicated volunteers – often grad students, with famously low spare time.
But combining C’s performance and Python’s readability was worth it.
NumPy was initially developed by students, faculty and researchers to provide an advanced, open-source array programming library for Python, which was free to use and unencumbered by license servers and software protection dongles. There was a sense of building something consequential together for the benefit of many others. Participating in such an endeavour, within a welcoming community of like-minded individuals, held a powerful attraction for many early contributors.Harris et al. 2020
And it’s still, of course, improving. Like any programming application, there are always improvements and advancements to be added in the next release. In the meantime, it’s a convenient toolkit to work with.
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array Programming with NumPy. Nature, 585(February), 357–362. https://doi.org/10.1038/s41586-020-2649-2