Similarity detection is a common method used to identify items that share traits but not necessarily the same features. Product recommendations and related articles are often driven by similarity metrics. Cosine similarity is the most popular and will be covered here. This article will evaluate the performance of cosine similarity in Python using NumPy and TensorFlow.
NumPy is a robust and mature library for working with large multi-dimensional matrices. NumPy has a rich collection of linear algebra functions. It’s well-tuned and runs very fast on CPUs.
TensorFlow is a math and machine learning library that can utilize both the CPU and GPU. TensorFlow is well proven in production and used to power advanced machine learning algorithms. It also has a rich collection of linear algebra functions. Most of what you can do with NumPy can also be done in TensorFlow.
Both are great libraries and both can compute similarity metrics between data sets.
NumPy and TensorFlow can be installed via pip if not already installed. This article is using NumPy 1.17.4 and TensorFlow 2.1
pip install numpy pip install tensorflow
The following sample code shows how to calculate cosine similarity in NumPy and TensorFlow.
Running the above will produce output similar to the following. There will also be a lot of TensorFlow messages, it’s important you see the top line in bold to ensure the GPU is being used.
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:x:
[[0.16522373 0.55317429 0.61327354 0.74017118 0.54443702]
[0.46154321 0.17925365 0.3785271 0.71921798 0.91126758]
[0.10342192 0.96612358 0.06434406 0.85482426 0.7329657 ]
[0.1333734 0.99817565 0.42731489 0.46437847 0.40758545]
[0.43727089 0.91457894 0.21971166 0.24664127 0.61568784]]
[[0.15979143 0.47441621 0.20148356 0.99541121 0.41767036]]np:
The top two arrays show a 5x5 array for x and 1x5 for y. It also shows the cosine similarity for y against each row in x, calculated with NumPy and TensorFlow.
Let's try to time the calculations with more data.
For the first test, we’ll try two small arrays, calculating cosine similarity between a 1000x25 array for x and a 50x25 array for y.
Running the above will produce the following output:
np time = 0.2509448528289795 tf time = 0.7871346473693848 similarity output equal: True
In this case, NumPy is faster than TensorFlow, even though TensorFlow is GPU enabled. As with all software development, there is no one size meets all answer to a problem. NumPy still does have a place and it will perform better with small data.
There is overhead to copy data from the CPU to the GPU and overhead to build the GPU graph to execute. As data grows, the overhead will be negligible compared to the overall run time, let's try a larger data array.
We’ll run almost the exact same code as above except we’ll increase the array size.
x = np.random.rand(10000, 25) y = np.random.rand(50, 25)
For this test, we’ll use a 10000x25 array for x and a 50x25 array for y.
np time = 3.3217058181762695 tf time = 1.129739761352539 similarity output equal: True
TensorFlow is now 3x faster than NumPy. The overhead of copying data to the GPU and compiling the GPU graph is now worth the performance trade-off.
Let's try with an even larger array.
x = np.random.rand(25000, 100) y = np.random.rand(50, 100)
For this test, we’ll use a 25000x25 array for x and a 50x100 array for y.
np time = 23.823707103729248 tf time = 2.93641996383667 similarity output equal: True
TensorFlow is 8x faster in this case. The performance gain will continue to grow rapidly as long as the data set can fit on the GPU.
Calculating cosine similarity will get you an array of floats from 0 to 1, with 1 being most similar and 0 being least. For most use cases, you’ll want to calculate similarity along with the best associated records. You can do that both in NumPy and TensorFlow as follows.
The above code gets a list of indices sorted by the highest cosine similarity. The top record is selected and printed.
Both NumPy and TensorFlow have their place. GPUs don’t just make everything faster in all cases. Use NumPy for smaller data sets and TensorFlow for larger sets. Always test on your own data to see what works best in your specific situation.