Super-charged similarity metric calculations

Featuring NumPy and TensorFlow

Draft ·4min read

Photo by Joe Neric on Unsplash

Similarity detection is a common method used to identify items that share traits but not necessarily the same features. Product recommendations and related articles are often driven by similarity metrics. Cosine similarity is the most popular and will be covered here. This article will evaluate the performance of cosine similarity in Python using NumPy and TensorFlow.

NumPy and TensorFlow

NumPy is a robust and mature library for working with large multi-dimensional matrices. NumPy has a rich collection of linear algebra functions. It’s well-tuned and runs very fast on CPUs.

TensorFlow is a math and machine learning library that can utilize both the CPU and GPU. TensorFlow is well proven in production and used to power advanced machine learning algorithms. It also has a rich collection of linear algebra functions. Most of what you can do with NumPy can also be done in TensorFlow.

Both are great libraries and both can compute similarity metrics between data sets.

Calculating cosine similarity

NumPy and TensorFlow can be installed via pip if not already installed. This article is using NumPy 1.17.4 and TensorFlow 2.1

pip install numpy
pip install tensorflow

The following sample code shows how to calculate cosine similarity in NumPy and TensorFlow.

NumPy and TensorFlow cosine similarity

Running the above will produce output similar to the following. There will also be a lot of TensorFlow messages, it’s important you see the top line in bold to ensure the GPU is being used.

I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:x: 
[[0.16522373 0.55317429 0.61327354 0.74017118 0.54443702]
[0.46154321 0.17925365 0.3785271 0.71921798 0.91126758]
[0.10342192 0.96612358 0.06434406 0.85482426 0.7329657 ]
[0.1333734 0.99817565 0.42731489 0.46437847 0.40758545]
[0.43727089 0.91457894 0.21971166 0.24664127 0.61568784]]
y:
[[0.15979143 0.47441621 0.20148356 0.99541121 0.41767036]]
np:
[[0.91509268]
[0.83738509]
[0.91553484]
[0.80027872]
[0.7071257 ]]
tf:
[[0.91509268]
[0.83738509]
[0.91553484]
[0.80027872]
[0.7071257 ]]

The top two arrays show a 5x5 array for x and 1x5 for y. It also shows the cosine similarity for y against each row in x, calculated with NumPy and TensorFlow.

Let's try to time the calculations with more data.

Performance with small arrays

For the first test, we’ll try two small arrays, calculating cosine similarity between a 1000x25 array for x and a 50x25 array for y.

Timing for small data array

Running the above will produce the following output:

np time = 0.2509448528289795
tf time = 0.7871346473693848
similarity output equal: True

In this case, NumPy is faster than TensorFlow, even though TensorFlow is GPU enabled. As with all software development, there is no one size meets all answer to a problem. NumPy still does have a place and it will perform better with small data.

There is overhead to copy data from the CPU to the GPU and overhead to build the GPU graph to execute. As data grows, the overhead will be negligible compared to the overall run time, let's try a larger data array.

Performance with large arrays

We’ll run almost the exact same code as above except we’ll increase the array size.

x = np.random.rand(10000, 25)
y = np.random.rand(50, 25)

For this test, we’ll use a 10000x25 array for x and a 50x25 array for y.

np time = 3.3217058181762695
tf time = 1.129739761352539
similarity output equal: True

TensorFlow is now 3x faster than NumPy. The overhead of copying data to the GPU and compiling the GPU graph is now worth the performance trade-off.

Let's try with an even larger array.

x = np.random.rand(25000, 100)
y = np.random.rand(50, 100)

For this test, we’ll use a 25000x25 array for x and a 50x100 array for y.

np time = 23.823707103729248
tf time = 2.93641996383667
similarity output equal: True

TensorFlow is 8x faster in this case. The performance gain will continue to grow rapidly as long as the data set can fit on the GPU.

Selecting the Top N indices

Calculating cosine similarity will get you an array of floats from 0 to 1, with 1 being most similar and 0 being least. For most use cases, you’ll want to calculate similarity along with the best associated records. You can do that both in NumPy and TensorFlow as follows.

Cosine similarity and selection to best match

The above code gets a list of indices sorted by the highest cosine similarity. The top record is selected and printed.

Conclusion

Both NumPy and TensorFlow have their place. GPUs don’t just make everything faster in all cases. Use NumPy for smaller data sets and TensorFlow for larger sets. Always test on your own data to see what works best in your specific situation.

我来评几句
登录后评论

已发表评论数()

相关站点

热门文章