# Sklearn pairwise distance

Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. What is the difference between Scikit-learn's sklearn. From source code documentation :. Learn more. Asked 4 years, 2 months ago. Active 10 months ago. Viewed 7k times. Nick Lothian Nick Lothian 8 8 silver badges 20 20 bronze badges. Active Oldest Votes. From source code documentation : Cosine enigma2 image entpacken is defined as 1. So your result make sense. Farseer Farseer 3, 2 2 gold badges 30 30 silver badges 48 48 bronze badges. Email Required, but never shown. The Overflow Blog. Featured on Meta. Feedback on Q2 Community Roadmap.This method takes either a vector array or a distance matrix, and returns a distance matrix. If the input is a vector array, the distances are computed. If the input is a distances matrix, it is returned instead. This method provides a safe way to take a distance matrix as input, while preserving compatibility with many other algorithms that take a vector array. If Y is given default is Nonethen the returned matrix is the pairwise distance between the arrays from both X and Y.

Read more in the User Guide. The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy.

Alternatively, if metric is a callable function, it is called on each pair of instances rows and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.

The number of jobs to use for the computation. None means 1 unless in a joblib. See Glossary for more details. Any further parameters are passed directly to the distance function. If using a scipy. See the scipy docs for usage examples. Agglomerative clustering with different metrics. These metrics support sparse matrix inputs. From scipy. These metrics do not support sparse matrix inputs. Only allowed if metric!By using our site, you acknowledge that you have read and understand our Cookie PolicyPrivacy Policyand our Terms of Service.

The dark mode beta is finally here. Change your preferences any time. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information.

I have an 1D array of numbers, and want to calculate all pairwise euclidean distances. I have a method thanks to SO of doing this with broadcasting, but it's inefficient because it calculates each distance twice.

And it doesn't scale well. Note: the matrix is symmetric, so I'm guessing that it's possible to get at least a 2x speedup by addressing that, I just don't know how. Neither of the other answers quite answered the question - 1 was in Cython, one was slower. But both provided very useful hints. Following up on them suggests that scipy. I didn't try the Cython implementation I can't use it for this projectbut comparing my results to the other answer that did, it looks like scipy. Here is a Cython implementation that gives more than 3X speed improvement for this example on my computer.

This timing should be reviewed for bigger arrays tough, because the BLAS routines can probably scale much better than this rather naive code. Learn more. Fastest pairwise distance metric in python Ask Question. Asked 6 years, 4 months ago. Active 6 years, 4 months ago. Viewed 25k times. Here's an example that gives me what I want with an array of numbers. There's a function for that: scipy. I dunno whether this is the fastest option, since it needs to have checks for multidimensional data, non-Euclidean norms, and other things, but it's built in.

How fast do you need this to be? Remember that you need to populate million entries of output. That's almost half a gigabyte of pairwise distances. If you follow the source code, in the end this is the function getting called. Not only is there no fancy optimization, but for 1D vectors it is squaring and taking the square root to compute the absolute value. Probably worse than the OP's code for his particular use case.By using our site, you acknowledge that you have read and understand our Cookie PolicyPrivacy Policyand our Terms of Service.

The dark mode beta is finally here. Change your preferences any time. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. I am currently trying various methods: 1. Mutual Information. Distance Correlation to find the strength of relationship between the variables in X and the dependent variable in y. Correlation is the fastest and simplest 1 hour on a sample to 3 million records and variables.

Mutual Information calculation takes approximately 16 hours. I am also looking at distance correlation because of it's interesting property: The distance correlation between Xi and Y is zeroif and only if they are independent.

However I am facing a problem while doing the calculation in Python. I want to capture the distance correlation of each variable in X with y and store it in a dataframe and hence I am doing. How can I get distance correlation between each Xi and y in python? Can someone please help me with this? I tried the approach of repeating the columns of y as per X. I find that it works super quickly for big datasets. But note that this returns a correlation distance, but there's a bunch of different metrics that you can use as well as custom metrics.

More details are on the docs page. Are you sure that you have computed what you wanted? It seems that scipy computes a distance based on Pearson correlation using this method. Learn more. Asked 1 year, 9 months ago. Active 1 year, 6 months ago. Viewed 1k times. This requires equal number of features in both X and Y? Update: I tried the approach of repeating the columns of y as per X. Shuvayan Das. Shuvayan Das Shuvayan Das 9 9 silver badges 26 26 bronze badges.

Can't you just duplicate y for each of the x columns, or do the comparisons one column at a time?This documentation is for scikit-learn version 0.

## sklearn.metrics.pairwise_distances

If you use the software, please consider citing scikit-learn. This method takes either a vector array or a distance matrix, and returns a distance matrix. If the input is a vector array, the distances are computed. If the input is a distances matrix, it is returned instead. This method provides a safe way to take a distance matrix as input, while preserving compatability with many other algorithms that take a vector array.

If Y is given default is Nonethen the returned matrix is the pairwise distance between the arrays from both X and Y. Please note that support for sparse matrices is currently limited to those metrics listed in pairwise. The metric to use when calculating distance between instances in a feature array.

If metric is a string, it must be one of the options allowed by scipy. Alternatively, if metric is a callable function, it is called on each pair of instances rows and the resulting value recorded.

The callable should take two arrays from X as input and return a value indicating the distance between them. The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debuging. Any further parameters are passed directly to the distance function. If using a scipy.

Change your preferences any time. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information.

I have a very large scipy sparse csr matrix. It is a ,x2, dimensional matrix. Let's call it X. Each row is a sample vector in a 2, dimensional space. I need to calculate the cosine distances between each pair of samples very efficiently.

It is the output of scipy pdist function. I have memory limitations and I can't calculate the square form and then get the condensed form. Due to memory limitations, I also cannot use scipy pdist as it requires a dense matrix X which does not again fit in memory.

I thought about looping through different chunks of X and calculate the condensed form for each chunk and join them together to get the complete condensed form, but this is relatively cumbersome. Any better ideas? Below is a reproducible example of course for demonstration purposes X is much smaller :.

As you see dist2 is in the condensed form and is a dimensional vector.

But dist1 is in the symmetric squareform and is a x matrix. I can produce the same square array by dividing the xy dot product with the appropriate norm outer product:. Or by normalizing X before taking dot product.

This appears to be what the scikit version does. That means you need to understand how the sparse matrix multiplication is implemented - in c code. The scikit cython 'fast sparse' code might also give ideas. It does not attempt to save time or space by implementing tri calculations directly. It's easier to iterate over the rectangular layout of a nd array with shape and strides than to do the more complex variable length steps of a triangular array.

The condensed form only cuts the space and calculation steps by half. Learn more. How to get the condensed form of pairwise distances directly? Ask Question. Asked 3 years, 9 months ago. Active 3 years, 9 months ago. Viewed 1k times. Any help is much much appreciated.

The dark mode beta is finally here. Change your preferences any time. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information.

I'm doing some behavior analysis where I track behaviors over time and then create n-grams of those behaviors. I want to be able to cluster these n-grams, but I need to create a pre-computed distance matrix using a custom metric. My metric appears to work fine, but when I try to create the distance matrix using the sklearn function, I get an error:. If you can convert the strings to numbers encode a string to specific number and then pass it, it will work properly.