The pandas corr function is really fast. Even for a table with a 1000 columns x 200 rows it doesn't take more than a minute to calculate a cross correlation matrix.
I was trying to use a for loop to compute the slope between each pair of columns and it was taking much longer. I went back and tried to just use correlation in a loop and that was also taking a long time.
So I guess I have two part question. How is the pandas corr function working so fast. Is there a similar method to to compute the pairwise slope between each column?
Here is an example:
here is table
a b
0 0
1 1
In that case slope(a,b)=1
a b
0 0
1 2
In that case slope(a,b)=2
Related
I have a csv file having 140K rows. Working with pandas library.
Now the problem is I have to compare each rows with every other rows.
Now the problem is it's taking too much time.
At the same time, I am creating another column where I am appending many data for each row based on the comparison. Here I am getting memory error.
What is the optimal solution for atleast Memory error?
I am working on 12GB RAM, Google Colaboratory.
Dataframe sample:
ID x_coordinate y_coordinate
1 2 3
2 3 4
............
X 1 5
Now, I need to find distance each row with other rows and if the distance in certain threshold, I am assigning a new id for that two row which are in certain distance. So, if in my case ID 1 and ID 2 is in a certain distance I assigned a for both. And ID 2 and ID X is in certain distance I am assigning b as new matched id like below
ID x_coordinate y_coordinate Matched ID
1 2 3 [a]
2 3 4 [a, b]
............
X 1 5 [b]
For distance I am using √{(xi − xj)2 + (yi − yj)2}
Threshold can be anything. Say m unit.
This reads like you attempt to hold the complete square distance matrix in memory, which obviously doesn't scale very well, as you have noticed.
I'd suggest you to read up on how DBSCAN clustering approaches the problem, compared to e.g., hierarchical clustering:
https://en.wikipedia.org/wiki/DBSCAN#Complexity
Instead of computing all the pairwise distances at once, they seem to
put the data into a spatial database (for efficient neighborhood queries with a threshold) and then
iterate the points to identify the neighbors and the relevant distances on the fly.
Unfortunately I can't point you to readily available code or pandas functionality to support this though.
I have many n by 2 matrixes, like so :
Distance
Value
Distance.1
Value.1
...
-15
1
-15
2
...
-14.9
3
-14.995
4
...
-14.8
4
-14.992
2
...
...
...
...
...
...
15
3
8.959
2
...
...
...
...
15.048
3
...
The Distance columns all start at -15, and all finish around +15+-0.05.
My goal is to compute the mean value (of the value columns) and then plot those values as a function of the distance. What I am stuck on is how to 'align' all of the distances, like so :
Distance
Value
Distance.1
Value.1
Mean of Values
-15
1
-15
2
mean value at this distance
NaN
NaN
-14.995
4
mean value at this distance
NaN
NaN
...
...
mean value at this distance
-14.8
4
-14.8
2
mean value at this distance
...
...
...
...
mean value at this distance
NaN
NaN
14.995
2
mean value at this distance
15
2
15
3
mean value at this distance
At the moment I am just plotting all the values against the 1st Distance column, which means that all the data points that are in columns longer than the first matrix are lost.
Here is some code. Now I know that this could hopefully be solved in an other language than python, and even then without pandas, but I am using them in my project to clean and filter the data before hand.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df0 = pd.DataFrame({'Distance':np.linspace(-15, 14.969, 100),
'Value':np.random.rand(100)})
df1 = pd.DataFrame({'Distance.1':np.linspace(-15, 15.034, 500),
'Value.1':np.random.rand(500)})
df = pd.concat([df0,df1], ignore_index=False, axis=1)
df['mean'] = df.filter(regex='Value').mean(axis=1)
df.plot(x='Distance', y='mean')
plt.show()
So with the code above, the distances are not aligned. I see two ways of solving the problem :
Somehow add NaNs to fill in the missing values in the shorter columns, while aligning the distances which appear in all the Distance columns.
Somehow average out the values in the longer columns. So say for every distance in the the short column, there are 3 distances in the long column, take the average of those 3 extra values and replace the 3 entries with that average.
I think, taking into account statistics, the second method is better. Because with the first method the values that only appear in the longer columns will be overrepresented in the mean compared to the values at distances that are in all the columns.
I'm guessing that there is a very clever way to use a Lambda expression in combination with pandas' DataFrame.apply. But I'm really not sure on how to do it. (How to get the lambda expression to look at all the columns except the longest or vice versa ? How to add values to certain columns and not others ?)
I've come up with this so far :
for each row, compare the value in the distance column to the distance value of the longest distance column. If its the same, go to next column, else somehow insert NaN values and shift all of the values in the rows below down, except those in the longest matrix (longest two columns).
Any help is appreciated. I'm sure that I am not the first person to encounter this problem, but I am really not sure how to phrase it well enough to be able to google it.
I'm new to Python, and I've got problems in calculating correlation coefficients for multiple participants.
I've got a dataframe just like this :
|Index|Participant|Condition|ReactionTime1|ReactionTime2|
|:---:|:---------:|:-------:|:-----------:|:-------------:|
|1|1|A|320|542|
|2|1|A|250|623|
|3|1|B|256|547|
|4|1|B|301|645|
|5|2|A|420|521|
|6|2|A|123|456|
|7|2|B|265|362|
|8|2|B|402|631|
I am wondering how to calculate correlation coefficient between Reaction Time 1 and Reaction Time2 for Participant 1 and for Participant 2 in each condition. My real dataset is way bigger than this (hundreds of Reaction Time for each participant, and there are a lot of participant too). Is there a general way to calculate this and to put coeff in a new df like this?
|Index|Participant|Condition|Correlation coeff|
|:---:|:---------:|:-------:|:-----------:|
|1|1|A|?|
|2|1|B|?|
|3|2|A|?|
|4|2|B|?|
Thanks :)
You can try groupby and apply with np.corrcoef, and reset_index afterwards:
result = (df.groupby(["Participant", "Condition"])
.apply(lambda gr: np.corrcoef(gr["ReactionTime1"], gr["ReactionTime2"])[0, 1])
.reset_index(name="Correlation coeff"))
which gives
Participant Condition Correlation coeff
0 1 A -1.0
1 1 B 1.0
2 2 A 1.0
3 2 B 1.0
We use [0, 1] on the returned value of np.corrcoef since it returns a symmetric matrix where diagonal elements are normalized to 1 and off-diagonal elements are the same and each gives the desired coefficient (so might as well index with [1, 0]). That is,
array([[1. , 0.25691558],
[0.25691558, 1. ]])
is an example returned value and we are interested in the off-diagonal entry.
Why it returned all +/- 1 in your case: since each participant & conditon pair only has 2 entries for each reaction, they are always perfectly correleated and the sign is determined via their orientation i.e. if one increases from one coordinate to the other coordinate, does other one increase or decrease.
I have this dataframe. I would like to find a way to make a correlation matrix between an Hour and the same hour of the day before (for example H01 of 28/09 vs H01 of 27/09).
I thought about two different approaches:
1) Do the corr matrix of the transpose dataframe.
dft=df.transpose()
dft.corr()
2) create a copy of the dataframe with 1 day/rows of lag and than do .corrwith() in order to compare them.
In the first approach I obtain weird results (for example rows like 634 and 635 low correlated even if they have values very similar), in the second approach I obtain all ones. I'm ideally looking forward to find the correlation in days close to eachothers actually. Send help please
Presented as an example.
Two data sets. One collected over a 1 hour period. One collected over a 20 min period within that hour.
Each data set contains instances of events that can transformed into single columns of true (-) or false (_), representing if the event is occurring or not.
DS1.event:
_-__-_--___----_-__--_-__---__
DS2.event:
__--_-__--
I'm looking for a way to automate the correlation (correct me if the terminology is incorrect) of the two data sets and find the offset(s) into DS1 at which DS2 is most (top x many) likely to have occurred. This will probably end up with some matching percentage that I can then threshold to determine the validity of the match.
Such that
_-__-_--___----_-__--_-__---__
__--_-__--
DS1.start + 34min ~= DS2.start
Additional information:
DS1 was recorded at roughly 1 Hz. DS2 at roughly 30 Hz. This makes it less likely that there will be a 100% clean match.
Alternate methods (to pandas) will be appreciated, but python/pandas are what I have at my disposal.
Sounds like you just want something like a cross correlation?
I would first convert the string to a numeric representation, so replace your - and _ with 1 and 0
You can do that using a strings replace method (e.g. signal.replace("-", "1"))
Convert them to a list or a numpy array:
event1 = [int(x) for x in signal1]
event2 = [int(x) for x in signal2]
Then calculate the cross correlation between them:
xcor = np.correlate(event1, event2, "full")
That will give you the cross correlation value at each time lag. You just want to find the largest value, and the time lag at which it happens:
nR = max(xcor)
maxLag = np.argmax(xcor) # I imported numpy as np here
Giving you something like:
Cross correlation value: 5
Lag: 20
It sounds like you're more interested in the lag value here. What the lag tells you is essentially how many time/positional shifts are required to get the maximum cross correlation value (degree of match) between your 2 signals
You might want to take a look at the docs for np.correlate and np.convolve to determine the method (full, same, or valid) you want to use as thats determined by the length of your data and what you want to happen if your signals are different lengths