Math opeartion across Dataframes - python

I have two dataframes, let's call them df A and df B
A =
0 1
123 798
456 845
789 932
B =
0 1
321 593
546 603
937 205
Now I would like to multiply them, but also with an expression, as in A-1/B^2 for each of them
AB =
0 1
123-1/(321^2) 798-1/(593^2)
456-1/(546^2) 845-1/603^2)
789-1/(937^2) 932-1/(205^2)
Now, I have figured I could loop through each row and each column and try some sort of
A[i][j]-1/(B[i][j]^2)
But when it goes up to a 1000x1000 matrix, it would take quite some time.
Is there any operation for pandas or numpy that allows these sort of cross matrix operations? Not just multiplying one matrix by the other, but rather doing a math opeartion between them.
Maybe calculate the divider at first for a new df B ?

Related

Random-shuffing python dataframe row between categories

I am aware of the numpy.random.permutation method to conveniently shuffle the rows in a dataframe.
I want, however, the rows of one column to be shuffled such that after the shuffling, same values of this column are associated with the same values of a
second column. For instance here:
sid tid cluster_id coherence
0 484 367 0 (-0.7602504647007313-0.12366326038519604j)
1 485 367 0 (-0.7602504647007313-0.12366326038519604j)
2 227 2 1 (0.8285282150429198+0.007917196582272277j)
3 228 2 1 (0.8285282150429198+0.007917196582272277j)
4 488 245 2 (-0.5247187752391191+0.03756613687159624j)
5 489 245 2 (-0.5247187752391191+0.03756613687159624j)
6 76 504 3 (-0.5017704895797781-0.17508351848297674j)
7 59 545 3 (-0.37153924345882344-0.08026706090664427j)
I want to shuffle the value of the rows of "coherence".
Right now, rows with identical tids also have the same coherence values.
This should also remain after the shuffling - but the coherence values should be assigned to new tids.
Hence a coherence value which was previously associated with a tid X will be
associated with a new tid Y, but all rows with this new tid Y should have this same coherence value.
Since I'm too lazy to reproduce your dataframe, I just use a toy case. What you want is to shuffle in a groupby:
df = pd.DataFrame({'tid':[1,1,1,2,2,2],'others':[1,2,3,4,5,6],'coherence':[1,2,3,4,5,6]})
df['coherence'] = df.groupby('tid').coherence.transform(np.random.permutation)
UPDATE
Okay so I understood you wrong the first time, the previous answer shuffles within each group of tid but you want to shuffle the groups. Still the groupby is the solution, just shuffle the groups first:
import random
df = pd.DataFrame({'tid':[1,1,2,2,3,3,4,4],'val':[1,2,3,4,5,6,7,8],'coherence':[1,1,2,2,3,3,4,4]})
groups = [df for _, df in df[['tid','coherence']].groupby('tid')]
random.shuffle(groups)
df[['tid', 'coherence']] = pd.concat(groups).reset_index(drop=True)
I hope this does it.
UPDATE
What you want is not clear from your question at all. Here's your solution:
df = pd.DataFrame({'tid':[1,1,2,2,3,3,4,4],'val':[1,2,3,4,5,6,7,8],'coherence':[1,1,2,2,3,3,4,4]})
tmp = df[['tid', 'coherence']].drop_duplicates()
tmp['coherence'] = np.random.permutation(tmp.coherence)
pd.merge(df, tmp, 'left', left_on='tid', right_on='tid')
coherence_x is the old one and coherence_y is the new one.

Pandas: get pairs of rows with similar (the difference being within some bound) column values

I have a Pandas dataframe with 1M rows, 3 columns (TrackP, TrackPt, NumLongTracks) and I want to find pairs of 'matching' rows, such that for say two 'matching' rows the difference between the values for each row of column 1 (TrackP), column 2 (TrackPt) and column 3 (NumLongTracks) are all within some bound i.e. no more than ±1,
TrackP TrackPt NumLongTracks
1 2801 544 102
2 2805 407 65
3 2802 587 70
4 2807 251 145
5 2802 543 101
6 2800 545 111
For this particular case you would only retain the pair row 1 and row 5, because for this pair
TrackP(row 1) - TrackP(row 5) = -1,
TrackPt(row 1) - TrackP(row 5) = +1,
NumLongTracks(row 1) - NumLongTracks(row 5) = +1
This is trivial when the values are exactly the same between rows, but I'm having trouble figuring out the best way to do this for this particular case.
I think is easier to handle the columns as a single value for comparision.
#new dataframe
tr = track.TrackP.astype(str) + track.TrackPt.astype(str) + track.NumLongTracks.astype(str)
# finding matching routes
matching = []
for r,i in zip(tr,tr.index):
if r[0:4]: #4 position is TrackP
close = (int(r[0:4])-1,int(r[0:4])+1) #range 1 up/down
ptRange = (int(r[5:7])-1,int(r[5:7])+1)
nLRange = (int(r[8:])-1,int(r[8:])+1)
for r2 in tr:
if int(r2[0:4]) in close: #TrackP in range
if int(r2[5:7]) in ptRange: #TrackPt in range
if int(r2[8:]) in nLRange: #NumLongTracks in range
pair = [r,r2]
matching.append(pair)
# back to the format
#[['2801544102', '2802543101'], ['2802543101', '2801544102']]
import collections
routes = collections.defaultdict(list)
for seq in matching:
routes['TrackP'].append(int(seq[0][0:4]))
routes['TrackPt'].append(int(seq[0][4:7]))
routes['NumLongTracks'].append(int(seq[0][7:]))
Now you can easily decompress in a dataframe using the formula:
df = pd.DataFrame.from_dict(dict(routes))
print(df)
TrackP TrackPt NumLongTracks
0 2801 544 102
1 2802 543 101

Finding top N values for each group, 200 million rows

I have a pandas DataFrame which has around 200 million rows and looks like this:
UserID MovieID Rating
1 455 5
2 411 4
1 288 2
2 300 3
2 137 5
1 300 3
...
I want to get top N movies for each user sorted by rating in descending order, so for N=2 the output should look like this:
UserID MovieID Rating
1 455 5
1 300 3
2 137 5
2 411 4
When I try to do it like this, I get a 'memory error' caused by the 'groupby' (I have 8gb of RAM on my machine)
df.sort_values(by=['rating']).groupby('userID').head(2)
Any suggestions?
Quick and dirty answer
Given that the sort works, you may be able to squeak by with the following, which uses a Numpy-based memory efficient alternative to the Pandas groupby:
import pandas as pd
d = '''UserID MovieID Rating
1 455 5
2 411 4
3 207 5
1 288 2
3 69 2
2 300 3
3 410 4
3 108 3
2 137 5
3 308 3
1 300 3'''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+', index_col='UserID')
df = df.sort_values(['UserID', 'Rating'])
# carefully handle the construction of ix to ensure no copies are made
ix = np.zeros(df.shape[0], np.int8)
np.subtract(df.index.values[1:], df.index.values[:-1], out=ix[:-1])
# the above assumes that UserID is the index of df. If it's just a column, use this instead
#np.subtract(df['UserID'].values[1:], df['UserID'].values[:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(df.iloc[ix])
Output:
MovieID Rating
UserID
1 300 3
1 455 5
2 411 4
2 137 5
3 410 4
3 207 5
More memory efficient answer
Instead of a Pandas dataframe, for stuff this big you should just work with Numpy arrays (which Pandas uses for storing data under the hood). If you use an appropriate structured array, you should be able to fit all of your data into a single array roughly of size:
2 * 10**8 * (4 + 2 + 1)
1,400,000,000 bytes
or ~1.304 GB
which means that it (and a couple of temporaries for calculations) should easily fit into your 8 GB system memory.
Here's some details:
The trickiest part will be initializing the structured array. You may be able to get away with manually initializing the array and then copying the data over:
dfdtype = np.dtype([('UserID', np.uint32), ('MovieID', np.uint16), ('Rating', np.uint8)])
arr = np.empty(df.shape[0], dtype=dfdtype)
arr['UserID'] = df.index.values
for n in dfdtype.names[1:]:
arr[n] = df[n].values
If the above causes an out of memory error, from the start of your program you'll have to build and populate a structured array instead of a dataframe:
arr = np.empty(rowcount, dtype=dfdtype)
...
adapt the code you use to populate the df and put it here
...
Once you have arr, here's how you'd do the groupby you're aiming for:
arr.sort(order=['UserID', 'Rating'])
ix = np.zeros(arr.shape[0], np.int8)
np.subtract(arr['UserID'][1:], arr['UserID'][:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(arr[ix])
The above size calculation and dtype assumes that no UserID is larger than 4,294,967,295, no MovieID is larger than 65535, and no rating is larger than 255. This means that the columns of your dataframe can be (np.uint32, np.uint16, np.uint8) without loosing any data.
If you want to keep working with pandas, you can divide your data into batches - 10K rows at a time, for example. You can split the data either after loading the source data to the DF, or even better, load the data in parts.
You can save the results of each iteration (batch) into a dictionary keeping only the number of movies you're interested with:
{userID: {MovieID_1: score1, MovieID_2: s2, ... MovieID_N: sN}, ...}
and update the nested dictionary on each iteration, keeping only the best N movies per user.
This way you'll be able to analyze data much larger than your computer's memory

Calculating Precision and Recall from confusion matrix in python

I have a confusion matrix for 2 classes with pre-calculated totals in a pandas dataframe format:
Actual_class Predicted_class_0 Predicted_class_1 Total
0 0 39 73 112
1 1 52 561 613
2 All 91 634 725
I need to calculate precision and recall using a loop as I need a general case solution for more classes.
Precision for class 0 would be 39/91 and for class 1 would be 561/634.
Recall for class 0 would be 39/112 and for class 1 would be 561/613.
So I need to iterate by diagonal and totals to get the following results
Actual_class Predicted_class_0 Predicted_class_1 Total Precision Recall
0 0 39 73 112 43% 35%
1 1 52 561 613 88% 92%
2 All 91 634 725
Totals (All row and Total column) will be removed afterwords, so there is no point to calculate them.
I tried the following code, but it doesn't go by diagonal and loses data for the class 0:
cols = [c for c in cross_tab.columns if c.lower()[:4] == 'pred']
for c in cols:
cross_tab["Precision"] = cross_tab[c]/cross_tab[c].iloc[-1]
for c in cols:
cross_tab["Recall"] = cross_tab[c]/cross_tab['Total']
I'm novice to pandas matrix operations and really need your help.
I'm sure there is a way to proceed without pre-calculating totals.
Thank you very much!!!
I found a solution using numpy diagonal:
import numpy as np
cols = [c for c in cross_tab.columns if c.lower()[:4] == 'pred' or c == 'Total']
denomPrecision=[]
for c in cols:
denomPrecision.append(cross_tab[c].iloc[-1])
diag = np.diagonal(cross_tab.values,1)
cross_tab["Precision"] = np.round(diag.astype(float)/denomPrecision*100,1)
cross_tab["Recall"] = np.round(diag.astype(float)/cross_tab.Total*100,1)

Grouping numerical values in pandas

In my Dataframe I have one column with numeric values, let say - distance. I want to find out which group of distance (range) have the biggest number of records (rows).
Doing simple:
df.distance.count_values() returns:
74 1
90 1
94 1
893 1
889 1
885 1
877 1
833 1
122 1
545 1
What I want to achieve is something like buckets from histogram, so I am expecting output like this:
900 4 #all values < 900 and > 850
100 3
150 1
550 1
850 1
The one approach I've figured out so far, but I don't think is the best and most optimal one is just find max and min values, divide by my step (50 in this case) and then do loop checking all the values and assigning to appropriate group.
Is there any other, better approach for that?
I'd suggest doing the following, assuming your value column is labeled val
import numpy as np
df['bin'] = df['val'].apply(lambda x: 50*np.floor(x/50))
The result is the following:
df.groupby('bin')['val'].count()
Thanks to EdChum suggestion and based on this example I've figured out, the best way (at least for me) is to do something like this:
import numpy as np
step = 50
#...
max_val = df.distance.max()
bins = list(range(0,int(np.ceil(max_val/step))*step+step,step))
clusters = pd.cut(df.distance,bins,labels=bins[1:])

Categories

Resources