Fast conversion to multiindexed pandas dataframe using bincounts - python

I have data from users who have left star ratings (1, 2 or 3 stars) on items in various categories, where each item may belong to multiple categories. In my current dataframe, each row represents a rating and the categories are one-hot encoded, like so:
import numpy as np
import pandas as pd
df_old = pd.DataFrame({
'user': [1, 1, 2, 2, 2],
'rate': [3, 2, 1, 1, 2],
'cat1': [1, 0, 1, 1, 1],
'cat2': [0, 1, 0, 0, 1]
})
# user rate cat1 cat2
# 0 1 3 1 0
# 1 1 2 0 1
# 2 2 1 1 0
# 3 2 1 1 0
# 4 2 2 1 1
I want to convert this to a new dataframe, multiindexed by user and rate, which show the per-category bincounts for each star rating. I'm currently doing this with loops:
multi_idx = pd.MultiIndex.from_product(
[df_old.user.unique(), range(1,4)],
names=['user', 'rate']
)
df_new = pd.DataFrame( # preallocate in an attempt to speed up the code
{'cat1': np.nan, 'cat2': np.nan},
index=multi_idx
)
df_new.sort_index(inplace=True)
idx = pd.IndexSlice
for uid in df_old.user.unique():
for cat in ['cat1', 'cat2']:
df_new.loc[idx[uid, :], cat] = np.bincount(
df_old.loc[(df_old.user == uid) & (df_old[cat] == 1),
'rate'].values, minlength=4)[1:]
# cat1 cat2
# user rate
# 1 1 0.0 0.0
# 2 0.0 1.0
# 3 1.0 0.0
# 2 1 2.0 0.0
# 2 1.0 1.0
# 3 0.0 0.0
Unfortunately the above code is hopelessly slow on my real dataframe, which is long and contains many categories. How can I eliminate the loops please?

With your multi-index, you can aggregate your old data frame, and reindex it:
df_old.groupby(['user', 'rate']).sum().reindex(multi_idx).fillna(0)
Or as #piRSquared commented, do the reindex and fill missing value at one step:
df_old.groupby(['user', 'rate']).sum().reindex(multi_idx, fill_value=0)

Related

Compare a DataFrame to itself? pandas

I have a dataframe with week number as int, item name, and ranking.
For instance:
item_name ranking week_number
0 test 4 1
1 test 3 2
I'd like to add a new column with the ranking evolution since the last week.
The math is very simple:
df['ranking_evolution'] = ranking_previous_week - df['ranking']
It would only require exception handling for week 1.
But I'm not sure how to return the ranking previous week.
I could do it by iterating over the rows but I'm wondering if there is a cleaner way so I can just declare a column?
The issue is that I'd have to compare the dataframe to itself.
I've candidly tried:
df['ranking_evolution'] = df['ranking'].loc[(df[item_name] == df['item_name]) & (df['week_number'] == df['week_number'] - 1) - df['ranking']
But this return NaN values.
Even using a copy returned NaN values.
I assume this is a simplistic example, you probably have different products and maybe missing weeks?
A robust way would be to perform a self-merge with the week+1:
(df.merge(df.assign(week_number=df['week_number']+1),
on=['item_name', 'week_number'],
suffixes=(None, '_evolution'),
how='left')
.assign(ranking_evolution=lambda d: d['ranking_evolution'].sub(d['ranking']))
)
Output:
item_name ranking week_number ranking_evolution
0 test 4 1 NaN
1 test 3 2 1.0
Shortly, try this code to figure out the trick.
import pandas as pd
data = {
'item_name': ['test', 'test', 'test', 'test', 'test', 'test', 'test', 'test', 'test', 'test'],
'ranking': [4, 3, 2, 1, 2, 3, 4, 5, 6, 7],
'week_number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}
df = pd.DataFrame(data)
df['ranking_evolution'] = df['ranking'].diff(-1) # this is the line that does the trick
print(df)
Results
item_name ranking week_number ranking_evolution
test 4 1 1.0
test 3 2 1.0
test 2 3 1.0
test 1 4 -1.0

how to apply multiplication within pandas dataframe

please advice how to get the following output:
df1 = pd.DataFrame([['1, 2', '2, 2','3, 2','1, 1', '2, 1','3, 1']])
df2 = pd.DataFrame([[1, 2, 100, 'x'], [3, 4, 200, 'y'], [5, 6, 300, 'x']])
import numpy as np
df22 = df2.rename(index = lambda x: x + 1).set_axis(np.arange(1, len(df2.columns) + 1), inplace=False, axis=1)
f = lambda x: df22.loc[tuple(map(int, x.split(',')))]
df = df1.applymap(f)
print (df)
Output:
0 1 2 3 4 5
0 2 4 6 1 3 5
df1 is 'address' of df2 in row, col format (1,2 is first row, second column which is 2, 2,2 is 4 3,2 is 6 etc.)
I need to add values from the 3rd and 4th columns to get something like (2*100x, 4*200y, 6*300x, 1*100x, 3*200y, 5*300x)
the output should be 5000(sum of x's and y's), 0.28 (1400/5000 - % of y's)
It's not clear to me why you need df1 and df... Maybe your question is lacking some details?
You can compute your values directly:
df22['val'] = (df22[1] + df22[2])*df22[3]
Output:
1 2 3 4 val
1 1 2 100 x 300
2 3 4 200 y 1400
3 5 6 300 x 3300
From there it's straightforward to compute the sums (total and grouped by column 4):
total = df22['val'].sum() # 5000
y_sum = df22.groupby(4).sum().loc['y', 'val'] # 1400
print(y_sum/total) # 0.28
Edit: if df1 doesn't necessarily contain all members of columns 1 and 2, you could loop through it (it's not clear in your question why df1 is a Dataframe or if it can have more than one row, therefore I flattened it):
df22['val'] = 0
for c in df1.to_numpy().flatten():
i, j = map(int, c.split(','))
df22.loc[i, 'val'] += df22.loc[i, j]*df22.loc[i, 3]
This gives you the same output as above for your example but will ignore values that are not in df1.

Reshape the structure of a dataframe

In a dataframe df containing points (row) and coordinates (columns), I want to compute, for each point, the n closest neighbors points and the corresponding distances.
I did something like this:
df = pd.DataFrame(np.random.rand(4, 6))
def dist(p, q):
return ((p - q)**2).sum(axis=1)
def f(s):
closest = dist(s, df).nsmallest(3)
return list(closest.index) + list(closest)
df.apply(f, axis=1, result_type="expand")
which gives:
0 1 2 3 4 5
0 0.0 3.0 2.0 0.0 0.743722 1.140251
1 1.0 2.0 0.0 0.0 1.548676 1.695104
2 2.0 3.0 0.0 0.0 0.702797 1.140251
3 3.0 2.0 0.0 0.0 0.702797 0.743722
(first 3 columns are the indices of the closest points, the next 3 columns are the corresponding distances)
However, I would prefer to get a dataframe with 3 columns: point, closest point to it, distance between them.
Put another way: I want one column per distance, and not one column per point.
I tried pd.melt, pd.pivot but without finding any good way to do it...
Option 1: Scikit-learn NearestNeighbors class
To find k-nearest-neighbors (kNN), sklearn.neighbors.NearestNeighbors serves the purpose.
Data
import numpy as np
import pandas as pd
np.random.seed(52) # reproducibility
df = pd.DataFrame(np.random.rand(4, 6))
print(df)
0 1 2 3 4 5
0 0.823110 0.026118 0.210771 0.618422 0.098284 0.620131
1 0.053890 0.960654 0.980429 0.521128 0.636553 0.764757
2 0.764955 0.417686 0.768805 0.423202 0.926104 0.681926
3 0.368456 0.858910 0.380496 0.094954 0.324891 0.415112
Code
from sklearn.neighbors import NearestNeighbors
k = 3
dist, indices = NearestNeighbors(n_neighbors=k).fit(df).kneighbors(df)
Result
print(dist)
array([[0.00000000e+00, 1.09330867e+00, 1.13862254e+00],
[0.00000000e+00, 9.32862532e-01, 9.72369661e-01],
[0.00000000e+00, 9.72369661e-01, 1.02130721e+00],
[2.10734243e-08, 9.32862532e-01, 1.02130721e+00]])
print(indices)
array([[0, 2, 3],
[1, 3, 2],
[2, 1, 3],
[3, 1, 2]])
The obtained distances and indices can be easily rearranged.
Option 2: compute manually (nearest except self)
sklearn.metrics has a built-in Euclidean distance function, which outputs an array of shape [#rows x #rows]. You can exclude the diagonal elements (distance to itself, namely 0) from min() and argmin() by filling it with infinity.
Code
from sklearn.metrics import euclidean_distances
dist = euclidean_distances(df.values, df.values)
np.fill_diagonal(dist, np.inf) # exclude self from min()
df_want = pd.DataFrame({
"point": range(df.shape[0]),
"closest_point": dist.argmin(axis=1),
"distance": dist.min(axis=1)
})
Result
print(df_want)
point closest_point distance
0 0 2 1.093309
1 1 3 0.932863
2 2 1 0.972370
3 3 1 0.932863

Python 3.4 - Pandas - Help in proper arrangement of dataframe columns and deletion of invalid columns

This question is based on Python - Pandas - Combining rows of multiple columns into single row in dataframe based on categorical value which I had asked earlier.
I have a table in the following format:
Var1 Var2 Var3 Var4 ID
0 0.70089 0.93120 1.867650 0.658020 1
1 0.15893 -0.74950 1.089150 -0.045123 1
2 0.13690 0.59210 -0.032990 0.672860 1
3 -0.50136 0.89913 0.440200 0.812150 1
4 1.08940 0.43036 0.669470 1.286000 1
5 0.09310 0.14979 -0.392335 0.040500 1
6 7 0.63339 1.27161 0.852072 0.474800 2
7 8 -0.54944 -0.04547 0.867050 -0.234800 2
8 9 1.28600 1.87650 0.976670 0.440200 2
I have created the above table using the using the following code:
import pandas as pd
df1 = {'Var1': [0.70089, 0.15893, 0.1369, -0.50136, 1.0894, 0.0931, 0.63339, -0.54944, 1.286], Var2': [0.9312, -0.7495, 0.5921, 0.89913, 0.43036, 0.14979, 1.27161, -0.04547, 1.8765], 'Var3': [1.86765, 1.08915,-0.03299, 0.4402, 0.66947, -0.392335, 0.852072, 0.86705, 0.97667], 'Var4': [0.65802, -0.045123, 0.67286, 0.81215, 1.286, 0.0405, 0.4748, -0.2348, 0.4402] 'ID':[1, 1, 1, 1, 1, 1, 2, 2, 2]}
df=pd.Dataframe(data=df1)
I want to bring it into a particular format by grouping it based on the column 'ID'.
The desired output is similar in structure to the table below:
ID V1_0_0 V2_0_1 V3_0_2 V4_0_3 V1_1_0 V2_1_1 V3_1_2 V4_1_3
1 A B C D E F G H
2 I J K L 0 0 0 0
I achieved it with the help of user Allen in the last question that is referenced above. The code is printed below:
num_V = 4
max_row = df.groupby('ID').ID.count().max()
df= df.groupby('ID').apply(lambda x: x.values[:,1:].reshape(1,-1)
[0].apply(lambda x: x.values[:,1:].reshape(1,-1)[0]).apply(pd.Series)
.fillna(0)
df.columns = ['V{}_{}_{}'.format(i+1,j,i) for j in range(max_row) for i in
range(num_V)]
print(df)
The result of which produces the below output table:
V1_0_0 V2_0_1 V3_0_2 ***V4_0_3** V1_1_0 V2_1_1 V3_1_2 \
ID
1 0.93120 1.867650 0.65802 1 -0.74950 1.08915 -0.045123
2 1.27161 0.852072 0.47480 2 -0.04547 0.86705 -0.234800
**V4_1_3*** V1_2_0 V2_2_1 ...V3_3_2 **V4_3_3** V1_4_0 V2_4_1 \
ID ...
1 1 0.5921 -0.03299 ... 0.81215 1 0.43036 0.66947
2 2 1.8765 0.97667 ... 0.00000 0 0.00000 0.00000
V3_4_2 **V4_4_3** V1_5_0 V2_5_1 V3_5_2 **V4_5_3**
ID
1 1.286 1 0.14979 -0.392335 0.0405 1
2 0.000 0 0.00000 0.000000 0.0000 0
This is partially correct, but the problem is that there are certain columns that give the value of 1 and 2 after every 3 columns (the ones between ** **).
It then prints 1 and 0 after there are no values pertaining to the 'ID' value 2.
After examining it I realize that it is not printing the "Var1" values, and the values are off by one column. (That is V1_0_0 should be 0.70089, and the real value of V4_0_3should have the value of V3_0_2 which equals 0.65802.
Is there any way to rectify this so that I get something exactly like my desired output table? How do I make sure the ** ** marked columns delete the values they have and return the proper values?
I am using Python 3.4 running it on a Linux Terminal
Thanks.
not sure whats wrong with the code you have provided, but try this out and let me know if it gives you what you want:
import pandas as pd
df = {'Var1': [0.70089, 0.15893, 0.1369, -0.50136, 1.0894, 0.0931, 0.63339, -0.54944, 1.286], 'Var2': [0.9312, -0.7495, 0.5921, 0.89913, 0.43036, 0.14979, 1.27161, -0.04547, 1.8765], 'Var3': [1.86765, 1.08915,-0.03299, 0.4402, 0.66947, -0.392335, 0.852072, 0.86705, 0.97667], 'Var4': [0.65802, -0.045123, 0.67286, 0.81215, 1.286, 0.0405, 0.4748, -0.2348, 0.4402], 'ID':[1, 1, 1, 1, 1, 1, 2, 2, 2]}
df=pd.DataFrame(df)
newdataframe=pd.DataFrame(columns=df.columns)
newID=[]
for agroup in df.ID.unique():
temp_df=pd.DataFrame(columns=df.columns)
adf=df[df.ID==agroup]
for aline in adf.itertuples():
a= ((pd.DataFrame(list(aline))).T).drop(columns=[0])
a.columns=df.columns
if a.ID.values[0] not in newID:
suffix_count=1
temp_df=pd.concat([temp_df,a])
newID.append(a.ID.values[0])
else:
temp_df = temp_df.merge(a, how='outer', on='ID', suffixes=('', '_'+ str(suffix_count)))
suffix_count += 1
newdataframe=pd.concat([newdataframe,temp_df])
print (newdataframe)
Output :
ID Var1 Var1_1 Var1_2 Var1_3 Var1_4 Var1_5 Var2 Var2_1 \
0 1.0 0.70089 0.15893 0.1369 -0.50136 1.0894 0.0931 0.93120 -0.74950
0 2.0 0.63339 -0.54944 1.2860 NaN NaN NaN 1.27161 -0.04547
Var2_2 ... Var3_2 Var3_3 Var3_4 Var3_5 Var4 Var4_1 \
0 0.5921 ... -0.03299 0.4402 0.66947 -0.392335 0.65802 -0.045123
0 1.8765 ... 0.97667 NaN NaN NaN 0.47480 -0.234800
Var4_2 Var4_3 Var4_4 Var4_5
0 0.67286 0.81215 1.286 0.0405
0 0.44020 NaN NaN NaN
another code for achieving the output you looking for:
import pandas as pd
import numpy as np
import re
df = {'Var1': [0.70089, 0.15893, 0.1369, -0.50136, 1.0894, 0.0931, 0.63339, -0.54944, 1.286], 'Var2': [0.9312, -0.7495, 0.5921, 0.89913, 0.43036, 0.14979, 1.27161, -0.04547, 1.8765], 'Var3': [1.86765, 1.08915,-0.03299, 0.4402, 0.66947, -0.392335, 0.852072, 0.86705, 0.97667], 'Var4': [0.65802, -0.045123, 0.67286, 0.81215, 1.286, 0.0405, 0.4748, -0.2348, 0.4402], 'ID':[1, 1, 1, 1, 1, 1, 2, 2, 2]}
df=pd.DataFrame(df)
df['duplicateID']=df['ID'].duplicated()
newdf=df[df['duplicateID']==False]
newdf=newdf.reset_index()
newdf=newdf.iloc[:,1:]
df=df[df['duplicateID']==True]
df=df.reset_index()
df=df.iloc[:,1:]
del newdf['duplicateID']
del df['duplicateID']
merge_count=0
newID=[]
for aline in df.itertuples():
a= ((pd.DataFrame(list(aline))).T).drop(columns=[0])
a.columns=df.columns
newdf=newdf.merge(a, how='left', on ='ID', suffixes=('_'+str(merge_count),'_'+str(merge_count+1)))
merge_count+=1
newdf.index=newdf['ID']
del newdf['ID']
newdf.columns=[col+'_'+str(int(re.findall('\d+',col)[0])-1) for col in newdf.columns]
print newdf

Manipulate pandas.DataFrame with multiple criterias

For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?
query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0
Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2

Categories

Resources