Multiply dataframes with differnet lengths regarding columns names - python

I have two dataframes, the first one df1 contains only one row :
A B C D E
0 5 8 9 5 0
and the second one has multiple rows , but the same number of columns:
D C E A B
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
3 6 7 7 8 1
4 5 9 8 9 4
5 3 0 3 5 0
6 2 3 8 1 3
7 3 3 7 0 1
8 9 9 0 4 7
9 3 2 7 2 0
In real example I have much more columns (more than 100). the both dataframes has the same number of columns, and the same columns names, but the order of columns is different, as it's shown in the example.
I should multiply two dataframes (matrix_like multiplication), except of I couldn't perform simple df2.values * df1.values because the columns are not ordered in the same manner, so for instance the second column of df1 B couldn't be multiplied at the second column of df2, because we find C instead of B at second column of df2 , while the column B is the 5th column in df2.
Is there simple and pythonic solution to multiply the dataframes, taking into account the column names ant not column index?

df1[df2.columns] returns a dataframe where the columns are ordered as in df2:
df1
Out[91]:
A B C D E
0 3 8 9 5 0
df1[df2.columns]
Out[92]:
D C E A B
0 5 9 0 3 8
So, you just need:
df2.values * df1[df2.columns].values
This will raise a key error if you have additional columns in df2; and it will only select df2's columns even if you have more columns in df1.
As #MaxU noted, since you are operating on numpy arrays, in order to go back to the dataframe structure you will need:
pd.DataFrame(df2.values * df1[df2.columns].values, columns = df2.columns)

You can use mul, df1 is converted to Serie by ix:
print df1.ix[0]
A 5
B 8
C 9
D 5
E 0
Name: 0, dtype: int64
print df2.mul(df1.ix[0])
A B C D E
0 15 56 0 25 0
1 10 32 27 45 0
2 40 8 54 35 0
3 40 8 63 30 0
4 45 32 81 25 0
5 25 0 0 15 0
6 5 24 27 10 0
7 0 8 27 15 0
8 20 56 81 45 0
9 10 0 18 15 0
If you need change order of final DataFrame, use with reindex_axis:
print df2.mul(df1.ix[0]).reindex_axis(df2.columns.tolist(), axis=1)
D C E A B
0 25 0 0 15 56
1 45 27 0 10 32
2 35 54 0 40 8
3 30 63 0 40 8
4 25 81 0 45 32
5 15 0 0 25 0
6 10 27 0 5 24
7 15 27 0 0 8
8 45 81 0 20 56
9 15 18 0 10 0
Another solution is reorder columns by reindex index of Serie by df2.columns:
print df2.mul(df1.ix[0].reindex(df2.columns))
D C E A B
0 25 0 0 15 56
1 45 27 0 10 32
2 35 54 0 40 8
3 30 63 0 40 8
4 25 81 0 45 32
5 15 0 0 25 0
6 10 27 0 5 24
7 15 27 0 0 8
8 45 81 0 20 56
9 15 18 0 10 0

Related

Python Dataframe GroupBy Function

I am having hard time understanding what the code below does. I initially thought it was counting the unique appearances of the values in (weight age) and (weight height) however when I ran this example, I found out it was doing something else.
data = [[0,33,15,4],[1,44,12,3],[0,44,12,5],[1,33,15,4],[0,77,13,4],[1,33,15,4],[1,99,40,7],[0,58,45,4],[1,11,13,4]]
df = pd.DataFrame(data,columns=["Lbl","Weight","Age","Height"])
print (df)
def group_fea(df,key,target):
'''
Adds columns for feature combinations
'''
tmp = df.groupby(key, as_index=False)[target].agg({
key+target + '_nunique': 'nunique',
}).reset_index()
del tmp['index']
print("****{}****".format(target))
return tmp
#Add feature combinations
feature_key = ['Weight']
feature_target = ['Age','Height']
for key in feature_key:
for target in feature_target:
tmp = group_fea(df,key,target)
df = df.merge(tmp,on=key,how='left')
print (df)
Lbl Weight Age Height
0 0 33 15 4
1 1 44 12 3
2 0 44 12 5
3 1 33 15 4
4 0 77 13 4
5 1 33 15 4
6 1 99 40 7
7 0 58 45 4
8 1 11 13 4
****Age****
****Height****
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1
I want to understand what the values in WeightAge_nunique WeightHeight_nunique mean
The value of WeightAge_nunique on a given row is the number of unique Ages that have the same Weight. The corresponding thing is true of WeightHeight_nunique. E.g., for people of Weight=44, there is only 1 unique age (12), hence WeightAge_nunique=1 on those rows, but there are 2 unique Heights (3 and 5), hence WeightHeight_nunique=2 on those same rows.
You can see that this happens because the grouping function groups by the "key" column (Weight), then performs the "nunique" aggregation function on the "target" column (either Age or Height).
Let us try transform
g = df.groupby('Weight').transform('nunique')
df['WeightAge_nunique'] = g['Age']
df['WeightHeight_nunique'] = g['Height']
df
Out[196]:
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1

Remove duplicates after a certain number of occurrences

How do we filter the dataframe below to remove all duplicate ID rows after a certain number of ID occurrence. I.E. remove all rows of ID == 0 after the 3rd occurrence of ID == 0
Thanks
pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=['ID', 'Value']).sort_values('ID')
Output:
ID Value
0 7
0 8
0 5
0 5
... ... ...
9 7
9 7
9 1
9 3
Desired Output for filter_count = 3:
Output:
ID Value
0 7
0 8
0 5
1 7
1 7
1 1
2 3
If you want to do this for all IDs, use:
df.groupby("ID").head(3)
For single ID, you can assign a new column using cumcount and then filter by conditions:
df["count"] = df.groupby("ID")["Value"].cumcount()
print (df.loc[(df["ID"].ne(0))|((df["ID"].eq(0)&(df["count"]<3)))])
ID Value count
64 0 6 0
77 0 6 1
83 0 0 2
44 1 7 0
58 1 5 1
40 1 2 2
35 1 7 3
89 1 9 4
19 1 7 5
10 1 3 6
45 2 4 0
68 2 1 1
74 2 4 2
75 2 8 3
34 2 4 4
60 2 6 5
78 2 0 6
31 2 8 7
97 2 9 8
2 2 6 9
93 2 8 10
13 2 2 11
...
I will do without groupby
df = pd.concat([df.loc[df.ID==0].head(3),df.loc[df.ID!=0]])
Thanks Henry,
I modified your code and I think this should work as well.
Your df.groupby("ID").head(3) is great. Thanks.
df["count"] = df.groupby("ID")["Value"].cumcount()
df.loc[df["count"]<3].drop(['count'], axis=1)

Several Layers of If Statements with String

I have a data frame
df = pd.DataFrame([[3,2,1,5,'Stay',2],[4,5,6,10,'Leave',10],
[10,20,30,40,'Stay',11],[12,2,3,3,'Leave',15],
[31,23,31,45,'Stay',25],[12,21,17,6,'Stay',15],
[15,17,18,12,'Leave',10],[3,2,1,5,'Stay',3],
[12,2,3,3,'Leave',12]], columns = ['A','B','C','D','Status','E'])
A B C D Status E
0 3 2 1 5 Stay 2
1 4 5 6 10 Leave 10
2 10 20 30 40 Stay 11
3 12 2 3 3 Leave 15
4 31 23 31 45 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
I want to run a condition where if Status is Stay and if column E is smaller than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
If Status is Leave and if column E is larger than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
So the result is:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
My attempt:
if df['Status'] == 'Stay':
if df['E'] < df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
elif df['Status'] == 'Leave':
if df['E'] > df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
This runs into several problems including problem with string. Your help is kindly appreciated.
I think you want boolean indexing:
s1 = df.Status.eq('Stay') & df['E'].lt(df['A'])
s2 = df.Status.eq('Leave') & df['E'].gt(df['A'])
s = s1 | s2
df.loc[s, ['A','B','C','D']] = df.loc[s, ['E','A','B','C']].to_numpy()
Output:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
Using np.roll with .loc:
shift = np.roll(df.select_dtypes(exclude='object'),1,axis=1)[:, :-1]
m1 = df['Status'].eq('Stay') & (df['E'] < df['A'])
m2 = df['Status'].eq('Leave') & (df['E'] > df['A'])
df.loc[m1|m2, ['A','B','C','D']] = shift[m1|m2]
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
Use DataFrame.mask + DataFrame.shift:
#Status like index to use shift
new_df=df.set_index('Status')
#DataFrame to replace
df_modify=new_df.shift(axis=1,fill_value=df['E'])
#Creating boolean mask
under_mask=(df.Status.eq('Stay'))&(df.E<df.A)
over_mask=(df.Status.eq('Leave'))&(df.E>df.A)
#Using DataFrame.mask
new_df=new_df.mask(under_mask|over_mask,df_modify).reset_index()
print(new_df)
Output
Status A B C D E
0 Stay 2 3 2 1 5
1 Leave 10 4 5 6 10
2 Stay 10 20 30 40 11
3 Leave 15 12 2 3 3
4 Stay 25 31 23 31 45
5 Stay 12 21 17 6 15
6 Leave 15 17 18 12 10
7 Stay 3 2 1 5 3
8 Leave 12 2 3 3 12
It sounds like you want to do this for each row of the data, but your code is written to try to do it at the top level. Can you use a for ... in loop to iterate over the rows?
for row in df:
if row['Status'] == 'Stay':
... etc ...

Pivot column and column values in pandas dataframe

I have a dataframe that looks like this, but with 26 rows and 110 columns:
index/io 1 2 3 4
0 42 53 23 4
1 53 24 6 12
2 63 12 65 34
3 13 64 23 43
Desired output:
index io value
0 1 42
0 2 53
0 3 23
0 4 4
1 1 53
1 2 24
1 3 6
1 4 12
2 1 63
2 2 12
...
I have tried with dict and lists by transforming the dataframe to dict, and then create a new list with index values and update in new dict with io.
indx = []
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indx.append(key)
indxio = {}
for element in indx:
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indxio.update({element:k})
I know this is too far probably, but it's the only thing I could think of. The process was too long, so I stopped.
You can use set_index, stack, and reset_index().
df.set_index("index/io").stack().reset_index(name="value")\
.rename(columns={'index/io':'index','level_1':'io'})
Output:
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
You need set_index + stack + rename_axis + reset_index:
df = df.set_index('index/io').stack().rename_axis(('index','io')).reset_index(name='value')
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
Solution with melt, rename, but there is different order of values, so sort_values is necessary:
d = {'index/io':'index'}
df = df.melt('index/io', var_name='io', value_name='value') \
.rename(columns=d).sort_values(['index','io']).reset_index(drop=True)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
And alternative solution for numpy lovers:
df = df.set_index('index/io')
a = np.repeat(df.index, len(df.columns))
b = np.tile(df.columns, len(df.index))
c = df.values.ravel()
cols = ['index','io','value']
df = pd.DataFrame(np.column_stack([a,b,c]), columns = cols)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43

Changing structure of pandas dataframe

Is there a function that can swap between the following dataframes(df1,df2):
import random
import pandas as pd
numbers = random.sample(range(1,50), 10)
d = {'num': list(range(1,6)) + list(range(1,6)),'values':numbers,'type':['a']*5 + ['b']*5}
df = pd.DataFrame(d)
e = {'num': list(range(1,6)) ,'a':numbers[:5],'b':numbers[5:]}
df2 = pd.DataFrame(e)
Dataframe df1:
#df1
num type values
0 1 a 18
1 2 a 26
2 3 a 34
3 4 a 21
4 5 a 48
5 1 b 1
6 2 b 19
7 3 b 36
8 4 b 42
9 5 b 30
Dataframe df2:
a b num
0 18 1 1
1 26 19 2
2 34 36 3
3 21 42 4
4 48 30 5
I take the first df and the type column becomes a type name with the variables.Is there a function that can do this(from df1 to df2) and the vice-verca action(from df2 to df1)
You can use stack and pivot:
print df
num type values
0 1 a 20
1 2 a 25
2 3 a 2
3 4 a 27
4 5 a 29
5 1 b 39
6 2 b 40
7 3 b 6
8 4 b 17
9 5 b 47
print df2
a b num
0 20 39 1
1 25 40 2
2 2 6 3
3 27 17 4
4 29 47 5
df1 = df2.set_index('num').stack().reset_index()
df1.columns = ['num','type','values']
df1 = df1.sort_values('type')
print df1
num type values
0 1 a 20
2 2 a 46
4 3 a 21
6 4 a 33
8 5 a 10
1 1 b 45
3 2 b 39
5 3 b 38
7 4 b 37
9 5 b 34
df3 = df.pivot(index='num', columns='type', values='values').reset_index()
df3.columns.name = None
df3 = df3[['a','b','num']]
print df3
a b num
0 46 23 1
1 38 6 2
2 36 47 3
3 33 34 4
4 15 1 5

Categories

Resources