Pandas Boolean indexing with two dataframes - python

I have two pandas dataframes:
df1
'A' 'B'
0 0
0 2
1 1
1 1
1 3
df2
'ID' 'value'
0 62
1 70
2 76
3 4674
4 3746
I want to assign df.value as a new column D to df1, but just when df.A == 0.
df1.B and df2.ID are supposed to be the identifiers.
Example output:
df1
'A' 'B' 'D'
0 0 62
0 2 76
1 1 NaN
1 1 NaN
1 3 NaN
I tried the following:
df1['D'][ df1.A == 0 ] = df2['value'][df2.ID == df1.B]
However, since df2 and df1 don't have the same length, I get the a ValueError.
ValueError: Series lengths must match to compare
This is quite certainly due to the boolean indexing in the last part: [df2.ID == df1.B]
Does anyone know how to solve the problem without needing to iterate over the dataframe(s)?
Thanks a bunch!
==============
Edit in reply to #EdChum: It worked perfectly with the example data, but I have issues with my real data. df1 is a huge dataset. df2 looks like this:
df2
ID value
0 1 1.00000
1 2 1.00000
2 3 1.00000
3 4 1.00000
4 5 1.00000
5 6 1.00000
6 7 1.00000
7 8 1.00000
8 9 0.98148
9 10 0.23330
10 11 0.56918
11 12 0.53251
12 13 0.58107
13 14 0.92405
14 15 0.00025
15 16 0.14863
16 17 0.53629
17 18 0.67130
18 19 0.53249
19 20 0.75853
20 21 0.58647
21 22 0.00156
22 23 0.00000
23 24 0.00152
24 25 1.00000
After doing the merging, the output is the following: first 133 times 0.98148, then 47 times 0.00025 and then it continues with more sequences of values from df2 until finally a sequence of NaN entries appear...
Out[91]: df1
A B D
0 1 3 0.98148
1 0 9 0.98148
2 0 9 0.98148
3 0 7 0.98148
5 1 21 0.98148
7 1 12 0.98148
... ... ... ...
2592 0 2 NaN
2593 1 17 NaN
2594 1 16 NaN
2596 0 17 NaN
2597 0 6 NaN
Any idea what might have happened here? They are all int64.
==============
Here are two csv with data that reproduces the problem.
df1:
https://owncloud.tu-berlin.de/public.php?service=files&t=2a7d244f55a5772f16aab364e78d3546
df2:
https://owncloud.tu-berlin.de/public.php?service=files&t=6fa8e0c2de465cb4f8a3f8890c325eac
To reproduce:
import pandas as pd
df1 = pd.read_csv("../../df1.csv")
df2 = pd.read_csv("../../df2.csv")
df1['D'] = df1[df1.A == 0].merge(df2,left_on='B', right_on='ID', how='left')['value']

Slightly tricky this one, there are 2 steps here, first is to select only the rows in df where 'A' is 0, then merge to this the other df where 'B' and 'ID' match but perform a 'left' merge, then select the 'value' column from this and assign to the df:
In [142]:
df['D'] = df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')['value']
df
Out[142]:
A B D
0 0 0 62
1 0 2 76
2 1 1 NaN
3 1 1 NaN
4 1 3 NaN
Breaking this down will show what is happening:
In [143]:
# boolean mask on condition
df[df.A == 0]
Out[143]:
A B D
0 0 0 62
1 0 2 76
In [144]:
# merge using 'B' and 'ID' columns
df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')
Out[144]:
A B D ID value
0 0 0 62 0 62
1 0 2 76 2 76
After all the above you can then assign directly:
df['D'] = df[df.A == 0].merge(df1, left_on='B',right_on='ID', how='left')['value']
This works as it will align with the left hand side idnex so any missing values will automatically be assigned NaN
EDIT
Another method and one that seems to work for your real data is to use map to perform the lookup for you, map accepts a dict or series as a param and will lookup the corresponding value, in this case you need to set the index to 'ID' column, this reduces your df to one with just the 'Value' column:
df['D'] = df[df.A==0]['B'].map(df1.set_index('ID')['value'])
So the above performs boolean indexing as before and then calls map on the 'B' column and looksup the corresponding 'Value' in the other df after we set the index on 'ID'.
Update
I looked at your data and my first method and I can see why this fails, the alignment to the left hand side df fails so you get 1192 values in a continuous row and then the rest of the rows are NaN up to row 2500.
What does work is if you apply the same mask to the left hand side like so:
df1.loc[df1.A==0, 'D'] = df1[df1.A == 0].merge(df2,left_on='B', right_on='ID', how='left')['value']
So this masks the rows on the left hand side correctly and assigns the result of the merge

Related

Changing a cell value based on other rows and columns

I have a dataframe called Result that comes from a SQL query:
Loc ID Bank
1 23 NULL
1 24 NULL
1 25 NULL
2 23 6
2 24 7
2 25 8
I am trying to set the values of Loc == 1 Bank equal to the Bank of Loc == 2 when the ID is the same, resulting in:
Loc ID Bank
1 23 6
1 24 7
1 25 8
2 23 6
2 24 7
2 25 8
Here is where I am at with the code, I know the ending is super simple I just can't wrap my head around a solution that doesn't involve iterating over every row (9000~).
result.loc[(result['Loc'] == '1'), 'bank'] = ???
You can try this. It uses map() to get the values from ID.
for_map = df.loc[df['Loc'] == 2].set_index('ID')['Bank'].squeeze().to_dict()
df.loc[df['Loc'] == 1,'Bank'] = df.loc[df['Loc'] == 1,'Bank'].fillna(df['ID'].map(for_map))
You could do a self merge on the dataframe, on ID, then filter for rows where it is equal to 2:
(
df.merge(df, on="ID")
.loc[lambda df: df.Loc_y == 2, ["Loc_x", "ID", "Bank_y"]]
.rename(columns=lambda x: x.split("_")[0] if "_" in x else x)
.astype({"Bank": "Int8"})
.sort_values("Loc", ignore_index=True)
)
Loc ID Bank
0 1 23 6
1 1 24 7
2 1 25 8
3 2 23 6
4 2 24 7
5 2 25 8
You could also stack/unstack, although this fails if you have duplicate indices:
(
df.set_index(["Loc", "ID"])
.unstack("Loc")
.bfill(1)
.stack()
.reset_index()
.reindex(columns=df.columns)
)
Loc ID Bank
0 1 23 6.0
1 2 23 6.0
2 1 24 7.0
3 2 24 7.0
4 1 25 8.0
5 2 25 8.0
Why not use pandas.MultiIndex ?
Commonalities
# Arguments,
_0th_level = 'Loc'
merge_key = 'ID'
value_key = 'Bank' # or a list of colnames or `slice(None)` to propagate all columns values.
src_key = '2'
dst_key = '1'
# Computed once for all,
df = result.set_index([_0th_level, merge_key])
df2 = df.xs(key=src_key, level=_0th_level, drop_level=False)
df1_ = df2.rename(level=_0th_level, index={src_key: dst_key})
First (naive) approach
df.loc[df1_.index, value_key] = df1_
# to get `result` back : df.reset_index()
Second (robust) approach
That being shown, the first approach may be illegal (since pandas version 1.0.0) if there is one or more missing label [...].
So if you must ensure that indexes exist both at source and destination, the following does the job on shared IDs only.
df1 = df.xs(key=dst_key, level=_0th_level, drop_level=False)
idx = df1.index.intersection(df1_.index) # <-----
df.loc[idx, value_key] = df1_.loc[idx, value_key]

Grouping and aggregating over columns duplicates columns in pandas

I am joining two tables left_table and right_table on non-unique keys that results in row explosion. I then want to aggregate rows to match the number of rows in left_table. To do this I aggregate over left_table columns.
Weirdly, when I save the table the columns in left_table double. It seems like columns of left_table become an index for resulting dataframe...
Left table
k1 k2 s v c target
0 1 3 20 40 2 2
1 1 2 10 20 1 1
2 1 2 10 80 2 1
Right table
k11 k22 s2 v2
0 1 2 0 100
1 2 3 30 200
2 1 2 10 300
Left join
k1 k2 s v c target s2 v2
0 1 3 20 40 2 2 NaN NaN
1 1 2 10 20 1 1 0.0 100.0
2 1 2 10 20 1 1 10.0 300.0
3 1 2 10 80 2 1 0.0 100.0
4 1 2 10 80 2 1 10.0 300.0
Aggregation code
dic = {}
keys_to_agg_over = left_table_col_names
for col in numeric_cols:
if col in all_cols:
dic[col] = 'median'
left_join = left_join.groupby(keys_to_agg_over).aggregate(dic)
After aggregation (doubled number of left table cols)
k1 k2 s v c target s2 v2
k1 k2 s v c target
1 2 10 20 1 1 1 2 10 20 1 1 5.0 200.0
80 2 1 1 2 10 80 2 1 5.0 200.0
3 20 40 2 2 1 3 20 40 2 2 NaN NaN
Saved to csv file
k1,k2,s,v,c,target,k1,k2,s,v,c,target,s2,v2
1,2,10,20,1,1,1,2,10,20,1,1,5.0,200.0
1,2,10,80,2,1,1,2,10,80,2,1,5.0,200.0
1,3,20,40,2,2,1,3,20,40,2,2,,
I tried resetting index, as left_join.reset_index() but I get
ValueError: cannot insert target, already exists
How to fix the issue of column-doubling?
You have a couple of options:
Store csv not including the index: I guess you are using the to_csv method to store the result in a csv. By default it includes you index columns in the generated csv. you can do to_csv(index=False) to avoid storing them.
reset_index dropping it: you can use left_join.reset_index(drop=True) in order to discard the index columns and not add them in the dataframe. By default reset_index adds the current index columns to the dataframe, generating the ValueError you obtain.
It seems like you are using:
left_join = left_table.merge(right_table, left_on = ["k1", "k2"], "right_on" = ["k11", "k22"] , how = "left")
This will result in a dataframe with repeated rows since indexes 1 and 2 from the left table both can be joined to indexes 0 and 2 of the right table. If that is the behavior you expected, and just want to get rid of duplicated rows you can try using:
left_join = left_join.drop_duplicates()
Before aggregating. This solution won't stop duplicating rows, it will rather eliminate them to not cause any trouble.
You can also pass the parameter as_index = False in the groupby function like this:
left_join = left_join.groupby(keys_to_agg_over, as_index = False).aggregate(dic)
To stop geting the "grouping columns" as indexes.

Pandas reshape a multicolumn dataframe long to wide with conditional check

I have a pandas data frame as follows:
id group type action cost
101 A 1 10
101 A 1 repair 3
102 B 1 5
102 B 1 repair 7
102 B 1 grease 2
102 B 1 inflate 1
103 A 2 12
104 B 2 9
I need to reshape it from long to wide, but depending on the value of the action column, as follows:
id group type action_std action_extra
101 A 1 10 3
102 B 1 5 10
103 A 2 12 0
104 B 2 9 0
In other words, for the rows with empty action field the cost value should be put under the action_std column, while for the rows with non-empty action field the cost value should be summarized under the action_extra column.
I've attempted with several combinations of groupby / agg / pivot but I cannot find any fully working solution...
I would suggest you simply split the cost column into a cost, and a cost_extra column. Something like the following:
import numpy as np
result = df.assign(
cost_extra=lambda df: np.where(
df['action'].notnull(), df['cost'], np.nan
)
).assign(
cost=lambda df: np.where(
df['action'].isnull(), df['cost'], np.nan
)
).groupby(
["id", "group", "type"]
)["cost", "cost_extra"].agg(
"sum"
)
result looks like:
cost cost_extra
id group type
101 A 1 10.0 3.0
102 B 1 5.0 10.0
103 A 2 12.0 0.0
104 B 2 9.0 0.0
Check groupby with unstack
df.cost.groupby([df.id,df.group,df.type,df.action.eq('')]).sum().unstack(fill_value=0)
action False True
id group type
101 A 1 3 10
102 B 1 10 5
103 A 2 0 12
104 B 2 0 9
Thanks for your hints, this is the solution that I finally like the most (also for its simplicity):
df["action_std"] = df["cost"].where(df["action"] == "")
df["action_extra"] = df["cost"].where(df["action"] != "")
df = df.groupby(["id", "group", "type"])["action_std", "action_extra"].sum().reset_index()

Is it possible to split a Pandas dataframe using groupby and merge each group with separate dataframes

I have a Pandas dataframe that contains a grouping variable. I would like to merge each group with other dataframes based on the contents of one of the columns. So, for example, I have a dataframe, dfA, which can be defined as:
dfA = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[0,1,0,0,1,1],
'c':['a','b','c','d','e','f']})
a b c
0 1 0 a
1 2 1 b
2 3 0 c
3 4 0 d
4 5 1 e
5 6 1 f
Two other dataframes, dfB and dfC, contain a common column ('a') and an extra column ('d') and can be defined as:
dfB = pd.DataFrame({'a':[1,2,3],
'd':[11,12,13]})
a d
0 1 11
1 2 12
2 3 13
dfC = pd.DataFrame({'a':[4,5,6],
'd':[21,22,23]})
a d
0 4 21
1 5 22
2 6 23
I would like to be able to split dfA based on column 'b' and merge one of the groups with dfB and the other group with dfC to produce an output that looks like:
a b c d
0 1 0 a 11
1 2 1 b 12
2 3 0 c 13
3 4 0 d 21
4 5 1 e 22
5 6 1 f 23
In this simplified version, I could concatenate dfB and dfC and merge with dfA without splitting into groups as shown below:
dfX = pd.concat([dfB,dfC])
dfA = dfA.merge(dfX,on='a',how='left')
print(dfA)
a b c d
0 1 0 a 11
1 2 1 b 12
2 3 0 c 13
3 4 0 d 21
4 5 1 e 22
5 6 1 f 23
However, in the real-world situation, the smaller dataframes will be generated from multiple different complex sources; generating the dataframes and combining into a single dataframe beforehand may not be feasible because there may be overlapping data on the column that will be used for merging the dataframes (but this will be avoided if the dataframe can be split based on the grouping variable). Is it possible to use Pandas groupby() method to do this instead? I was thinking of something like the following (which doesn't work, perhaps because I'm not combining the groups into a new dataframe correctly):
grouped = dfA.groupby('b')
for name, group in grouped:
if name == 0:
group = group.merge(dfB,on='a',how='left')
elif name == 1:
group = group.merge(dfC,on='a',how='left')
Any thoughts would be appreciated.
This will fix your code
l=[]
grouped = dfA.groupby('b')
for name, group in grouped:
if name == 0:
group = group.merge(dfB,on='a',how='left')
elif name == 1:
group = group.merge(dfC,on='a',how='left')
l.append(group)
pd.concat(l)
Out[215]:
a b c d
0 1 0 a 11.0
1 3 0 c 13.0
2 4 0 d NaN
0 2 1 b NaN
1 5 1 e 22.0
2 6 1 f 23.0

Filtering Pandas Dataframe Aggregate

I have a pandas dataframe that I groupby, and then perform an aggregate calculation to get the mean for:
grouped = df.groupby(['year_month', 'company'])
means = grouped.agg({'size':['mean']})
Which gives me a dataframe back, but I can't seem to filter it to the specific company and year_month that I want:
means[(means['year_month']=='201412')]
gives me a KeyError
The issue is that you are grouping based on 'year_month' and 'company' . Hence in the means DataFrame, year_month and company would be part of the index (MutliIndex). You cannot access them as you access other columns.
One method to do this would be to get the values of the level 'year_month' of index . Example -
means.loc[means.index.get_level_values('year_month') == '201412']
Demo -
In [38]: df
Out[38]:
A B C
0 1 2 10
1 3 4 11
2 5 6 12
3 1 7 13
4 2 8 14
5 1 9 15
In [39]: means = df.groupby(['A','B']).mean()
In [40]: means
Out[40]:
C
A B
1 2 10
7 13
9 15
2 8 14
3 4 11
5 6 12
In [41]: means.loc[means.index.get_level_values('A') == 1]
Out[41]:
C
A B
1 2 10
7 13
9 15
As already pointed out, you will end up with a 2 level index. You could try to unstack the aggregated dataframe:
means = df.groupby(['year_month', 'company']).agg({'size':['mean']}).unstack(level=1)
This should give you a single 'year_month' index, 'company' as columns and your aggregate size as values. You can then slice by the index:
means.loc['201412']

Categories

Resources