I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Related
I have this data in a dataframe, The code column has several values and is of object datatype.
I want to split the rows in the following way
result
I tried to change the datatype by using
df['Code'] = df['Code'].astype(str)
and then tried to split the commas and reset the index on the basis of ID (unique) but I only get two column values. I need the entire dataset.
df = (pd.DataFrame(df.Code.str.split(',').tolist(), index=df.ID).stack()).reset_index([0, 'ID'])
df.columns = ['ID', 'Code']
Can someone help me out? I don't understand how to twist this code.
Attaching the setup code:
import pandas as pd
x = {'ID': ['1','2','3','4','5','6','7'],
'A': ['a','b','c','a','b','b','c'],
'B': ['z','x','y','x','y','z','x'],
'C': ['s','d','w','','s','s','s'],
'D': ['m','j','j','h','m','h','h'],
'Code': ['AB,BC,A','AD,KL','AD,KL','AB,BC','A','A','B']
}
df = pd.DataFrame(x, columns = ['ID', 'A','B','C','D','Code'])
df
You can first split Code column on comma , then explode it to get the desired output.
df['Code']=df['Code'].str.split(',')
df=df.explode('Code')
OUTPUT:
ID A B C D Code
0 1 a z s m AB
0 1 a z s m BC
0 1 a z s m A
1 2 b x d j AD
1 2 b x d j KL
2 3 c y w j AD
2 3 c y w j KL
3 4 a x h AB
3 4 a x h BC
4 5 b y s m A
5 6 b z s h A
6 7 c x s h B
If needed, you can replace empty string by NaN
I have two dataframes like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'A': list('aaabdcde'),
'B': list('smnipiuy'),
'C': list('zzzqqwll')
}
)
df2 = pd.DataFrame(
{
'mapcol': list('abpppozl')
}
)
A B C
0 a s z
1 a m z
2 a n z
3 b i q
4 d p q
5 c i w
6 d u l
7 e y l
mapcol
0 a
1 b
2 p
3 p
4 p
5 o
6 z
7 l
Now I want to create an additional column in df1 which should be filled with values coming from the columns A, B and C respectively, depending on whether their values can be found in df2['mapcol']. If the values in one row can be found in more than one column, they should be first used from A, then B and then C, so my expected outcome looks like this:
A B C final
0 a s z a # <- values can be found in A and C, but A is preferred
1 a m z a # <- values can be found in A and C, but A is preferred
2 a n z a # <- values can be found in A and C, but A is preferred
3 b i q b # <- value can be found in A
4 d p q p # <- value can be found in B
5 c i w NaN # none of the values can be mapped
6 d u l l # value can be found in C
7 e y l l # value can be found in C
A straightforward implementation could look like this (filling the column final iteratively using fillna in the preferred order):
preferred_order = ['A', 'B', 'C']
df1['final'] = np.nan
for col in preferred_order:
df1['final'] = df1['final'].fillna(df1[col][df1[col].isin(df2['mapcol'])])
which gives the desired outcome.
Does anyone see a solution that avoids the loop?
you can use where and isin on the full dataframe df1 to mask the value not in the df2, then reorder with the preferred_order and bfill along the column, keep the first column with iloc
preferred_order = ['A', 'B', 'C']
df1['final'] = (df1.where(df1.isin(df2['mapcol'].to_numpy()))
[preferred_order]
.bfill(axis=1)
.iloc[:, 0]
)
print (df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l
Use:
order = ['A', 'B', 'C'] # order of columns
d = df1[order].isin(df2['mapcol'].tolist()).loc[lambda x: x.any(axis=1)].idxmax(axis=1)
df1.loc[d.index, 'final'] = df1.lookup(d.index, d)
Details:
Use DataFrame.isin and filter the rows using boolean masking with DataFrame.any along axis=1 then use DataFrame.idxmax along axis=1 to get column names names associated with max values along axis=1.
print(d)
0 A
1 A
2 A
3 A
4 B
6 C
7 C
dtype: object
Use DataFrame.lookup to lookup the values in df1 corresponding to the index and columns of d and assign this values to column final:
print(df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l
If I have 2 panda df, I want to let A left join B on 2 conditions, 1) A.id = B.id 2) A.x in B.x_set. Can anyone help me on this? :
A:
id x
1 a
2 b
3 c
B:
id x_set detail
1 a,b,c x
1 d y
2 a,c z
2 d m
2 b n
3 a i
3 b,c j
The final table should be like this:
id x detail
1 a x
2 b n
3 c j
If using pandas==0.25, you can:
Transform the values to list
Explode the list into new rows
Merge back with A using pd.merge
B['x_set'] = B['x_set'].apply(lambda x: x.split(','))
B = B.explode('x_set')
A.merge(B, left_on=['id','x'], right_on=['id','x_set'])
Out[11]:
id x x_set detail
0 1 a a x
1 2 b b n
2 3 c c j
If pandas<0.25:
Transform the values to list
Get flatten list of x
Create a new dataframe with the new list
Pass the id and detail using pd.Series.repeat
Merge with A (we can use the same keys here)
B['x_set'] = B['x_set'].apply(lambda x: x.split(','))
len_set = B['x_set'].apply(len).values
values = B['x_set'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]
new_B = pd.DataFrame(flat_results, columns=['x'])
new_B['id'] = B['id'].repeat(len_set).values
new_B['detail'] = B['detail'].repeat(len_set).values
A.merge(new_B, on=['id','x'])
Out[32]:
id x detail
0 1 a x
1 2 b n
2 3 c j
Consider the following hdfstore and dataframes df and df2
import pandas as pd
store = pd.HDFStore('test.h5')
midx = pd.MultiIndex.from_product([range(2), list('XYZ')], names=list('AB'))
df = pd.DataFrame(dict(C=range(6)), midx)
df
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
midx2 = pd.MultiIndex.from_product([range(2), list('VWX')], names=list('AB'))
df2 = pd.DataFrame(dict(C=range(6)), midx2)
df2
C
A B
0 V 0
W 1
X 2
1 V 3
W 4
X 5
I want to first write df to the store.
store.append('df', df)
store.get('df')
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
At a later point in time I will have another dataframe that I want to update the store with. I want to overwrite the rows with the same index values as are in my new dataframe while keeping the old ones.
When I do
store.append('df', df2)
store.get('df')
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
0 V 0
W 1
X 2
1 V 3
W 4
X 5
This isn't at all what I want. Notice that (0, 'X') and (1, 'X') are repeated. I can manipulate the combined dataframe and overwrite, but I expect to be working with a lot data where this wouldn't be feasible.
How do I update the store to get?
C
A B
0 V 0
W 1
X 2
Y 1
Z 2
1 V 3
W 4
X 5
Y 4
Z 5
You'll see that For each level of 'A', 'Y' and 'Z' are the same, 'V' and 'W' are new, and 'X' is updated.
What is the correct way to do this?
Idea: remove matching rows (with matching index values) from the HDF first and then append df2 to HDFStore.
Problem: I couldn't find a way to use where="index in df2.index" for multi-index indexes.
Solution: first convert multiindexes to normal ones:
df.index = df.index.get_level_values(0).astype(str) + '_' + df.index.get_level_values(1).astype(str)
df2.index = df2.index.get_level_values(0).astype(str) + '_' + df2.index.get_level_values(1).astype(str)
this yields:
In [348]: df
Out[348]:
C
0_X 0
0_Y 1
0_Z 2
1_X 3
1_Y 4
1_Z 5
In [349]: df2
Out[349]:
C
0_V 0
0_W 1
0_X 2
1_V 3
1_W 4
1_X 5
make sure that you use format='t' and data_columns=True (this will index save index and index all columns in the HDF5 file, allowing us to use them in the where clause) when you create/append HDF5 files:
store = pd.HDFStore('d:/temp/test1.h5')
store.append('df', df, format='t', data_columns=True)
store.close()
now we can first remove those rows from the HDFStore with matching indexes:
store = pd.HDFStore('d:/temp/test1.h5')
In [345]: store.remove('df', where="index in df2.index")
Out[345]: 2
and append df2:
In [346]: store.append('df', df2, format='t', data_columns=True, append=True)
Result:
In [347]: store.get('df')
Out[347]:
C
0_Y 1
0_Z 2
1_Y 4
1_Z 5
0_V 0
0_W 1
0_X 2
1_V 3
1_W 4
1_X 5
I want to group my data set and enrich it with a formatted representation of the aggregated information.
This is my data set:
h = ['A', 'B', 'C']
d = [["a", "x", 1], ["a", "y", 2], ["b", "y", 4]]
rows = pd.DataFrame(d, columns=h)
A B C
0 a x 1
1 a y 2
2 b y 4
I create a pivot table to generate 0 for missing values:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
C
B x y
A
a 1 2
b 0 4
I groupy by A to remove dimension B:
wanted = rows.groupby("A").sum()
C
A
a 3
b 4
I try to add a column with the string representation of the aggregate details:
wanted["D"] = pivot["C"].applymap(lambda vs: reduce(lambda a,b: str(a)+"+"+str(b), vs.values))
AttributeError: ("'int' object has no attribute 'values'", u'occurred at index x')
It seems that I don't understand applymap.
What I want to achieve is:
C D
A
a 3 1+2
b 4 0+4
You can first remove [] from parameters in pivot_table, so you remove Multiindex from columns:
pivot = pd.pivot_table(rows,index="A", values="C", columns="B",fill_value=0)
Then sum values by columns:
pivot['C'] = pivot.sum(axis=1)
print (pivot)
B x y C
A
a 1 2 3
b 0 4 4
Cast by astype int columns x and y to str and output to D:
pivot['D'] = pivot['x'].astype(str) + '+' + pivot['y'].astype(str)
print (pivot)
B x y C D
A
a 1 2 3 1+2
b 0 4 4 0+4
Last remove column name by rename_axis (new in pandas 0.18.0) and drop unnecessary columns:
pivot = pivot.rename_axis(None, axis=1).drop(['x', 'y'], axis=1)
print (pivot)
C D
A
a 3 1+2
b 4 0+4
But if want Multiindex in columns:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
pivot['E'] = pivot["C"].sum(1)
print (pivot)
C E
B x y
A
a 1 2 3
b 0 4 4
pivot["D"] = pivot[('C','x')].astype(str) + '+' + pivot[('C','y')].astype(str)
print (pivot)
C E D
B x y
A
a 1 2 3 1+2
b 0 4 4 0+4
pivot = pivot.rename_axis((None,None), axis=1).drop('C', axis=1).rename(columns={'E':'C'})
pivot.columns = pivot.columns.droplevel(-1)
print (pivot)
C D
A
a 3 1+2
b 4 0+4
EDIT:
Another solution with groupby and MultiIndex.droplevel:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
#remove top level of Multiindex in columns
pivot.columns = pivot.columns.droplevel(0)
print (pivot)
B x y
A
a 1 2
b 0 4
wanted = rows.groupby("A").sum()
wanted['D'] = pivot['x'].astype(str) + '+' + pivot['y'].astype(str)
print (wanted)
C D
A
a 3 1+2
b 4 0+4