Map values based off matched columns - Python - python

I want to map values based how two columns are matched. For instance, the df below contains different labels, A or B. I want to assign a new column that describes these labels. How this occurs is comparing columns Z L and Z P. Z L will always contain either ['X1','X2','X3','X4']. While Z P will correspondingly contain ['LA','LB','LC','LD'].
These will always be in acceding order or reverse order. As in ascending order will mean X1 corresponds to LA, X2 corresponds to LB etc. Reverse order means X1 corresponds to LD, X2 corresponds to LC etc.
If ascending order I want to map an R. If reverse order I want to map an L.
X = ['X1','X2','X3','X4']
R = ['LA','LB','LC','LD']
L = ['LD','LC','LB','LA']
df = pd.DataFrame({
'Period' : [1,1,1,1,1,2,2,2,2,2],
'labels' : ['A','B','A','B','A','B','A','B','A','B'],
'Z L' : [np.nan,np.nan,'X3','X2','X4',np.nan,'X2','X3','X3','X1'],
'Z P' : [np.nan,np.nan,'LC','LC','LD',np.nan,'LC','LC','LB','LA'],
})
df = df.dropna()
This is the output dataset to determine the combinations. I have a large df with repeated combinations so I'm not too concerned with returning all of them. I'm mainly concerned with all unique Mapped values for each Period.
Period labels Z L Z P
2 1 A X3 LC
3 1 B X2 LC
4 1 A X4 LD
6 2 A X2 LC
7 2 B X3 LC
8 2 A X3 LB
9 2 B X1 LA
Attempt:
labels = df['labels'].unique().tolist()
I = df.loc[df['labels'] == labels[0]]
J = df.loc[df['labels'] == labels[1]]
I['Map'] = ((I['Z L'].isin(X)) | (I['Z P'].isin(R))).map({True:'R', False:'L'})
J['Map'] = ((J['Z L'].isin(X)) | (J['Z P'].isin(R))).map({True:'R', False:'L'})
If I drop duplicates from period and labels the intended df is:
Period labels Map
0 1 A R
1 1 B L
2 2 A L
3 2 B R

Here's my approach:
# the ascending orders
lst1,lst2 = ['X1','X2','X3','X4'], ['LA','LB','LC','LD']
# enumerate the orders
d1, d2 = ({v:k for k,v in enumerate(l)} for l in (lst1, lst2))
# check if the enumerations in `Z L` and `Z P` are the same
df['Map'] = np.where(df['Z L'].map(d1)== df['Z P'].map(d2), 'R', 'L')
Output:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
and df.drop_duplicates(['Period', 'labels']):
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
6 2 A X2 LC L
7 2 B X3 LC R

You said your data is always either in ascending or reversed order. You only need to define a fix mapping between Z L and Z P as the R and check on this mapping. If True it is R, else L. I may be wrong, but I think solution may be reduced to this
r_dict = dict(zip(['X1','X2','X3','X4'], ['LA','LB','LC','LD']))
df1['Map'] = (df1['Z L'].map(r_dict) == df1['Z P']).map({True: 'R', False: 'L'})
Out[292]:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
For the bottom desired output, you just drop_duplicates as QuangHoang.

Related

Group By Sum Multiple Columns in Pandas (Ignoring duplicates)

I have the following code where my dataframe contains 3 columns
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
oneframe = pd.concat([df['toBeSummed'],df['toBeSummed2'],df['toBesummed3']], axis=1).reset_index()
temp = oneframe.groupby(['toBeSummed']).size().reset_index()
temp2 = oneframe.groupby(['toBeSummed2']).size().reset_index()
temp3 = oneframe.groupby(['toBeSummed3']).size().reset_index()
temp.columns.values[0] = "SameName"
temp2.columns.values[0] = "SameName"
temp3.columns.values[0] = "SameName"
final = pd.concat([temp,temp2,temp3]).groupby(['SameName']).sum().reset_index()
final.columns.values[0] = "Letter"
final.columns.values[1] = "Sum"
The problem here is that with the code I have, it sums up all instances of each value. Meaning calling final would result in
Letter Sum
0 X 3
1 Y 4
2 Z 5
However I want it to not count more than once if the same value exists in the row (I.e in the first row there are two X's so it would only count the one X)
Meaning the desired output is
Letter Sum
0 X 2
1 Y 3
2 Z 3
I can update or add more comments if this is confusing.
Given df:
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
Doing:
sum_cols = ['toBeSummed', 'toBeSummed2', 'toBesummed3']
out = df[sum_cols].apply(lambda x: x.unique()).explode().value_counts()
print(out.to_frame('Sum'))
Output:
Sum
Y 3
Z 3
X 2

how to solve pandas multi-column explode issue?

I am trying to explode multi-columns at a time systematically.
Such that:
[
and I want the final output as:
I tried
df=df.explode('sauce', 'meal')
but this only provides the first element ( sauce) in this case to be exploded, and the second one was not exploded.
I also tried:
df=df.explode(['sauce', 'meal'])
but this code provides
ValueError: column must be a scalar
error.
I tried this approach, and also this. none worked.
Note: cannot apply to index, there are some none- unique values in the fruits column.
Prior to pandas 1.3.0 use:
df.set_index(['fruits', 'veggies'])[['sauce', 'meal']].apply(pd.Series.explode).reset_index()
Output:
fruits veggies sauce meal
0 x1 y2 a d
1 x1 y2 b e
2 x1 y2 c f
3 x2 y2 g k
4 x2 y2 h l
Many columns? Try:
df.set_index(df.columns.difference(['sauce', 'meal']).tolist())\
.apply(pd.Series.explode).reset_index()
Output:
fruits veggies sauce meal
0 x1 y2 a d
1 x1 y2 b e
2 x1 y2 c f
3 x2 y2 g k
4 x2 y2 h l
Update your version of Pandas
# Setup
df = pd.DataFrame({'fruits': ['x1', 'x2'],
'veggies': ['y1', 'y2'],
'sauce': [list('abc'), list('gh')],
'meal': [list('def'), list('kl')]})
print(df)
# Output
fruits veggies sauce meal
0 x1 y1 [a, b, c] [d, e, f]
1 x2 y2 [g, h] [k, l]
Explode (Pandas 1.3.5):
out = df.explode(['sauce', 'meal'])
print(out)
# Output
fruits veggies sauce meal
0 x1 y1 a d
0 x1 y1 b e
0 x1 y1 c f
1 x2 y2 g k
1 x2 y2 h l

pd.Dataframe.update puts the result at the top of the dataframe

Lets say I have two dataframes like this:
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n)
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(n)
I want to update the zeroes in the z column in the nf dataframe with the values from the z column in the mf dataframe only in the rows with keys from the column x
when i call
nf.update(mf)
i get
x y z
b 1 10
d 2 100
e 3 1000
d 4 0
e 5 0
instead of the desired output
x y z
a 1 0
b 2 10
c 3 0
d 4 100
e 5 1000
To answer your problem, you need to match the indexes of both dataframes, here how you can do it :
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n).set_index('x')
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(m).set_index('x')
nf.update(mf)
nf = nf.reset_index()

Replace contents of cell with another cell if condition on a separate cell is met

I have to following data frame
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
we can see 4 columns A,B,C,D the intended outcome is to replace the contents of B with the contents of D, if a condition on C is met, for this example the condition is of C = 1
the intended output is
A = [1,2,5,4,3,1]
B = ["y","No","y","y","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
output_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
output_df.drop('D', axis = 1)
What is the best way to apply this logic to a data frame?
There are many ways to solve, here is another one:
test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
This can be done with np.where:
test_df['B'] = np.where(test_df['C']==1, test_df['D'], test_df['B'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
The desired output is achieved using .loc with column 'C' as the mask.
test_df.loc[test_df['C']==1,'B'] = test_df.loc[test_df['C']==1,'D']
UPDATE: Just found out a similar answer is posted by #QuangHoang. This answer is slightly different in that it does not require numpy
I don't know if inverse is the right word here, but I noticed recently that mask and where are "inverses" of each other. If you pass a ~ to the condition of a .where statement, then you get the same result as mask:
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
test_df['B'] = test_df['B'].where(~(test_df['C'] == 1), test_df['D'])
# test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D']) - Scott Boston's answer
test_df
Out[1]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
You can also use df.where:
test_df['B'] = test_df['D'].where(test_df.C.eq(1), test_df.B)
Output:
In [875]: test_df
Out[875]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

How to repeat certain rows of a dataframe?

I have a dataframe like this
import pandas as pd
df1 = pd.DataFrame({
'key': list('AAABBC'),
'prop1': list('xyzuuy'),
'prop2': list('mnbnbb')
})
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
and a dictionary like this (user input):
d = {
'A': 2,
'B': 1,
'C': 3,
}
The keys of d refer to entries in column key in df1, the values indicate how often the rows of df1 that belong to the respective keys should be present: 1 means that nothing has to be done, 2 means all lines should be copied once, 3 they should be copied twice.
For the example above, the expected output looks as follows:
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 A x m # <-- copied, copy 1
7 A y n # <-- copied, copy 1
8 A z b # <-- copied, copy 1
9 C y b # <-- copied, copy 1
10 C y b # <-- copied, copy 2
So, the rows that belong to A have been copied once and added to df1, nothing had to be done about the rows the belong to B and the rows that belong to C have been copied twice and were also added to df1.
I currently implement this as follows:
dfs_to_add = []
for el, val in d.items():
if val > 1:
_temp_df = pd.concat(
[df1[df1['key'] == el]] * (val-1)
)
dfs_to_add.append(_temp_df)
df_to_add = pd.concat(dfs_to_add)
df_final = pd.concat([df1, df_to_add]).reset_index(drop=True)
which gives me the desired output.
The code is rather ugly; does anyone see a more straightforward option to get to the same output?
The order is important, so in case of A, I would need
0 A x m
1 A y n
2 A z b
0 A x m
1 A y n
2 A z b
and not
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
We can sue concat + groupby
df=pd.concat([ pd.concat([y]*d.get(x)) for x , y in df1.groupby('key')])
key prop1 prop2
0 A x m
1 A y n
2 A z b
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b
One way using Index.repeat with loc[] and series.map:
m = df1.set_index('key',append=True)
out = m.loc[m.index.repeat(df1['key'].map(d))].reset_index('key')
print(out)
key prop1 prop2
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b
You can try repeat:
df1.loc[df1.index.repeat(df1['key'].map(d))]
Output:
key prop1 prop2
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b
If order is not important, use another solutions.
If order is important get indices of repeated values, repeat by loc and add to original:
idx = [x for k, v in d.items() for x in df1.index[df1['key'] == k].repeat(v-1)]
df = df1.append(df1.loc[idx], ignore_index=True)
print (df)
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 A x m
7 A y n
8 A z b
9 C y b
10 C y b
Using DataFrame.merge and np.repeat:
df = df1.merge(
pd.Series(np.repeat(list(d.keys()), list(d.values())), name='key'), on='key')
Result:
# print(df)
key prop1 prop2
0 A x m
1 A x m
2 A y n
3 A y n
4 A z b
5 A z b
6 B u n
7 B u b
8 C y b
9 C y b
10 C y b

Categories

Resources