Fill cols where equal to value, until another value - Pandas - python

I'm trying to ffill() values in two columns in a df based on a separate column. I'm hoping to continue filling until a condition is met. Using the df below, where Val1 and Val2 are equal to C, I want to fill subsequent rows until strings in Code begin with either ['FR','GE','GA'].
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Code' : ['CA','GA','YA','GE','XA','CA','YA','FR','XA'],
'Val1' : ['A','B','C','A','B','C','A','B','C'],
'Val2' : ['A','B','C','A','B','C','A','B','C'],
})
mask = (df['Val1'] == 'C') & (df['Val2'] == 'C')
cols = ['Val1', 'Val2']
df[cols] = np.where(mask, df[cols].ffill(), df[cols])
Intended output:
Code Val1 Val2
0 CA A A
1 GA B B
2 YA C C
3 GE A A
4 XA B B
5 CA C C
6 YA C C
7 FR B B
8 XA C C
Note: Strings in Code are shortened to be two characters but are longer in my dataset, so I'm hoping to use startswith

The problem is similar to start/stop signal that I have answered before, but couldn't find it. So here's the solution along with other things your mentioned:
# check for C
is_C = df.Val1.eq('C') & df.Val2.eq('C')
# check for start substring with regex
startswith = df.Code.str.match("^(FR|GE|GA)")
# merge the two series
# startswith is 0, is_C is 1
mask = np.select((startswith,is_C), (0,1), np.nan)
# update mask with ffill
# rows after an `is_C` and before a `startswith` will be marked with 1
mask = pd.Series(mask, df.index).ffill().fillna(0).astype(bool);
# update the dataframe
df.loc[mask, ['Val1','Val2']] = 'C'
Output
Code Val1 Val2
0 CA A A
1 GA B B
2 YA C C
3 GE A A
4 XA B B
5 CA C C
6 YA C C
7 FR B B
8 XA C C

Related

Subtracting multiple columns between dataframes based on key

I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3

Splitting a column into multiple rows

I have this data in a dataframe, The code column has several values and is of object datatype.
I want to split the rows in the following way
result
I tried to change the datatype by using
df['Code'] = df['Code'].astype(str)
and then tried to split the commas and reset the index on the basis of ID (unique) but I only get two column values. I need the entire dataset.
df = (pd.DataFrame(df.Code.str.split(',').tolist(), index=df.ID).stack()).reset_index([0, 'ID'])
df.columns = ['ID', 'Code']
Can someone help me out? I don't understand how to twist this code.
Attaching the setup code:
import pandas as pd
x = {'ID': ['1','2','3','4','5','6','7'],
'A': ['a','b','c','a','b','b','c'],
'B': ['z','x','y','x','y','z','x'],
'C': ['s','d','w','','s','s','s'],
'D': ['m','j','j','h','m','h','h'],
'Code': ['AB,BC,A','AD,KL','AD,KL','AB,BC','A','A','B']
}
df = pd.DataFrame(x, columns = ['ID', 'A','B','C','D','Code'])
df
You can first split Code column on comma , then explode it to get the desired output.
df['Code']=df['Code'].str.split(',')
df=df.explode('Code')
OUTPUT:
ID A B C D Code
0 1 a z s m AB
0 1 a z s m BC
0 1 a z s m A
1 2 b x d j AD
1 2 b x d j KL
2 3 c y w j AD
2 3 c y w j KL
3 4 a x h AB
3 4 a x h BC
4 5 b y s m A
5 6 b z s h A
6 7 c x s h B
If needed, you can replace empty string by NaN

How to fill a column based on several other columns?

I have two dataframes like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'A': list('aaabdcde'),
'B': list('smnipiuy'),
'C': list('zzzqqwll')
}
)
df2 = pd.DataFrame(
{
'mapcol': list('abpppozl')
}
)
A B C
0 a s z
1 a m z
2 a n z
3 b i q
4 d p q
5 c i w
6 d u l
7 e y l
mapcol
0 a
1 b
2 p
3 p
4 p
5 o
6 z
7 l
Now I want to create an additional column in df1 which should be filled with values coming from the columns A, B and C respectively, depending on whether their values can be found in df2['mapcol']. If the values in one row can be found in more than one column, they should be first used from A, then B and then C, so my expected outcome looks like this:
A B C final
0 a s z a # <- values can be found in A and C, but A is preferred
1 a m z a # <- values can be found in A and C, but A is preferred
2 a n z a # <- values can be found in A and C, but A is preferred
3 b i q b # <- value can be found in A
4 d p q p # <- value can be found in B
5 c i w NaN # none of the values can be mapped
6 d u l l # value can be found in C
7 e y l l # value can be found in C
A straightforward implementation could look like this (filling the column final iteratively using fillna in the preferred order):
preferred_order = ['A', 'B', 'C']
df1['final'] = np.nan
for col in preferred_order:
df1['final'] = df1['final'].fillna(df1[col][df1[col].isin(df2['mapcol'])])
which gives the desired outcome.
Does anyone see a solution that avoids the loop?
you can use where and isin on the full dataframe df1 to mask the value not in the df2, then reorder with the preferred_order and bfill along the column, keep the first column with iloc
preferred_order = ['A', 'B', 'C']
df1['final'] = (df1.where(df1.isin(df2['mapcol'].to_numpy()))
[preferred_order]
.bfill(axis=1)
.iloc[:, 0]
)
print (df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l
Use:
order = ['A', 'B', 'C'] # order of columns
d = df1[order].isin(df2['mapcol'].tolist()).loc[lambda x: x.any(axis=1)].idxmax(axis=1)
df1.loc[d.index, 'final'] = df1.lookup(d.index, d)
Details:
Use DataFrame.isin and filter the rows using boolean masking with DataFrame.any along axis=1 then use DataFrame.idxmax along axis=1 to get column names names associated with max values along axis=1.
print(d)
0 A
1 A
2 A
3 A
4 B
6 C
7 C
dtype: object
Use DataFrame.lookup to lookup the values in df1 corresponding to the index and columns of d and assign this values to column final:
print(df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l

Trouble with ignore_index and concat()

I'm new to Python. I have 2 dataframes each with a single column. I want to join them together and keep the values based on their respective positions in each of the tables.
My code looks something like this:
huh = pd.DataFrame(columns=['result'], data=['a','b','c','d'])
huh2 = pd.DataFrame(columns=['result2'], data=['aa','bb','cc','dd'])
huh2 = huh2.sort_values('result2', ascending=False)
tmp = pd.concat([huh,huh2], ignore_index=True, axis=1)
tmp
From the documentation it looks like the ignore_index flag and axis=1 should be sufficient to achieve this but the results obviously disagree.
Current Output:
0 1
0 a aa
1 b bb
2 c cc
3 d dd
Desired Output:
result result2
0 a dd
1 b cc
2 c bb
3 d aa
If you concatenate the DataFrames horizontally, then the column names are ignored. If you concatenate vertically, the indexes are ignored. You can only ignore one or the other, not both.
In your case, I would recommend setting the index of "huh2" to be the same as that of "huh".
pd.concat([huh, huh2.set_index(huh.index)], axis=1)
result result2
0 a dd
1 b cc
2 c bb
3 d aa
If you aren't dealing with custom indices, reset_index will suffice.
pd.concat([huh, huh2.reset_index(drop=True)], axis=1)
result result2
0 a dd
1 b cc
2 c bb
3 d aa

How to broadcast a scalar to a filtered column in a pandas dataframe

I wish to broadcast the results of an expression to the a dataframe, but not to the entire column, just to a filtered subset. A simplification below:
In [6]: df1 = DataFrame({"A":[1, 2, 3, 4], "B":["w", "x", "y", "z"], "C":(numpy.
zeros((4), dtype='S1'))})
In [7]: df1
Out[7]:
A B C
0 1 w
1 2 x
2 3 y
3 4 z
So A and B contain my existing data and column C is prepared to enter my results into. So I can broadcast to the entire column as below:
In [9]: df1['C'] = 'H'
In [10]: df1
Out[10]:
A B C
0 1 w H
1 2 x H
2 3 y H
3 4 z H
But if I try and broadcast (in this example, the letter "R") to a filtered subset:
In [14]: (df1[df1['A'] > 2])['C']
Out[14]:
2 H
3 H
Name: C
(just to prove the filtering works)
so now I try and assign "R" to this subset..
In [12]: (df1[df1['A'] > 2])['C'] = "R"
In [13]: df1
Out[13]:
A B C
0 1 w H
1 2 x H
2 3 y H
3 4 z H
But my values remain unchanged :( (though interestingly I do not receive an error!?)
Please can anyone suggest a way I can achieve this?
Many thanks,
First choose the column, then filter:
df1.loc[df1['A'] > 2, 'C'] = "R"
A B C
0 1 w H
1 2 x H
2 3 y R
3 4 z R
Just as a remark: pandas was nicely improved to give a warning in this case:
In [8]: In [12]: (df1[df1['A'] > 2])['C'] = "R"
/Users/tismer/anaconda/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#!/bin/bash /Users/tismer/anaconda/bin/python.app
click to read the pandas link from above

Categories

Resources