Assign value to a column based on a string-based hierarchy - python

I am attempting to create a new column in a Pandas DataFrame where two columns are compared, and based on a pre-defined hierarchy a third column is populated after the comparison of the two columns. The new column will take the higher of the two based on the hierarchy. The hierarchy is as follows from highest to lowest:
A1
A2
A3
A4
A5
The DataFrame df is seen below.
sales_code price_bucket_a price_bucket_b
101 A1 A2
102 A3 A4
202 A2 A3
201 A4 A5
301 A2 A2
302 A5 A1
The desired output I am attempting to achieve is seen below.
sales_code price_bucket_a price_bucket_b price_bucket_hier
101 A1 A2 A1
102 A3 A4 A3
202 A2 A3 A2
201 A4 A5 A4
301 A2 A2 A2
302 A5 A1 A1
The hierarchy and DataFrame in question is just a snippet of the overall totals.
Any assistance that anyone could provide would be greatly appreciated.

First we need convert to category then we can do min or max to get the right answer
cat=['A1','A2','A3','A4','A5']
df[['price_bucket_a','price_bucket_b']].apply(lambda x : pd.Categorical(x, categories=cat,ordered=True )).min(axis=1)
0 A1
1 A3
2 A2
3 A4
4 A2
dtype: object

Here's one approach IIUC:
ix = df.filter(like='price').apply(lambda x: x.str.lstrip('A')).astype(int).idxmin(1)
df['price_bucket_hier'] = df.lookup(range(df.shape[0]), ix)
print(df)
sales_code price_bucket_a price_bucket_b price_bucket_hier
0 101 A1 A2 A1
1 102 A3 A4 A3
2 202 A2 A3 A2
3 201 A4 A5 A4
4 301 A2 A2 A2

Related

Check if value of one column exists in another column, put a value in another column in pandas

Say I have a data frame like the following:
A B C D E
a1 b1 c1 d1 e1
a2 a1 c2 d2 e2
a3 a1 a2 d3 e3
a4 a1 a2 a3 e4
I want to create a new column with predefined values if a value found in other columns.
Something like this:
A B C D E F
a1 b1 c1 d1 e1 NA
a2 a1 c2 d2 e2 in_B
a3 a1 a2 d3 e3 in_B, in_C
a4 a1 a2 a3 e4 in_B, in_C, in_D
The in_B, in_C could be other string of choice. If values present in multiple columns, then value of F would be multiple. Example, row 3 and 4 of column F (in row 3 there are two values and in row 4 there are three values). So far, I have tried a below:
DF.F=np.where(DF.A.isin(DF.B), DF.A,'in_B')
But it does not give expected result. Any help
STEPS:
Stack the dataframe.
check for the duplicate values.
unstack to get the same structure back.
use dot to get the required result.
df['new_col'] = df.stack().duplicated().unstack().dot(
'In ' + k.columns + ',').str.strip(',')
OUTPUT:
A B C D E new_col
0 a1 b1 c1 d1 e1
1 a2 a1 c2 d2 e2 In B
2 a3 a1 a2 d3 e3 In B,In C
3 a4 a1 a2 a3 e4 In B,In C,In D

Dataframe slicing with string values

I have a string dataframe that I would like to modify. I need to cut off each row of the dataframe at a value say A4 and replace other values after A4 with -- or remove them. I would like to create a new dataframe that has values only upto the string "A4". How would i do this?
import pandas as pd
columns = ['c1','c2','c3','c4','c5','c6']
values = [['A1', 'A2','A3','A4','A5','A6'],['A1','A3','A2','A5','A4','A6'],['A1','A2','A4','A3','A6','A5'],['A2','A1','A3','A4','A5','A6'], ['A2','A1','A3','A4','A6','A5'],['A1','A2','A4','A3','A5','A6']]
input = pd.DataFrame(values, columns)
columns = ['c1','c2','c3','c4','c5','c6']
values = [['A1', 'A2','A3','A4','--','--'],['A1','A3,'A2','A5','A4','--'],['A1','A2','A4','--','--','--'],['A2','A1','A3','A4','--','--'], ['A2','A1','A3','A4','--','--'],['A1','A2','A4','--','--','--']]
output = pd.DataFrame(values, columns)
You can make a small function, that will take an array, and modify the values after your desired value:
def myfunc(x, val):
for i in range(len(x)):
if x[i] == val:
break
x[(i+1):] = '--'
return x
Then you need to apply the function to the dataframe in a rowwise (axis = 1) manner:
input.apply(lambda x: myfunc(x, 'A4'), axis = 1)
0 1 2 3 4 5
c1 A1 A2 A3 A4 -- --
c2 A1 A3 A2 A5 A4 --
c3 A1 A2 A4 -- -- --
c4 A2 A1 A3 A5 A4 --
c5 A2 A1 A4 -- -- --
c6 A1 A2 A4 -- -- --
I assume you will have values more than A4
df.replace('A([5-9])', '--', regex=True)
0 1 2 3 4 5
c1 A1 A2 A3 A4 -- --
c2 A1 A3 A2 -- A4 --
c3 A1 A2 A4 A3 -- --
c4 A2 A1 A3 -- A4 --
c5 A2 A1 A4 A3 -- --
c6 A1 A2 A4 A3 -- --

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!
First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

select the first N elements of each row in a column

I am looking to select the first two elements of each row in column a and column b.
Here is an example
df = pd.DataFrame({'a': ['A123', 'A567','A100'], 'b': ['A156', 'A266666','A35555']})
>>> df
a b
0 A123 A156
1 A567 A266666
2 A100 A35555
desired output
>>> df
a b
0 A1 A1
1 A5 A2
2 A1 A3
I have been trying to use df.loc but not been successful.
Use
In [905]: df.apply(lambda x: x.str[:2])
Out[905]:
a b
0 A1 A1
1 A5 A2
2 A1 A3
Or,
In [908]: df.applymap(lambda x: x[:2])
Out[908]:
a b
0 A1 A1
1 A5 A2
2 A1 A3
In [107]: df.apply(lambda c: c.str.slice(stop=2))
Out[107]:
a b
0 A1 A1
1 A5 A2
2 A1 A3

Taking last characters of a column of objects and making it the column on a dataframe - pandas python

I have a dataframe like the following:
df =
A B D
a1 b1 9052091001A
a2 b2 95993854906
a3 b3 93492480190
a4 b4 93240941993
What I want:
df_resp =
A B D
a1 b1 001A
a2 b2 4906
a3 b3 0190
a4 b4 1993
What I tried:
for i in (0,len(df['D'])):
df['D'][i]= df['D'][i][-4:]
Error I got:
KeyError: 4906
Also, it takes a really long time and I think there should be a quicker way with pandas.
Use pd.Series.str string accessor for vectorized string operations. These are preferred over using apply.
If D elements are already strings
df.assign(D=df.D.str[-4:])
A B D
0 a1 b1 001A
1 a2 b2 4906
2 a3 b3 0190
3 a4 b4 1993
If not
df.assign(D=df.D.astype(str).str[-4:])
A B D
0 a1 b1 001A
1 a2 b2 4906
2 a3 b3 0190
3 a4 b4 1993
You can change in place with
df['D'] = df.D.str[-4:]
Use the apply() method of pandas.Series, it will be way faster than iterating with a for loop...
This should work (provided the column contains only strings):
df_resp = df.copy()
df_resp['D'] = df_resp['D'].apply(lambda x : x[-4:])
As for the KeyError, it probably comes from your DataFrame's index, since calling df['D'][i] is equivalent to df.loc[i]['D'], i.e. i refers to the index's label, not its position. It would (probably) work if you replaced it with df.loc[i]['D'], which refers to the index at position i.
I hope this helps!

Categories

Resources