Pandas (python): max in columns define new value in new column - python

I have a df with about 50 columns:
Product ID | Cat1 | Cat2 |Cat3 | ... other columns ...
8937456 0 5 10
8497534 25 3 0
8754392 4 15 7
Cat signifies how many quantities of that product fell into a category. Now I want to add a column "Category" denoting the majority Category for a product (ignoring the other columns and just considering the Cat columns).
df_goal:
Product ID | Cat1 | Cat2 |Cat3 | Category | ... other columns ...
8937456 0 5 10 3
8497534 25 3 0 1
8754392 4 15 7 2
I think I need to use max and apply or map?
I found those on stackoverflow, but they don't not address the category assignment. In Excel I renamed the columns from Cat 1 to 1 and used index(match(max)).
Python Pandas max value of selected columns
How should I take the max of 2 columns in a dataframe and make it another column?
Assign new value in DataFrame column based on group max

Here's a NumPy way with numpy.argmax -
df['Category'] = df.values[:,1:].argmax(1)+1
To restrict the selection to those columns, use those column headers/names specifically and then use idxmax and finally replace the string Cat with `empty strings, like so -
df['Category'] = df[['Cat1','Cat2','Cat3']].idxmax(1).str.replace('Cat','')
numpy.argmax or panda's idxmax basically gets us the ID of max element along an axis.
If we know that the column names for the Cat columns start at 1st column and end at 4th one, we can slice the dataframe : df.iloc[:,1:4] instead of df[['Cat1','Cat2','Cat3']].

Related

Pandas: Adding incremental numbers to the suffix of repeating values of a column that are grouped by the value of another column and ordered by index

I am trying to add an underscore and incremental numbers to any repeating values ordered by index and within a group that is defined by another column.
For example, I would like the repeating values in the Chemistry column to have underscores and incremental numbers ordered by index and grouped by the Cycle column.
df = pd.DataFrame([[1,1,1,1,1,1,2,2,2,2,2,2], ['NaOH', 'H20', 'MWS', 'H20', 'MWS', 'NaOh', 'NaOH', 'H20', 'MWS', 'H20', 'MWS', 'NaOh']]).transpose()
df.columns = ['Cycle', 'Chemistry']
df
Original Table
So the output will look like the table in the link below:
Desired output table
IIUC:
pandas.Series.str.cat and cumcount
df['Chemistry'] = df.Chemistry.str.cat(
df.groupby(['Cycle', 'Chemistry']).cumcount().add(1).astype(str),
sep='_'
)
df
Cycle Chemistry
0 1 NaOH_1
1 1 H20_1
2 1 MWS_1
3 1 H20_2
4 1 MWS_2
5 1 NaOh_1
6 2 NaOH_1
7 2 H20_1
8 2 MWS_1
9 2 H20_2
10 2 MWS_2
11 2 NaOH_2

Handling rows with 2 lines of data in Python

My DataFrame looks like this :
there are some rows ( example: 297) where the "Price" column has two values ( Plugs and Quarts), I have filled the Nans with the previous row since it belongs to the same Latin Name. However I was thinking of splitting the Price Column further into two columns with Names "Quarts" and "Plugs" and fill the amount, 0 if there are no Plugs found and the same with Quarts.
Example :
Plugs | Quarts
0 | 2
2 | 3
4 | 0
Can someone help me with this?

Replace & filter dataframe row values

I have two dataframe's one with expression and another with values. Dataframe 1 criteria column has value with column name of another dataframe. My need is to take each row values from Dataframe 2 and replace Dataframe 1 criteria without loop.
How should I do it in an optimized way ?
DataFrame 1:
Criteria point
0 chgdsl='10' 1
1 chgdt ='01022007' 2
3 chgdsl='9' 3
DataFrame 2:
chgdsl chgdt chgname
0 10 01022007 namrr
1 9 02022007 chard
2 9 01022007 exprr
I expect that when I take first row of DataFrame 2 , output of Dataframe 1 will be 10='10' , 01022007 ='01022007' 10='9'
Need to take one row at a time from Dataframe 2 and replace it in all rows of Dataframe 1.

Python Pandas differing value_counts() in two columns of same len()

I have a pandas data frame that contains two columns, with trace numbers [col_1] and ID numbers [col_2]. Trace numbers can be duplicates, as can ID numbers - however, each trace & ID should correspond only a specific fellow in the adjacent column.
Each of my two columns are the same length, but have different unique value counts, which should be the same, as shown below:
in[1]: Trace | ID
1 | 5054
2 | 8291
3 | 9323
4 | 9323
... |
100 | 8928
in[2]: print('unique traces: ', df['Trace'].value_counts())
print('unique IDs: ', df['ID'].value_counts())
out[3]: unique traces: 100
unique IDs: 99
In the code above, the same ID number (9232) is represented by two Trace numbers (3 & 4) - how can I isolate these incidences? Thanks for looking!
By using the duplicated() function (docs), you can do the following:
df[df['ID'].duplicated(keep=False)]
By setting keep to False, we get all the duplicates (instead of excluding the first or the last one).
Which returns:
Trace ID
2 3 9323
3 4 9323
You can use groupby and filter:
df.groupby('ID').filter(lambda x: x.Trace.nunique() > 1)
Output:
Trace ID
2 3 9323.0
3 4 9323.0
#this should tell you the index of Non-unique Trace or IDs.
df.groupby('ID').filter(lambda x: len(x)>1)
Out[85]:
Trace ID
2 3 9323
3 4 9323
df.groupby('Trace').filter(lambda x: len(x)>1)
Out[86]:
Empty DataFrame
Columns: [Trace, ID]
Index: []

Sort a column within groups in Pandas

I am new to pandas. I'm trying to sort a column within each group. So far, I was able to group first and second column values together and calculate the mean value in third column. But I am still struggling to sort 3rd column.
This is my input dataframe
This is my dataframe after applying groupby and mean function
I used the following line of code to group input dataframe,
df_o=df.groupby(by=['Organization Group','Department']).agg({'Total Compensation':np.mean})
Please let me know how to sort the last column for each group in 1st column using pandas.
It seems you need sort_values:
#for return df add parameter as_index=False
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
Sample:
df = pd.DataFrame({'Organization Group':['a','b','a','a'],
'Department':['d','f','a','a'],
'Total Compensation':[1,8,9,1]})
print (df)
Department Organization Group Total Compensation
0 d a 1
1 f b 8
2 a a 9
3 a a 1
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
print (df_o)
Organization Group Department Total Compensation
0 a a 5
1 a d 1
2 b f 8
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
print (df_o)
Organization Group Department Total Compensation
1 a d 1
0 a a 5
2 b f 8

Categories

Resources