I have Pandas dataframe with two columns. One is unique identifier and second is the name of product attached to this unique identifier. I have duplicate values for identifier and product names. I want to convert one column of product names into several columns without duplicating identifier. Maybe I need to aggregate product names through identifier.
My dataframe looks like:
ID Product_Name
100 Apple
100 Banana
200 Cherries
200 Apricots
200 Apple
300 Avocados
I want to have dataframe like this:
ID
100 Apple Banana
200 Cherries Apricots Apple
300 Avocados
Each product along each identifier has to be in separate column
I tried pd.melt, pd.pivot, pd.pivot_table but only errors and this errors says No numeric types to aggregate
Any idea how to do this?
Use cumcount for new columns names to MultiIndex by set_index and reshape by unstack:
df = df.set_index(['ID',df.groupby('ID').cumcount()])['Product_Name'].unstack()
Or create Series of lists and new DataFrame by contructor:
s = df.groupby('ID')['Product_Name'].apply(list)
df = pd.DataFrame(s.values.tolist(), index=s.index)
print (df)
0 1 2
ID
100 Apple Banana NaN
200 Cherries Apricots Apple
300 Avocados NaN NaN
But if want 2 column DataFrame:
df1 = df.groupby('ID')['Product_Name'].apply(' '.join).reset_index(name='new')
print (df1)
ID new
0 100 Apple Banana
1 200 Cherries Apricots Apple
2 300 Avocados
use pivot funtion pivoting it can do the required thing!!
Related
I want to subset a DataFrame by two columns in different dataframes if the values in the columns are the same. Here is an example of df1 and df2:
df1
A
0 apple
1 pear
2 orange
3 apple
df2
B
0 apple
1 orange
2 orange
3 pear
I would like the output to be a subsetted df1 based upon the df2 column:
A
0 apple
2 orange
I tried
df1 = df1[df1.A == df2.B] but get the following error:
ValueError: Can only compare identically-labeled Series objects
I do not want to rename the column in either.
What is the best way to do this? Thanks
If need compare index values with both columns create Multiindex and use Index.isin:
df = df1[df1.set_index('A', append=True).index.isin(df2.set_index('B', append=True).index)]
print (df)
A
0 apple
2 orange
I'd like to group by a specific column within a data frame called 'Fruit' and calculate the percentage of that particular fruit that are 'Good'
See below for my initial dataframe
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple','Apple','Banana'], 'Condition': ['Good','Bad','Good']})
Dataframe
Fruit Condition
0 Apple Good
1 Apple Bad
2 Banana Good
See below for my desired output data frame
Fruit Percentage
0 Apple 50%
1 Banana 100%
Note: Because there is 1 "Good" Apple and 1 "Bad" Apple, the percentage of Good Apples is 50%.
See below for my attempt which is overwriting all the columns
groupedDF = df.groupby('Fruit')
groupedDF.apply(lambda x: x[(x['Condition'] == 'Good')].count()/x.count())
See below for resulting table, which seems to calculate percentage but within existing columns instead of new column:
Fruit Condition
Fruit
Apple 0.5 0.5
Banana 1.0 1.0
We can compare Condition with eq and take advantage of the fact that True is (1) and False is (0) when processed as numbers and take the groupby mean over Fruits:
new_df = (
df['Condition'].eq('Good').groupby(df['Fruit']).mean().reset_index()
)
new_df:
Fruit Condition
0 Apple 0.5
1 Banana 1.0
We can further map to a format string and rename to get output into the shown desired output:
new_df = (
df['Condition'].eq('Good')
.groupby(df['Fruit']).mean()
.map('{:.0%}'.format) # Change to Percent Format
.rename('Percentage') # Rename Column to Percentage
.reset_index() # Restore RangeIndex and make Fruit a Column
)
new_df:
Fruit Percentage
0 Apple 50%
1 Banana 100%
*Naturally further manipulations can be done as well.
I am trying to drop rows in pandas based on whether or not it contains "/" in the cells in column "Price". I have referred to the question: Drop rows in pandas if they contains "???".
As such, I have tried both codes:
df = df[~df["Price"].str.contains('/')]
and
df = df[~df["Price"].str.contains('/',regex=False)]
However, both codes give the error:
AttributeError: Can only use .str accessor with string values!
For reference, the first few rows of my dataframe is as follows:
Fruit Price
0 Apple 3
1 Apple 2/3
2 Banana 2
3 Orange 6/7
May I know what went wrong and how can I fix this problem? Thank you very much!
Try this:
df = df[~df['Price'].astype(str).str.contains('/')]
print(df)
Fruit Price
0 Apple 3
2 Banana 2
You need to convert the price column to string first and then apply this operation. I believe that price column doesn't have datatype string
df['Price'] = df['Price'].astype(str)
and then try
df = df[~df["Price"].str.contains('/',regex=False)]
I have a dataframe as below.
Date Fruit level_0 Num Color
0 2013-11-25 Apple DF2 22.1 Red
1 2013-11-24 Banana DF1 22.1 Yellow
2 2013-11-24 Banana DF2 122.1 Yellow
3 2013-11-23 Celery DF1 10.2 Green
4 2013-11-24 Orange DF1 8.6 Orange
5 2013-11-24 Orange DF2 8.6 Orange1
6 2013-11-25 Orange DF1 8.6 Orange
I need to find and compare the rows within the dataframe and see which columns have data mismatch. The rows that are selected for comparison should be only those which have the same "Date" and "Fruit" values but different "level_0" values. So in the dataframe i need to compare rows having index 1 and 2 since they have same value for "Date" & "Fruit", but different "level_0" values. When comparing these since they differ in the "Num" column, we need to suffix a label(say "NM" ) beside the value in both rows. Rows which have only one occurrence of "Date" & "Fruit" combination will need to have a label (say "Miss") suffixed to the value in "Fruit" column.
Example of expected output below:
1.)Is it possible to get such an output?
2.)Is there a fast way get it, as my actual dataset contains millions of rows and 20-25 columns?
This is pretty complex, since there are lot different filters you want to do. If I get you right, you want
for rows that have the same "Date" and "Fruit" values, and
of those rows, those that have different "level_0" values, and
of those rows, those that have different "Num" values to get -NM. From your example you want to do the same with the "Color"-column.
Rows that are the only occurence of a "Date" and "Fruit" value get -Miss.
First, you'll need to make Num a string column, since we are adding suffixes. Then we groupby Date and Fruit (1). Then, since you wanted the groups to have different level_0 values, we make filter on that called diff_frames (2). Then we add the suffixes using transform on both columns if they have two unique elements (3).
df['Num'] = df['Num'].astype(str)
g = df.groupby(['Date', 'Fruit'])
diff_frames = g['level_0'].transform(lambda s: s.nunique() == 2)
df[['Num', 'Color']] = df[diff_frames].groupby(['Date', 'Fruit'])[['Num', 'Color']].transform(
lambda s: s+'-NM' if s.nunique() == 2 else s)
Then, for the second part, we get the non-duplicated rows in Date and Fruit, and add -Miss to the Fruit column. (4)
df.loc[~df.duplicated(subset=['Date', 'Fruit'], keep=False), 'Fruit'] += '-Miss'
print(df)
Date Fruit level_0 Num Color
0 0 Apple-Miss DF2 22.1 Red
1 1 Banana DF1 22.1-NM Yellow
2 1 Banana DF2 122.1-NM Yellow
3 2 Celery-Miss DF1 10.2 Green
4 3 Orange DF1 8.6 Orange-NM
5 3 Orange DF2 8.6 Orange1-NM
6 4 Orange-Miss DF2 8.6 Orange
I have a df with about 50 columns:
Product ID | Cat1 | Cat2 |Cat3 | ... other columns ...
8937456 0 5 10
8497534 25 3 0
8754392 4 15 7
Cat signifies how many quantities of that product fell into a category. Now I want to add a column "Category" denoting the majority Category for a product (ignoring the other columns and just considering the Cat columns).
df_goal:
Product ID | Cat1 | Cat2 |Cat3 | Category | ... other columns ...
8937456 0 5 10 3
8497534 25 3 0 1
8754392 4 15 7 2
I think I need to use max and apply or map?
I found those on stackoverflow, but they don't not address the category assignment. In Excel I renamed the columns from Cat 1 to 1 and used index(match(max)).
Python Pandas max value of selected columns
How should I take the max of 2 columns in a dataframe and make it another column?
Assign new value in DataFrame column based on group max
Here's a NumPy way with numpy.argmax -
df['Category'] = df.values[:,1:].argmax(1)+1
To restrict the selection to those columns, use those column headers/names specifically and then use idxmax and finally replace the string Cat with `empty strings, like so -
df['Category'] = df[['Cat1','Cat2','Cat3']].idxmax(1).str.replace('Cat','')
numpy.argmax or panda's idxmax basically gets us the ID of max element along an axis.
If we know that the column names for the Cat columns start at 1st column and end at 4th one, we can slice the dataframe : df.iloc[:,1:4] instead of df[['Cat1','Cat2','Cat3']].