How to 'explode' Pandas Column Value to unique row - python

So, what i mean with explode is like this, i want to transform some dataframe like :
ID | Name | Food | Drink
1 John Apple, Orange Tea , Water
2 Shawn Milk
3 Patrick Chichken
4 Halley Fish Nugget
into this dataframe:
ID | Name | Order Type | Items
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Shawn Drink Milk
6 Pattrick Food Chichken
i dont know how to make this happen. any help would be appreciated !

IIUC stack with unnest process , here I would not change the ID , I think keeping the original one is better
s=df.set_index(['ID','Name']).stack()
pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).reset_index()
Out[289]:
ID Name level_2 0
0 1 John Food Apple
1 1 John Food Orange
2 1 John Drink Tea
3 1 John Drink Water
4 2 Shawn Drink Milk
5 3 Patrick Food Chichken
6 4 Halley Food Fish Nugget
# if you need rename the column to item try below
#pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).rename(columns={0:'Item'}).reset_index()

You can use pd.melt to convert the data from wide to long format. I think this will be easier to understand step by step.
# first split into separate columns
df[['Food1','Food2']] = df.Food.str.split(',', expand=True)
df[['Drink1','Drink2']] = df.Drink.str.split(',', expand=True)
# now melt the df into long format
df = pd.melt(df, id_vars=['Name'], value_vars=['Food1','Food2','Drink1','Drink2'])
# remove unwanted rows and filter data
df = df[df['value'].notnull()].sort_values('Name').reset_index(drop=True)
# rename the column names and values
df.rename(columns={'variable':'Order Type', 'value':'Items'}, inplace=True)
df['Order Type'] = df['Order Type'].str.replace('\d','')
# output
print(df)
Name Order Type Items
0 Halley Food Fish Nugget
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Patrick Food Chichken
6 Shawn Drink Milk

Related

Concatenate strings in multiple csv files into one datafram along x and y axis (in pandas)

I have a folder with many csv files. They all look similar, they all have the same names for columns and rows. They all have strings as values in their cells. I want to concatenate them along columns AND rows so that all the strings are concatenated into their respective cells.
Example:
file1.csv
0
1
2
3
4
b1
peter
house
ash
plane
b2
carl
horse
paul
knife
b3
mary
apple
linda
carrot
b4
hank
car
herb
beer
file2.csv
0
1
2
3
4
b1
mark
green
hello
band
b2
no
phone
spoon
goodbye
b3
red
cherry
charly
hammer
b4
good
yes
ok
simon
What I want is this result with no delimiter between the string values:
concatenated.csv
0
1
2
3
4
b1
peter mark
house green
ash hello
plane band
b2
carl no
horse phone
paul spoon
knife goodbye
b3
mary red
apple cherry
linda charly
carrot hammer
b4
hank good
car yes
herb ok
beer simon
And I don't know how to do this in pandas in a jupyter notebook.
I have tried a couple of things but all of them either kept a seperate set of rows or of columns.
If these are your dataframes:
df1_data = {
1 : ['peter', 'carl', 'mary', 'hank'],
2 : ['house', 'horse','apple', 'car']
}
df1 = pd.DataFrame(df1_data)
df2_data = {
1 : ['mark', 'no', 'red', 'good'],
2 : ['green','phone','cherry','yes' ]
}
df2 = pd.DataFrame(df2_data)
df1:
1 2
0 peter house
1 carl horse
2 mary apple
3 hank car
df2:
1 2
0 mark green
1 no phone
2 red cherry
3 good yes
You can reach your requested dataframe like this:
df = pd.DataFrame()
df[1] = df1[1] + ' ' + df2[1]
df[2] = df1[2] + ' ' + df2[2]
print(df)
Output:
1 2
0 peter mark house green
1 carl no horse phone
2 mary red apple cherry
3 hank good car yes
Loop for csv files:
Now, if you have a lot of csv files with names like file1.csv and file2.csv and so on, you can save them all in d like this:
d = {}
for i in range(1,#N):
d[i] = pd.read_csv('.../file'+str(i)+'.csv')
#N is the number of csv files. (because I started from 1, you have to add 1 to N)
And build the dataframe you want like this:
concatenated_df = pd.DataFrame()
for i in range(1,#N):
concatenated_df[i] = d[1].iloc[:,i] + ' ' + d[2].iloc[:,i] + ...
#N is the number of columns here.
If performance is not an issue, you can use pandas.DataFrame.applymap with pandas.Series.add :
out = df1[[0]].join(df1.iloc[:, 1:].applymap(lambda v: f"{v} ").add(df2.iloc[:, 1:]))
Or, for a large dataset, you can use pandas.concat with a listcomp :
out = (
df1[[0]]
.join(pd.concat([df1.merge(df2, on=0)
.filter(regex=f"{p}_\w").agg(" ".join, axis=1)
.rename(idx) for idx, p in enumerate(range(1, len(df1.columns)), start=1)],
axis=1))
)
Output :
​
print(out)
0 1 2 3 4
0 b1 peter mark house green ash hello plane band
1 b2 carl no horse phone paul spoon knife goodbye
2 b3 mary red apple cherry linda charly carrot hammer
3 b4 hank good car yes herb ok beer simon
Reading many csv files into a single DF is a pretty common answer, and is the first part of your question. You can find a suitable answer here.
After that, in an effort to allow you to perform this on all of the files at the same time, you can melt and pivot with a custom agg function like so:
import glob
import pandas as pd
# See the linked answer if you need help finding csv files in a different directory
all_files = glob.glob('*.csv'))
df = pd.concat((pd.read_csv(f) for f in all_files))
output = df.melt(id_vars='0')
.pivot_table(index='0',
columns='variable',
values='value',
aggfunc=lambda x: ' '.join(x))

How can you group a data frame and reshape from long to wide?

I am fairly new to Python, so excuse me if this question has been answered before or can be easily solved.
I have a long data frame with numerical variables and categorical variables. It looks something like this:
Category Detail Gender Weight
Food Apple Female 30
Food Apple Male 40
Beverage Milk Female 10
Beverage Milk Male 5
Beverage Milk Male 20
Food Banana Female 50
What I want to do is this: Group by Category and Detail and then count all instances of 'Female' and 'Male'. I then want to weight these instances (see column 'Weight'). This should be done by taking the value from column 'Weight' and then deviding that by the summed weight. (so here for the group: Beverage, Milk, Male, it would be 25 devided by 35). It also would be nice to have the share of the gender.
At the end of the day I want my data frame to look something like this:
Category Detail Female Male
Beverage Milk 29% 71%
Food Apple 43% 57%
Food Banana 100% 0%
So in addition to the grouping, I want to kind of 'unmelt' the data frame by taking Female and Male an adding them as new columns.
I could just sum the weights with groupby on different levels, but how can I reshape the data frame in that way of adding these new columns?
Is there any way to do that? Thanks for any help in advance!
Use DataFrame.pivot_table with divide summed values, last multiple by 100 and round:
df = df.pivot_table(index=['Category','Detail'],
columns='Gender', values='Weight', aggfunc='sum', fill_value=0)
df = df.div(df.sum(axis=1), axis=0).mul(100).round().reset_index()
print (df)
Gender Category Detail Female Male
0 Beverage Milk 29.0 71.0
1 Food Apple 43.0 57.0
2 Food Banana 100.0 0.0
For percentages use:
df = df.pivot_table(index=['Category','Detail'],
columns='Gender', values='Weight', aggfunc='sum', fill_value=0)
df = df.div(df.sum(axis=1), axis=0).applymap("{:.2%}".format).reset_index()
print (df)
Gender Category Detail Female Male
0 Beverage Milk 28.57% 71.43%
1 Food Apple 42.86% 57.14%
2 Food Banana 100.00% 0.00%
Like so
df2 = df.pivot_table(
index=['Category', 'Detail'],
columns='Gender',
values='Weight',
aggfunc='sum'
).fillna(0)
final = df2[['Female', 'Male']].div(df2.sum(axis=1), axis=0)
Gender Female Male
Category Detail
Beverage Milk 0.285714 0.714286
Food Apple 0.428571 0.571429
Banana 1.000000 0.000000

Convert Key and Value in Dictionary Column as Different Columns Pandas

I have a table like this
Name Type Food Variant and Price
A Cake {‘Choco’:100, ‘Cheese’:100, ‘Mix’: 125}
B Drinks {‘Cola’:25, ‘Milk’:35}
C Side dish {‘French Fries’:20}
D Bread {None:10}
I want to use the keys and values of dictionaries in the Variant and Price column as 2 different columns but I am still confused, here is the output that I want:
Name Type Food Variant Price
A Cake Choco 100
A Cake Cheese 100
A Cake Mix 125
B Drinks Cola 25
B Drinks Milk 35
C Side dish French Fries 20
D Bread NaN 10
Can anyone help me to figure it out?
Create list of tuples and then use DataFrame.explode, last create 2 columns:
df['Variant and Price'] = df['Variant and Price'].apply(lambda x: list(x.items()))
df = df.explode('Variant and Price').reset_index(drop=True)
df[['Variant','Price']] = df.pop('Variant and Price').to_numpy().tolist()
print (df)
Name Type Food Variant Price
0 A Cake Choco 100
1 A Cake Cheese 100
2 A Cake Mix 125
3 B Drinks Cola 25
4 B Drinks Milk 35
5 C Side dish French Fries 20
6 D Bread None 10
Or create 2 columns and then use DataFrame.explode:
df['Variant'] = df['Variant and Price'].apply(lambda x: list(x.keys()))
df['Price'] = df.pop('Variant and Price').apply(lambda x: list(x.values()))
df = df.explode(['Variant', 'Price']).reset_index(drop=True)

How to append a new row in a dataframe by searching for an existing column value without iterating?

I'm trying to find the best way to create new rows for every 1 row when a certain value is contained in a column.
Example Dataframe
Index
Person
Drink_Order
1
Sam
Jack and Coke
2
John
Coke
3
Steve
Dr. Pepper
I'd like to search the DataFrame for Jack and Coke, remove it and add 2 new records as Jack and Coke are 2 different drink sources.
Index
Person
Drink_Order
2
John
Coke
3
Steve
Dr. Pepper
4
Sam
Jack Daniels
5
Sam
Coke
Example Code that I want to replace as my understanding is you should never modify rows you are iterating
for index, row in df.loc[df['Drink_Order'].str.contains('Jack and Coke')].iterrows():
df.loc[len(df)]=[row['Person'],'Jack Daniels']
df.loc[len(df)]=[row['Person'],'Coke']
df = df[df['Drink_Order']!= 'Jack and Coke']
Split using and. That will result in a list. Explode list to get each element in a list appear as an individual row. Then conditionally rename Jack to Jack Daniels
df= df.assign(Drink_Order=df['Drink_Order'].str.split('and')).explode('Drink_Order')
df['Drink_Order']=np.where(df['Drink_Order'].str.contains('Jack'),'Jack Daniels',df['Drink_Order'])
Index Person Drink_Order
0 1 Sam Jack Daniels
0 1 Sam Coke
1 2 John Coke
2 3 Steve Dr. Pepper

Combine two rows in Pandas depending on values in a column and creating a new category

I am most interested in how this is done in a good and excellent pandas way.
In this example data Tim from Osaka has two fruit's.
import pandas as pd
data = {'name': ['Susan', 'Tim', 'Tim', 'Anna'],
'fruit': ['Apple', 'Apple', 'Banana', 'Banana'],
'town': ['Berlin', 'Osaka', 'Osaka', 'Singabpur']}
df = pd.DataFrame(data)
print(df)
Result
name fruit town
0 Susan Apple Berlin
1 Tim Apple Osaka
2 Tim Banana Osaka
3 Anna Banana Singabpur
I investigate the data ans see that one of the persons have multiple fruits. I want to create a new "category" for it named banana&fruit (or something else). The point is that the other fields of Tim are equal in their values.
df.groupby(['name', 'town', 'fruit']).size()
I am not sure if this is the correct way to explore this data set. The logical question behind is if some of the person+town combinations have multiple fruits.
As a result I want this
name fruit town
0 Susan Apple Berlin
1 Tim Apple&Banana Osaka
2 Anna Banana Singabpur
Use groupby agg:
new_df = (
df.groupby(['name', 'town'], as_index=False, sort=False)
.agg(fruit=('fruit', '&'.join))
)
new_df:
name town fruit
0 Susan Berlin Apple
1 Tim Osaka Apple&Banana
2 Anna Singabpur Banana
>>> df.groupby(["name", "town"], sort=False)["fruit"]
.apply(lambda f: "&".join(f)).reset_index()
name town fruit
0 Anna Singabpur Banana
1 Susan Berlin Apple
2 Tim Osaka Apple&Banana

Categories

Resources