Add data to new column based on previous row - python

df
type content
1 task buy xbox
2 task buy fruit from supermarket
3 note orange with squash\buy if cheap
4 note apple
5 task buy sunglassess
The notes refer to the task directly above it. How could I manipulate the df to get the following df?
Expected Output:
task comment1 comment2
1 buy xbox
2 buy fruit from supermarket orange with squash apple
buy if cheap
3 buy sunglassess
...

Use helper Series for get groups by task by compare value with cumulkative sum, get counter by GroupBy.cumcount and reshape by DataFrame.set_index and Series.unstack:
s = df['type'].eq('task').cumsum()
g = df.groupby(s).cumcount()
df1 = (df.set_index([s, g])['content']
.unstack(fill_value='')
.add_prefix('comment')
.rename(columns={'comment0':'task'})
.reset_index(drop=True))
print (df1)
task comment1 comment2
0 buy xbox
1 buy fruit from supermarket orange with squasuy if cheap apple
2 buy sunglassess

Related

How to divide a list to allocate it to another dataframe based on sum of values?

I have two dataframes for example:
First dataframe contains the name and kind of chocolate they want:
Name
Chocolate
Kirti
Nutella
Rahul
Lindt
Sam
Lindt
Joy
Lindt
Mrinal
Kit Kat
Sai
Lindt
The second dataframe contains shop and availability of each item in shop:
Shop
Chocolate
Count
Shop 1
Lindt
2
Shop 2
Lindt
3
Shop 1
Nutella
5
The end result that I'm looking for should return a dataframe which indicates which shop the people can go to.
Rahul, Sam, Joy and Sai are 4 people who want Lindt. 2 of them can go to Shop 1 and other 2 can go to shop 3 to ensure everyone can get lindt Chocolate.
Now we can randomly assign 2 of them to shop 1 and 2 of them to Shop 2.
Similarly with other chocolates and resulting dataframe will be
Name
Chocolate
Shop
Kirti
Nutella
Shop 1
Rahul
Lindt
Shop 1
Sam
Lindt
Shop 1
Joy
Lindt
Shop 2
Mrinal
Kit Kat
NA
Sai
Lindt
Shop 2
In above case, Mrinal doesn't get assigned any shop because no shop has KitKat available
I've been trying to do a vlookup in Python using map but all people who want Lindt get assigned Shop 2. I want to assign them in such a way that divides the qty available in each shop so that everyone possible can get chocolate.
Here's the code that I wrote as of now:
df_demand = pd.DataFrame({'Name': ['Kirti','Rahul','Sam','Joy','Mrinal','Sai'],
'Chocolate': ['Nutella','Lindt','Lindt','Lindt','Kit-Kat','Lindt']})
df_inventory = pd.DataFrame({'Shop':['Shop1','Shop2','Shop1'],
'Chocolate':['Lindt','Lindt','Nutella'],
'Count':[2,3,5]})
df_inventory = df_inventory.sort_values(by = ['Count'], ascending = False, kind = "mergesort")
df_inventory= df_inventory.drop_duplicates(subset ="Chocolate")
df_inv1= df_inventory.set_index('Chocolate').to_dict()['Shop']
df_demand['Shop'] = df_demand['Chocolate'].map(df_inv1)
Output of above code:
A way to do this is to count Chocolate need/sale opportunity up and then use that number to merge the request of the kids with the corresponding shops.
df = pd.DataFrame(
[['Shop1','Lindt',1],
['Shop1','Milka',1],
['Shop2','Lindt',3],
['Shop3','Lindt',3],
['Shop3','Milka',3]]
,columns=['Shop','Chocolate','Count'])
dk = pd.DataFrame(
[['Alfred','Milka'],
['Berta','Milka'],
['Charlie','Milka'],
['Darius','Milka'],
['Emil','Milka'],
['George','Lindt'],
['Francois','Milka']],
columns =['Name','Chocolate'])
df['max_satisfaction']=df.groupby('Chocolate').cumsum()
df['min_satisfaction'] = df['max_satisfaction']-df['Count']
df['satisfies']=df.apply(lambda x:list(range(x[-1],x[-2])),axis=1)
df = df.explode('satisfies')
dk['request_number'] = dk.groupby('Chocolate').cumcount()
dk = dk.merge(df,how='left',
left_on=['Chocolate','request_number'],
right_on=['Chocolate','satisfies'])
dk[['Name','Chocolate','Shop']]
Note that this solution will be quite expensive if the shops have way more supply than demand. A limit to prevent the explosion of df could be however easily implemented.

Pandas - GroupBy 2 Columns - Unable to reset index

I have a DF as follows:
Date Bought | Fruit
2018-01 Apple
2018-02 Orange
2018-02 Orange
2018-02 Lemon
I wish to group the data by 'Date Bought' & 'Fruit' and count the occurrences.
Expected result:
Date Bought | Fruit | Count
2018-01 Apple 1
2018-02 Orange 2
2018-02 Lemon 1
What I get:
Date Bought | Fruit | Count
2018-01 Apple 1
2018-02 Orange 2
Lemon 1
Code used:
Initial attempt:
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count')
#2
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count').reset_index()
ERROR: Cannot insert Fruit, already exists
#3
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count').reset_index(inplace=True)
ERROR: Type Error: Cannot reset_index inplace on a Series to create a DataFrame
Documentation shows that the groupby function returns a 'groupby object' not a standard DF. How can I group the data as mentioned above and retain the DF format?
The problem here is that by resetting the index you'd end up with 2 columns with the same name. Because working with Series is possible set parameter name in Series.reset_index:
df1 = (df.groupby(['Date Bought','Fruit'], sort=False)['Fruit']
.agg('count')
.reset_index(name='Count'))
print (df1)
Date Bought Fruit Count
0 2018-01 Apple 1
1 2018-02 Orange 2
2 2018-02 Lemon 1

Use Excel sheet to create dictionary in order to replace values

I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Pandas group and join

I am new to pandas. I want to analysis the following case. Let say, A fruit market is giving the prices of the fruits daily the time from 18:00 to 22:00. For every half an hour they are updating the price of the fruits between the time lab. Consider the market giving the prices of the fruits at 18:00 as follows,
Fruit Price
Apple 10
Banana 20
After half an hour at 18:30, the list has been updated as follows,
Fruit Price
Apple 10
Banana 21
Orange 30
Grapes 25
Pineapple 65
I want to check has the prices of the fruits been changed of recent one[18:30] with the earlier one[18:00].
Here I want to get the result as,
Fruit 18:00 18:30
Banana 20 21
To solve this I am thinking to do the following,
1) Add time column in the two data frames.
2) Merge the tables into one.
3) Make a Pivot table with Index Fruit name and Column as ['Time','Price'].
I don't know how to get intersect the two data frames of grouped by Time. How to get the common rows of the two Data Frames.
You dont need to pivot in this case, we can simply use merge and use suffixes argument to get the desired results:
df_update = pd.merge(df, df2, on='Fruit', how='outer', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10.0 10.0
1 Banana 20.0 21.0
2 Orange NaN 30.0
3 Grapes NaN 25.0
4 Pineapple NaN 65.0
Edit
Why are we using the outer argument? We want to keep all the new data that is updated in df2. If we use inner for example, we will not get the updated fruits, like below. Unless this is the desired output by OP, which is not clear in this case.
df_update = pd.merge(df, df2, on='Fruit', how='inner', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10 10.0
1 Banana 20 21.0
If Fruit is the index of your data frame the following code should work. The Idea is to return rows with inequality:
df['1800'] = df1['Price']
df['1830'] = df2['Price']
print(df.loc[df['1800'] != df1['1830']])
You can also use datetime in your column heading.

Summing in a Dataframe over one column while keeping others

In a pandas Dataframe df I have columns likes this:
NAME KEYWORD AMOUNT INFO
0 orange fruit 13 from italy
1 potato veggie 7 from germany
2 potato veggie 9 from germany
3 orange fruit 8 from italy
4 potato veggie 6 from germany
Doing a groupby KEYWORD operation I want to build the sum of the AMOUNT values per group and keep from the other columns always the first value, so that the result reads:
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
I tried
df.groupby('KEYWORD).sum()
but this "summarises" over all columns, i.e I get
NAME KEYWORD AMOUNT INFO
0 orangeorange fruit 21 from italyfrom italy
1 potatopotatopotato veggie 22 from germanyfrom germanyfrom germany
Then I tried to use different functions for different columns:
df.groupby('KEYWORD).agg({'AMOUNT': sum, 'NAME': first, ....})
with
def first(f_arg, *args):
return f_arg
But this gives me unfortunately a "ValueError: function does not reduce" error.
So I am a bit at a loss. How can I apply sum only to the AMOUNT column, while keeping the others?
Use groupby + agg with a custom aggfunc dict.
f = dict.fromkeys(df.columns.difference(['KEYWORD']), 'first')
f['AMOUNT'] = sum
df = df.groupby('KEYWORD', as_index=False).agg(f)
df
KEYWORD NAME AMOUNT INFO
0 fruit orange 21 from italy
1 veggie potato 22 from germany
dict.fromkeys gives me a nice way of generalising this for N number of columns. If column order matters, add a reindex operation at the end:
df = df.groupby('KEYWORD', as_index=False).agg(f).reindex(columns=df.columns)
df
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
Use drop_duplicates by column KEYWORD and then assign aggregate values:
df=df.drop_duplicates('KEYWORD').assign(AMOUNT=df.groupby('KEYWORD')['AMOUNT'].sum().values)
print (df)
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany

Categories

Resources