I have 2 pandas DataFrames, this one:
item inStock description
Apples 10 a juicy treat
Oranges 34 mediocre at best
Bananas 21 can be used as phone prop
<...many other fruits...>
Kiwi 0 too fuzzy
and a lookup table with only a subset of the items above:
item Price
Apples 1.99
Oranges 6.99
I would like to scan through the first table and fill in a price column for the DataFrame when the fruit in the first DataFrame matches the fruit in the second:
item inStock description Price
Apples 10 a juicy treat 1.99
Oranges 34 mediocre at best 6.99
Bananas 21 can be used as phone prop
<...many other fruits...>
Kiwi 0 too fuzzy
I've looked at examples with the built-in lookup function, as well as using a where-in type function but I cannot seem to get the syntax to work. Can someone help me out?
import pandas as pd
df_item= pd.read_csv('Item.txt')
df_price= pd.read_csv('Price.txt')
df_final=pd.merge(df_item,df_price ,on='item',how='left' )
print df_final
output
item inStock description Price
0 Apples 10 a juicy treat 1.99
1 Oranges 34 mediocre at best 6.99
2 Bananas 21 can be used as phone prop NaN
3 Kiwi 0 too fuzzy NaN
Related
I'm struggling with next task: I would like to identify using pandas (or any other tool on python) if any of multiple cells (Fruit 1 through Fruit 3) in each row from Table 2 contains in column Fruits of Table1. And at the end obtain "Contains Fruits Table 2?" table.
Fruits
apple
orange
grape
melon
Name
Fruit 1
Fruit 2
Fruit 3
Contains Fruits Table 2?
Mike
apple
Yes
Bob
peach
pear
orange
Yes
Jack
banana
No
Rob
peach
banana
No
Rita
apple
orange
banana
Yes
Fruits in Table 2 can be up to 40 columns. Number of rows in Table1 is about 300.
I hope it is understandable, and someone can help me resolve this.
I really appreciate the support in advance!
Try:
filter DataFrame to include columns that contain the word "Fruit"
Use isin to check if the values are in table1["Fruits"]
Return True if any of fruits are found
map True/False to "Yes"/"No"
table2["Contains Fruits Table 2"] = table2.filter(like="Fruit")
.isin(table1["Fruits"].tolist())
.any(axis=1)
.map({True: "Yes", False: "No"})
>>> table2
Name Fruit 1 Fruit 2 Fruit 3 Contains Fruits Table 2
0 Mike apple None None Yes
1 Bob peach pear orange Yes
2 Jack banana None None No
3 Rob peach banana None No
4 Rita apple orange banana Yes
~~~
I have two dataframes the first one:
df1:
product price
0 apples 1.99
1 bananas 1.20
2 oranges 1.49
3 lemons 0.5
4 Olive Oil 8.99
df2:
product product.1 product.2
0 apples bananas Olive Oil
1 bananas lemons oranges
2 Olive Oil bananas oranges
3 lemons apples bananas
I want a column in the second dataframe to be the sum of the prices base on the price of each item in the first dataframe. So desired outcome would be:
product product.1 product.2 total_price
0 apples bananas Olive Oil 12.18
1 bananas lemons oranges 3.19
2 Olive Oil bananas oranges 11.68
3 lemons apples bananas 3.69
What is the best way to accomplish this? I have tried merging the dataframes on the name for each of the columns in df2 but this seems time consuming especially as df1 gets more rows and df2 gets more columns.
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product')
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product.1')
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product.2')
df['Total_Price'] = df['price']+df['price.1']+df['price.2']
You can try something like below:
First, converting df1 to dictionary of keys and values
Using dictionary in above with applymap followed by sum
May be following snippet will do something similar:
dictionary_val = { k[0]: k[1] for k in df1.values }
df2['Total_Price'] = df2.applymap(lambda row: dictionary_val[row]).sum(axis=1) # Note not creating new dataframe but using existing one
Then result is df2:
product product.1 product.2 Total_Price
0 apples bananas Olive Oil 12.18
1 bananas lemons oranges 3.19
2 Olive Oil bananas oranges 11.68
3 lemons apples bananas 3.69
Suppose I have a DataFrame like so:
Item Check Date Inventory
Apple 1/1/2020 50
Banana 1/1/2020 80
Apple 1/2/2020 75
Banana 1/2/2020 300
Apple 2/1/2020 100
Apple 2/2/2020 98
Banana 2/2/2020 341
Apple 2/3/2020 95
Banana 2/3/2020 328
Apple 2/4/2020 90
Apple 2/5/2020 85
Banana 2/5/2020 325
I want to find the average rate of change in the inventory for a given item starting from the max inventory count, then use that to compute what day the inventory will reach zero.
So for apples it would be starting from 2/1: 2+3+5+5/4 = 3.75, similarly for bananas starting from 2/2 13+3/2 = 8.
Since there are different items, I have used:
apples = df[df["Item"] == "apples"]
to get a dataframe for just the apples, then used:
apples["Inventory"].idxmax()
to find the row with the max inventory count.
However, this gives me the row label of the row for the original dataframe. So I'm not sure where to go from here since my plan was to then get the date off the row with the max inventory count, then ignore any dates before that.
You can still use the idxmax but with transform
s=df[df.index>=df.groupby('Item').Inventory.transform('idxmax')]
out=s.groupby('Item')['Inventory'].apply(lambda x : -x.diff().mean())
Item
Apple 3.75
Banana 8.00
Name: Inventory, dtype: float64
I am new to pandas. I want to analysis the following case. Let say, A fruit market is giving the prices of the fruits daily the time from 18:00 to 22:00. For every half an hour they are updating the price of the fruits between the time lab. Consider the market giving the prices of the fruits at 18:00 as follows,
Fruit Price
Apple 10
Banana 20
After half an hour at 18:30, the list has been updated as follows,
Fruit Price
Apple 10
Banana 21
Orange 30
Grapes 25
Pineapple 65
I want to check has the prices of the fruits been changed of recent one[18:30] with the earlier one[18:00].
Here I want to get the result as,
Fruit 18:00 18:30
Banana 20 21
To solve this I am thinking to do the following,
1) Add time column in the two data frames.
2) Merge the tables into one.
3) Make a Pivot table with Index Fruit name and Column as ['Time','Price'].
I don't know how to get intersect the two data frames of grouped by Time. How to get the common rows of the two Data Frames.
You dont need to pivot in this case, we can simply use merge and use suffixes argument to get the desired results:
df_update = pd.merge(df, df2, on='Fruit', how='outer', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10.0 10.0
1 Banana 20.0 21.0
2 Orange NaN 30.0
3 Grapes NaN 25.0
4 Pineapple NaN 65.0
Edit
Why are we using the outer argument? We want to keep all the new data that is updated in df2. If we use inner for example, we will not get the updated fruits, like below. Unless this is the desired output by OP, which is not clear in this case.
df_update = pd.merge(df, df2, on='Fruit', how='inner', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10 10.0
1 Banana 20 21.0
If Fruit is the index of your data frame the following code should work. The Idea is to return rows with inequality:
df['1800'] = df1['Price']
df['1830'] = df2['Price']
print(df.loc[df['1800'] != df1['1830']])
You can also use datetime in your column heading.
In a pandas Dataframe df I have columns likes this:
NAME KEYWORD AMOUNT INFO
0 orange fruit 13 from italy
1 potato veggie 7 from germany
2 potato veggie 9 from germany
3 orange fruit 8 from italy
4 potato veggie 6 from germany
Doing a groupby KEYWORD operation I want to build the sum of the AMOUNT values per group and keep from the other columns always the first value, so that the result reads:
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
I tried
df.groupby('KEYWORD).sum()
but this "summarises" over all columns, i.e I get
NAME KEYWORD AMOUNT INFO
0 orangeorange fruit 21 from italyfrom italy
1 potatopotatopotato veggie 22 from germanyfrom germanyfrom germany
Then I tried to use different functions for different columns:
df.groupby('KEYWORD).agg({'AMOUNT': sum, 'NAME': first, ....})
with
def first(f_arg, *args):
return f_arg
But this gives me unfortunately a "ValueError: function does not reduce" error.
So I am a bit at a loss. How can I apply sum only to the AMOUNT column, while keeping the others?
Use groupby + agg with a custom aggfunc dict.
f = dict.fromkeys(df.columns.difference(['KEYWORD']), 'first')
f['AMOUNT'] = sum
df = df.groupby('KEYWORD', as_index=False).agg(f)
df
KEYWORD NAME AMOUNT INFO
0 fruit orange 21 from italy
1 veggie potato 22 from germany
dict.fromkeys gives me a nice way of generalising this for N number of columns. If column order matters, add a reindex operation at the end:
df = df.groupby('KEYWORD', as_index=False).agg(f).reindex(columns=df.columns)
df
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
Use drop_duplicates by column KEYWORD and then assign aggregate values:
df=df.drop_duplicates('KEYWORD').assign(AMOUNT=df.groupby('KEYWORD')['AMOUNT'].sum().values)
print (df)
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany