Pandas group and join - python

I am new to pandas. I want to analysis the following case. Let say, A fruit market is giving the prices of the fruits daily the time from 18:00 to 22:00. For every half an hour they are updating the price of the fruits between the time lab. Consider the market giving the prices of the fruits at 18:00 as follows,
Fruit Price
Apple 10
Banana 20
After half an hour at 18:30, the list has been updated as follows,
Fruit Price
Apple 10
Banana 21
Orange 30
Grapes 25
Pineapple 65
I want to check has the prices of the fruits been changed of recent one[18:30] with the earlier one[18:00].
Here I want to get the result as,
Fruit 18:00 18:30
Banana 20 21
To solve this I am thinking to do the following,
1) Add time column in the two data frames.
2) Merge the tables into one.
3) Make a Pivot table with Index Fruit name and Column as ['Time','Price'].
I don't know how to get intersect the two data frames of grouped by Time. How to get the common rows of the two Data Frames.

You dont need to pivot in this case, we can simply use merge and use suffixes argument to get the desired results:
df_update = pd.merge(df, df2, on='Fruit', how='outer', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10.0 10.0
1 Banana 20.0 21.0
2 Orange NaN 30.0
3 Grapes NaN 25.0
4 Pineapple NaN 65.0
Edit
Why are we using the outer argument? We want to keep all the new data that is updated in df2. If we use inner for example, we will not get the updated fruits, like below. Unless this is the desired output by OP, which is not clear in this case.
df_update = pd.merge(df, df2, on='Fruit', how='inner', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10 10.0
1 Banana 20 21.0

If Fruit is the index of your data frame the following code should work. The Idea is to return rows with inequality:
df['1800'] = df1['Price']
df['1830'] = df2['Price']
print(df.loc[df['1800'] != df1['1830']])
You can also use datetime in your column heading.

Related

Merge two data frame by comparing values but not the column name

DataFrame 1 - Price of Fruits by date (Index is a date)
fruits_price = {'Apple': [9,5,14],
'Orange': [10,12,10],
'Kiwi': [5,4,20],
'Watermelon': [4.4,5.4,6.4]}
df1 = pd.DataFrame(fruits_price,
columns = ['Apple','Orange','Kiwi','Watermelon'],
index=['2020-01-01','2020-01-02','2020-01-10'])
date Apple Oranges Kiwi Watermelon ... Fruit_100
2020-01-01 9 10 5 4.4
2002-01-02 5 12 4 5.4
...
2002-12-10 14 10 20 6.4
Dataframe 2 (Top fruits by Rank) (Index is a date)
top_fruits = {'Fruit_1': ['Apple','Apple','Apple'],
'Fruit_2': ['Kiwi','Orange','Kiwi'],
'Fruit_3': ['Orange','Watermelon','Watermelon'],
'Fruit_4': ['Watermelon','Kiwi','Orange']}
df2 = pd.DataFrame(top_fruits,
columns = ['Fruit_1','Fruit_2','Fruit_3','Fruit_4'],
index=['2020-01-01','2020-01-02','2020-01-10'])
date Fruit_1 Fruit_2 Fruit_3 Fruit_4 ... Fruit_100
2020-01-01 Apple Kiwi Oranges Watermelon Pineapple
2002-01-02 Apple Oranges Watermelon Kiwi Pineapple
...
2002-12-10 Apple Kiwi Watermelon Oranges Pineapple
I want DataFrame 3 (Price of the top fruit for the given date)
which actually tells me the price of the top fruit at the given date
date Price_1 Price_2 Price_3 Price_4 ..... Price_100
2020-01-01 9 5 10 4.4
2002-01-02 5 12 5.4 4
...
2002-12-10 14 20 6.4 10
Spent almost 1 night and have tried iterating Dataframe 2 and then Inner loop on DataFrame 1 and added values to DataFrame 3. I have I tried almost 6-7 different ways by iterrow ,iteritems, and then storing output directly via iloc to df3. None of those worked.
Just wondering there is an easier way to do this.
This I will later then multiply with sales of fruits in the same dataframe formate.
Just use apply function with axis=1, what this does is row by row, and each row is a series, its name is the date, replace the value with corresponding row in df1.
df2.apply(lambda x: x.replace(df1.to_dict('index')[x.name]), axis=1)
Make a dict by df1, and then use replace on df2:
import pandas as pd
fruits_price = {'Apple': [9,5,14],
'Orange': [10,12,10],
'Kiwi': [5,4,20],
'Watermelon': [4.4,5.4,6.4]}
df1 = pd.DataFrame(fruits_price,
columns = ['Apple','Orange','Kiwi','Watermelon'],
index=['2020-01-01','2020-01-02','2020-01-10'])
top_fruits = {'Fruit_1': ['Apple','Apple','Apple'],
'Fruit_2': ['Kiwi','Orange','Kiwi'],
'Fruit_3': ['Orange','Watermelon','Watermelon'],
'Fruit_4': ['Watermelon','Kiwi','Orange']}
df2 = pd.DataFrame(top_fruits,
columns = ['Fruit_1','Fruit_2','Fruit_3','Fruit_4'],
index=['2020-01-01','2020-01-02','2020-01-10'])
result = df2.T.replace(df1.T.to_dict()).T
result.columns = [f"Price_{i}" for i in range(1, len(result.columns)+1)]
result
output:
Price_1 Price_2 Price_3 Price_4
2020-01-01 9.0 5.0 10.0 4.4
2020-01-02 5.0 12.0 5.4 4.0
2020-01-10 14.0 20.0 6.4 10.0

How to calculate when inventory will run out using pandas?

Suppose I have a DataFrame like so:
Item Check Date Inventory
Apple 1/1/2020 50
Banana 1/1/2020 80
Apple 1/2/2020 75
Banana 1/2/2020 300
Apple 2/1/2020 100
Apple 2/2/2020 98
Banana 2/2/2020 341
Apple 2/3/2020 95
Banana 2/3/2020 328
Apple 2/4/2020 90
Apple 2/5/2020 85
Banana 2/5/2020 325
I want to find the average rate of change in the inventory for a given item starting from the max inventory count, then use that to compute what day the inventory will reach zero.
So for apples it would be starting from 2/1: 2+3+5+5/4 = 3.75, similarly for bananas starting from 2/2 13+3/2 = 8.
Since there are different items, I have used:
apples = df[df["Item"] == "apples"]
to get a dataframe for just the apples, then used:
apples["Inventory"].idxmax()
to find the row with the max inventory count.
However, this gives me the row label of the row for the original dataframe. So I'm not sure where to go from here since my plan was to then get the date off the row with the max inventory count, then ignore any dates before that.
You can still use the idxmax but with transform
s=df[df.index>=df.groupby('Item').Inventory.transform('idxmax')]
out=s.groupby('Item')['Inventory'].apply(lambda x : -x.diff().mean())
Item
Apple 3.75
Banana 8.00
Name: Inventory, dtype: float64

mapping two data frames using one common column

I have two data frames, (FYI created one using Groupby)
I want to map data array x which looks like the below
Fruit #
Apple 2
Pear 5
lemon 1
into Data Frame y which looks like the below
Date Fruit Cost
Mon Apple 1.00
Mon Pear 2.00
Tues lemon 1.50
Tues Apple 1.00
When mapping into y I want to create new column called #. so the final outcome should look like the below
Date Fruit Cost #
Mon Apple 1.00 2
Mon Pear 2.00 5
Tues lemon 1.50 1
Tues Apple 1.00 2
I have tried using the below
y['#'] = np.where(y['Fruit'].map(x.set_index('Fruit')['#']))
however this kicks out valueerror:Length of values does not match length of index.
The length of both data arrays could vary as well depending on the underlying data. Any suggestions would be most appreciated.
Thanks

Regress by group in pandas dataframe and add columns with forecast values and beta/t-stats

here is an example of my dataframe df:
Category Y X1 X2
0 Apple 0.083050996 0.164056482 0.519875358
1 Apple 0.411044939 0.774160332 0.002869499
2 Apple 0.524315907 0.422193005 0.97720091
3 Apple 0.721124638 0.645927536 0.750210715
4 Berry 0.134488729 0.299288214 0.522933484
5 Berry 0.733162132 0.608742944 0.957595544
6 Berry 0.113051075 0.641533175 0.19799635
7 Berry 0.275379123 0.249143751 0.049082766
8 Carrot 0.588121494 0.750480977 0.615399987
9 Carrot 0.878221581 0.021366296 0.069184879
Now I want the code to be able to do a regression for each Category (ie, cross sectional regression grouped by Category (for Apple, Berry and Carrot etc,)).
Then I want to add new columns df['Y_hat'] which has the forecast value from the regression, and the corresponding 2 beta and t-statistic values (beta and t-stat values would be the same for multiple rows of same category).
Final df would have 5 additional columns, Y_hat, beta 1, beta 2 , t-stat 1 and t-stat 2.
You want to do a lot of things for a "GroupBy" :)
I think is better if you slice the DataFrame by Category, then store each individual result for that category in a dictionary which you're going to use at the end of the loop to build your DataFrame.
result = {}
# loop on every category
for category in df['Category'].unique():
# slice
df_slice = df[df['Category'] == category]
# run all the stuff your want to do
result[category] = {
'predicted_value': ***,
'Y_hat': ***
'etc'
...
}
# build dataframe with all your results
final_df = pd.DataFrame(result)
Will be much easier if ever need to debug too! Good luck! :)

Changing CSV files in python

I have a bunch of CSV files with 4 line headers. In these files, I want to change the values in the sixth column based on the values in the second column. For example, if the second column, under the name PRODUCT is Banana, I would want to change the value in the same row under TIME to 10m. If the the product was Apple I would want the time to be 15m and so on.
When 12:07
Area Produce
Store Name FF
Eatfresh
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 5m
2 Apple 400000 F4 8m
3 Pair 6m
4 Banana 4000 G3 7m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 6m
Desired Output
When 12:07
Area Produce
Store Name FF
Eatfresh
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 10m
2 Apple 400000 F4 15m
3 Pair 6m
4 Banana 4000 G3 10m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 15m
I want to output all of them to be outputed to a directory call NTime. Here is what I have thus far, but being new to coding, I don't really understand a great deal and have gotten stuck on how to make the actual changes. I found Python/pandas idiom for if/then/else and it seems similar to what I want to do, but I don't completely understand what is going on.
import pandas as pd
import glob
import os
fns = glob.glob('*.csv')
colname1 = 'PRODUCT'
colname2 = 'TIME'
for csv in fns:
s = pd.read_csv(csv, usecols=[colname1], squeeze=True, skiprows=4, header=0)
with open(os.path.join('NTime', fn), 'wb') as f:
Can someone help me?
You can do this with a combination of groupby, replace and a dict
In [76]: from pandas import DataFrame
In [77]: fruits = ['banana', 'apple', 'pear', 'banana', 'watermelon', 'orange', 'apple']
In [78]: times = ['5m', '8m', '6m', '7m', '13m', '2m', '6m']
In [79]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m'}
In [80]: df = DataFrame({'fruits': fruits, 'time': times})
Out[80]:
fruits time
0 banana 5m
1 apple 8m
2 pear 6m
3 banana 7m
4 watermelon 13m
5 orange 2m
6 apple 6m
In [81]: def replacer(g, time_map):
....: tv = g.time.values
....: return g.replace(to_replace=tv, value=time_map.get(g.name, tv))
In [82]: df.groupby('fruits').apply(replacer, time_map)
Out[82]:
fruits time
0 banana 10m
1 apple 15m
2 pear 5m
3 banana 10m
4 watermelon 13m
5 orange 2m
6 apple 15m
You said you're new to programming so I'll explain what's going on.
df.groupby('fruits') splits the DataFrame into subsets (which are DataFrames or Series objects) using the values of the fruits column.
The apply method applies a function to each of the aforementioned subsets and concatenates the result (if needed).
replacer is where the "magic" happens: each group's time values get replaced (to_replace) with the new value that's defined in time_map. The get method of dicts allows you to provide a default value if the key you're searching for (the fruit name in this case) is not there. nan is commonly used for this purpose, but here I'm actually just using the time that was already there if there isn't a new one defined for it in the time_map dict.
One thing to note is my use of g.name. This doesn't normally exist as an attribute on DataFrames (you can of course define it yourself if you want to), but is there so you can perform computations that may require the group name. In this case that's the "current" fruit you're looking at when you apply your function.
If you have a new value for each fruit or you write in the old values manually you can shorten this to a one-liner:
In [130]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m', 'orange': '10m', 'watermelon': '100m'
}
In [131]: s = Series(time_map, name='time')
In [132]: s[df.fruits]
Out[132]:
fruits
banana 10m
apple 15m
pear 5m
banana 10m
watermelon 100m
orange 10m
apple 15m
Name: time, dtype: object
In [133]: s[df.fruits].reset_index()
Out[133]:
fruits time
0 banana 10m
1 apple 15m
2 pear 5m
3 banana 10m
4 watermelon 100m
5 orange 10m
6 apple 15m
Assuming that your data is in a Pandas DataFrame and looks something like this:
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 10m
2 Apple 400000 F4 15m
3 Pair 6m
4 Banana 4000 G3 10m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 15m
Then you should be able to do manipulate values in one column based on values in another column (same row) using simple loops like this:
for numi, i in enumerate(df["PRODUCT"]):
if i == "Banana":
df["TIME"][numi] = "10m"
if i == "Apple":
df["TIME"][numi] = "15m"
The code first loops through the rows of the dataframe column "PRODUCT", with the row value stored as i and the row-number stored as numi. It then uses if statements to identify the different levels of interest in the Product column. For those rows with the levels of interest (eg "Banana" or "Apple"), it uses the row-numbers to change the value of another column in the same row.
There are lots of ways to do this, and depending on the size of your data and the number of levels (in this case "Products") you want to change, this isn't necessarily the most efficient way to do this. But since you're a beginner, this will probably be a good basic way of doing it for you to start with.

Categories

Resources