Merge two data frame by comparing values but not the column name

Merge two data frame by comparing values but not the column name - python

DataFrame 1 - Price of Fruits by date (Index is a date)
fruits_price = {'Apple': [9,5,14],
'Orange': [10,12,10],
'Kiwi': [5,4,20],
'Watermelon': [4.4,5.4,6.4]}
df1 = pd.DataFrame(fruits_price,
columns = ['Apple','Orange','Kiwi','Watermelon'],
index=['2020-01-01','2020-01-02','2020-01-10'])
date Apple Oranges Kiwi Watermelon ... Fruit_100
2020-01-01 9 10 5 4.4
2002-01-02 5 12 4 5.4
...
2002-12-10 14 10 20 6.4
Dataframe 2 (Top fruits by Rank) (Index is a date)
top_fruits = {'Fruit_1': ['Apple','Apple','Apple'],
'Fruit_2': ['Kiwi','Orange','Kiwi'],
'Fruit_3': ['Orange','Watermelon','Watermelon'],
'Fruit_4': ['Watermelon','Kiwi','Orange']}
df2 = pd.DataFrame(top_fruits,
columns = ['Fruit_1','Fruit_2','Fruit_3','Fruit_4'],
index=['2020-01-01','2020-01-02','2020-01-10'])
date Fruit_1 Fruit_2 Fruit_3 Fruit_4 ... Fruit_100
2020-01-01 Apple Kiwi Oranges Watermelon Pineapple
2002-01-02 Apple Oranges Watermelon Kiwi Pineapple
...
2002-12-10 Apple Kiwi Watermelon Oranges Pineapple
I want DataFrame 3 (Price of the top fruit for the given date)
which actually tells me the price of the top fruit at the given date
date Price_1 Price_2 Price_3 Price_4 ..... Price_100
2020-01-01 9 5 10 4.4
2002-01-02 5 12 5.4 4
...
2002-12-10 14 20 6.4 10
Spent almost 1 night and have tried iterating Dataframe 2 and then Inner loop on DataFrame 1 and added values to DataFrame 3. I have I tried almost 6-7 different ways by iterrow ,iteritems, and then storing output directly via iloc to df3. None of those worked.
Just wondering there is an easier way to do this.
This I will later then multiply with sales of fruits in the same dataframe formate.

Just use apply function with axis=1, what this does is row by row, and each row is a series, its name is the date, replace the value with corresponding row in df1.
df2.apply(lambda x: x.replace(df1.to_dict('index')[x.name]), axis=1)

Make a dict by df1, and then use replace on df2:
import pandas as pd
fruits_price = {'Apple': [9,5,14],
'Orange': [10,12,10],
'Kiwi': [5,4,20],
'Watermelon': [4.4,5.4,6.4]}
df1 = pd.DataFrame(fruits_price,
columns = ['Apple','Orange','Kiwi','Watermelon'],
index=['2020-01-01','2020-01-02','2020-01-10'])
top_fruits = {'Fruit_1': ['Apple','Apple','Apple'],
'Fruit_2': ['Kiwi','Orange','Kiwi'],
'Fruit_3': ['Orange','Watermelon','Watermelon'],
'Fruit_4': ['Watermelon','Kiwi','Orange']}
df2 = pd.DataFrame(top_fruits,
columns = ['Fruit_1','Fruit_2','Fruit_3','Fruit_4'],
index=['2020-01-01','2020-01-02','2020-01-10'])
result = df2.T.replace(df1.T.to_dict()).T
result.columns = [f"Price_{i}" for i in range(1, len(result.columns)+1)]
result
output:
Price_1 Price_2 Price_3 Price_4
2020-01-01 9.0 5.0 10.0 4.4
2020-01-02 5.0 12.0 5.4 4.0
2020-01-10 14.0 20.0 6.4 10.0

Related

How to get consecutive pairs in pandas data frame and find the date difference for valid pairs

Input Data:
sn
fruits
Quality
Date
1
Apple
A
2022-09-01
2
Apple
A
2022-08-15
3
Apple
A
2022-07-15
4
Apple
B
2022-06-01
5
Apple
A
2022-05-15
6
Apple
A
2022-04-15
7
Banana
A
2022-08-15
8
Orange
A
2022-08-15
Get the average date diff for each type of fruit, only if quality=A and there are consecutive record with quality A.
If there are three rows of A quality only first 2 make valid pair. Third one is not valid pair as 4th record is quality=B
So in above data we have 2 valid pairs for Apple 1st pair= (1,2) = 15days date diff and 2nd pair = (5,6) = 15days diff so avg for apple is 15days
Expected output
fruits
avg time diff
Apple
15 days
Banana
null
Orange
null
How can I do this without using any looping in pandas dataframe?

How to fill missing values when you have it in both columns? dictionary - map( ) method

I have this scenario. I’m in the process of learning. I'm cleaning a dataset. Now I have a problem
There are a lot of rows that have this issue
I have the key but not the name product. I have the name product but not the key.
prod_key product
0 21.0 NaN
1 21.0 NaN
2 0.0 metal
3 35.0 NaN
4 22.0 NaN
5 0.0 wood
I know that the key of metal is 24 and the key of wood is 25
The product name that belongs to key 21 is plastic and the product name that belongs to key 22 is paper
There are hundreds of rows with the same situation. So, rename each and everyone of them will take me a lot of time.
I created a dictionary and then I used the .map() method but I’m still unable to ‘merge’ or you can say ‘mix’ the missing values in both columns without removing the other column value.
Thank you

You can create an extra dataframe and do merge two times
lst = [
['metal', 24],
['wood', 25],
['plastic', 21],
['paper', 22]
]
df2 = pd.DataFrame(lst, columns=['name', 'key'])
df1['product'].update(df1.merge(df2, left_on='prod_key', right_on='key', how='left')['name'])
df1['prod_key'].update(df1.merge(df2, left_on='product', right_on='name', how='left')['key'])
print(df2)
name key
0 metal 24
1 wood 25
2 plastic 21
3 paper 22
print(df1)
prod_key product
0 21.0 plastic
1 21.0 plastic
2 24.0 metal
3 35.0 NaN
4 22.0 paper
5 25.0 wood

Pandas group and join

I am new to pandas. I want to analysis the following case. Let say, A fruit market is giving the prices of the fruits daily the time from 18:00 to 22:00. For every half an hour they are updating the price of the fruits between the time lab. Consider the market giving the prices of the fruits at 18:00 as follows,
Fruit Price
Apple 10
Banana 20
After half an hour at 18:30, the list has been updated as follows,
Fruit Price
Apple 10
Banana 21
Orange 30
Grapes 25
Pineapple 65
I want to check has the prices of the fruits been changed of recent one[18:30] with the earlier one[18:00].
Here I want to get the result as,
Fruit 18:00 18:30
Banana 20 21
To solve this I am thinking to do the following,
1) Add time column in the two data frames.
2) Merge the tables into one.
3) Make a Pivot table with Index Fruit name and Column as ['Time','Price'].
I don't know how to get intersect the two data frames of grouped by Time. How to get the common rows of the two Data Frames.

You dont need to pivot in this case, we can simply use merge and use suffixes argument to get the desired results:
df_update = pd.merge(df, df2, on='Fruit', how='outer', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10.0 10.0
1 Banana 20.0 21.0
2 Orange NaN 30.0
3 Grapes NaN 25.0
4 Pineapple NaN 65.0
Edit
Why are we using the outer argument? We want to keep all the new data that is updated in df2. If we use inner for example, we will not get the updated fruits, like below. Unless this is the desired output by OP, which is not clear in this case.
df_update = pd.merge(df, df2, on='Fruit', how='inner', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10 10.0
1 Banana 20 21.0

If Fruit is the index of your data frame the following code should work. The Idea is to return rows with inequality:
df['1800'] = df1['Price']
df['1830'] = df2['Price']
print(df.loc[df['1800'] != df1['1830']])
You can also use datetime in your column heading.

Pandas fill in missing dates in DataFrame with multiple columns

I want to add missing dates for a specific date range, but keep all columns. I found many posts using afreq(), resample(), reindex(), but they seemed to be for Series and I couldn't get them to work for my DataFrame.
Given a sample dataframe:
data = [{'id' : '123', 'product' : 'apple', 'color' : 'red', 'qty' : 10, 'week' : '2019-3-7'}, {'id' : '123', 'product' : 'apple', 'color' : 'blue', 'qty' : 20, 'week' : '2019-3-21'}, {'id' : '123', 'product' : 'orange', 'color' : 'orange', 'qty' : 8, 'week' : '2019-3-21'}]
df = pd.DataFrame(data)
color id product qty week
0 red 123 apple 10 2019-3-7
1 blue 123 apple 20 2019-3-21
2 orange 123 orange 8 2019-3-21
My goal is to return below; filling in qty as 0, but fill other columns. Of course, I have many other ids. I would like to be able to specify the start/end dates to fill; this example uses 3/7 to 3/21.
color id product qty week
0 red 123 apple 10 2019-3-7
1 blue 123 apple 20 2019-3-21
2 orange 123 orange 8 2019-3-21
3 red 123 apple 0 2019-3-14
4 red 123 apple 0 2019-3-21
5 blue 123 apple 0 2019-3-7
6 blue 123 apple 0 2019-3-14
7 orange 123 orange 0 2019-3-7
8 orange 123 orange 0 2019-3-14
How can I keep the remainder of my DataFrame intact?

In you case , you just need do with unstack and stack + reindex
df.week=pd.to_datetime(df.week)
s=pd.date_range(df.week.min(),df.week.max(),freq='7 D')
df=df.set_index(['color','id','product','week']).\
qty.unstack().reindex(columns=s,fill_value=0).stack().reset_index()
df
color id product level_3 0
0 blue 123 apple 2019-03-14 0.0
1 blue 123 apple 2019-03-21 20.0
2 orange 123 orange 2019-03-14 0.0
3 orange 123 orange 2019-03-21 8.0
4 red 123 apple 2019-03-07 10.0
5 red 123 apple 2019-03-14 0.0

One option is to use the complete function from pyjanitor to expose the implicitly missing rows; afterwards you can fill with fillna:
# pip install pyjanitor
import pandas as pd
import janitor
df.week = pd.to_datetime(df.week)
# create new dates, which will be used to expand the dataframe
new_dates = {"week": pd.date_range(df.week.min(), df.week.max(), freq="7D")}
# use the complete function
# note how color, id and product are wrapped together
# this ensures only missing values based on data in the dataframe is exposed
# if you want all combinations, then you get rid of the tuple,
(df
.complete(("color", "id", "product"), new_dates, sort = False)
.fillna({'qty':0, downcast='infer')
)
id product color qty week
0 123 apple red 10 2019-03-07
1 123 apple blue 20 2019-03-21
2 123 orange orange 8 2019-03-21
3 123 apple red 0 2019-03-14
4 123 apple red 0 2019-03-21
5 123 apple blue 0 2019-03-07
6 123 apple blue 0 2019-03-14
7 123 orange orange 0 2019-03-07
8 123 orange orange 0 2019-03-14

Split a list of tuples in a column of dataframe to columns of a dataframe

I've a dataframe which contains a list of tuples in one of its columns. I need to split the list tuples into corresponding columns. My dataframe df looks like as given below:-
A B
[('Apple',50),('Orange',30),('banana',10)] Winter
[('Orange',69),('WaterMelon',50)] Summer
The expected output should be:
Fruit rate B
Apple 50 winter
Orange 30 winter
banana 10 winter
Orange 69 summer
WaterMelon 50 summer

You can use DataFrame constructor with numpy.repeat and numpy.concatenate:
df1 = pd.DataFrame(np.concatenate(df.A), columns=['Fruit','rate']).reset_index(drop=True)
df1['B'] = np.repeat(df.B.values, df['A'].str.len())
print (df1)
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer
Another solution with chain.from_iterable:
from itertools import chain
df1 = pd.DataFrame(list(chain.from_iterable(df.A)), columns=['Fruit','rate'])
.reset_index(drop=True)
df1['B'] = np.repeat(df.B.values, df['A'].str.len())
print (df1)
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer

This should work:
fruits = []
rates = []
seasons = []
def create_lists(row):
tuples = row['A']
season = row['B']
for t in tuples:
fruits.append(t[0])
rates.append(t[1])
seasons.append(season)
df.apply(create_lists, axis=1)
new_df = pd.DataFrame({"Fruit" :fruits, "Rate": rates, "B": seasons})[["Fruit", "Rate", "B"]]
output:
Fruit Rate B
0 Apple 50 winter
1 Orange 30 winter
2 banana 10 winter
3 Orange 69 summer
4 WaterMelon 50 summer

You can do this in a chained operation:
(
df.apply(lambda x: [[k,v,x.B] for k,v in x.A],axis=1)
.apply(pd.Series)
.stack()
.apply(pd.Series)
.reset_index(drop=True)
.rename(columns={0:'Fruit',1:'rate',2:'B'})
)
Out[1036]:
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge two data frame by comparing values but not the column name - python

Just use apply function with axis=1, what this does is row by row, and each row is a series, its name is the date, replace the value with corresponding row in df1. df2.apply(lambda x: x.replace(df1.to_dict('index')[x.name]), axis=1)

Related

How to get consecutive pairs in pandas data frame and find the date difference for valid pairs

How to fill missing values when you have it in both columns? dictionary - map( ) method

Pandas group and join

Pandas fill in missing dates in DataFrame with multiple columns

Split a list of tuples in a column of dataframe to columns of a dataframe

Categories

Resources