How Do I Merge These Two Datasets? - python

I have two datasets. I would like to merge using the index.
The 1st data set:
index A B C
01/01/2010 15 20 30
15/01/2010 12 15 25
17/02/2010 14 13 35
19/02/2010 11 10 22
The 2nt data set:
index year month price
0 2010 january 70
1 2010 february 80
I want them to be joined like:
index A B C price
01/01/2010 15 20 30 70
15/01/2010 12 15 25 70
17/02/2010 14 13 35 80
19/02/2010 11 10 22 80
The problem is how to use two columns (year and month of the 2nd dataset) to create a temporary datetime index.

Try this, by extracting .month_name() and year(.dt.year) from df1 and merge it with df2
>>> df1
index A B C
0 01/01/2010 15 20 30
1 15/01/2010 12 15 25
2 17/02/2010 14 13 35
3 19/02/2010 11 10 22
>>> df2
index year month price
0 0 2010 january 70
1 1 2010 february 80
# merging df1 and df2 by month and year.
>>> df1.merge(df2,
left_on = [pd.to_datetime(df1['index']).dt.year,
pd.to_datetime(df1['index']).dt.month_name().str.lower()],
right_on = ['year', 'month'])
Output:
index_x A B C index_y year month price
0 01/01/2010 15 20 30 0 2010 january 70
1 15/01/2010 12 15 25 0 2010 january 70
2 17/02/2010 14 13 35 1 2010 february 80
3 19/02/2010 11 10 22 1 2010 february 80

here is the stupid answer! I'm sure you can do smarter than that :) But that works, considering your tables are a list of dictionaries (you can easily convert your SQL tables in this format). I'm aware this is not a clean solution, but you asked for an easy one, probably that's the easiest to understand :)
months = {'january': "01",
'february': "02",
'march': "03",
'april':"04",
'may': "05",
'june': "06",
'july': "07",
'august': "08",
'september': "09",
'october': "10",
'november': "11",
'december': "12"}
table1 = [{'index': '01/01/2010', 'A': 15, 'B': 20, 'C': 30},
{'index': '15/01/2010', 'A': 12, 'B': 15, 'C': 25},
{'index': '17/02/2010', 'A': 14, 'B': 13, 'C': 35},
{'index': '19/02/2010', 'A': 11, 'B': 10, 'C': 22}]
table2 = [{'index': 0, 'year': 2010, 'month': 'january', 'price':70},
{'index': 1, 'year': 2010, 'month': 'february', 'price':80}]
def joiner(table1, table2):
for row in table2:
row['tempDate'] = "{0}/{1}".format(months[row['month']], str(row['year']))
for row in table1:
row['tempDate'] = row['index'][3:]
table3 = []
for row1 in table1:
row3 = row1.copy()
for row2 in table2:
if row2['tempDate'] == row1['tempDate']:
row3['price'] = row2['price']
break
table3.append(row3)
return(table3)
table3 = joiner(table1, table2)
print(table3)

Related

How to add an empty row for each "blank" year with no data?

I have a df that has country-year data from 2000-2020 with various columns containing the sum total of given events in each country-year unit. In some countries, the event only happened in some of the years, so there are no rows for the remaining years which I would like to have a "0" in all columns in that row.
country
iyear
nwound
Med
claimed
Nigeria
2000
2
5
7
Nigeria
2001
3
15
9
Nigeria
2005
4
6
14
Nigeria
2017
9
41
20
Benin
2004
2
5
7
Benin
2008
3
15
9
Benin
20010
4
6
14
Benin
2019
9
41
20
In short, I'm looking for a way to add rows for all the years 2000-2020 for Nigeria and Benin (and all the other countries not listed) that are missing with each value in the row (for nwound, med and claimed) being 0. Keep in mind, this data set have 18 countries in it so I would want the code to be reproducible.
Use the reindex method from pandas:
import pandas as pd
df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Nigeria', 'Benin', 'Benin', 'Benin', 'Benin'],
'iyear': [2000, 2001, 2005, 2017, 2004, 2008, 2010, 2019],
'nwound': [2, 3, 4, 9, 2, 3, 4, 9],
'Med': [5, 15, 6, 41, 5, 15, 6, 41],
'claimed': [7, 9, 14, 20, 7, 9, 14, 20]})
df = df.set_index(['country', 'iyear'])
countries = df.index.levels[0].tolist()
index = pd.MultiIndex.from_product([countries, range(2000, 2021)], names=['country', 'iyear'])
df = df.reindex(index, fill_value=0)
df = df.reset_index()
print(df)

Sort values intra group [duplicate]

This question already has an answer here:
Pandas groupby sort each group values and order dataframe groups based on max of each group
(1 answer)
Closed 1 year ago.
Suppose I have this dataframe:
df = pd.DataFrame({
'price': [2, 13, 24, 15, 11, 44],
'category': ["shirts", "pants", "shirts", "tops", "hat", "tops"],
})
price category
0 2 shirts
1 13 pants
2 24 shirts
3 15 tops
4 11 hat
5 44 tops
I want to sort values in such a way that:
Find what is the highest price of each category.
Sort categories according to highest price (in this case, in descending order: tops, shirts, pants, hat).
Sort each category according to higher price.
The final dataframe would look like:
price category
0 44 tops
1 15 tops
2 24 shirts
3 24 shirts
4 13 pants
5 11 hat
I'm not a big fan of one-liners, so here's my solution:
# Add column with max-price for each category
df = df.merge(df.groupby('category')['price'].max().rename('max_cat_price'),
left_on='category', right_index=True)
# Sort
df.sort_values(['category','price','max_cat_price'], ascending=False)
# Drop column that has max-price for each category
df.drop('max_cat_price', axis=1, inplace=True)
print(df)
price category
5 44 tops
3 15 tops
2 24 shirts
0 2 shirts
1 13 pants
4 11 hat
You can use .groupby and .sort_values:
df.join(df.groupby("category").agg("max"), on="category", rsuffix="_r").sort_values(
["price_r", "price"], ascending=False
)
Output
price category price_r
5 44 tops 44
3 15 tops 44
2 24 shirts 24
0 2 shirts 24
1 13 pants 13
4 11 hat 11
I used the get_group in an dataframe apply to get the max price for a category
df = pd.DataFrame({
'price': [2, 13, 24, 15, 11, 44],
'category': ["shirts", "pants", "shirts", "tops", "hat", "tops"],
})
grouped=df.groupby('category')
df['price_r']=df['category'].apply(lambda row: grouped.get_group(row).price.max())
df=df.sort_values(['category','price','price_r'], ascending=False)
print(df)
output
price category price_r
5 44 tops 44
3 15 tops 44
2 24 shirts 24
0 2 shirts 24
1 13 pants 13
4 11 hat 11

Removing duplicates python dataframe

I have a dataframe with some duplicates that I need to remove. In the dataframe below, where the month, year and type are all the same it should keep the row with the highest sale. Eg:
df = pd.DataFrame({'month': [1, 1, 7, 10],
'year': [2012, 2012, 2013, 2014],
'type':['C','C','S','C'],
'sale': [55, 40, 84, 31]})
After removing duplicates and keeping the highest value of column 'sale' should look like:
df_2 = pd.DataFrame({'month': [1, 7, 10],
'year': [2012, 2013, 2014],
'type':['C','S','C'],
'sale': [55, 84, 31]})
You can use:
(df.sort_values('sale',ascending=False)
.drop_duplicates(['month','year','type']).sort_index())
month year type sale
0 1 2012 C 55
2 7 2013 S 84
3 10 2014 C 31
You could groupby and take the max of sale:
df.groupby(['month', 'year', 'type']).max().reset_index()
month year type sale
0 1 2012 C 55
1 7 2013 S 84
2 10 2014 C 31
If you have another column, like other, than you must specify which column to take the max, in this way:
df.groupby(['month', 'year', 'type'])[['sale']].max().reset_index()
month year type sale
0 1 2012 C 55
1 7 2013 S 84
2 10 2014 C 31

Pandas - Create column with difference in values

I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?
The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.
Thanks a lot.
Example
[]
import pandas as pd
import numpy as np
example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
'2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})
example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()
example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])
example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])
Result:
Day Name Money_x Expiry Difference
0 2020-08-30 John 100 1Y 50
1 2020-08-30 Mike 950 1Y 900
2 2020-08-30 John 200 2Y -50
3 2020-08-30 Mike 1000 2Y -200
If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.
You can solve it with groupby.diff
Take the dataframe
df = pd.DataFrame({
'Day': [30, 30, 30, 30, 29, 29, 28, 28],
'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)
Which looks like
Day Name Money Expiry
0 30 John 100 1
1 30 Mike 950 1
2 30 John 200 2
3 30 Mike 1000 2
4 29 John 50 1
5 29 Mike 50 1
6 28 John 250 2
7 28 Mike 1200 2
And the code
# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)
# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)
Output
Day Name Money Expiry Difference
0 30 John 100 1 50.0
1 30 Mike 950 1 900.0
2 30 John 200 2 -50.0
3 30 Mike 1000 2 -200.0
4 29 John 50 1 NaN
5 29 Mike 50 1 NaN
6 28 John 250 2 NaN
7 28 Mike 1200 2 NaN

How to groupby in Pandas and keep all columns [duplicate]

This question already has answers here:
Pandas DataFrame Groupby two columns and get counts
(8 answers)
Closed 3 years ago.
I have a data frame like this:
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
And I need to group drug names and mean number of ingredients by year like this:
year drug_name avg_number_of_ingredients
0 2019 drug a,b,c.. mean value for column
1 2018 drug a,b,c.. mean value for column
2 2017 drug a,b,c.. mean value for column
If I do df.groupby('year'), I lose drug names. How can I do it?
Let me show you the solution on the simple example. First, I make the same data frame as you have:
>>> df = pd.DataFrame(
[
{'year': 2019, 'drug_name': 'NEXIUM I.V.', 'avg_number_of_ingredients': 8},
{'year': 2016, 'drug_name': 'ZOLADEX', 'avg_number_of_ingredients': 10},
{'year': 2017, 'drug_name': 'PRILOSEC', 'avg_number_of_ingredients': 59},
{'year': 2017, 'drug_name': 'BYDUREON BCise', 'avg_number_of_ingredients': 24},
{'year': 2019, 'drug_name': 'Lynparza', 'avg_number_of_ingredients': 28},
]
)
>>> print(df)
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
Now, I make a df_grouped, which still consists of information about drugs name.
>>> df_grouped = df.groupby('year', as_index=False).agg({'drug_name': ', '.join, 'avg_number_of_ingredients': 'mean'})
>>> print(df_grouped)
year drug_name avg_number_of_ingredients
0 2016 ZOLADEX 10.0
1 2017 PRILOSEC, BYDUREON BCise 41.5
2 2019 NEXIUM I.V., Lynparza 18.0

Categories

Resources