Removing duplicates python dataframe

Removing duplicates python dataframe - python

I have a dataframe with some duplicates that I need to remove. In the dataframe below, where the month, year and type are all the same it should keep the row with the highest sale. Eg:
df = pd.DataFrame({'month': [1, 1, 7, 10],
'year': [2012, 2012, 2013, 2014],
'type':['C','C','S','C'],
'sale': [55, 40, 84, 31]})
After removing duplicates and keeping the highest value of column 'sale' should look like:
df_2 = pd.DataFrame({'month': [1, 7, 10],
'year': [2012, 2013, 2014],
'type':['C','S','C'],
'sale': [55, 84, 31]})

You can use:
(df.sort_values('sale',ascending=False)
.drop_duplicates(['month','year','type']).sort_index())
month year type sale
0 1 2012 C 55
2 7 2013 S 84
3 10 2014 C 31

You could groupby and take the max of sale:
df.groupby(['month', 'year', 'type']).max().reset_index()
month year type sale
0 1 2012 C 55
1 7 2013 S 84
2 10 2014 C 31
If you have another column, like other, than you must specify which column to take the max, in this way:
df.groupby(['month', 'year', 'type'])[['sale']].max().reset_index()
month year type sale
0 1 2012 C 55
1 7 2013 S 84
2 10 2014 C 31

Related

Python/increase code efficiency about multiple columns filter

I was wondering if someone could help me find a more efficiency way to run my code.
I have a dataset contains 7 columns, which are country,sector,year,month,week,weekday,value.
the year column have only 3 elements, 2019,2020,2021
What I have to do here is to substract every value in 2020 and 2021 from 2019.
But its more complicated that I need to match the weekday columns.
For example,i need to use year 2020, month 1, week 1, weekday 0(monday) value to substract,
year 2019, month 1, week 1, weekday 0(monday) value, if cant find it, it will pass, and so on, which means, the weekday(monday,Tuesaday....must be matched)
And here is my code, it can run, but it tooks me hours:(
for i in itertools.product(year_list,country_list, sector_list,month_list,week_list,weekday_list):
try:
data_2 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == i[0])
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
data_1 = df_carbon[(df_carbon['country'] == i[1])
& (df_carbon['sector'] == i[2])
& (df_carbon['year'] == 2019)
& (df_carbon['month'] == i[3])
& (df_carbon['week'] == i[4])
& (df_carbon['weekday'] == i[5])]['co2'].tolist()[0]
co2.append(data_2-data_1)
country.append(i[1])
sector.append(i[2])
year.append(i[0])
month.append(i[3])
week.append(i[4])
weekday.append(i[5])
except:
pass
I changed the for loops to itertools, but it still not fast enough, any other ideas?
many thanks:)
##############################
here is the sample dataset
country co2 sector date week weekday year month
Brazil 108.767782 Power 2019-01-01 0 1 2019 1
China 14251.044482 Power 2019-01-01 0 1 2019 1
EU27 & UK 1886.493814 Power 2019-01-01 0 1 2019 1
France 53.856398 Power 2019-01-01 0 1 2019 1
Germany 378.323440 Power 2019-01-01 0 1 2019 1
Japan 21.898788 IA 2021-11-30 48 1 2021 11
Russia 19.773822 IA 2021-11-30 48 1 2021 11
Spain 42.293944 IA 2021-11-30 48 1 2021 11
UK 56.425121 IA 2021-11-30 48 1 2021 11
US 166.425000 IA 2021-11-30 48 1 2021 11
or this
import pandas as pd
pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['Brazil', 'Brazil', 'Brazil'],
'sector': ['power', 'power', 'power'],
'month': [1, 1, 1],
'week': [0,0,0],
'weekday': [0,0,0]
})

pandas can subtract two dataframe index-by-index, so the idea would be to separate your data into a minuend and a subtrahend, set ['country', 'sector', 'month', 'week', 'weekday'] as their indices, just subtract them, and remove rows (by dropna) where a match in year 2019 is not found.
df_carbon = pd.DataFrame({
'year': [2019, 2020, 2021],
'co2': [1,2,3],
'country': ['ab', 'ab', 'bc']
})
index = ['country']
# index = ['country', 'sector', 'month', 'week', 'weekday']
df_2019 = df_carbon[df_carbon['year']==2019].set_index(index)
df_rest = df_carbon[df_carbon['year']!=2019].set_index(index)
ans = (df_rest - df_2019).reset_index().dropna()
ans['year'] += 2019
Two additional points:
In this subtraction the year is also covered, so I need to add 2019 back.
I created a small example of df_carbon to test my code. If you had provided a more realistic version in text form, I would have tested my code using your data.

Keep the most recent GROUPR of records in a dataframe

I am trying to clean up some data out of which I need to keep only the most recent but all of them, if they appear more than once. What confuses me is that the data are actually organised in "groups". I have a dataframe example below along with the comments that might make it clearer:
method year proteins values
0 John 2017 A 10
1 John 2017 B 20
2 John 2018 A 30 # John's method in 2018 is most recent, keep this line and drop index 0 and1
3 Kate 2018 B 11
4 Kate 2018 C 22 # Kate's method appears only in 2018 so keep both lines (index 3 and 4)
5 Patrick 2017 A 90
6 Patrick 2018 A 80
7 Patrick 2018 B 85
8 Patrick 2018 C 70
9 Patrick 2019 A 60
10 Patrick 2019 C 50 # Patrick's method in 2019 is the most recent of Patrick's so keep index 9 and 10 only
So the desired output dataframe is irrelevant of the proteins that are measured but all the measured proteins should be included:
method year proteins values
0 John 2018 A 30
1 Kate 2018 B 11
2 Kate 2018 C 22
3 Patrick 2019 A 60
4 Patrick 2019 C 50
Hope this is clear. I have tried something like this my_df.sort_values('year').drop_duplicates('method', keep='last') but it gives a wrong output. Any ideas? Thank you!
PS: To replicate my initial df, you can copy the below lines:
import pandas as pd
import numpy as np
methodology=["John", "John", "John", "Kate", "Kate", "Patrick", "Patrick", "Patrick", "Patrick", "Patrick", "Patrick"]
year_pract=[2017, 2017, 2018, 2018, 2018, 2017, 2018, 2018, 2018, 2019, 2019]
proteins=['A', 'B', 'A', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'C']
values=[10, 20, 30, 11, 22, 90, 80, 85, 70, 60, 50]
my_df=pd.DataFrame(zip(methodology,year,proteins,values), columns=['method','year','proteins','values'])
my_df['year']=my_df['year'].astype(str)
my_df['year']=pd.to_datetime(my_df['year'], format='%Y') # the format never works for me and this is why I add the line below
my_df['year']=my_df['year'].dt.year

Because duplicates is necessary use GroupBy.transform with max and compare by original column year with Series.eq for equal and filtering by boolean indexing:
df = my_df[my_df['year'].eq(my_df.groupby('method')['year'].transform('max'))]
print (df)
method year proteins values
2 John 2018 A 30
3 Kate 2018 B 11
4 Kate 2018 C 22
9 Patrick's 2019 A 60
10 Patrick's 2019 C 50

How can one filter a dataframe based on rows containing specific value (in any of the columns)

I need to limit a dataset so that it returns only rows that contain specific string, however, that string can exist in many (8) of the columns.
How can I do this? Ive seen str.isin methods, but it returns a single series for a single row. How can I remove any rows that contain the string in ANY of the columns.
Example code
If I had the dataframe df generated by
import pandas as pd
data = {'year': [2011, 2012, 2013, 2014, 2014, 2011, 2012, 2015],
'year2': [2012, 2016, 2015, 2015, 2012, 2013, 2019, 2016],
'reports': [52, 20, 43, 33, 41, 11, 43, 72]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df
year year2 reports
a 2011 2012 52
b 2012 2016 20
c 2013 2015 43
d 2014 2015 33
e 2014 2012 41
f 2011 2013 11
g 2012 2019 43
h 2015 2016 72
I want the code to remove rows all rows that do not contain the value 2012. Note that in my actual dataset, it is a string, not an int (it is peoples names)
so in the above code it would remove rows c, d, f, and h.

you can use df.eq with df.any on axis=1:
df[df.eq('2012').any(1)] #for year as string
Or:
df[df.eq(2012).any(1)] #for year as int
year year2 reports
a 2011 2012 52
b 2012 2016 20
e 2014 2012 41
g 2012 2019 43

try simple code like this:
import pandas as pd
data = {'year': [2011, 2012, 2013, 2014, 2014, 2011, 2012, 2015],
'year2': [2012, 2016, 2015, 2015, 2012, 2013, 2019, 2016],
'reports': [52, 20, 43, 33, 41, 11, 43, 72]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df = df.drop(['c', 'd', 'f', 'h'])
df
it will give you dataframe like this:
year year2 reports
a 2011 2012 52
b 2012 2016 20
e 2014 2012 41
g 2012 2019 43

To find the dataframe made of the rows that have the value
df[(df == '2012').all(axis=1)]
To find the dataframe made of the rows that do not have the value
df[~(df == '2012').all(axis=1)]
or
df[(df != '2012').all(axis=1)]
See the related https://stackoverflow.com/a/35682788/12411517.

How to groupby in Pandas and keep all columns [duplicate]

This question already has answers here:
Pandas DataFrame Groupby two columns and get counts
(8 answers)
Closed 3 years ago.
I have a data frame like this:
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
And I need to group drug names and mean number of ingredients by year like this:
year drug_name avg_number_of_ingredients
0 2019 drug a,b,c.. mean value for column
1 2018 drug a,b,c.. mean value for column
2 2017 drug a,b,c.. mean value for column
If I do df.groupby('year'), I lose drug names. How can I do it?

Let me show you the solution on the simple example. First, I make the same data frame as you have:
>>> df = pd.DataFrame(
[
{'year': 2019, 'drug_name': 'NEXIUM I.V.', 'avg_number_of_ingredients': 8},
{'year': 2016, 'drug_name': 'ZOLADEX', 'avg_number_of_ingredients': 10},
{'year': 2017, 'drug_name': 'PRILOSEC', 'avg_number_of_ingredients': 59},
{'year': 2017, 'drug_name': 'BYDUREON BCise', 'avg_number_of_ingredients': 24},
{'year': 2019, 'drug_name': 'Lynparza', 'avg_number_of_ingredients': 28},
]
)
>>> print(df)
year drug_name avg_number_of_ingredients
0 2019 NEXIUM I.V. 8
1 2016 ZOLADEX 10
2 2017 PRILOSEC 59
3 2017 BYDUREON BCise 24
4 2019 Lynparza 28
Now, I make a df_grouped, which still consists of information about drugs name.
>>> df_grouped = df.groupby('year', as_index=False).agg({'drug_name': ', '.join, 'avg_number_of_ingredients': 'mean'})
>>> print(df_grouped)
year drug_name avg_number_of_ingredients
0 2016 ZOLADEX 10.0
1 2017 PRILOSEC, BYDUREON BCise 41.5
2 2019 NEXIUM I.V., Lynparza 18.0

How Do I Merge These Two Datasets?

I have two datasets. I would like to merge using the index.
The 1st data set:
index A B C
01/01/2010 15 20 30
15/01/2010 12 15 25
17/02/2010 14 13 35
19/02/2010 11 10 22
The 2nt data set:
index year month price
0 2010 january 70
1 2010 february 80
I want them to be joined like:
index A B C price
01/01/2010 15 20 30 70
15/01/2010 12 15 25 70
17/02/2010 14 13 35 80
19/02/2010 11 10 22 80
The problem is how to use two columns (year and month of the 2nd dataset) to create a temporary datetime index.

Try this, by extracting .month_name() and year(.dt.year) from df1 and merge it with df2
>>> df1
index A B C
0 01/01/2010 15 20 30
1 15/01/2010 12 15 25
2 17/02/2010 14 13 35
3 19/02/2010 11 10 22
>>> df2
index year month price
0 0 2010 january 70
1 1 2010 february 80
# merging df1 and df2 by month and year.
>>> df1.merge(df2,
left_on = [pd.to_datetime(df1['index']).dt.year,
pd.to_datetime(df1['index']).dt.month_name().str.lower()],
right_on = ['year', 'month'])
Output:
index_x A B C index_y year month price
0 01/01/2010 15 20 30 0 2010 january 70
1 15/01/2010 12 15 25 0 2010 january 70
2 17/02/2010 14 13 35 1 2010 february 80
3 19/02/2010 11 10 22 1 2010 february 80

here is the stupid answer! I'm sure you can do smarter than that :) But that works, considering your tables are a list of dictionaries (you can easily convert your SQL tables in this format). I'm aware this is not a clean solution, but you asked for an easy one, probably that's the easiest to understand :)
months = {'january': "01",
'february': "02",
'march': "03",
'april':"04",
'may': "05",
'june': "06",
'july': "07",
'august': "08",
'september': "09",
'october': "10",
'november': "11",
'december': "12"}
table1 = [{'index': '01/01/2010', 'A': 15, 'B': 20, 'C': 30},
{'index': '15/01/2010', 'A': 12, 'B': 15, 'C': 25},
{'index': '17/02/2010', 'A': 14, 'B': 13, 'C': 35},
{'index': '19/02/2010', 'A': 11, 'B': 10, 'C': 22}]
table2 = [{'index': 0, 'year': 2010, 'month': 'january', 'price':70},
{'index': 1, 'year': 2010, 'month': 'february', 'price':80}]
def joiner(table1, table2):
for row in table2:
row['tempDate'] = "{0}/{1}".format(months[row['month']], str(row['year']))
for row in table1:
row['tempDate'] = row['index'][3:]
table3 = []
for row1 in table1:
row3 = row1.copy()
for row2 in table2:
if row2['tempDate'] == row1['tempDate']:
row3['price'] = row2['price']
break
table3.append(row3)
return(table3)
table3 = joiner(table1, table2)
print(table3)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing duplicates python dataframe - python

You can use: (df.sort_values('sale',ascending=False) .drop_duplicates(['month','year','type']).sort_index()) month year type sale 0 1 2012 C 55 2 7 2013 S 84 3 10 2014 C 31

Related

Python/increase code efficiency about multiple columns filter

Keep the most recent GROUPR of records in a dataframe

How can one filter a dataframe based on rows containing specific value (in any of the columns)

How to groupby in Pandas and keep all columns [duplicate]

How Do I Merge These Two Datasets?

Categories

Resources