selecting duplicates by condition python pandas

selecting duplicates by condition python pandas - python

I have a simple dataframe which I would like to separate from each other with some conditions.
Car
Year
Speed
Cond
BMW
2001
150
X
BMW
2000
150
Audi
1997
200
Audi
2000
200
Audi
2012
200
X
Fiat
2020
180
Mazda
2022
183
What i have to do is take duplicates to another dataframe and in my main dataframe leave only one line.
Rows that are duplicates in the Car column I would like to separate into a separate dataframe, but I don't need the ones that have X in the cond column.
In main dataframe I would like keep one row. I would like the left row to be the one that contains X in the cond column
I have code:
import pandas as pd
import numpy as np
cars = {'Car': {0: 'BMW', 1: 'BMW', 2: 'Audi', 3: 'Audi', 4: 'Audi', 5: 'Fiat', 6: 'Mazda'},
'Year': {0: 2001, 1: 2000, 2: 1997, 3: 2000, 4: 2012, 5: 2020, 6: 2022},
'Speed': {0: 150, 1: 150, 2: 200, 3: 200, 4: 200, 5: 180, 6: 183},
'Cond': {0: 'X', 1: np.nan, 2: 'X', 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan}}
df = pd.DataFrame.from_dict(cars)
df_duplicates = df.loc[df.duplicated(subset=['Car'], keep = False)].loc[df['Cond']!='X']
I don't know how i can leave the main dataframe with only one row which additionally contains X in cond column
Maybe it's possible to have one command that will delete and select another dataframe according to the rules above?

If I understand correctly the desired logic, you can use groupby.idxmax to select the first X per group if any (else the first row of the group), to keep in the main DataFrame. The rest goes in the other DataFrame (df2).
# get indices of the row with X is any, else of the first one per group
keep = df['Cond'].eq('X').groupby(df['Car']).idxmax()
# drop selected rows
df2 = df.drop(keep)
# keep selected rows
df = df.loc[keep]
Output:
# updated df1
Car Year Speed Cond
0 BMW 2001 150 X
2 Audi 1997 200 X
5 Fiat 2020 180 NaN
6 Mazda 2022 183 NaN
# df2
Car Year Speed Cond
1 BMW 2000 150 NaN
3 Audi 2000 200 NaN
4 Audi 2012 200 NaN

Related

Panda dataframe nan replacements

i am newbie in pandas. So please bear with me.
I have this dataframe:
Name,Year,Engine,Price
Car1,2001,100 CC,1000
Car2,2002,150 CC,2000
Car1,2001,100 CC,nan
Car1,2001,100 CC,100
I can't figure out how to change the nan or null value of “Car 1" + Year+ "100 CC” from nan to 1000.
I need to extract the value of “Price” while combining “Name +Year + Engine”. And replace where its null.
There are numbers of rows in the csv file which have the null “Price” while combining “Name + Engine”, however in some rows same “Name + Year+ Engine “ has “Price” association with it.
Thanks for the help.

With the update of your question (an extra row with Price == 100, where Name == Car and Engine == 100 CC), the logic behind the choice for filling the NaN value in this group with 1000.0 has become ambiguous. Let's add yet another row:
import pandas as pd
import numpy as np
data = {'Name': {0: 'Car1', 1: 'Car2', 2: 'Car1', 3: 'Car1', 4: 'Car1'},
'Year': {0: 2001, 1: 2002, 2: 2001, 3: 2001, 4: 2001},
'Engine': {0: '100 CC', 1: '150 CC', 2: '100 CC', 3: '100 CC', 4: '100 CC'},
'Price': {0: 1000.0, 1: 2000.0, 2: np.nan, 3: 100.0, 4: np.nan}}
df = pd.DataFrame(data)
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC NaN
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC NaN
In this case, what should happen with the second associated NaN value? If you want to fill all NaNs with the first value, you could limit the assignment to the rows that contain NaNs by combining df.loc with pd.Series.isna(). This way you'll only overwrite the NaNs:
df.loc[df['Price'].isna(),'Price'] = df.groupby(['Name','Engine'])\
['Price'].transform('first')
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC 1000.0
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC 1000.0
But you can of course change the function (here: "first") passed to DataFrameGroupBy.transform. E.g. use "max" for 1000.0, if you are selecting it because it is the highest value. Or if you want the mode, you could do: .transform(lambda x: x.mode().iloc[0]) (and get 100.0 in this case!); or get "mean" (550.0), "last" (100) etc.
More likely, you would want to use df.ffill, i.e. "forward fill", to propagate the last valid value forward. So, fill first NaN with 1000.0, and the second with 100.0. If so, use:
df['Price'] = df.groupby(['Name','Engine'])['Price'].transform('ffill')
print(df)
Name Year Engine Price
0 Car1 2001 100 CC 1000.0
1 Car2 2002 150 CC 2000.0
2 Car1 2001 100 CC 1000.0
3 Car1 2001 100 CC 100.0
4 Car1 2001 100 CC 100.0

pandas subtract values in two dataframes with identical columns create new dataframe to store result

I am trying to create a new dataframe new_df with a new column containing the difference in values from subtracting identical columns in 2 separate dataframes: df1 df2
I have tried to use the code new_df.loc['difference'] = df1.loc['s_values'] - df2.loc['s_values']
but I cannot achieve my result.
where df1 =
stats s_values
gender year
women 2007 height 40
2007 cigarette use 31
and df2 =
stats s_values
gender year
Men 2007 height 10
2007 cigarette use 11
desired output achieved (I do not want to include the gender index)
new_df =
stats difference
year
2007 height 30
2007 cigarette use 20

You can try this (full example):
Input:
import pandas as pd
df1 = pd.DataFrame({'gender': {0: 'woman', 1: 'woman'},
'year': {0: 2007, 1: 2007},
'stats': {0: 'height', 1: 'cigarette use'},
's_values': {0: 40, 1: 31}})
df2 = pd.DataFrame({'gender': {0: 'men', 1: 'men'},
'year': {0: 2007, 1: 2007},
'stats': {0: 'height', 1: 'cigarette use'},
's_values': {0: 10, 1: 11}})
Code:
df = pd.concat([df1,df2], ignore_index=True)
df['s_values'] = df.groupby(['year', 'stats'])['s_values'].diff().abs()
df.dropna(subset=['s_values']).drop('gender', axis=1)
Output:
year stats s_values
2 2007 height 30.0
3 2007 cigarette use 20.0
Note:
If both dataframes are completely identicaly structured, its even shorter:
df1.drop('gender', axis=1).assign(s_values=df1['s_values'] - df2['s_values'])

new_df = pd.DataFrame()
new_df["year"] = df1["year"]
new_df["stats"] = df1["stats"]
for i, (val1, val2) in enumerate(zip(df1["s_values"],df2["s_values"])):
new_df.at[i,"difference"] = val1-val2

Performing an Operation On Grouped Pandas Data

I have a pandas DataFrame with the following information:
year state candidate percvotes electoral_votes perc_evotes vote_frac vote_int
1976 ALABAMA CARTER, JIMMY 55.727269 9 5.015454 0.015454 5
1976 ALABAMA FORD, GERALD 42.614871 9 3.835338 0.835338 3
1976 ALABAMA MADDOX, LESTER 0.777613 9 0.069985 0.069985 0
1976 ALABAMA BUBAR, BENJAMIN 0.563808 9 0.050743 0.050743 0
1976 ALABAMA HALL, GUS 0.165194 9 0.014867 0.014867 0
where pervotes is the percentage of the total votes cast the candidate received (calculated before), electoral_votes are the electoral college votes for that state, perc_evotes is the calculated percent of the electoral votes, and vote_frac and vote_int are the fraction and whole number part of the electoral votes earned respectively. This data repeats for each year of an election and then by state per year. The candidates each have a row for each state, and it is similar data.
What I want to do is allocate the leftover electoral votes to the candidate with the highest fraction. This number is different for each state and year. In this case there would be 1 leftover electoral vote (9 total votes and 5+3=8 are already allocated) and the remaining one will go to 'FORD, GERALD' since he has 0.85338 in the vote_frac column. Sometimes there are 2 or 3 left unallocated.
I have a solution that adds the data to a dictionary, but it is using for loops. I know there must be a better way to do this in a more "pandas" way. I have touched on groupby in this loop but I feel like I am not utilizing pandas to it's full potential.
My for loop:
results = {}
grouped = electdf.groupby(["year", "state"])
for key, group in grouped:
year, state = key
group['vote_remaining'] = group['electoral_votes'] - group['vote_int'].sum()
remaining = group['vote_remaining'].iloc[0]
top_fracs = group['vote_frac'].nlargest(remaining)
group['total'] = (group['vote_frac'].isin(top_fracs)).astype(int) + group['vote_int']
if year not in results:
results[year] = {}
for candidate, evotes in zip(group['candidate'], group['total']):
if candidate not in results[year] and evotes:
results[year][candidate] = 0
if evotes:
results[year][candidate] += evotes
Thanks in advance!

Perhaps an apply function which finds the available electoral votes, the amount of votes cast, and conditionally updates the max 'vote_frac' row's 'vote_int' column with the difference of available and cast votes:
import pandas as pd
df = pd.DataFrame({'year': {0: 1976, 1: 1976, 2: 1976, 3: 1976, 4: 1976},
'state': {0: 'ALABAMA', 1: 'ALABAMA', 2: 'ALABAMA',
3: 'ALABAMA', 4: 'ALABAMA'},
'candidate': {0: 'CARTER, JIMMY', 1: 'FORD, GERALD',
2: 'MADDOX, LESTER', 3: 'BUBAR, BENJAMIN',
4: 'HALL, GUS'},
'percvotes': {0: 55.727269, 1: 42.614871, 2: 0.777613, 3: 0.563808,
4: 0.165194},
'electoral_votes': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9},
'perc_evotes': {0: 5.015454, 1: 3.835338, 2: 0.069985,
3: 0.050743, 4: 0.014867},
'vote_frac': {0: 0.015454, 1: 0.835338, 2: 0.069985,
3: 0.050743, 4: 0.014867},
'vote_int': {0: 5, 1: 3, 2: 0, 3: 0, 4: 0}})
def apply_extra_e_votes(grp):
# Get First Electoral Vote
# (Assumes first row in group contains the
# correct number of electoral votes for the group)
available_e_votes = grp['electoral_votes'].iloc[0]
# Get the Sum of the vote_int column
current_e_votes = grp['vote_int'].sum()
# If there more available votes than votes cast
if available_e_votes > current_e_votes:
# Update the 'vote_int' column at the max value of 'vote_frac'
grp.loc[
grp['vote_frac'].idxmax(),
'vote_int'
] += available_e_votes - current_e_votes # (Remaining Votes)
return grp
# Groupby and Apply Function
new_df = df.groupby(['year', 'state']).apply(apply_extra_e_votes)
# For Display
print(new_df.to_string(index=False))
Output:
year
state
candidate
percvotes
electoral_votes
perc_evotes
vote_frac
vote_int
1976
ALABAMA
CARTER, JIMMY
55.727269
9
5.015454
0.015454
5
1976
ALABAMA
FORD, GERALD
42.614871
9
3.835338
0.835338
4
1976
ALABAMA
MADDOX, LESTER
0.777613
9
0.069985
0.069985
0
1976
ALABAMA
BUBAR, BENJAMIN
0.563808
9
0.050743
0.050743
0
1976
ALABAMA
HALL, GUS
0.165194
9
0.014867
0.014867
0

Multiply a specific value with a series of columns based on a Condition in Pandas Dataframe

I have data of a certain country that gives the certain age group population in a time series. I am trying to multiply the number of the female population with -1 to display it on the other side of the pyramid graph. I have achieved that for one year i.e 1960 (see code below). Now I want to achieve the same results for all the columns from 1960-2020
PakPopulation.loc[PakPopulation['Gender']=="Female",['1960']]=PakPopulation['1960'].apply(lambda x:-x)
I have also tried the following solution but no luck:
PakPopulation.loc[PakPopulation['Gender']=="Female",[:,['1960':'2019']]=PakPopulation[:,['1960':'2019']].apply(lambda x:-x)
Schema:
Country
Age Group
Gender
1960
1961
1962
XYZ
0-4
Male
5880k
5887k
6998k
XYZ
0-4
Female
5980k
6887k
7998k

You could build a list of years and use that list as part of your selection:
import pandas as pd
PakPopulation = pd.DataFrame({
'Country': {0: 'XYZ', 1: 'ABC'},
'Age Group': {0: '0-4', 1: '0-4'},
'Gender': {0: 'Male', 1: 'Female'},
'1960': {0: 5880, 1: 5980},
'1961': {0: 5887, 1: 6887},
'1962': {0: 6998, 1: 7998},
})
start_year = 1960
end_year = 1962
years_lst = list(map(str, range(start_year, end_year + 1)))
PakPopulation.loc[PakPopulation['Gender'] == "Female", years_lst] = \
PakPopulation[years_lst].apply(lambda x: -x)
print(PakPopulation)
Output:
Country Age Group Gender 1960 1961 1962
0 XYZ 0-4 Male 5880 5887 6998
1 ABC 0-4 Female -5980 -6887 -7998

Aggregate a bunch of different data in a single groupby with multiple columns

I have large dataframe of data in Pandas (let's say of courses at a university) looking like:
ID name credits enrolled ugrad/grad year semester
1 Math 4 62 ugrad 2016 Fall
2 History 3 15 ugrad 2016 Spring
3 Adv Math 3 8 grad 2017 Fall
...
and I want to group it by year and semester, and then get a bunch of different aggregate data on it, but all at one time if I can. For example, I want a total count of courses, count of only undergraduate courses, and sum of enrollment for a given semester. I can do each of these individually using value_counts, but I'd like to get an output such as:
year semester count count_ugrad total_enroll
2016 Fall # # #
Spring # # #
2017 Fall # # #
Spring # # #
...
Is this possible?

Here I added a new subject for Python and provided as a dict to load into dataframe.
Solution is a combination of the agg() method on a groupby, where the aggregations are provided in a dictionary, and then the use of a custom aggregation function for your ugrad requirement:
def my_custom_ugrad_aggregator(arr):
return sum(arr == 'ugrad')
dict = {'name': {0: 'Math', 1: 'History', 2: 'Adv Math', 3: 'Python'}, 'year': {0: 2016, 1: 2016, 2: 2017, 3: 2017}, 'credits': {0: 4, 1: 3, 2: 3, 3: 4}, 'semester': {0: 'Fall', 1: 'Spring', 2: 'Fall', 3: 'Spring'}, 'ugrad/grad': {0: 'ugrad', 1: 'ugrad', 2: 'grad', 3: 'ugrad'}, 'enrolled': {0: 62, 1: 15, 2: 8, 3: 8}, 'ID': {0: 1, 1: 2, 2: 3, 3: 4}}
df =pd.DataFrame(dict)
ID credits enrolled name semester ugrad/grad year
0 1 4 62 Math Fall ugrad 2016
1 2 3 15 History Spring ugrad 2016
2 3 3 8 Adv Math Fall grad 2017
3 4 4 8 Python Spring ugrad 2017
print df.groupby(['year','semester']).agg({'name':['count'],'enrolled':['sum'],'ugrad/grad':my_custom_ugrad_aggregator})
gives:
name ugrad/grad enrolled
count my_custom_ugrad_aggregator sum
year semester
2016 Fall 1 1 62
Spring 1 1 15
2017 Fall 1 0 8
Spring 1 1 8

Use agg with dictionary on how to rollup/aggregate each column:
df_out = df.groupby(['year','semester'])[['enrolled','ugrad/grad']]\
.agg({'ugrad/grad':lambda x: (x=='ugrad').sum(),'enrolled':['sum','size']})\
.set_axis(['Ugrad Count','Total Enrolled','Count Courses'], inplace=False, axis=1)
df_out
Output:
Ugrad Count Total Enrolled Count Courses
year semester
2016 Fall 1 62 1
Spring 1 15 1
2017 Fall 0 8 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

selecting duplicates by condition python pandas - python

Related

Panda dataframe nan replacements

pandas subtract values in two dataframes with identical columns create new dataframe to store result

Performing an Operation On Grouped Pandas Data

Multiply a specific value with a series of columns based on a Condition in Pandas Dataframe

Aggregate a bunch of different data in a single groupby with multiple columns

Categories

Resources