Performing an Operation On Grouped Pandas Data - python

I have a pandas DataFrame with the following information:
year state candidate percvotes electoral_votes perc_evotes vote_frac vote_int
1976 ALABAMA CARTER, JIMMY 55.727269 9 5.015454 0.015454 5
1976 ALABAMA FORD, GERALD 42.614871 9 3.835338 0.835338 3
1976 ALABAMA MADDOX, LESTER 0.777613 9 0.069985 0.069985 0
1976 ALABAMA BUBAR, BENJAMIN 0.563808 9 0.050743 0.050743 0
1976 ALABAMA HALL, GUS 0.165194 9 0.014867 0.014867 0
where pervotes is the percentage of the total votes cast the candidate received (calculated before), electoral_votes are the electoral college votes for that state, perc_evotes is the calculated percent of the electoral votes, and vote_frac and vote_int are the fraction and whole number part of the electoral votes earned respectively. This data repeats for each year of an election and then by state per year. The candidates each have a row for each state, and it is similar data.
What I want to do is allocate the leftover electoral votes to the candidate with the highest fraction. This number is different for each state and year. In this case there would be 1 leftover electoral vote (9 total votes and 5+3=8 are already allocated) and the remaining one will go to 'FORD, GERALD' since he has 0.85338 in the vote_frac column. Sometimes there are 2 or 3 left unallocated.
I have a solution that adds the data to a dictionary, but it is using for loops. I know there must be a better way to do this in a more "pandas" way. I have touched on groupby in this loop but I feel like I am not utilizing pandas to it's full potential.
My for loop:
results = {}
grouped = electdf.groupby(["year", "state"])
for key, group in grouped:
year, state = key
group['vote_remaining'] = group['electoral_votes'] - group['vote_int'].sum()
remaining = group['vote_remaining'].iloc[0]
top_fracs = group['vote_frac'].nlargest(remaining)
group['total'] = (group['vote_frac'].isin(top_fracs)).astype(int) + group['vote_int']
if year not in results:
results[year] = {}
for candidate, evotes in zip(group['candidate'], group['total']):
if candidate not in results[year] and evotes:
results[year][candidate] = 0
if evotes:
results[year][candidate] += evotes
Thanks in advance!

Perhaps an apply function which finds the available electoral votes, the amount of votes cast, and conditionally updates the max 'vote_frac' row's 'vote_int' column with the difference of available and cast votes:
import pandas as pd
df = pd.DataFrame({'year': {0: 1976, 1: 1976, 2: 1976, 3: 1976, 4: 1976},
'state': {0: 'ALABAMA', 1: 'ALABAMA', 2: 'ALABAMA',
3: 'ALABAMA', 4: 'ALABAMA'},
'candidate': {0: 'CARTER, JIMMY', 1: 'FORD, GERALD',
2: 'MADDOX, LESTER', 3: 'BUBAR, BENJAMIN',
4: 'HALL, GUS'},
'percvotes': {0: 55.727269, 1: 42.614871, 2: 0.777613, 3: 0.563808,
4: 0.165194},
'electoral_votes': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9},
'perc_evotes': {0: 5.015454, 1: 3.835338, 2: 0.069985,
3: 0.050743, 4: 0.014867},
'vote_frac': {0: 0.015454, 1: 0.835338, 2: 0.069985,
3: 0.050743, 4: 0.014867},
'vote_int': {0: 5, 1: 3, 2: 0, 3: 0, 4: 0}})
def apply_extra_e_votes(grp):
# Get First Electoral Vote
# (Assumes first row in group contains the
# correct number of electoral votes for the group)
available_e_votes = grp['electoral_votes'].iloc[0]
# Get the Sum of the vote_int column
current_e_votes = grp['vote_int'].sum()
# If there more available votes than votes cast
if available_e_votes > current_e_votes:
# Update the 'vote_int' column at the max value of 'vote_frac'
grp.loc[
grp['vote_frac'].idxmax(),
'vote_int'
] += available_e_votes - current_e_votes # (Remaining Votes)
return grp
# Groupby and Apply Function
new_df = df.groupby(['year', 'state']).apply(apply_extra_e_votes)
# For Display
print(new_df.to_string(index=False))
Output:
year
state
candidate
percvotes
electoral_votes
perc_evotes
vote_frac
vote_int
1976
ALABAMA
CARTER, JIMMY
55.727269
9
5.015454
0.015454
5
1976
ALABAMA
FORD, GERALD
42.614871
9
3.835338
0.835338
4
1976
ALABAMA
MADDOX, LESTER
0.777613
9
0.069985
0.069985
0
1976
ALABAMA
BUBAR, BENJAMIN
0.563808
9
0.050743
0.050743
0
1976
ALABAMA
HALL, GUS
0.165194
9
0.014867
0.014867
0

Related

selecting duplicates by condition python pandas

I have a simple dataframe which I would like to separate from each other with some conditions.
Car
Year
Speed
Cond
BMW
2001
150
X
BMW
2000
150
Audi
1997
200
Audi
2000
200
Audi
2012
200
X
Fiat
2020
180
Mazda
2022
183
What i have to do is take duplicates to another dataframe and in my main dataframe leave only one line.
Rows that are duplicates in the Car column I would like to separate into a separate dataframe, but I don't need the ones that have X in the cond column.
In main dataframe I would like keep one row. I would like the left row to be the one that contains X in the cond column
I have code:
import pandas as pd
import numpy as np
cars = {'Car': {0: 'BMW', 1: 'BMW', 2: 'Audi', 3: 'Audi', 4: 'Audi', 5: 'Fiat', 6: 'Mazda'},
'Year': {0: 2001, 1: 2000, 2: 1997, 3: 2000, 4: 2012, 5: 2020, 6: 2022},
'Speed': {0: 150, 1: 150, 2: 200, 3: 200, 4: 200, 5: 180, 6: 183},
'Cond': {0: 'X', 1: np.nan, 2: 'X', 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan}}
df = pd.DataFrame.from_dict(cars)
df_duplicates = df.loc[df.duplicated(subset=['Car'], keep = False)].loc[df['Cond']!='X']
I don't know how i can leave the main dataframe with only one row which additionally contains X in cond column
Maybe it's possible to have one command that will delete and select another dataframe according to the rules above?
If I understand correctly the desired logic, you can use groupby.idxmax to select the first X per group if any (else the first row of the group), to keep in the main DataFrame. The rest goes in the other DataFrame (df2).
# get indices of the row with X is any, else of the first one per group
keep = df['Cond'].eq('X').groupby(df['Car']).idxmax()
# drop selected rows
df2 = df.drop(keep)
# keep selected rows
df = df.loc[keep]
Output:
# updated df1
Car Year Speed Cond
0 BMW 2001 150 X
2 Audi 1997 200 X
5 Fiat 2020 180 NaN
6 Mazda 2022 183 NaN
# df2
Car Year Speed Cond
1 BMW 2000 150 NaN
3 Audi 2000 200 NaN
4 Audi 2012 200 NaN

Multiply a specific value with a series of columns based on a Condition in Pandas Dataframe

I have data of a certain country that gives the certain age group population in a time series. I am trying to multiply the number of the female population with -1 to display it on the other side of the pyramid graph. I have achieved that for one year i.e 1960 (see code below). Now I want to achieve the same results for all the columns from 1960-2020
PakPopulation.loc[PakPopulation['Gender']=="Female",['1960']]=PakPopulation['1960'].apply(lambda x:-x)
I have also tried the following solution but no luck:
PakPopulation.loc[PakPopulation['Gender']=="Female",[:,['1960':'2019']]=PakPopulation[:,['1960':'2019']].apply(lambda x:-x)
Schema:
Country
Age Group
Gender
1960
1961
1962
XYZ
0-4
Male
5880k
5887k
6998k
XYZ
0-4
Female
5980k
6887k
7998k
You could build a list of years and use that list as part of your selection:
import pandas as pd
PakPopulation = pd.DataFrame({
'Country': {0: 'XYZ', 1: 'ABC'},
'Age Group': {0: '0-4', 1: '0-4'},
'Gender': {0: 'Male', 1: 'Female'},
'1960': {0: 5880, 1: 5980},
'1961': {0: 5887, 1: 6887},
'1962': {0: 6998, 1: 7998},
})
start_year = 1960
end_year = 1962
years_lst = list(map(str, range(start_year, end_year + 1)))
PakPopulation.loc[PakPopulation['Gender'] == "Female", years_lst] = \
PakPopulation[years_lst].apply(lambda x: -x)
print(PakPopulation)
Output:
Country Age Group Gender 1960 1961 1962
0 XYZ 0-4 Male 5880 5887 6998
1 ABC 0-4 Female -5980 -6887 -7998

Adding missing dates to the dataframe

I have a dataframe that has a list of One Piece manga, which currently looks like this:
0 # Title Pages
Date
1997-07-19 1 Romance Dawn - The Dawn of the Adventure 53
1997-07-28 2 That Guy, "Straw Hat Luffy" 23
1997-08-04 3 Introducing "Pirate Hunter Zoro" 21
1997-08-11 4 Marine Captain "Axe-Hand Morgan" 19
1997-08-25 5 Pirate King and Master Swordsman 19
1997-09-01 6 The First Crew Member 23
1997-09-08 7 Friends 20
1997-09-13 8 Introducing Nami 19
Although every episode is to be issued weekly, sometimes they are delayed or on break, resulting in an irregular interval in the dates. What I would like to do is to add a missing date. For example, between 1997-08-11 and 1997-08-25, there should be 1997-08-18 (7 days from 1997-08-11) where the episode was not issued. Could you help me out with how to operate this code?
Thank you.
You sould use the shift builtin function.
df['day_between'] = df['Date'].shift(-1) - df['Date']
output of print(df[['Date', 'day_between']]) is then:
Date day_between
0 1997-07-19 9 days
1 1997-07-28 7 days
2 1997-08-04 7 days
3 1997-08-11 14 days
4 1997-08-25 7 days
5 1997-09-01 7 days
6 1997-09-08 5 days
7 1997-09-13 NaT
I used relativedelta and list comprehension to get a 14-day interval per row and .shift(1) to compare to another row with .np.where() with a 1 returning a row where we would want to insert a row before. Then, I looped through the dataframe and appended the relevant rows to another dataframe. Then, I used pd.concat to bring the two dataframes together, sorted by date, deleted the helper columns and reset the index.
There may be some gaps as others have mentioned like 22 days+ but this should get you in the right direction. Perhaps you could turn it into a function and run it multiple times, which is why I added .reset_index(drop=True) at the end. Obviously, you could just make this more advanced, but I hope this helps.
from dateutil.relativedelta import relativedelta
import pandas
from datetime import datetime
df = pd.DataFrame({'Date': {0: '1997-07-19',
1: '1997-07-28',
2: '1997-08-04',
3: '1997-08-11',
4: '1997-08-25',
5: '1997-09-01',
6: '1997-09-08',
7: '1997-09-13'},
'#': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'Title': {0: 'Romance Dawn - The Dawn of the Adventure',
1: 'That Guy, "Straw Hat Luffy"',
2: 'Introducing "Pirate Hunter Zoro"',
3: 'Marine Captain "Axe-Hand Morgan"',
4: 'Pirate King and Master Swordsman',
5: 'The First Crew Member',
6: 'Friends',
7: 'Introducing Nami'},
'Pages': {0: 53, 1: 23, 2: 21, 3: 19, 4: 19, 5: 23, 6: 20, 7: 19}})
df['Date'] = pd.to_datetime(df['Date'])
df['Date2'] = [d - relativedelta(days=-14) for d in df['Date']]
df['Date3'] = np.where((df['Date'] >= df['Date2'].shift(1)), 1 , 0)
df1 = pd.DataFrame({})
n=0
for j in (df['Date3']):
n+=1
if j == 1:
new_row = pd.DataFrame({"Date": df['Date'][n-1] - relativedelta(days=7)}, index=[n])
df1=df1.append(new_row)
df = pd.concat([df, df1]).sort_values('Date').drop(['Date2', 'Date3'], axis=1).reset_index(drop=True)
df
Output:
Date # Title Pages
0 1997-07-19 1.0 Romance Dawn - The Dawn of the Adventure 53.0
1 1997-07-28 2.0 That Guy, "Straw Hat Luffy" 23.0
2 1997-08-04 3.0 Introducing "Pirate Hunter Zoro" 21.0
3 1997-08-11 4.0 Marine Captain "Axe-Hand Morgan" 19.0
4 1997-08-18 NaN NaN NaN
5 1997-08-25 5.0 Pirate King and Master Swordsman 19.0
6 1997-09-01 6.0 The First Crew Member 23.0
7 1997-09-08 7.0 Friends 20.0
8 1997-09-13 8.0 Introducing Nami 19.0

Check if column in dataframe is missing values

I have am column full of state names.
I know how to iterate down through it, but I don't know what syntax to use to have it check for empty values as it goes. tried "isnull()" but that seems to be the wrong approach. Anyone know a way?
was thinking something like:
for state_name in datFrame.state_name:
if datFrame.state_name.isnull():
print ('no name value' + other values from row)
else:
print(row is good.)
df.head():
state_name state_ab city zip_code
0 Alabama AL Chickasaw 36611
1 Alabama AL Louisville 36048
2 Alabama AL Columbiana 35051
3 Alabama AL Satsuma 36572
4 Alabama AL Dauphin Island 36528
to_dict():
{'state_name': {0: 'Alabama',
1: 'Alabama',
2: 'Alabama',
3: 'Alabama',
4: 'Alabama'},
'state_ab': {0: 'AL', 1: 'AL', 2: 'AL', 3: 'AL', 4: 'AL'},
'city': {0: 'Chickasaw',
1: 'Louisville',
2: 'Columbiana',
3: 'Satsuma',
4: 'Dauphin Island'},
'zip_code': {0: '36611', 1: '36048', 2: '35051', 3: '36572', 4: '36528'}}
Based on your description, you can use np.where to check if rows are either null or empty strings.
df['status'] = np.where(df['state'].eq('') | df['state'].isnull(), 'Not Good', 'Good')
(MCVE) For example, suppose you have the following dataframe
state
0 New York
1 Nevada
2
3 None
4 New Jersey
then,
state status
0 New York Good
1 Nevada Good
2 Not Good
3 None Not Good
4 New Jersey Good
It's always worth mentioning that you should avoid loops whenever possible, because they are way slower than masking

Aggregate a bunch of different data in a single groupby with multiple columns

I have large dataframe of data in Pandas (let's say of courses at a university) looking like:
ID name credits enrolled ugrad/grad year semester
1 Math 4 62 ugrad 2016 Fall
2 History 3 15 ugrad 2016 Spring
3 Adv Math 3 8 grad 2017 Fall
...
and I want to group it by year and semester, and then get a bunch of different aggregate data on it, but all at one time if I can. For example, I want a total count of courses, count of only undergraduate courses, and sum of enrollment for a given semester. I can do each of these individually using value_counts, but I'd like to get an output such as:
year semester count count_ugrad total_enroll
2016 Fall # # #
Spring # # #
2017 Fall # # #
Spring # # #
...
Is this possible?
Here I added a new subject for Python and provided as a dict to load into dataframe.
Solution is a combination of the agg() method on a groupby, where the aggregations are provided in a dictionary, and then the use of a custom aggregation function for your ugrad requirement:
def my_custom_ugrad_aggregator(arr):
return sum(arr == 'ugrad')
dict = {'name': {0: 'Math', 1: 'History', 2: 'Adv Math', 3: 'Python'}, 'year': {0: 2016, 1: 2016, 2: 2017, 3: 2017}, 'credits': {0: 4, 1: 3, 2: 3, 3: 4}, 'semester': {0: 'Fall', 1: 'Spring', 2: 'Fall', 3: 'Spring'}, 'ugrad/grad': {0: 'ugrad', 1: 'ugrad', 2: 'grad', 3: 'ugrad'}, 'enrolled': {0: 62, 1: 15, 2: 8, 3: 8}, 'ID': {0: 1, 1: 2, 2: 3, 3: 4}}
df =pd.DataFrame(dict)
ID credits enrolled name semester ugrad/grad year
0 1 4 62 Math Fall ugrad 2016
1 2 3 15 History Spring ugrad 2016
2 3 3 8 Adv Math Fall grad 2017
3 4 4 8 Python Spring ugrad 2017
print df.groupby(['year','semester']).agg({'name':['count'],'enrolled':['sum'],'ugrad/grad':my_custom_ugrad_aggregator})
gives:
name ugrad/grad enrolled
count my_custom_ugrad_aggregator sum
year semester
2016 Fall 1 1 62
Spring 1 1 15
2017 Fall 1 0 8
Spring 1 1 8
Use agg with dictionary on how to rollup/aggregate each column:
df_out = df.groupby(['year','semester'])[['enrolled','ugrad/grad']]\
.agg({'ugrad/grad':lambda x: (x=='ugrad').sum(),'enrolled':['sum','size']})\
.set_axis(['Ugrad Count','Total Enrolled','Count Courses'], inplace=False, axis=1)
df_out
Output:
Ugrad Count Total Enrolled Count Courses
year semester
2016 Fall 1 62 1
Spring 1 15 1
2017 Fall 0 8 1

Categories

Resources