This is my first post so pardon any missing information. I do have dataset like this below
Dataset:-
My expected final output should be like this
Final Output:-
Basically I would like to iterate over Discounted price and apply last discounted price to next year
For example in 2019 , NYC Budget is $10,000 and Discount is 0.05 so Discounted Price is $9,500. Next year discount becomes 0.64 which should be calculated on $9,500, which is $3,420 and in 2021 , 0.04 which comes out to be $3,283 as final discounted price for NYC.
I need to code it in Python using pandas data frame. I think I need to write For loop and then IF inside it . But struggling so far.
Really appreciate any help.
You could use groupby and apply the function to obtain the total discount amount for all the years for a particular city. Then use first over the Budget column groups to get the first row in the group with the rows indexed as the original dataframe. This holds true since "Groupby preserves the order of rows within each group." is a guaranteed behavior.
import pandas as pd
data = {'Year': {0: 2019, 1: 2020, 2: 2021, 3: 2019, 4: 2020, 5: 2019},
'City': {0: 'NYC', 1: 'NYC', 2: 'NYC', 3: 'Edison', 4: 'Edison', 5: 'Princeton'},
'Budget': {0: 10000, 1: 10000, 2: 10000, 3: 5000, 4: 5000, 5: 2000},
'Discount': {0: 0.05, 1: 0.64, 2: 0.04, 3: 0.35, 4: 0.06, 5: 0.45}}
df = pd.DataFrame(data)
g = df.groupby("City")
disc_prod = g['Discount'].apply(lambda x: (1-x).prod())
budget = g['Budget'].first()
result = disc_prod * budget
print(result)
City
Edison 3055.0
NYC 3283.2
Princeton 1100.0
Related
I have a simple dataframe which I would like to separate from each other with some conditions.
Car
Year
Speed
Cond
BMW
2001
150
X
BMW
2000
150
Audi
1997
200
Audi
2000
200
Audi
2012
200
X
Fiat
2020
180
Mazda
2022
183
What i have to do is take duplicates to another dataframe and in my main dataframe leave only one line.
Rows that are duplicates in the Car column I would like to separate into a separate dataframe, but I don't need the ones that have X in the cond column.
In main dataframe I would like keep one row. I would like the left row to be the one that contains X in the cond column
I have code:
import pandas as pd
import numpy as np
cars = {'Car': {0: 'BMW', 1: 'BMW', 2: 'Audi', 3: 'Audi', 4: 'Audi', 5: 'Fiat', 6: 'Mazda'},
'Year': {0: 2001, 1: 2000, 2: 1997, 3: 2000, 4: 2012, 5: 2020, 6: 2022},
'Speed': {0: 150, 1: 150, 2: 200, 3: 200, 4: 200, 5: 180, 6: 183},
'Cond': {0: 'X', 1: np.nan, 2: 'X', 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan}}
df = pd.DataFrame.from_dict(cars)
df_duplicates = df.loc[df.duplicated(subset=['Car'], keep = False)].loc[df['Cond']!='X']
I don't know how i can leave the main dataframe with only one row which additionally contains X in cond column
Maybe it's possible to have one command that will delete and select another dataframe according to the rules above?
If I understand correctly the desired logic, you can use groupby.idxmax to select the first X per group if any (else the first row of the group), to keep in the main DataFrame. The rest goes in the other DataFrame (df2).
# get indices of the row with X is any, else of the first one per group
keep = df['Cond'].eq('X').groupby(df['Car']).idxmax()
# drop selected rows
df2 = df.drop(keep)
# keep selected rows
df = df.loc[keep]
Output:
# updated df1
Car Year Speed Cond
0 BMW 2001 150 X
2 Audi 1997 200 X
5 Fiat 2020 180 NaN
6 Mazda 2022 183 NaN
# df2
Car Year Speed Cond
1 BMW 2000 150 NaN
3 Audi 2000 200 NaN
4 Audi 2012 200 NaN
I am trying to create a new dataframe new_df with a new column containing the difference in values from subtracting identical columns in 2 separate dataframes: df1 df2
I have tried to use the code new_df.loc['difference'] = df1.loc['s_values'] - df2.loc['s_values']
but I cannot achieve my result.
where df1 =
stats s_values
gender year
women 2007 height 40
2007 cigarette use 31
and df2 =
stats s_values
gender year
Men 2007 height 10
2007 cigarette use 11
desired output achieved (I do not want to include the gender index)
new_df =
stats difference
year
2007 height 30
2007 cigarette use 20
You can try this (full example):
Input:
import pandas as pd
df1 = pd.DataFrame({'gender': {0: 'woman', 1: 'woman'},
'year': {0: 2007, 1: 2007},
'stats': {0: 'height', 1: 'cigarette use'},
's_values': {0: 40, 1: 31}})
df2 = pd.DataFrame({'gender': {0: 'men', 1: 'men'},
'year': {0: 2007, 1: 2007},
'stats': {0: 'height', 1: 'cigarette use'},
's_values': {0: 10, 1: 11}})
Code:
df = pd.concat([df1,df2], ignore_index=True)
df['s_values'] = df.groupby(['year', 'stats'])['s_values'].diff().abs()
df.dropna(subset=['s_values']).drop('gender', axis=1)
Output:
year stats s_values
2 2007 height 30.0
3 2007 cigarette use 20.0
Note:
If both dataframes are completely identicaly structured, its even shorter:
df1.drop('gender', axis=1).assign(s_values=df1['s_values'] - df2['s_values'])
new_df = pd.DataFrame()
new_df["year"] = df1["year"]
new_df["stats"] = df1["stats"]
for i, (val1, val2) in enumerate(zip(df1["s_values"],df2["s_values"])):
new_df.at[i,"difference"] = val1-val2
I have a pandas DataFrame with the following information:
year state candidate percvotes electoral_votes perc_evotes vote_frac vote_int
1976 ALABAMA CARTER, JIMMY 55.727269 9 5.015454 0.015454 5
1976 ALABAMA FORD, GERALD 42.614871 9 3.835338 0.835338 3
1976 ALABAMA MADDOX, LESTER 0.777613 9 0.069985 0.069985 0
1976 ALABAMA BUBAR, BENJAMIN 0.563808 9 0.050743 0.050743 0
1976 ALABAMA HALL, GUS 0.165194 9 0.014867 0.014867 0
where pervotes is the percentage of the total votes cast the candidate received (calculated before), electoral_votes are the electoral college votes for that state, perc_evotes is the calculated percent of the electoral votes, and vote_frac and vote_int are the fraction and whole number part of the electoral votes earned respectively. This data repeats for each year of an election and then by state per year. The candidates each have a row for each state, and it is similar data.
What I want to do is allocate the leftover electoral votes to the candidate with the highest fraction. This number is different for each state and year. In this case there would be 1 leftover electoral vote (9 total votes and 5+3=8 are already allocated) and the remaining one will go to 'FORD, GERALD' since he has 0.85338 in the vote_frac column. Sometimes there are 2 or 3 left unallocated.
I have a solution that adds the data to a dictionary, but it is using for loops. I know there must be a better way to do this in a more "pandas" way. I have touched on groupby in this loop but I feel like I am not utilizing pandas to it's full potential.
My for loop:
results = {}
grouped = electdf.groupby(["year", "state"])
for key, group in grouped:
year, state = key
group['vote_remaining'] = group['electoral_votes'] - group['vote_int'].sum()
remaining = group['vote_remaining'].iloc[0]
top_fracs = group['vote_frac'].nlargest(remaining)
group['total'] = (group['vote_frac'].isin(top_fracs)).astype(int) + group['vote_int']
if year not in results:
results[year] = {}
for candidate, evotes in zip(group['candidate'], group['total']):
if candidate not in results[year] and evotes:
results[year][candidate] = 0
if evotes:
results[year][candidate] += evotes
Thanks in advance!
Perhaps an apply function which finds the available electoral votes, the amount of votes cast, and conditionally updates the max 'vote_frac' row's 'vote_int' column with the difference of available and cast votes:
import pandas as pd
df = pd.DataFrame({'year': {0: 1976, 1: 1976, 2: 1976, 3: 1976, 4: 1976},
'state': {0: 'ALABAMA', 1: 'ALABAMA', 2: 'ALABAMA',
3: 'ALABAMA', 4: 'ALABAMA'},
'candidate': {0: 'CARTER, JIMMY', 1: 'FORD, GERALD',
2: 'MADDOX, LESTER', 3: 'BUBAR, BENJAMIN',
4: 'HALL, GUS'},
'percvotes': {0: 55.727269, 1: 42.614871, 2: 0.777613, 3: 0.563808,
4: 0.165194},
'electoral_votes': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9},
'perc_evotes': {0: 5.015454, 1: 3.835338, 2: 0.069985,
3: 0.050743, 4: 0.014867},
'vote_frac': {0: 0.015454, 1: 0.835338, 2: 0.069985,
3: 0.050743, 4: 0.014867},
'vote_int': {0: 5, 1: 3, 2: 0, 3: 0, 4: 0}})
def apply_extra_e_votes(grp):
# Get First Electoral Vote
# (Assumes first row in group contains the
# correct number of electoral votes for the group)
available_e_votes = grp['electoral_votes'].iloc[0]
# Get the Sum of the vote_int column
current_e_votes = grp['vote_int'].sum()
# If there more available votes than votes cast
if available_e_votes > current_e_votes:
# Update the 'vote_int' column at the max value of 'vote_frac'
grp.loc[
grp['vote_frac'].idxmax(),
'vote_int'
] += available_e_votes - current_e_votes # (Remaining Votes)
return grp
# Groupby and Apply Function
new_df = df.groupby(['year', 'state']).apply(apply_extra_e_votes)
# For Display
print(new_df.to_string(index=False))
Output:
year
state
candidate
percvotes
electoral_votes
perc_evotes
vote_frac
vote_int
1976
ALABAMA
CARTER, JIMMY
55.727269
9
5.015454
0.015454
5
1976
ALABAMA
FORD, GERALD
42.614871
9
3.835338
0.835338
4
1976
ALABAMA
MADDOX, LESTER
0.777613
9
0.069985
0.069985
0
1976
ALABAMA
BUBAR, BENJAMIN
0.563808
9
0.050743
0.050743
0
1976
ALABAMA
HALL, GUS
0.165194
9
0.014867
0.014867
0
I have data of a certain country that gives the certain age group population in a time series. I am trying to multiply the number of the female population with -1 to display it on the other side of the pyramid graph. I have achieved that for one year i.e 1960 (see code below). Now I want to achieve the same results for all the columns from 1960-2020
PakPopulation.loc[PakPopulation['Gender']=="Female",['1960']]=PakPopulation['1960'].apply(lambda x:-x)
I have also tried the following solution but no luck:
PakPopulation.loc[PakPopulation['Gender']=="Female",[:,['1960':'2019']]=PakPopulation[:,['1960':'2019']].apply(lambda x:-x)
Schema:
Country
Age Group
Gender
1960
1961
1962
XYZ
0-4
Male
5880k
5887k
6998k
XYZ
0-4
Female
5980k
6887k
7998k
You could build a list of years and use that list as part of your selection:
import pandas as pd
PakPopulation = pd.DataFrame({
'Country': {0: 'XYZ', 1: 'ABC'},
'Age Group': {0: '0-4', 1: '0-4'},
'Gender': {0: 'Male', 1: 'Female'},
'1960': {0: 5880, 1: 5980},
'1961': {0: 5887, 1: 6887},
'1962': {0: 6998, 1: 7998},
})
start_year = 1960
end_year = 1962
years_lst = list(map(str, range(start_year, end_year + 1)))
PakPopulation.loc[PakPopulation['Gender'] == "Female", years_lst] = \
PakPopulation[years_lst].apply(lambda x: -x)
print(PakPopulation)
Output:
Country Age Group Gender 1960 1961 1962
0 XYZ 0-4 Male 5880 5887 6998
1 ABC 0-4 Female -5980 -6887 -7998
I am expanding on earlier thread: Including missing combinations of values in a pandas groupby aggregation
In above thread, the accepted answer computes all possible combinations for the grouping variable. In this version, I'd like to compute combinations based on group of groups.
Let's take an example.
Here's input dataframe:
Here, one group is [Year,Quarter] i.e.
Year Quarter
2014 Q1
2015 Q2
2015 Q3
Another set of group is Name:
Name
Adam
Smith
Now, I want to apply groupby and sum such that missing values of the combination of above groups is detected as NaN
Here's sample output:
I'd appreciate any help.
Here's sample input and output in dict format:
input=
{'Year': {0: 2014, 1: 2014, 2: 2015, 3: 2015, 4: 2015},
'Quarter': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q2', 4: 'Q3'},
'Name': {0: 'Adam', 1: 'Smith', 2: 'Adam', 3: 'Adam', 4: 'Smith'},
'Value': {0: 2, 1: 3, 2: 4, 3: 5, 4: 5}}
output=
{'Year': {0: 2014, 1: 2014, 2: 2015, 3: 2015, 4: 2015, 5: 2015},
'Quarter': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q2', 4: 'Q3', 5: 'Q3'},
'Name': {0: 'Adam', 1: 'Smith', 2: 'Adam', 3: 'Smith', 4: 'Smith', 5: 'Adam'},
'Value': {0: 2.0, 1: 3.0, 2: 9.0, 3: nan, 4: 5.0, 5: nan}}
Clarification:
I am looking for a method without doing melt and cast. i.e. without playing around with long and wide format.
The example post you posted is the correct answer: groupby get the sum then unstack to find the missing value then stack with the param dropna=False here are the docs on stack
df.groupby(['Year','Quarter','Name']).sum().unstack().stack(dropna=False).reset_index()
Year Quarter Name Value
0 2014 Q1 Adam 2.0
1 2014 Q1 Smith 3.0
2 2015 Q2 Adam 9.0
3 2015 Q2 Smith NaN
4 2015 Q3 Adam NaN
5 2015 Q3 Smith 5.0
Using pivot_table, PS you can add reset_index at the end
df.pivot_table(index=['Year','Quarter'],columns='Name',values='Value',aggfunc='sum').stack(dropna=False)
Year Quarter Name
2014 Q1 Adam 2.0
Smith 3.0
2015 Q2 Adam 9.0
Smith NaN
Q3 Adam NaN
Smith 5.0
dtype: float64