Filling in missing data Python - python

I have a lot of missing data in between years and months of my dataframe that looks like:
Year Month State Value
1969 12 NJ 5500
1969 12 NY 6418
1970 8 IL 10093
1970 12 WI 6430
1970 7 NY 6140
1971 10 IL 10093
1971 6 MN 6850
1971 3 SC 7686
1972 12 FL 8772
2016 1 NJ 9000
For each state I need to fill out all the missing data from the beginning of the year the values began until 2018 but the only data that exists is mostly in between 1969 and 1990 so I just need to fill in the blanks.
The desired output (for NJ but needed for all states) would be:
Year Month State Value
1969 12 NJ 5500
1970 1 NJ 5500
1970 2 NJ 5500
1970 3 NJ 5500
1970 4 NJ 5500
1970 5 NJ 5500
1970 6 NJ 5500
.
.
1970 12 NJ 5500
.
.
2010 1 NJ 5500
2010 2 NJ 5500
2010 3 NJ 5500
.
.
2018 1 NJ 9000
I've tried turning the months into categorical values that range from 1-12 months, regroup and reset the index, and then use ffill to partition through the values into those newly made column indices like:
df['Month'] = pd.Categorical(df['Month'], categories=range(1, 13))
df = df.groupby(['State', 'Year', 'Month']).first().reset_index()
df['Value'] = df.groupby('Region')['Value'].ffill()
But this method gives me NaN values like:
State Year Month Value
NJ 1969 12 5500.0
NJ 1970 1 nan
NJ 1970 2 nan
NJ 1970 3 nan
.
.
NJ 2016 1 9000.0
I can't understand why this method has worked before as I've tested it on other data with actual results.

Sorry to all those who took time to correct this. It was a simple matter of accidentally grouping by a false column.
I had previously created a 'Region' column based off a collection of the State variables which was called rather than the States themselves.
So to clarify:
df['Value'] = df.groupby('Region')['Value'].ffill()
Needs to be changed into:
df['Value'] = df.groupby('State')['Value'].ffill()
This method works correctly.

Related

How to create a multi-index pivot table that sums the max values within a sub-group

I have a somewhat large dataframe of customers assigned to a hub and each hub is in a specific location. The hubs get flagged whenever there's an issue and I'd like to know the number of customers affected each time this happens.
So I'd like to find the max number of customers assigned to each hub (this would then exclude the times the hub may have been flagged multiple times) and then group the rows by location and the columns by type, then show the sum of the max count of customers over a period of months.
The data looks like:
Hub
Location
DateTime
Month
Type
Customers
J01
NY
01/01/2022
January
Type 1
250
J03
CA
01/21/2022
January
Type 2
111
J01
NY
04/01/2022
April
Type 1
250
J05
CA
06/01/2022
June
Type 1
14
J03
CA
08/18/2022
August
Type 2
111
I did the following code to generate a pivot table and it generates the max values for each hub, but there are hundreds of hubs.
` pd.pivot_table (out,values='Customers',index=['Location','Hub'], columns=
['Type','Month'],aggfunc='max') `
Results mostly look like:
Type
Type 1
Type 2
Month
January
February
March
January
Location
Hub
NA
NY
J01
0
250
250
NA
J04
222
222
222
NA
CA
J03
NA
NA
NA
111
CA
J05
14
14
0
NA
I would like the results to look like:
Type
Type 1
Type 2
Month
January
February
March
January
Location
NY
222
472
472
0
CA
14
14
0
111
Is there an easier way to achieve this?
You're on the right start! pivot_table is the right way to group the table with columns by type. You also identified that you can perform the max aggregation at pivot-time:
df = pd.pivot_table(
out,
values='Customers',
index=['Location','Hub'],
columns=['Type','Month'],
aggfunc='max'
)
Next, we need to groupby Location and add (sum) the values:
df = df.groupby("Location").max()
The final result (trying to build a table that matches your data):
out = pd.read_csv(io.StringIO("""
Hub Location DateTime Month Type Customers
J01 NY 01/01/2022 January Type 1 250
J03 CA 01/21/2022 January Type 2 111
J01 NY 04/01/2022 April Type 1 250
J04 NY 01/01/2022 January Type 1 222
J04 NY 02/01/2022 February Type 1 222
J04 NY 04/01/2022 April Type 1 222
J05 CA 06/01/2022 June Type 1 14
J03 CA 08/18/2022 August Type 2 111
"""), sep='\t')
pd.pivot_table(
out,
values='Customers',
index=['Location','Hub'],
columns=['Type','Month'],
aggfunc='max'
).groupby("Location").sum()
gives:
Type Type 1 Type 2
Month April February January June August January
Location
CA 0.0 0.0 0.0 14.0 111.0 111.0
NY 472.0 222.0 472.0 0.0 0.0 0.0

Create a new column based on condition from a column in another dataset

I have a weird pandas problem I am not sure how to begin. Here are examples of my two datasets:
df1: This dataset has a yearly metric per state.
Metric Year State
8 1996 AL
6 1997 AL
4 1998 AL
5 1999 AL
7 2000 AL
20 2001 AL
21 2002 AL
20 2003 AL
34 1996 CA
35 1997 CA
36 1998 CA
22 1999 CA
20 2000 CA
22 2001 CA
24 2002 CA
df2: This dataset has a law (which I'm referring to as ID) instituted in the year for the state.
ID State Year
ABC123 AL 1999
DEF456 AL 2000
GHI789 AL 2001
JKL012 AL 2001
PQR678 CA 1999
STU901 CA 2000
YZA567 CA 2001
I want to determine if there was a significant difference in the average of the metric before and after the law was instituted in that state for each ID. I would essentially want a fourth column in df2 that is just the avg(metric after) - avg(metric before). My first instinct was to use a np.where statement, but was unsure how to properly write the statement. Here's my attempt:
df2['diff'] = np.where((df2['Year']==2000) & (df2['State']=='AL'),df1[df1['Year']<2000]['Metric'].mean()-df1[df1'Year']>2000]['Metric'].mean(),0)
I know this is not correct as the alternative condition is just 0 and this would only apply to the condition of the year = 2000 and for Alabama. It doesn't filter out the California metrics from the calculation either.
So, what I am looking for is an iterative way of getting that difference for every state-year combination.
Any help would be appreciated! Thank you!
First make a full outer join of df1 and df2 on State:
df3 = df2.merge(df1, on='State', suffixes=('', '_metric'))
Then get the average metrics before and after the law was introduced for each ID, State, Year combination:
df3.groupby(['ID', 'State', 'Year']).apply(
lambda x: pd.Series([x.loc[x.Year_metric < x.Year, 'Metric'].mean(),
x.loc[x.Year_metric > x.Year, 'Metric'].mean()],
index=['before', 'after'])
)
Result:
before after
ID State Year
ABC123 AL 1999 6.00 17.000000
DEF456 AL 2000 5.75 20.333333
GHI789 AL 2001 6.00 20.500000
JKL012 AL 2001 6.00 20.500000
PQR678 CA 1999 35.00 22.000000
STU901 CA 2000 31.75 23.000000
YZA567 CA 2001 29.40 24.000000
To just see the difference you can do instead:
df3.groupby(['ID', 'State', 'Year'], as_index=False).apply(
lambda x: x.loc[x.Year_metric > x.Year, 'Metric'].mean() -
x.loc[x.Year_metric < x.Year, 'Metric'].mean()
).rename(columns={None: 'Difference'})
Result:
ID State Year Difference
0 ABC123 AL 1999 11.000000
1 DEF456 AL 2000 14.583333
2 GHI789 AL 2001 14.500000
3 JKL012 AL 2001 14.500000
4 PQR678 CA 1999 -13.000000
5 STU901 CA 2000 -8.750000
6 YZA567 CA 2001 -5.400000

Cumsum with groupby

I have a dataframe containing:
State Country Date Cases
0 NaN Afghanistan 2020-01-22 0
271 NaN Afghanistan 2020-01-23 0
... ... ... ... ...
85093 NaN Zimbabwe 2020-11-30 9950
85364 NaN Zimbabwe 2020-12-01 10129
I'm trying to create a new column of cumulative cases but grouped by Country AND State.
State Country Date Cases Total Cases
231 California USA 2020-01-22 5 5
342 California USA 2020-01-23 10 15
233 Texas USA 2020-01-22 4 4
322 Texas USA 2020-01-23 12 16
I have been trying to follow Pandas groupby cumulative sum and have tried things such as:
df['Total'] = df.groupby(['State','Country'])['Cases'].cumsum()
Returns a series of -1's
df['Total'] = df.groupby(['State', 'Country']).sum() \
.groupby(level=0).cumsum().reset_index()
Returns the sum.
df['Total'] = df.groupby(['Country'])['Cases'].apply(lambda x: x.cumsum())
Doesnt separate sums by state.
df_f['Total'] = df_f.groupby(['Region','State'])['Cases'].apply(lambda x: x.cumsum())
This one works exept when 'State' is NaN, 'Total' is also NaN.
arrays = [['California', 'California', 'Texas', 'Texas'],
['USA', 'USA', 'USA', 'USA'],
['2020-01-22','2020-01-23','2020-01-22','2020-01-23'], [5,10,4,12]]
df = pd.DataFrame(list(zip(*arrays)), columns = ['State', 'Country', 'Date', 'Cases'])
df
State Country Date Cases
0 California USA 2020-01-22 5
1 California USA 2020-01-23 10
2 Texas USA 2020-01-22 4
3 Texas USA 2020-01-23 12
temp = df.set_index(['State', 'Country','Date'], drop=True).sort_index( )
df['Total Cases'] = temp.groupby(['State', 'Country']).cumsum().reset_index()['Cases']
df
State Country Date Cases Total Cases
0 California USA 2020-01-22 5 5
1 California USA 2020-01-23 10 15
2 Texas USA 2020-01-22 4 4
3 Texas USA 2020-01-23 12 16

Creating a dataframe with months x years based on time series in pandas

I have a time series data with number of days for each month for several years and trying to create a new dataframe which would have months as rows and years as columns.
I have this
DateTime Days Month Year
2004-11-30 3 November 2004
2004-12-31 16 December 2004
2005-01-31 12 January 2005
2005-02-28 11 February 2005
2005-03-31 11 March 2005
... ... ... ...
2019-06-30 0 June 2019
2019-07-31 2 July 2019
2019-08-31 5 August 2019
2019-09-30 5 September 2019
2019-10-31 3 October 2019
And I'm trying to get this
Month 2004 2005 ... 2019
January nan 12 7
February nan 11 9
...
November 17 17 nan
December 14 15 nan
I created a new dataframe with the first column meaning months and tried to iterate through the first dataframe to add the new columns (years) and information to the cells but the condition which checks whether the month in the first dataframe (days) matches the month in the new dataframe (output) is never True, so the new dataframe never gets updated. I guess this is because the month in days is never the same as the month in output within the same iteration.
for index, row in days.iterrows():
print(days.loc[index, 'Days']) #this prints out as expected
for month in output.items():
print(index.month_name()) #this prints out as expected
if index.month_name()==month:
output.at[month, index.year]=days.loc[index, 'Days'] #I wanted to use this to fill up the cells, is this right?
print(days.loc[index, 'Days']) #this never gets printed out
Could you tell me how to fix this? Or maybe there's a better way to accomplish the result rather than iteration?
It's my first attempt to use libraries in python, so I would appreciate some help.
Use pivot, if your input dataframe has a single value per month and year:
df.pivot('Month', 'Year', 'Days')
Output:
Year 2004 2005 2019
Month
August NaN NaN 5
December 16 NaN NaN
February NaN 11 NaN
January NaN 12 NaN
July NaN NaN 2
June NaN NaN 0
March NaN 11 NaN
November 3 NaN NaN
October NaN NaN 3
September NaN NaN 5

Exclude rows in a dataframe based on matching values in rows from another dataframe

I have two dataframes (A and B). I want to remove all the rows in B where the values for columns Month, Year, Type, Name are an exact match.
Dataframe A
Name Type Month Year country Amount Expiration Paid
0 EXTRON GOLD March 2019 CA 20000 2019-09-07 yes
0 LEAF SILVER March 2019 PL 4893 2019-02-02 yes
0 JMC GOLD March 2019 IN 7000 2020-01-16 no
Dataframe B
Name Type Month Year country Amount Expiration Paid
0 JONS GOLD March 2018 PL 500 2019-10-17 yes
0 ABBY BRONZE March 2019 AU 60000 2019-02-02 yes
0 BUYT GOLD March 2018 BR 50 2018-03-22 no
0 EXTRON GOLD March 2019 CA 90000 2019-09-07 yes
0 JAYB PURPLE March 2019 PL 9.90 2018-04-20 yes
0 JMC GOLD March 2019 IN 6000 2020-01-16 no
0 JMC GOLD April 2019 IN 1000 2020-01-16 no
Desired Output:
Dataframe B
Name Type Month Year country Amount Expiration Paid
0 JONS GOLD March 2018 PL 500 2019-10-17 yes
0 ABBY BRONZE March 2019 AU 60000 2019-02-02 yes
0 BUYT GOLD March 2018 BR 50 2018-03-22 no
0 JAYB PURPLE March 2019 PL 9.90 2018-04-20 yes
0 JMC GOLD April 2019 IN 1000 2020-01-16 no
We can using merge here
l=['Month', 'Year','Type', 'Name']
B=B.merge(A[l],on=l,indicator=True,how='outer').loc[lambda x : x['_merge']=='left_only'].copy()
# you can add drop here like B=B.drop('_merge',1)
Name Type Month Year country Amount Expiration Paid _merge
0 JONS GOLD March 2018 PL 500.0 2019-10-17 yes left_only
1 ABBY BRONZE March 2019 AU 60000.0 2019-02-02 yes left_only
2 BUYT GOLD March 2018 BR 50.0 2018-03-22 no left_only
4 JAYB PURPLE March 2019 PL 9.9 2018-04-20 yes left_only
6 JMC GOLD April 2019 IN 1000.0 2020-01-16 no left_only
I tried using MultiIndex for the same.
cols =['Month', 'Year','Type', 'Name']
index1 = pd.MultiIndex.from_arrays([df1[col] for col in cols])
index2 = pd.MultiIndex.from_arrays([df2[col] for col in cols])
df2 = df2.loc[~index2.isin(index1)]

Categories

Resources