Imputing Values Based on FirstYear and LastYear in Long Table Format - python

I have a long table on firm-level that has the first and last active year and their zip code.
pd.DataFrame({'Firm':['A','B','C'],
'FirstYear':[2020, 2019, 2018],
'LastYear':[2021, 2022, 2019],
'Zipcode':['00000','00001','00003']})
Firm FirstYear LastYear Zipcode
A 2020 2021 00000
B 2019 2022 00001
C 2018 2019 00003
I want to get the panel data that has the zipcode for every active year. So ideally I might want a wide table that impute the value of Zipcode based on first year and last year, and every year between the first and last year.
It should look like this:
2020 2021 2019 2022 2018
A 00000 00000
B 00001 00001 00001 00001
C 00003 00003
I have some code to create a long table per row but I have many millions of rows and it takes a long time. What's the best way in terms of performance and memory use to transform the long table I have to impute every year's zipcode value in pandas?
Thanks in advance.
Responding to the answer's update:
Imagine there is a firm whose first and last year didn't overlap with other firms.
df=pd.DataFrame({'Firm':['A','B','C'],
'FirstYear':[2020, 2019, 1997],
'LastYear':[2021, 2022, 2008],
'Zipcode':['00000','00001','00003']})
The output from the code is like:
Firm 2020 2021 2019 2022 1997 2008
A 00000 00000
B 00001 00001 00001 00001
C 00003 00003

Here is a solution with pd.melt()
d = (pd.melt(df,id_vars=['Firm','Zipcode'])
.set_index(['Firm','value'])['Zipcode']
.unstack(level=1))
d = (d.ffill(axis=1)
.where(d.ffill(axis=1).notna() &
d.bfill(axis=1).notna())
.reindex(df[['FirstYear','LastYear']].stack().unique(),axis=1))
Original Answer:
(pd.melt(df,id_vars=['Firm','Zipcode'])
.set_index(['Firm','value'])['Zipcode']
.unstack(level=1)
.reindex(df[['FirstYear','LastYear']].stack().unique(),axis=1))
Output:
value 2020 2021 2019 2022 2018
Firm
A 00000 00000 NaN NaN NaN
B 00001 00001 00001 00001 NaN
C NaN NaN 00003 NaN 00003

Related

I want to filter rows from data frame where the year is 2020 and 2021 using re.search and re.match functions

Data Frame:
Unnamed: 0 date target insult tweet year
0 1 2014-10-09 thomas-frieden fool Can you believe this fool, Dr. Thomas Frieden ... 2014
1 2 2014-10-09 thomas-frieden DOPE Can you believe this fool, Dr. Thomas Frieden ... 2014
2 3 2015-06-16 politicians all talk and no action Big time in U.S. today - MAKE AMERICA GREAT AG... 2015
3 4 2015-06-24 ben-cardin It's politicians like Cardin that have destroy... Politician #SenatorCardin didn't like that I s... 2015
4 5 2015-06-24 neil-young total hypocrite For the nonbeliever, here is a photo of #Neily... 2015
I want the data frame which consists for only year with 2020 and 2021 using search and match methods.
df_filtered = df.loc[df.year.str.contains('2014|2015', regex=True) == True]

Aggregating similar rows in Pandas

I've got a dataframe that's currently aggregated by zip code, and looks similar to this:
Year Organization State Zip Number_of_people
2021 A NJ 07090 5
2020 B AZ 09876 3
2021 A NJ 01234 2
2021 C VA 23456 7
2019 A NJ 05385 1
I want to aggregate the dataframe and Number_of_People column by state instead, combining identical rows (aside from Number of people) so that the data above instead looks like this:
Year Organization State Number_of_people
2021 A NJ 7
2020 B AZ 3
2021 C VA 7
2019 A NJ 1
In other words, if rows are identical in all columns EXCEPT Number_of_people, I want to combine the rows and add the number_of_people.
I'm stuck on how to approach this problem after deleting the Zip column -- I think I need to group by Year, Organization, and State but not sure what to do after that.
A more pythonic version without zip codes
df.groupby(['Year','Organization','State'], as_index=False)['Number_of_people'].sum()
A more pythonic version with zip codes
df.groupby(['Year','Organization','State'], as_index=False).sum()
You don't have to drop zip first if you don't want, use the syntax below.
data = '''Year Organization State Zip Number_of_people
2021 A NJ 07090 5
2020 B AZ 09876 3
2021 A NJ 01234 2
2021 C VA 23456 7
2019 A NJ 05385 1'''
df = pd.read_csv(io.StringIO(data), sep='\s+', engine='python')
df[['Year','Organization','State', 'Number_of_people']].groupby(['Year','Organization','State']).sum().reset_index()
Output
Year Organization State Number_of_people
0 2019 A NJ 1
1 2020 B AZ 3
2 2021 A NJ 7
3 2021 C VA 7
If you do want to drop the zip code, then use this:
df.groupby(['Year','Organization','State']).sum().reset_index()

Merging DataFrames with "uneven" data

Excuse the title, I'm not even sure how to label what I'm trying to do. I have data in a DataFrame that looks like this:
Name Month Status
---- ----- ------
Bob Jan Good
Bob Feb Good
Bob Mar Bad
Martha Feb Bad
John Jan Good
John Mar Bad
Not every name 'Name' has every 'Month' and 'Status'. What I want to get is:
Name Month Status
---- ----- ------
Bob Jan Good
Bob Feb Good
Bob Mar Bad
Martha Jan N/A
Martha Feb Bad
Martha Mar N/A
John Jan Good
John Feb N/A
John Mar Bad
Where the missing months are filled in with a value in the 'Status' column.
What I've tried to do so far is export all of the unique 'Month" values to a list, convert to a DataFrame, then join/merge the two DataFrames. But I can't get anything to work.
What is the best way to do this?
You have to take advantage of Pandas' indexing to reshape the data :
Step1 : create a new index from the unique values of Name and Month columns :
new_index = pd.MultiIndex.from_product(
(df.Name.unique(), df.Month.unique()), names=["Name", "Month"]
)
Step2 : set Name and Month as the new index, reindex with new_index and reset_index to get your final output :
df.set_index(["Name", "Month"]).reindex(new_index).reset_index()
UPDATE 2021/01/08:
You can use the complete function from pyjanitor; at the moment you have to install the latest development version from github:
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import pyjanitor
df.complete("Name", "Month")
You can treat the month as a categorical column, then allow GroupBy to do the heavy lifting:
df['Month'] = pd.Categorical(df['Month'])
df.groupby(['Name', 'Month'], as_index=False).first()
Name Month Status
0 Bob Feb Good
1 Bob Jan Good
2 Bob Mar Bad
3 John Feb NaN
4 John Jan Good
5 John Mar Bad
6 Martha Feb Bad
7 Martha Jan NaN
8 Martha Mar NaN
The secret sauce here is that pandas treats missing "categories" by inserting a NaN there.
Caveat: This always sorts your data.
Do pivot
df=df.pivot(*df).stack(dropna=False).to_frame('Status').reset_index()
Name Month Status
0 Bob Feb Good
1 Bob Jan Good
2 Bob Mar Bad
3 John Feb NaN
4 John Jan Good
5 John Mar Bad
6 Martha Feb Bad
7 Martha Jan NaN
8 Martha Mar NaN

Reshaping this pandas dataframe

My current dataframe:
Year CityA CityB
Year Abilene, TX Akron, OH Albany, GA Albany, OR
0 2012 141.997500 92.033333 105.662500 116.250833
1 2013 150.175000 95.971667 109.942500 125.361667
2 2014 157.588333 98.930833 109.628333 132.511667
3 2015 161.584167 102.416667 109.717500 142.058333
4 2016 168.106667 107.449167 110.175833 157.204167
I want to reshape it preferably in-place in the following manner:
`Year City Value'
Year City Value
2012 Abilene, TX, somevalue
2013 Abilene, TX, somevalue
For every city.
How do I go about this in an efficient manner?
I figured it out.
pd.melt(DataFrame, id_vars = "Year", value_vars = DataFrame.columns[1:])

How to save split data in panda in reverse order?

You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.
You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN
Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()

Categories

Resources