Creating a multi-index from csv census data - python

I would like to create a multi indexed dataframe so I can calculate values in a more organized way.
I know a MUCH more elegant solution is out there, but I'm struggling to find it. Most of the stuff I've found involves series and tuples. I'm fairly new to pandas (and programming) and this is my first attempt at using/creating multi-indexes.
After downloading census data as csv and creating dataframe with pertinent fields I have:
county housingunits2010 housingunits2012 occupiedunits2010 occupiedunits2012
8001 120 200 50 100
8002 100 200 75 125
And I want to end up with:
id Year housingunits occupiedunits
8001 2010 120 50
2012 200 100
8002 2010 100 75
2012 200 125
And then be able to add columns from calculated values (ie difference between years, %change) and from other dataframes, matching merging by county and year.
I figured out a workaround with the basic methods that I've learned (see below), but...it certainly isn't elegant. Any suggestion would be appreciated.
First creating two diff data frames
df3 = df2[["county_id","housingunits2012"]]
df4 = df2[["county_id","housingunits2010"]]
Adding the year column
df3['year'] = np.array(['2012'] * 7)
df4['year'] = np.array(['2010'] * 7)
df3.columns = ['county_id','housingunits','year']
df4.columns = ['county_id','housingunits','year']
Appending
df5 = df3.append(df4)
Writing to csv
df5.to_csv('/Users/ntapia/df5.csv', index = False)
Reading & sorting
df6 = pd.read_csv('/Users/ntapia/df5.csv', index_col=[0, 2])
df6.sort_index(0)
Result (actual data):
housingunits
county_id year
8001 2010 163229
2012 163986
8005 2010 238457
2012 239685
8013 2010 127115
2012 128106
8031 2010 285859
2012 288191
8035 2010 107056
2012 109115
8059 2010 230006
2012 230850
8123 2010 96406
2012 97525
Thanks!

import re
df = df.set_index('county')
df = df.rename(columns=lambda x: re.search(r'([a-zA-Z_]+)(\d{4})', x).groups())
df.columns = MultiIndex.from_tuples(df.columns, names=['label', 'year'])
s = df.unstack()
s.name = 'count'
print(s)
gives
label year county
housingunits 2010 8001 120
8002 100
2012 8001 200
8002 200
occupiedunits 2010 8001 50
8002 75
2012 8001 100
8002 125
Name: count, dtype: int64
If you want that in a DataFrame call reset_index():
print(s.reset_index())
yields
label year county numunits
0 housingunits 2010 8001 120
1 housingunits 2010 8002 100
2 housingunits 2012 8001 200
3 housingunits 2012 8002 200
4 occupiedunits 2010 8001 50
5 occupiedunits 2010 8002 75
6 occupiedunits 2012 8001 100
7 occupiedunits 2012 8002 125

Related

Keep observations with two or more consecutive years of data by group

I have a dataset consisting of directorid, match_id, and calyear. I would like to keep only observations by director_id and match_id that have at least 2 consecutive years of data. I have tried a few different ways to do this, and haven't been able to get it quite right. The few different things I have tried have also required multiple steps and weren't particularly clean.
Here is what I have:
director_id
match_id
calyear
282
1111
2006
282
1111
2007
356
2222
2005
356
2222
2007
600
3333
2010
600
3333
2011
600
3333
2012
600
3355
2013
600
3355
2015
600
3355
2016
753
4444
2005
753
4444
2008
753
4444
2009
Here is what I want:
director_id
match_id
calyear
282
1111
2006
282
1111
2007
600
3333
2010
600
3333
2011
600
3333
2012
600
3355
2015
600
3355
2016
753
4444
2008
753
4444
2009
I started by creating a variable equal to one:
df['tosum'] = 1
And then count the number of observations where the difference in calyear by group is equal to 1.
df['num_years'] = (
df.groupby(['directorid','match_id'])['tosum'].transform('sum').where(df.groupby(['match_id'])['calyear'].diff()==1, np.nan)
)
And then I keep all observations with 'num_years' greater than 1.
However, the first observation per director_id match_id gets set equal to NaN. In general, I think I am going about this in a convoluted way...it feels like there should be a simpler way to achieve my goal. Any help is greatly appreciated!
Yes you need to groupby 'director_id', 'match_id' and then do a transform but the transform just needs to look at the difference between next element in both directions. In one direction you need to see if it equals 1 and in another -1 and then subset using the resulting True/False values.
df = df[
df.groupby(["director_id", "match_id"])["calyear"].transform(
lambda x: (x.diff().eq(1)) | (x[::-1].diff().eq(-1))
)
]
print(df):
director_id match_id calyear
0 282 1111 2006
1 282 1111 2007
4 600 3333 2010
5 600 3333 2011
6 600 3333 2012
8 600 3355 2015
9 600 3355 2016
11 753 4444 2008
12 753 4444 2009
A bit late to the party, but here is my solution to this problem:
use .diff() function to calculate the difference between rows
use list comprehension and conversions to a set to extract the indices
once you have the necessary indices, use the .loc() function to select rows.
Code:
indices = list(set(sum([[i-1,i] for i,row in enumerate(df['calyear'].diff()) if row == 1], [])))
new_df = df.loc[indices]
print(new_df)
Output:

Combine rows from different tables based on common columns pandas

I have 3 tables/df. All have same column names. Bascially they are df for data from different months
October (df1 name)
Sales_value Sales_units Unique_Customer_id Countries Month
1000 10 4 1 Oct
20 2 4 3 Oct
November (df2 name)
Sales_value Sales_units Unique_Customer_id Countries Month
2000 1000 40 14 Nov
112 200 30 10 Nov
December (df3 name)
Sales_value Sales_units Unique_Customer_id Countries Month
20009090 4809509 4500 30 Dec
etc. This is dummy data. Each table has thousands of rows in reality. How to combine all these 3 tables such that columns come only once and all rows are displayed such that rows from October df come first, followed by November df rows followed by December df rows. When i use joins I am getting column names repeated.
Expected output:
Sales_value Sales_units Unique_Customer_id Countries Month
1000 10 4 1 Oct
20 2 4 3 Oct
2000 1000 40 14 Nov
112 200 30 10 Nov
20009090 4809509 4500 30 Dec
Concat combines rows from different tables based on common columns
pd.concat([df1, df2, df3])

How to fill dataframe's empty/nan cell with conditional column mean

I am trying to fill the (pandas) dataframe's null/empty value using the mean of that specific column.
The data looks like this:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019
.
.
I am trying to fill that empty cell with the mean of Revenue column where Industry is == 'Construction'.
To get our numerical mean value I did:
df.groupby(['Industry'], as_index = False).mean()
I am trying to do something like this to fill up that empty cell in-place:
(df[df['Industry'] == "Construction"]['Revenue']).fillna("$21212121.01", inplace = True)
..but it is not working. Can anyone tell me how to achieve it! Thanks a lot.
Expected Output:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013 $21212121.01
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019 $21212121.01
.
.
Although the numbers used as averages are different, we have presented two types of averages: the normal average and the average calculated on the number of cases that include NaN.
df['Revenue'] = df['Revenue'].replace({'\$':'', ',':''}, regex=True)
df['Revenue'] = df['Revenue'].astype(float)
df_mean = df.groupby(['Industry'], as_index = False)['Revenue'].mean()
df_mean
Industry Revenue
0 Construction 4.358071e+06
1 Financial Services 8.858420e+06
2 IT Services 1.175702e+07
df_mean_nan = df.groupby(['Industry'], as_index = False)['Revenue'].agg({'Sum':np.sum, 'Size':np.size})
df_mean_nan['Mean_nan'] = df_mean_nan['Sum'] / df_mean_nan['Size']
df_mean_nan
Industry Sum Size Mean_nan
0 Construction 13074212.0 5.0 2614842.4
1 Financial Services 17716840.0 2.0 8858420.0
2 IT Services 11757018.0 1.0 11757018.0
Average taking into account the number of NaNs
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean_nan.loc[df_mean_nan['Industry'] == 'Construction',['Mean_nan']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5387469.0
1 2 Rednimdox Construction 2013 2614842.4
2 3 Lamtone IT Services 2009 11757018.0
3 4 Stripfind Financial Services 2010 12329371.0
4 5 Openjocon Construction 2013 4273207.0
5 6 Villadox Construction 2012 1097353.0
6 7 Sumzoomit Construction 2010 7703652.0
7 8 Abcddd Construction 2019 2614842.4
Normal average: (NaN is excluded)
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean.loc[df_mean['Industry'] == 'Construction',['Revenue']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5.387469e+06
1 2 Rednimdox Construction 2013 4.358071e+06
2 3 Lamtone IT Services 2009 1.175702e+07
3 4 Stripfind Financial Services 2010 1.232937e+07
4 5 Openjocon Construction 2013 4.273207e+06
5 6 Villadox Construction 2012 1.097353e+06
6 7 Sumzoomit Construction 2010 7.703652e+06
7 8 Abcddd Construction 2019 4.358071e+06

Merge pandas groupBy objects

I have a huge dataset of 292 million rows (6GB) in CSV format. Panda's read_csv function is not working for such big file. So I am reading data in small chunks (10 million rows) iteratively using this code :
for chunk in pd.read_csv('hugeData.csv', chunksize=10**7):
#something ...
In the #something I am grouping rows according to some columns. So in each iteration, I get new groupBy objects. I am not able to merge these groupBy objects.
A smaller dummy example is as follows :
Here dummy.csv is a 28 rows CSV file, which is trade report between some countries in some year. sitc is some product code and export is export amount in some USD billion. (Please note that data is fictional)
year,origin,dest,sitc,export
2000,ind,chn,2146,2
2000,ind,chn,4132,7
2001,ind,chn,2146,3
2001,ind,chn,4132,10
2002,ind,chn,2227,7
2002,ind,chn,4132,7
2000,ind,aus,7777,19
2001,ind,aus,2146,30
2001,ind,aus,4132,12
2002,ind,aus,4133,30
2000,aus,ind,4132,6
2001,aus,ind,2146,8
2001,chn,aus,1777,9
2001,chn,aus,1977,31
2001,chn,aus,1754,12
2002,chn,aus,8987,7
2001,chn,aus,4879,3
2002,aus,chn,3489,7
2002,chn,aus,2092,30
2002,chn,aus,4133,13
2002,aus,ind,0193,6
2002,aus,ind,0289,8
2003,chn,aus,0839,9
2003,chn,aus,9867,31
2003,aus,chn,3442,3
2004,aus,chn,3344,17
2005,aus,chn,3489,11
2001,aus,ind,0893,17
I split it into two 14 rows data and grouped them according to year, origin, dest.
for chunk in pd.read_csv('dummy.csv', chunksize=14):
xd = chunk.groupby(['origin','dest','year'])['export'].sum();
print(xd)
Results :
origin dest year
aus ind 2000 6
2001 8
chn aus 2001 40
ind aus 2000 19
2001 42
2002 30
chn 2000 9
2001 13
2002 14
Name: export, dtype: int64
origin dest year
aus chn 2002 7
2003 3
2004 17
2005 11
ind 2001 17
2002 14
chn aus 2001 15
2002 50
2003 40
Name: export, dtype: int64
How can I merge the two GroupBy objects?
Will merging them, again create memory issues in the big data? A prediction by looking at the nature of data, if properly merged the number of rows will surely reduce by at least 10-15 times.
The basic aim is :
Given origin country and dest country,
I need to plot total exports between them yearwise.
Querying this everytime over the whole data is taking a lot of time.
xd = chunk.loc[(chunk.origin == country1) & (chunk.dest == country2)]
Hence I was thinking to save time by once arranging them in groupBy manner.
Any suggestion is greatly appreciated.
You can use pd.concat to join groupby results and then apply sum:
>>> pd.concat([xd0,xd1],axis=1)
export export
origin dest year
aus ind 2000 6 6
2001 8 8
chn aus 2001 40 40
ind aus 2000 19 19
2001 42 42
2002 30 30
chn 2000 9 9
2001 13 13
2002 14 14
>>> pd.concat([xd0,xd1],axis=1).sum(axis=1)
origin dest year
aus ind 2000 12
2001 16
chn aus 2001 80
ind aus 2000 38
2001 84
2002 60
chn 2000 18
2001 26
2002 28

Pandas Groupby with multiple columns selecting rows with full range of values

I am working with a pandas dataframe. From the code:
contracts.groupby(['State','Year'])['$'].mean()
I have a pandas groupby object with two group layers: State and Year.
State / Year / $
NY 2009 5
2010 10
2011 5
2012 15
NJ 2009 2
2012 12
DE 2009 1
2010 2
2011 3
2012 6
I would like to look at only those states for which I have data on all the years (i.e. NY and DE, not NJ as it is missing 2010). Is there a way to suppress those nested groups with less than full rank?
After grouping by State and Year and taking the mean,
means = contracts.groupby(['State', 'Year'])['$'].mean()
you could groupby the State alone, and use filter to keep the desired groups:
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))
For example,
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 15
states = ['NY','NJ','DE']
years = range(2009, 2013)
contracts = pd.DataFrame({
'State': np.random.choice(states, size=N),
'Year': np.random.choice(years, size=N),
'$': np.random.randint(10, size=N)})
means = contracts.groupby(['State', 'Year'])['$'].mean()
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))
print(result)
yields
State Year
DE 2009 8
2010 5
2011 3
2012 6
NY 2009 2
2010 1
2011 5
2012 9
Name: $, dtype: int64
Alternatively, you could filter first and then take the mean:
filtered = contracts.groupby(['State']).filter(lambda x: x['Year'].nunique() >= len(years))
result = filtered.groupby(['State', 'Year'])['$'].mean()
but playing with various examples suggest this is typically slower than taking the mean, then filtering.

Categories

Resources