I have a dataset consisting of directorid, match_id, and calyear. I would like to keep only observations by director_id and match_id that have at least 2 consecutive years of data. I have tried a few different ways to do this, and haven't been able to get it quite right. The few different things I have tried have also required multiple steps and weren't particularly clean.
Here is what I have:
director_id
match_id
calyear
282
1111
2006
282
1111
2007
356
2222
2005
356
2222
2007
600
3333
2010
600
3333
2011
600
3333
2012
600
3355
2013
600
3355
2015
600
3355
2016
753
4444
2005
753
4444
2008
753
4444
2009
Here is what I want:
director_id
match_id
calyear
282
1111
2006
282
1111
2007
600
3333
2010
600
3333
2011
600
3333
2012
600
3355
2015
600
3355
2016
753
4444
2008
753
4444
2009
I started by creating a variable equal to one:
df['tosum'] = 1
And then count the number of observations where the difference in calyear by group is equal to 1.
df['num_years'] = (
df.groupby(['directorid','match_id'])['tosum'].transform('sum').where(df.groupby(['match_id'])['calyear'].diff()==1, np.nan)
)
And then I keep all observations with 'num_years' greater than 1.
However, the first observation per director_id match_id gets set equal to NaN. In general, I think I am going about this in a convoluted way...it feels like there should be a simpler way to achieve my goal. Any help is greatly appreciated!
Yes you need to groupby 'director_id', 'match_id' and then do a transform but the transform just needs to look at the difference between next element in both directions. In one direction you need to see if it equals 1 and in another -1 and then subset using the resulting True/False values.
df = df[
df.groupby(["director_id", "match_id"])["calyear"].transform(
lambda x: (x.diff().eq(1)) | (x[::-1].diff().eq(-1))
)
]
print(df):
director_id match_id calyear
0 282 1111 2006
1 282 1111 2007
4 600 3333 2010
5 600 3333 2011
6 600 3333 2012
8 600 3355 2015
9 600 3355 2016
11 753 4444 2008
12 753 4444 2009
A bit late to the party, but here is my solution to this problem:
use .diff() function to calculate the difference between rows
use list comprehension and conversions to a set to extract the indices
once you have the necessary indices, use the .loc() function to select rows.
Code:
indices = list(set(sum([[i-1,i] for i,row in enumerate(df['calyear'].diff()) if row == 1], [])))
new_df = df.loc[indices]
print(new_df)
Output:
I have 3 tables/df. All have same column names. Bascially they are df for data from different months
October (df1 name)
Sales_value Sales_units Unique_Customer_id Countries Month
1000 10 4 1 Oct
20 2 4 3 Oct
November (df2 name)
Sales_value Sales_units Unique_Customer_id Countries Month
2000 1000 40 14 Nov
112 200 30 10 Nov
December (df3 name)
Sales_value Sales_units Unique_Customer_id Countries Month
20009090 4809509 4500 30 Dec
etc. This is dummy data. Each table has thousands of rows in reality. How to combine all these 3 tables such that columns come only once and all rows are displayed such that rows from October df come first, followed by November df rows followed by December df rows. When i use joins I am getting column names repeated.
Expected output:
Sales_value Sales_units Unique_Customer_id Countries Month
1000 10 4 1 Oct
20 2 4 3 Oct
2000 1000 40 14 Nov
112 200 30 10 Nov
20009090 4809509 4500 30 Dec
Concat combines rows from different tables based on common columns
pd.concat([df1, df2, df3])
I am trying to fill the (pandas) dataframe's null/empty value using the mean of that specific column.
The data looks like this:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019
.
.
I am trying to fill that empty cell with the mean of Revenue column where Industry is == 'Construction'.
To get our numerical mean value I did:
df.groupby(['Industry'], as_index = False).mean()
I am trying to do something like this to fill up that empty cell in-place:
(df[df['Industry'] == "Construction"]['Revenue']).fillna("$21212121.01", inplace = True)
..but it is not working. Can anyone tell me how to achieve it! Thanks a lot.
Expected Output:
ID Name Industry Year Revenue
1 Treslam Financial Services 2009 $5,387,469
2 Rednimdox Construction 2013 $21212121.01
3 Lamtone IT Services 2009 $11,757,018
4 Stripfind Financial Services 2010 $12,329,371
5 Openjocon Construction 2013 $4,273,207
6 Villadox Construction 2012 $1,097,353
7 Sumzoomit Construction 2010 $7,703,652
8 Abcddd Construction 2019 $21212121.01
.
.
Although the numbers used as averages are different, we have presented two types of averages: the normal average and the average calculated on the number of cases that include NaN.
df['Revenue'] = df['Revenue'].replace({'\$':'', ',':''}, regex=True)
df['Revenue'] = df['Revenue'].astype(float)
df_mean = df.groupby(['Industry'], as_index = False)['Revenue'].mean()
df_mean
Industry Revenue
0 Construction 4.358071e+06
1 Financial Services 8.858420e+06
2 IT Services 1.175702e+07
df_mean_nan = df.groupby(['Industry'], as_index = False)['Revenue'].agg({'Sum':np.sum, 'Size':np.size})
df_mean_nan['Mean_nan'] = df_mean_nan['Sum'] / df_mean_nan['Size']
df_mean_nan
Industry Sum Size Mean_nan
0 Construction 13074212.0 5.0 2614842.4
1 Financial Services 17716840.0 2.0 8858420.0
2 IT Services 11757018.0 1.0 11757018.0
Average taking into account the number of NaNs
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean_nan.loc[df_mean_nan['Industry'] == 'Construction',['Mean_nan']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5387469.0
1 2 Rednimdox Construction 2013 2614842.4
2 3 Lamtone IT Services 2009 11757018.0
3 4 Stripfind Financial Services 2010 12329371.0
4 5 Openjocon Construction 2013 4273207.0
5 6 Villadox Construction 2012 1097353.0
6 7 Sumzoomit Construction 2010 7703652.0
7 8 Abcddd Construction 2019 2614842.4
Normal average: (NaN is excluded)
df.loc[df['Revenue'].isna(),['Revenue']] = df_mean.loc[df_mean['Industry'] == 'Construction',['Revenue']].values
df
ID Name Industry Year Revenue
0 1 Treslam Financial Services 2009 5.387469e+06
1 2 Rednimdox Construction 2013 4.358071e+06
2 3 Lamtone IT Services 2009 1.175702e+07
3 4 Stripfind Financial Services 2010 1.232937e+07
4 5 Openjocon Construction 2013 4.273207e+06
5 6 Villadox Construction 2012 1.097353e+06
6 7 Sumzoomit Construction 2010 7.703652e+06
7 8 Abcddd Construction 2019 4.358071e+06
I have a huge dataset of 292 million rows (6GB) in CSV format. Panda's read_csv function is not working for such big file. So I am reading data in small chunks (10 million rows) iteratively using this code :
for chunk in pd.read_csv('hugeData.csv', chunksize=10**7):
#something ...
In the #something I am grouping rows according to some columns. So in each iteration, I get new groupBy objects. I am not able to merge these groupBy objects.
A smaller dummy example is as follows :
Here dummy.csv is a 28 rows CSV file, which is trade report between some countries in some year. sitc is some product code and export is export amount in some USD billion. (Please note that data is fictional)
year,origin,dest,sitc,export
2000,ind,chn,2146,2
2000,ind,chn,4132,7
2001,ind,chn,2146,3
2001,ind,chn,4132,10
2002,ind,chn,2227,7
2002,ind,chn,4132,7
2000,ind,aus,7777,19
2001,ind,aus,2146,30
2001,ind,aus,4132,12
2002,ind,aus,4133,30
2000,aus,ind,4132,6
2001,aus,ind,2146,8
2001,chn,aus,1777,9
2001,chn,aus,1977,31
2001,chn,aus,1754,12
2002,chn,aus,8987,7
2001,chn,aus,4879,3
2002,aus,chn,3489,7
2002,chn,aus,2092,30
2002,chn,aus,4133,13
2002,aus,ind,0193,6
2002,aus,ind,0289,8
2003,chn,aus,0839,9
2003,chn,aus,9867,31
2003,aus,chn,3442,3
2004,aus,chn,3344,17
2005,aus,chn,3489,11
2001,aus,ind,0893,17
I split it into two 14 rows data and grouped them according to year, origin, dest.
for chunk in pd.read_csv('dummy.csv', chunksize=14):
xd = chunk.groupby(['origin','dest','year'])['export'].sum();
print(xd)
Results :
origin dest year
aus ind 2000 6
2001 8
chn aus 2001 40
ind aus 2000 19
2001 42
2002 30
chn 2000 9
2001 13
2002 14
Name: export, dtype: int64
origin dest year
aus chn 2002 7
2003 3
2004 17
2005 11
ind 2001 17
2002 14
chn aus 2001 15
2002 50
2003 40
Name: export, dtype: int64
How can I merge the two GroupBy objects?
Will merging them, again create memory issues in the big data? A prediction by looking at the nature of data, if properly merged the number of rows will surely reduce by at least 10-15 times.
The basic aim is :
Given origin country and dest country,
I need to plot total exports between them yearwise.
Querying this everytime over the whole data is taking a lot of time.
xd = chunk.loc[(chunk.origin == country1) & (chunk.dest == country2)]
Hence I was thinking to save time by once arranging them in groupBy manner.
Any suggestion is greatly appreciated.
You can use pd.concat to join groupby results and then apply sum:
>>> pd.concat([xd0,xd1],axis=1)
export export
origin dest year
aus ind 2000 6 6
2001 8 8
chn aus 2001 40 40
ind aus 2000 19 19
2001 42 42
2002 30 30
chn 2000 9 9
2001 13 13
2002 14 14
>>> pd.concat([xd0,xd1],axis=1).sum(axis=1)
origin dest year
aus ind 2000 12
2001 16
chn aus 2001 80
ind aus 2000 38
2001 84
2002 60
chn 2000 18
2001 26
2002 28
I am working with a pandas dataframe. From the code:
contracts.groupby(['State','Year'])['$'].mean()
I have a pandas groupby object with two group layers: State and Year.
State / Year / $
NY 2009 5
2010 10
2011 5
2012 15
NJ 2009 2
2012 12
DE 2009 1
2010 2
2011 3
2012 6
I would like to look at only those states for which I have data on all the years (i.e. NY and DE, not NJ as it is missing 2010). Is there a way to suppress those nested groups with less than full rank?
After grouping by State and Year and taking the mean,
means = contracts.groupby(['State', 'Year'])['$'].mean()
you could groupby the State alone, and use filter to keep the desired groups:
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))
For example,
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 15
states = ['NY','NJ','DE']
years = range(2009, 2013)
contracts = pd.DataFrame({
'State': np.random.choice(states, size=N),
'Year': np.random.choice(years, size=N),
'$': np.random.randint(10, size=N)})
means = contracts.groupby(['State', 'Year'])['$'].mean()
result = means.groupby(level='State').filter(lambda x: len(x)>=len(years))
print(result)
yields
State Year
DE 2009 8
2010 5
2011 3
2012 6
NY 2009 2
2010 1
2011 5
2012 9
Name: $, dtype: int64
Alternatively, you could filter first and then take the mean:
filtered = contracts.groupby(['State']).filter(lambda x: x['Year'].nunique() >= len(years))
result = filtered.groupby(['State', 'Year'])['$'].mean()
but playing with various examples suggest this is typically slower than taking the mean, then filtering.