Merge pandas groupBy objects - python

I have a huge dataset of 292 million rows (6GB) in CSV format. Panda's read_csv function is not working for such big file. So I am reading data in small chunks (10 million rows) iteratively using this code :
for chunk in pd.read_csv('hugeData.csv', chunksize=10**7):
#something ...
In the #something I am grouping rows according to some columns. So in each iteration, I get new groupBy objects. I am not able to merge these groupBy objects.
A smaller dummy example is as follows :
Here dummy.csv is a 28 rows CSV file, which is trade report between some countries in some year. sitc is some product code and export is export amount in some USD billion. (Please note that data is fictional)
year,origin,dest,sitc,export
2000,ind,chn,2146,2
2000,ind,chn,4132,7
2001,ind,chn,2146,3
2001,ind,chn,4132,10
2002,ind,chn,2227,7
2002,ind,chn,4132,7
2000,ind,aus,7777,19
2001,ind,aus,2146,30
2001,ind,aus,4132,12
2002,ind,aus,4133,30
2000,aus,ind,4132,6
2001,aus,ind,2146,8
2001,chn,aus,1777,9
2001,chn,aus,1977,31
2001,chn,aus,1754,12
2002,chn,aus,8987,7
2001,chn,aus,4879,3
2002,aus,chn,3489,7
2002,chn,aus,2092,30
2002,chn,aus,4133,13
2002,aus,ind,0193,6
2002,aus,ind,0289,8
2003,chn,aus,0839,9
2003,chn,aus,9867,31
2003,aus,chn,3442,3
2004,aus,chn,3344,17
2005,aus,chn,3489,11
2001,aus,ind,0893,17
I split it into two 14 rows data and grouped them according to year, origin, dest.
for chunk in pd.read_csv('dummy.csv', chunksize=14):
xd = chunk.groupby(['origin','dest','year'])['export'].sum();
print(xd)
Results :
origin dest year
aus ind 2000 6
2001 8
chn aus 2001 40
ind aus 2000 19
2001 42
2002 30
chn 2000 9
2001 13
2002 14
Name: export, dtype: int64
origin dest year
aus chn 2002 7
2003 3
2004 17
2005 11
ind 2001 17
2002 14
chn aus 2001 15
2002 50
2003 40
Name: export, dtype: int64
How can I merge the two GroupBy objects?
Will merging them, again create memory issues in the big data? A prediction by looking at the nature of data, if properly merged the number of rows will surely reduce by at least 10-15 times.
The basic aim is :
Given origin country and dest country,
I need to plot total exports between them yearwise.
Querying this everytime over the whole data is taking a lot of time.
xd = chunk.loc[(chunk.origin == country1) & (chunk.dest == country2)]
Hence I was thinking to save time by once arranging them in groupBy manner.
Any suggestion is greatly appreciated.

You can use pd.concat to join groupby results and then apply sum:
>>> pd.concat([xd0,xd1],axis=1)
export export
origin dest year
aus ind 2000 6 6
2001 8 8
chn aus 2001 40 40
ind aus 2000 19 19
2001 42 42
2002 30 30
chn 2000 9 9
2001 13 13
2002 14 14
>>> pd.concat([xd0,xd1],axis=1).sum(axis=1)
origin dest year
aus ind 2000 12
2001 16
chn aus 2001 80
ind aus 2000 38
2001 84
2002 60
chn 2000 18
2001 26
2002 28

Related

Replace NA in DataFrame for multiple columns with mean per country

I want to replace NA values with the mean of other column with the same year.
Note: To replace NA values for Canada data, I want to use only the mean of Canada, not the mean from the whole dataset of course.
Here's a sample dataframe filled with random numbers. And some NA how i find them in my dataframe:
Country
Inhabitants
Year
Area
Cats
Dogs
Canada
38 000 000
2021
4
32
21
Canada
37 000 000
2020
4
NA
21
Canada
36 000 000
2019
3
32
21
Canada
NA
2018
2
32
21
Canada
34 000 000
2017
NA
32
21
Canada
35 000 000
2016
3
32
NA
Brazil
212 000 000
2021
5
32
21
Brazil
211 000 000
2020
4
NA
21
Brazil
210 000 000
2019
NA
32
21
Brazil
209 000 000
2018
4
32
21
Brazil
NA
2017
2
32
21
Brazil
207 000 000
2016
4
32
NA
What's the easiest way with pandas to replace those NA with the mean values of the other years? And is it possible to write a code for which it is possible to go through every NA and replace them (Inhabitants, Area, Cats, Dogs at once)?
Note Example is based on your additional data source from the comments
Replacing the NA-Values for multiple columns with mean() you can combine the following three methods:
fillna() (Iterating per column axis should be 0, which is default value of fillna())
groupby()
transform()
Create data frame from your example:
df = pd.read_excel('https://happiness-report.s3.amazonaws.com/2021/DataPanelWHR2021C2.xls')
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
nan
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
Call fillna() and iterate over all columns grouped by name of country:
df = df.fillna(df.groupby('Country name').transform('mean'))
Check your result for Canada:
df[df['Country name'] == 'Canada']
Country name
year
Life Ladder
Log GDP per capita
Social support
Healthy life expectancy at birth
Freedom to make life choices
Generosity
Perceptions of corruption
Positive affect
Negative affect
Canada
2005
7.41805
10.6518
0.961552
71.3
0.957306
0.25623
0.502681
0.838544
0.233278
Canada
2007
7.48175
10.7392
0.93547
71.66
0.930341
0.249479
0.405608
0.871604
0.25681
Canada
2008
7.4856
10.7384
0.938707
71.84
0.926315
0.261585
0.369588
0.89022
0.202175
Canada
2009
7.48782
10.6972
0.942845
72.02
0.915058
0.246217
0.412622
0.867433
0.247633
Canada
2010
7.65035
10.7165
0.953765
72.2
0.933949
0.230451
0.41266
0.878868
0.233113
This also works:
In [2]:
df = pd.read_excel('DataPanelWHR2021C2.xls')
In [3]:
# Check for number of null values in df
df.isnull().sum()
Out [3]:
Country name 0
year 0
Life Ladder 0
Log GDP per capita 36
Social support 13
Healthy life expectancy at birth 55
Freedom to make life choices 32
Generosity 89
Perceptions of corruption 110
Positive affect 22
Negative affect 16
dtype: int64
SOLUTION
In [4]:
# Adds mean of column to any NULL values
df.fillna(df.mean(), inplace=True)
In [5]:
# 2nd check for number of null values
df.isnull().sum()
Out [5]: No more NULL values
Country name 0
year 0
Life Ladder 0
Log GDP per capita 0
Social support 0
Healthy life expectancy at birth 0
Freedom to make life choices 0
Generosity 0
Perceptions of corruption 0
Positive affect 0
Negative affect 0
dtype: int64

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

80 Gb file - Creating a data frame that submits data based upon a list of counties

I am working with an 80 Gb data set in Python. The data has 30 columns and ~180,000,000 rows.
I am using the chunk size parameter in pd.read_csv to read the data in chunks where I then iterate through the data to create a dictionary of the counties with their associated frequency.
This is where I am stuck. Once I have the list of counties, I want to iterate through the chunks row-by-row again summing the values of 2 - 3 other columns associated with each county and place it into a new DataFrame. This would roughly be 4 cols and 3000 rows which is more manageable for my computer.
I really don't know how to do this, this is my first time working with a large data set in python.
import pandas as pd
from collections import defaultdict
df_chunk = pd.read_csv('file.tsv', sep='\t', chunksize=8000000)
county_dict = defaultdict(int)
for chunk in df_chunk:
for county in chunk['COUNTY']:
county_dict[county] += 1
for chunk in df_chunk:
for row in chunk:
# I don't know where to go from here
I expect to be able to make a DataFrame with a column of all the counties, a column for total sales of product "1" per county, another column for sales of product per county, and then more columns of the same as needed.
The idea
I was not sure whether you have data for different counties (e.g. in UK or USA)
or countries (in the world), so I decided to have data concerning countries.
The idea is to:
Group data from each chunk by country.
Generate a partial result for this chunk, as a DataFrame with:
Sums of each column of interest (per country).
Number of rows per country.
To perform concatenation of partial results (in a moment), each partial
result should contain the chunk number, as an additional index level.
Concatenate partial results vertically (due to the additional index level,
each row has different index).
The final result (total sums and row counts) can be computed as
sum of the above result, grouped by country (discarding the chunk
number).
Test data
The source CSV file contains country names and 2 columns to sum (Tab separated):
Country Amount_1 Amount_2
Austria 41 46
Belgium 30 50
Austria 45 44
Denmark 31 42
Finland 42 32
Austria 10 12
France 74 54
Germany 81 65
France 40 20
Italy 54 42
France 51 16
Norway 14 33
Italy 12 33
France 21 30
For the test purpose I assumed chunk size of just 5 rows:
chunksize = 5
Solution
The main processing loop (and preparatory steps) are as follows:
df_chunk = pd.read_csv('Input.csv', sep='\t', chunksize=chunksize)
chunkPartRes = [] # Partial results from each chunk
chunkNo = 0
for chunk in df_chunk:
chunkNo += 1
gr = chunk.groupby('Country')
# Sum the desired columns and size of each group
res = gr.agg(Amount_1=('Amount_1', sum), Amount_2=('Amount_2', sum))\
.join(gr.size().rename('Count'))
# Add top index level (chunk No), then append
chunkPartRes.append(pd.concat([res], keys=[chunkNo], names=['ChunkNo']))
To concatenate the above partial results into a single DataFrame,
but still with separate results from each chunk, run:
chunkRes = pd.concat(chunkPartRes)
For my test data, the result is:
Amount_1 Amount_2 Count
ChunkNo Country
1 Austria 86 90 2
Belgium 30 50 1
Denmark 31 42 1
Finland 42 32 1
2 Austria 10 12 1
France 114 74 2
Germany 81 65 1
Italy 54 42 1
3 France 72 46 2
Italy 12 33 1
Norway 14 33 1
And to generate the final result, summing data from all chunks,
but keeping separation by countries, run:
res = chunkRes.groupby(level=1).sum()
The result is:
Amount_1 Amount_2 Count
Country
Austria 96 102 3
Belgium 30 50 1
Denmark 31 42 1
Finland 42 32 1
France 186 120 4
Germany 81 65 1
Italy 66 75 2
Norway 14 33 1
To sum up
Even if we look only on how numbers of rows per country are computed,
this solution is more "pandasonic" and elegant, than usage of defaultdict
and incrementation in a loop processing each row.
Grouping and counting of rows per group works significantly quicker
than a loop operating on rows.

Pandas aggfunc sum based on multiple columns

I'm trying to sum data from multiple columns in my dataframe by pivoting the table and using aggfunc. My dataframe gives emission data for various regions. I don't want to sum some rows so I make a selection of the rows that I want to sum. The output however is two rows for each column:
one is named True and gives the sum of the rows that I defined (this is the column that I want)
the other is named False and gives the sum of the remainder of the rows that I did not define (this one I would like to drop/omit)
The data is numeric regional data for multiple years so what I want to do is add data from some regions in order to get data for larger regions. The years are listed in columns.
The data looks something like this:
inp = [{'Scenario':'Baseline', 'Region':'CHINA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':5,'1995':10,'2000':15},
{'Scenario':'Baseline', 'Region':'INDIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':6,'1995':11,'2000':16},
{'Scenario':'Baseline', 'Region':'INDONESIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':7,'1995':12,'2000':17},
{'Scenario':'Baseline', 'Region':'KOREA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':8,'1995':13,'2000':18},
{'Scenario':'Baseline', 'Region':'JAPAN', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':9,'1995':14,'2000':19},
{'Scenario':'Baseline', 'Region':'THAILAND', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':10,'1995':15,'2000':20},
{'Scenario':'Baseline', 'Region':'RUSSIA', 'Variable':'Methane', 'Unit':'MtCO2eq', '1990':11,'1995':16,'2000':21}]
dt = pd.DataFrame(inp)
dt
1990 1995 2000 Region Scenario Unit Variable
0 5 10 15 CHINA Baseline MtCO2eq Methane
1 6 11 16 INDIA Baseline MtCO2eq Methane
2 7 12 17 INDONESIA Baseline MtCO2eq Methane
3 8 13 18 KOREA Baseline MtCO2eq Methane
4 9 14 19 JAPAN Baseline MtCO2eq Methane
5 10 15 20 THAILAND Baseline MtCO2eq Methane
6 11 16 21 RUSSIA Baseline MtCO2eq Methane
I run this piece of code:
dt_test = dt.pivot_table(dt,index=['Scenario','Variable','Unit'],
columns=[(df['Region'] == 'CHINA')|
(df['Region'] == 'INDIA')|
(df['Region'] == 'INDONESIA')
|(df['Region'] == 'KOREA')],
aggfunc=np.sum)
And get this as output:
1990 1995 2000
Region False True False True False True
Scenario Variable Unit
Baseline Methane MtCO2eq 46 10 76 15 106 20
If someone could help me out with either a way to drop this False column for all the years or another nifty way to get the totals that I want that would be amazing.
Use xs:
print (dt_test.xs(True, axis=1, level=1))
1990 1995 2000
Scenario Variable Unit
Baseline Methane MtCO2eq 26 46 66
But better is filter first by isin and boolean indexing:
df = df[df['Region'].isin(['CHINA','INDIA','INDONESIA','KOREA'])]
print (df)
1990 1995 2000 Region Scenario Unit Variable
0 5 10 15 CHINA Baseline MtCO2eq Methane
1 6 11 16 INDIA Baseline MtCO2eq Methane
2 7 12 17 INDONESIA Baseline MtCO2eq Methane
3 8 13 18 KOREA Baseline MtCO2eq Methane
And then aggregate sum per groups:
dt_test = df.groupby(['Scenario','Variable','Unit']).sum()
print (dt_test)
1990 1995 2000
Scenario Variable Unit
Baseline Methane MtCO2eq 26 46 66

Add calculated column to a pandas pivot table

I have created a pandas data frame and then converted it into pivot table.
My pivot table looks like this:
Operators TotalCB Qd(cb) Autopass(cb)
Aircel India 55 11 44
Airtel Ghana 20 17 3
Airtel India 41 9 9
Airtel Kenya 9 4 5
Airtel Nigeria 24 17 7
AT&T USA 18 10 8
I was wondering how to add calculated columns so that I get my pivot table with Autopass% (Autopass(cb)/TotalCB*100) just like we are able to create them in Excel using calculated field option.
I want my pivot table output to be something like below:
Operators TotalCB Qd(cb) Autopass(cb) Qd(cb)% Autopass(cb)%
Aircel India 55 11 44 20% 80%
Airtel Ghana 20 17 3 85% 15%
Airtel India 41 29 9 71% 22%
Airtel Kenya 9 4 5 44% 56%
AT&T USA 18 10 8 56% 44%
How do I define the function which calculates the percentage columns and how to apply that function to my two columns namely Qd(cb) and Autopass(cb) to give me additional calculated columns
This should do it, assuming data is your pivoted dataframe:
data['Autopass(cb)%'] = data['Autopass(cb)'] / data['TotalCB'] * 100
data['Qd(cb)%'] = data['Qd(cb)'] / data['TotalCB'] * 100
Adding a new column to a dataframe is as simple as df['colname'] = new_series. Here we assign it with your requested function, when we do it as a vector operation it creates a new series.

Categories

Resources