How to add conditional row to pandas dataframe - python

I tried looking for a succinct answer and nothing helped. I am trying to add a row to a dataframe that takes a string for the first column and then for each column grabbing the sum. I ran into a scalar issue, so I tried to make the desired row into a series then convert to a dataframe, but apparently I was adding four rows with one column value instead of one row with the four column values.
My code:
def country_csv():
# loop through absolute paths of each file in source
for filename in os.listdir(source):
filepath = os.path.join(source, filename)
if not os.path.isfile(filepath):
continue
df = pd.read_csv(filepath)
df = df.groupby(['Country']).sum()
df.reset_index()
print(df)
# df.to_csv(os.path.join(path1, filename))
Sample dataframe:
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
Would like to see this as the first row
World 632 27 109

import pandas as pd
import datetime as dt
df
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
df.loc['World'] = [df['Confirmed'].sum(),df['Deaths'].sum(),df['Recovered'].sum()]
df.sort_values(by=['Confirmed'], ascending=False)
Confirmed Deaths Recovered
Country
World 632 27 109
Albania 333 20 99
Afghanistan 299 7 10

IIUC, you can create a dict then repass it into a dataframe to concat.
data = df.sum(axis=0).to_dict()
data.update({'Country' : 'World'})
df2 = pd.concat([pd.DataFrame(data,index=[0]).set_index('Country'),df],axis=0)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99
or a oner liner using assign and Transpose
df2 = pd.concat(
[df.sum(axis=0).to_frame().T.assign(Country="World").set_index("Country"), df],
axis=0,
)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99

Related

Removing duplicate columns in pandas dataframe

I am trying to parse data, but duplicate names under columns started appearing.
Code:
import pandas as pd
def parseData():
countries = pd.read_csv('Int_Monthly_Visitor.csv')
cols = [e.strip() for e in list(countries.columns)]
regions = {
'Others': cols[30:]
}
countries.rename(str.strip, axis='columns', inplace=True)
regionlist = pd.DataFrame({'Columns': regions['Others'], 'Non-Null count': countries.loc[0:120, regions['Others']].count()})
print(regionlist)
parseData()
Output:
Columns Non-Null count
USA USA 121
Canada Canada 121
Australia Australia 121
New Zealand New Zealand 121
Africa Africa 121
Expected output:
Columns Non-Null count
USA 121
Canada 121
Australia 121
New Zealand 121
Africa 121
Is there a solution to remove the duplicate names under columns?
Since you're reading your dataframe from a .csv file, you can use pandas.read_csv and define usecols argument as shown below :
countries = pd.read_csv('Int_Monthly_Visitor.csv', usecols=lambda c: not c.startswith('Unnamed:'))
>>> print(countries)

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns
If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032
where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work
I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

80 Gb file - Creating a data frame that submits data based upon a list of counties

I am working with an 80 Gb data set in Python. The data has 30 columns and ~180,000,000 rows.
I am using the chunk size parameter in pd.read_csv to read the data in chunks where I then iterate through the data to create a dictionary of the counties with their associated frequency.
This is where I am stuck. Once I have the list of counties, I want to iterate through the chunks row-by-row again summing the values of 2 - 3 other columns associated with each county and place it into a new DataFrame. This would roughly be 4 cols and 3000 rows which is more manageable for my computer.
I really don't know how to do this, this is my first time working with a large data set in python.
import pandas as pd
from collections import defaultdict
df_chunk = pd.read_csv('file.tsv', sep='\t', chunksize=8000000)
county_dict = defaultdict(int)
for chunk in df_chunk:
for county in chunk['COUNTY']:
county_dict[county] += 1
for chunk in df_chunk:
for row in chunk:
# I don't know where to go from here
I expect to be able to make a DataFrame with a column of all the counties, a column for total sales of product "1" per county, another column for sales of product per county, and then more columns of the same as needed.
The idea
I was not sure whether you have data for different counties (e.g. in UK or USA)
or countries (in the world), so I decided to have data concerning countries.
The idea is to:
Group data from each chunk by country.
Generate a partial result for this chunk, as a DataFrame with:
Sums of each column of interest (per country).
Number of rows per country.
To perform concatenation of partial results (in a moment), each partial
result should contain the chunk number, as an additional index level.
Concatenate partial results vertically (due to the additional index level,
each row has different index).
The final result (total sums and row counts) can be computed as
sum of the above result, grouped by country (discarding the chunk
number).
Test data
The source CSV file contains country names and 2 columns to sum (Tab separated):
Country Amount_1 Amount_2
Austria 41 46
Belgium 30 50
Austria 45 44
Denmark 31 42
Finland 42 32
Austria 10 12
France 74 54
Germany 81 65
France 40 20
Italy 54 42
France 51 16
Norway 14 33
Italy 12 33
France 21 30
For the test purpose I assumed chunk size of just 5 rows:
chunksize = 5
Solution
The main processing loop (and preparatory steps) are as follows:
df_chunk = pd.read_csv('Input.csv', sep='\t', chunksize=chunksize)
chunkPartRes = [] # Partial results from each chunk
chunkNo = 0
for chunk in df_chunk:
chunkNo += 1
gr = chunk.groupby('Country')
# Sum the desired columns and size of each group
res = gr.agg(Amount_1=('Amount_1', sum), Amount_2=('Amount_2', sum))\
.join(gr.size().rename('Count'))
# Add top index level (chunk No), then append
chunkPartRes.append(pd.concat([res], keys=[chunkNo], names=['ChunkNo']))
To concatenate the above partial results into a single DataFrame,
but still with separate results from each chunk, run:
chunkRes = pd.concat(chunkPartRes)
For my test data, the result is:
Amount_1 Amount_2 Count
ChunkNo Country
1 Austria 86 90 2
Belgium 30 50 1
Denmark 31 42 1
Finland 42 32 1
2 Austria 10 12 1
France 114 74 2
Germany 81 65 1
Italy 54 42 1
3 France 72 46 2
Italy 12 33 1
Norway 14 33 1
And to generate the final result, summing data from all chunks,
but keeping separation by countries, run:
res = chunkRes.groupby(level=1).sum()
The result is:
Amount_1 Amount_2 Count
Country
Austria 96 102 3
Belgium 30 50 1
Denmark 31 42 1
Finland 42 32 1
France 186 120 4
Germany 81 65 1
Italy 66 75 2
Norway 14 33 1
To sum up
Even if we look only on how numbers of rows per country are computed,
this solution is more "pandasonic" and elegant, than usage of defaultdict
and incrementation in a loop processing each row.
Grouping and counting of rows per group works significantly quicker
than a loop operating on rows.

Iterate over rows and save as csv

I am working with this DataFrame index and it looks like this:
year personal economic human rank
country
Albania 2008 7.78 7.22 7.50 49
Albania 2009 7.86 7.31 7.59 46
Albania 2010 7.76 7.35 7.55 49
Germany 2011 7.76 7.24 7.50 53
Germany 2012 7.67 7.20 7.44 54
It has 162 countries for 9 years. What I would like to do is:
Create a for loop that returns a new dataframe with the data for each country that only shows the values for personal, economic, human, and rank only.
Save each dataframe as a .csvwith the name of the country the data belong to.
Iterate through unique values of country and year. Get data related to that country and year in another dataframe. Save it.
df.reset_index(inplace=True) # To covert multi-index as in example to columns
unique_val = df[['country', 'year']].drop_duplicates()
for _, country, year in unique_val.itertuples():
file_name = country + '_' + str(year) + '.csv'
out_df = df[(df.country == country) & (df.year == year)]
out_df = out_df.loc[:, ~out_df.columns.isin(['country', 'year'])]
print(out_df)
out_df.to_csv(file_name)

How to convert list to pandas DataFrame?

I use BeautifulSoup to get some data from a webpage:
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'html5lib')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
df.head()
But df is a list, not the pandas DataFrame as I expected from using pd.read_html.
How can I get pandas DataFrame out of it?
You can use read_html with your url:
df = pd.read_html("http://www.nationmaster.com/country-info/stats/Media/Internet-users")[0]
And then if necessary remove GRAPH and HISTORY columns and replace NaNs in column # by forward filling:
df = df.drop(['GRAPH','HISTORY'], axis=1)
df['#'] = df['#'].ffill()
print(df.head())
# COUNTRY AMOUNT DATE
0 1 China 389 million 2009
1 2 United States 245 million 2009
2 3 Japan 99.18 million 2009
3 3 Group of 7 countries (G7) average (profile) 80.32 million 2009
4 4 Brazil 75.98 million 2009
print(df.tail())
# COUNTRY AMOUNT DATE
244 214 Niue 1100 2009
245 =215 Saint Helena, Ascension, and Tristan da Cunha 900 2009
246 =215 Saint Helena 900 2009
247 217 Tokelau 800 2008
248 218 Christmas Island 464 2001

Categories

Resources