Merging multiple CSV files to one dataframe using Pandas - python

I have have 26 "CSV" files with file names similar to the below:
LA 2016
LA 2017
LA 2019
LA 2020
NY 2016
NY 2017
NY 2019
NY 2020
All the files have similar column names. The Column names are :
Month A B C Total
Jan 156 132 968 1256
Feb 863 363 657 1883
Mar 142 437 857 1436
I am trying to merge them all on Month. I tried pd.concat but for some reason the dataframe is not merging.
I am using the below code:
list=[]
city=['LA ','NY ','MA ','TX ']
year=['2016','2017','2018', '2019','2020']
for i in city:
for j in year:
list.append(i+j+".csv")
df=pd.concat([pd.read_csv(i) for i in list])
Can someone help me with this.

The following should work:
from functools import reduce
list_of_dataframes=[]
for i in list:
list_od_dataframes.append(pd.read_csv(i))
df_final = reduce(lambda left,right: pd.merge(left,right,on='Month'), list_of_dataframes)

Related

Sorting grouped DataFrame column without changing index sorting

I have a df as below:
I want only the top 5 countries from each year but keeping the year ascending.
First I grouped the df by year and country name and then ran the following code:
df.sort_values(['year','hydro_total'], ascending=False).groupby(['year']).head(5)
The result didn't keep the index ascending, instead, it sorted the year index too. How do I get the top 5 countries and keep the year's group ascending?
The CSV file is uploaded HERE .
You already sort by year and hydro_total, both decreasingly. You need to sort the year as increasing:
(df.sort_values(['year','hydro_total'],
ascending=[True,False])
.groupby('year').head(5)
)
Output:
country year hydro_total hydro_per_person
440 Japan 1971 7240000.0 0.06890
160 China 1971 2580000.0 0.00308
240 India 1971 2410000.0 0.00425
760 North Korea 1971 788000.0 0.05380
800 Pakistan 1971 316000.0 0.00518
... ... ... ... ...
199 China 2010 62100000.0 0.04630
279 India 2010 9840000.0 0.00803
479 Japan 2010 7070000.0 0.05590
1119 Turkey 2010 4450000.0 0.06120
839 Pakistan 2010 2740000.0 0.01580

How to add conditional row to pandas dataframe

I tried looking for a succinct answer and nothing helped. I am trying to add a row to a dataframe that takes a string for the first column and then for each column grabbing the sum. I ran into a scalar issue, so I tried to make the desired row into a series then convert to a dataframe, but apparently I was adding four rows with one column value instead of one row with the four column values.
My code:
def country_csv():
# loop through absolute paths of each file in source
for filename in os.listdir(source):
filepath = os.path.join(source, filename)
if not os.path.isfile(filepath):
continue
df = pd.read_csv(filepath)
df = df.groupby(['Country']).sum()
df.reset_index()
print(df)
# df.to_csv(os.path.join(path1, filename))
Sample dataframe:
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
Would like to see this as the first row
World 632 27 109
import pandas as pd
import datetime as dt
df
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
df.loc['World'] = [df['Confirmed'].sum(),df['Deaths'].sum(),df['Recovered'].sum()]
df.sort_values(by=['Confirmed'], ascending=False)
Confirmed Deaths Recovered
Country
World 632 27 109
Albania 333 20 99
Afghanistan 299 7 10
IIUC, you can create a dict then repass it into a dataframe to concat.
data = df.sum(axis=0).to_dict()
data.update({'Country' : 'World'})
df2 = pd.concat([pd.DataFrame(data,index=[0]).set_index('Country'),df],axis=0)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99
or a oner liner using assign and Transpose
df2 = pd.concat(
[df.sum(axis=0).to_frame().T.assign(Country="World").set_index("Country"), df],
axis=0,
)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99

Adding a subindex to merged dataframes

I have 3 dataframes each with the same columns (years) and same indexes (countries).
Now I want to merge these 3 dataframes. But since all have the same columns it is appending those.
So 'd like to keep the country index and add a subindex for each dataframe because all represent different numbers for each year.
#dataframe 1
#CO2:
2005 2010 2015 2020
country
Afghanistan 169405 210161 259855 319447
Albania 762 940 1154 1408
Algeria 158336 215865 294768 400126
#dataframe 2
#Arrivals + Departures:
2005 2010 2015 2020
country
Afghanistan 977896 1326120 1794547 2414943
Albania 103132 154219 224308 319440
Algeria 3775374 5307448 7389427 10159656
#data frame 3
#Travel distance in km:
2005 2010 2015 2020
country
Afghanistan 9330447004 12529259781 16776152792 22337458954
Albania 63159063 82810491 107799357 139543748
Algeria 12254674181 17776784271 25782632480 37150057977
The result should be something like:
2005 2010 2015 2020
country
Afghanistan co2 169405 210161 259855 319447
flights 977896 1326120 1794547 2414943
traveldistance 9330447004 12529259781 16776152792 22337458954
Albania ....
How can I do this?
NOTE: The years are an input so these are not fixed. They could just be 2005,2010 for example.
Thanks in advance.
I have tried to solve the problem using concat and groupby using your dataset hope it helps
First concat the 3 dfs
l=[df,df2,df3]
f=pd.concat(l,keys= ['CO2','Flights','traveldistance'],axis=0,).reset_index().rename(columns={'level_0':'Category'})
the use groupby to get the values
result_df=f.groupby(['country', 'Category'])[f.columns[2:]].first()
Hope it helps and solve your problem
Output looks like this

How to use pandas to pull out the counties with the largest amount of water used in a given year?

I am new to python and pandas and I am struggling to figure out how to pull out the 10 counties with the most water used for irrigation in 2014.
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('info.csv') #reads csv
data['Year'] = pd.to_datetime(['Year'], format='%Y') #converts string to
datetime
data.index = data['Year'] #makes year the index
del data['Year'] #delete the duplicate year column
This is what the data looks like (this is only partial of the data):
County WUCode RegNo Year SourceCode SourceID Annual CountyName
1 IR 311 2014 WELL 1 946 Adams
1 IN 311 2014 INTAKE 1 268056 Adams
1 IN 312 2014 WELL 1 48 Adams
1 IN 312 2014 WELL 2 96 Adams
1 IR 312 2014 INTAKE 1 337968 Adams
3 IR 315 2014 WELL 5 81900 Putnam
3 PS 315 2014 WELL 6 104400 Putnam
I have a couple questions:
I am not sure how to pull out only the "IR" in the WUCode Column with pandas and I am not sure how to print out a table with the 10 counties with the highest water usage for irrigation in 2014.
I have been able to use the .loc function to pull out the information I need, with something like this:
data.loc['2014', ['CountyName', 'Annual', 'WUCode']]
From here I am kind of lost. Help would be appreciated!
import numpy as np
import pandas as pd
import string
df = pd.DataFrame(data={"Annual": np.random.randint(20, 1000000, 1000),
"Year": np.random.randint(2012, 2016, 1000),
"CountyName": np.random.choice(list(string.ascii_letters), 1000)},
columns=["Annual", "Year", "CountyName"])
Say df looks like:
Annual Year CountyName
0 518966 2012 s
1 44511 2013 E
2 332010 2012 e
3 382168 2013 c
4 202816 2013 y
For the year 2014...
df[df['Year'] == 2014]
Group by CountyName...
df[df['Year'] == 2014].groupby("CountyName")
Look at Annual...
df[df['Year'] == 2014].groupby("CountyName")["Annual"]
Get the sum...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum()
Sort the result descending...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False)
Take the top 10...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False).head(10)
This example prints out (your actual result may vary since my data was random):
CountyName
Q 5191814
y 4335358
r 4315072
f 3985170
A 3685844
a 3583360
S 3301817
I 3231621
t 3228578
u 3164965
This may work for you:
res = df[df['WUCode'] == 'IR'].groupby(['Year', 'CountyName'])['Annual'].sum()\
.reset_index()\
.sort_values('Annual', ascending=False)\
.head(10)
# Year CountyName Annual
# 0 2014 Adams 338914
# 1 2014 Putnam 81900
Explanation
Filter by WUCode, as required, and groupby Year and CountyName.
Use reset_index so your result is a dataframe rather than a series.
Use sort_values and extract top 10 via pd.DataFrame.head.

How to convert list to pandas DataFrame?

I use BeautifulSoup to get some data from a webpage:
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'html5lib')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
df.head()
But df is a list, not the pandas DataFrame as I expected from using pd.read_html.
How can I get pandas DataFrame out of it?
You can use read_html with your url:
df = pd.read_html("http://www.nationmaster.com/country-info/stats/Media/Internet-users")[0]
And then if necessary remove GRAPH and HISTORY columns and replace NaNs in column # by forward filling:
df = df.drop(['GRAPH','HISTORY'], axis=1)
df['#'] = df['#'].ffill()
print(df.head())
# COUNTRY AMOUNT DATE
0 1 China 389 million 2009
1 2 United States 245 million 2009
2 3 Japan 99.18 million 2009
3 3 Group of 7 countries (G7) average (profile) 80.32 million 2009
4 4 Brazil 75.98 million 2009
print(df.tail())
# COUNTRY AMOUNT DATE
244 214 Niue 1100 2009
245 =215 Saint Helena, Ascension, and Tristan da Cunha 900 2009
246 =215 Saint Helena 900 2009
247 217 Tokelau 800 2008
248 218 Christmas Island 464 2001

Categories

Resources