Importing Excel into Panda Dataframe - python

The following is only the beginning for an Coursera assignment on Data Science. I hope this is not to trivial for. But I am lost on this and could not find an answer.
I am asked to import an Excelfile into a panda dataframe and to manipulate it afterwards. The file can be found here: http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls
What makes it difficult for me is
a) there is an 'overhead' of 17 lines and a footer
b) the first two columns are empty
c) the index column has no header name
After hours if seraching and reading I came up with this useless line:
energy=pd.read_excel('Energy Indicators.xls',
sheetname='Energy',
header=16,
skiprows=[17],
skipfooter=38,
skipcolumns=2
)
This seems to produce a multindex dataframe. Though the command energy.head() returns nothing.
I have two questions:
what did I wrong. Up to this exercise I thought I understand the dataframe. But now I am totally clueless and lost :-((
How do I have to tackle this? What do I have to do to get this Exceldata into a datafrae with the index consisting of the countries?
Thanks.

I think you need add parameters:
index_col for convert column to index
usecols - parse columns by positions
change header position to 15
energy=pd.read_excel('Energy Indicators.xls',
sheet_name='Energy',
skiprows=[17],
skipfooter=38,
header=15,
index_col=[0],
usecols=[2,3,4,5]
)
print (energy.head())
Energy Supply Energy Supply per capita \
Afghanistan 321 10
Albania 102 35
Algeria 1959 51
American Samoa ... ...
Andorra 9 121
Renewable Electricity Production
Afghanistan 78.669280
Albania 100.000000
Algeria 0.551010
American Samoa 0.641026
Andorra 88.695650

I installed xlrd package, with pip install xlrd and then loaded the file successfully as follows:
In [17]: df = pd.read_excel(r"http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls",
...: sheetname='Energy',
...: header=16,
...: skiprows=[17],
...: skipfooter=38,
...: skipcolumns=2)
In [18]: df.shape
Out[18]: (227, 3)
In [19]: df.head()
Out[19]:
Energy Supply Energy Supply per capita \
NaN Afghanistan Afghanistan 321 10
Albania Albania 102 35
Algeria Algeria 1959 51
American Samoa American Samoa ... ...
Andorra Andorra 9 121
Renewable Electricity Production
NaN Afghanistan Afghanistan 78.669280
Albania Albania 100.000000
Algeria Algeria 0.551010
American Samoa American Samoa 0.641026
Andorra Andorra 88.695650
In [20]: pd.__version__
Out[20]: u'0.20.3'
In [21]: df.columns
Out[21]:
Index([u'Energy Supply', u'Energy Supply per capita',
u'Renewable Electricity Production'],
dtype='object')
Notice that I am using the last version of pandas 0.20.3 make sure you have the latest version on your system.

I modified your code and was able to get the data into the dataframe. Instead of skipcolumns (which did not work), I used the argument usecols as follows
energy=pd.read_excel('Energy_Indicators.xls',
sheetname='Energy',
header=16,
skiprows=[16],
skipfooter=38,
usecols=[2,3,4,5]
)
Unnamed: 2 Petajoules Gigajoules %
0 Afghanistan 321 10 78.669280
1 Albania 102 35 100.000000
2 Algeria 1959 51 0.551010
3 American Samoa ... ... 0.641026
4 Andorra 9 121 88.695650
In order to make the countries as the index, you can do the following
# Rename the column Unnamed: 2 to Country
energy = energy.rename(columns={'Unnamed: 2':'Country'})
# Change the index to country column
energy.index = energy['Country']
# Drop the extra country column
energy = energy.drop('Country', axis=1)

Related

Adding Multiple Columns in Single groupby in Pandas

Dataset image
Please help, I have a dataset in which I have columns Country, Gas and Year from 2019 to 1991. Also attaching the snapshot of the dataset. I want to answer a question that I want to add all the values of a country column wise? For example, for Afghanistan, value should come 56.4 under 2019 (adding 28.79 + 6.23 + 16.37 + 5.01 = 56.4). Now I want it should calculate the result for every year. I have used below code for achieving 2019 data.
df.groupby(by='Country')['2019'].sum()
This is the output of that code:
Country
---------------------
Afghanistan 56.40
Albania 17.31
Algeria 558.67
Andorra 1.18
Angola 256.10
...
Venezuela 588.72
Vietnam 868.40
Yemen 50.05
Zambia 182.08
Zimbabwe 235.06
I have group the data country wise and adding the 2019 column values, but how should I add values of other years in single line of code?
Please help.
I can do the code shown here, to add rows and show multiple columns like this but this will be tedious task to do so write each column name.
df.groupby(by='Country')[['2019','2018','2017']].sum()
If you don't specify the column, it will sum all the numeric column.
df.groupby(by='Country').sum()
2019 2020 ...
Country
Afghanistan 56.40 32.4 ...
Albania 17.31 12.5 ...
Algeria 558.67 241.5 ...
Andorra 1.18 1.5 ...
Angola 256.10 32.1 ...
... ... ...
Venezuela 588.72 247.3 ...
Vietnam 868.40 323.5 ...
Yemen 50.05 55.7 ...
Zambia 182.08 23.4 ...
Zimbabwe 235.06 199.4 ...
Do a reset_index() to flatten the columns
df.groupby(by='Country').sum().reset_index()
Country 2020 2019 ...
Afghanistan 56.40 32.4 ...
Albania 17.31 12.5 ...
Algeria 558.67 241.5 ...
Andorra 1.18 1.5 ...
Angola 256.10 32.1 ...
... ... ...
Venezuela 588.72 247.3 ...
Vietnam 868.40 323.5 ...
Yemen 50.05 55.7 ...
Zambia 182.08 23.4 ...
Zimbabwe 235.06 199.4 ...
You can select columns keys in your dataframe starting from column 2019 till the last column key in this way:
df.groupby(by='Country')[df.keys()[2:]].sum()
Method df.keys will return all dataframe columns keys in a list then you can slice it from the index of 2019 key which is 2 till end of columns keys.
Suppose you want to select columns from 2016 till 1992 column:
df.groupby(by='Country')[df.keys()[5:-1]].sum()
you just need to slice the list of columns keys in correct index order.

Using Query in Pandas to remove a vector of values

I work in R and this operation would be easy in tidyverse; However, I'm having trouble figuring out how to do it in Python and Pandas.
Let's say we're using the gapminder dataset
data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'
gapminder = pd.read_csv(data_url)
and let's say that I want to filter out from the dataset all year values that are equal to 1952 and 1957. I would think that something like this would work, but it doesn't:
vector = [1952, 1957]
gapminder.query("year isin(vector)")
I realize here that I've made a vector in what is really a list. When I try to pass those two year values into an array as vector = pd.array(1952, 1957) That doesn't work either.
In R, for instance, you would have to do something simple like
vector = c(1952, 1957)
gapminder %>% filter(year %in% vector)
#or
gapminder %>% filter(year %in% c(1952, 1957))
So really this is a two part question: first, how can I create a vector of many values (if I were pulling these values from another dataset, I believe that I could just use pd.to_numpy) and then how do I then remove all rows based on that vector of observations from a dataframe?
I've looked at a lot of different variations for using query like here, for instance, https://www.geeksforgeeks.org/python-filtering-data-with-pandas-query-method/, but this has been surprisingly hard to find.
*Here I am updating my question: I found that this isn't working if I pull a vector from another dataset (or even from the same dataset); for instance:
vector = (1952, 1957)
#how to take a dataframe and make a vector
#how to make a vector
gapminder.vec = gapminder\
.query('year == [1952, 1958]')\
[['country']]\
.to_numpy()
gap_sum = gapminder.query("year != #gapminder.vec")
gap_sum
I receive the following error:
Thanks much!
James
You can use in or even == inside the query string like so:
# gapminder.query("year == #vector") returns the same result
print(gapminder.query("year in #vector"))
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
12 Albania 1952 1282697.0 Europe 55.230 1601.056136
13 Albania 1957 1476505.0 Europe 59.280 1942.284244
24 Algeria 1952 9279525.0 Africa 43.077 2449.008185
... ... ... ... ... ... ...
1669 Yemen Rep. 1957 5498090.0 Asia 33.970 804.830455
1680 Zambia 1952 2672000.0 Africa 42.038 1147.388831
1681 Zambia 1957 3016000.0 Africa 44.077 1311.956766
1692 Zimbabwe 1952 3080907.0 Africa 48.451 406.884115
1693 Zimbabwe 1957 3646340.0 Africa 50.469 518.764268
The # symbol tells the query string to look for a variable named vector outside of the context of the dataframe.
There are a couple of issues with the updated component of your question that I'll address:
The direct issue you're receiving is because you're using double square brackets to select a column. By using a double square bracket, you're forcing the selected column to be returned as a 2d table (e.g. a dataframe that contains a single column), instead of just the column itself. To resolve this issue, simply get rid of the double brackets. The to_numpy is also not necessary.
in your gap_sum variable, you're checking where the values in "year" are not in your gapminder.vec - which is a pd.Series (array for more generic term) of country names. So these don't really make sense to compare.
Don't use . notation to create variables in python. You're not making a new variable, but are attaching a new attribute to an existing object. Instead use underscores as is common practice in python (e.g. use gapminder_vec instead of gapminder.vec)
# countries that have years that are either 1952 or 1958
# will contain duplicate country names
gapminder_vec = gapminder.query('year == [1952, 1958]')['country']
# This won't actually filter anything- because `gapminder_vec` is
# a bunch of country names. Not years.
gapminder.query("year not in #gapminder_vec")
Also to perform a filter rather than a subset:
vec = (1952, 1958)
# returns a subset containing the rows who have a year in `vec`
subset_with_years_in_vec = gapminder.query('year in #vec')
# return subset containing rows who DO NOT have a year in `vec`
subset_without_years_in_vec = gapminder.query('year not in #vec')
To filter out years 1952 and 1957 you can use:
print(gapminder.loc[~(gapminder.year.isin([1952, 1957]))])
Prints:
country year pop continent lifeExp gdpPercap
2 Afghanistan 1962 1.026708e+07 Asia 31.99700 853.100710
3 Afghanistan 1967 1.153797e+07 Asia 34.02000 836.197138
4 Afghanistan 1972 1.307946e+07 Asia 36.08800 739.981106
5 Afghanistan 1977 1.488037e+07 Asia 38.43800 786.113360
6 Afghanistan 1982 1.288182e+07 Asia 39.85400 978.011439
7 Afghanistan 1987 1.386796e+07 Asia 40.82200 852.395945
8 Afghanistan 1992 1.631792e+07 Asia 41.67400 649.341395
9 Afghanistan 1997 2.222742e+07 Asia 41.76300 635.341351
10 Afghanistan 2002 2.526840e+07 Asia 42.12900 726.734055
11 Afghanistan 2007 3.188992e+07 Asia 43.82800 974.580338
14 Albania 1962 1.728137e+06 Europe 64.82000 2312.888958
15 Albania 1967 1.984060e+06 Europe 66.22000 2760.196931
16 Albania 1972 2.263554e+06 Europe 67.69000 3313.422188
17 Albania 1977 2.509048e+06 Europe 68.93000 3533.003910
...

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns
If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032
where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work
I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

Problem with New Column in Pandas Dataframe

I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.
Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)

Pandas for inserting a columns as index

When reading from excel to pandas, it shows like this
t0001 Albania 0.03914382317658349
0 t0001 Algeria 0.298994
1 t0001 Austria 1.01137
2 t0001 Belgium 0.306369
What I want to achieve is inserting a columns of 'time','region','value', and it should be shows like below
time region value
0 t0001 Albania 0.0391438
1 t0001 Algeria 0.298994
2 t0001 Austria 1.01137
3 t0001 Belgium 0.306369
Is it possible to achieve in the pandas?
When reading your excel file, read it as so, with a header and names parameter.
df = pd.read_excel(..., header=None, names=['time', 'region', 'value'])
If you are curious, the fix would be to call reset_index and assign columns:
df = df.T.reset_index().T
df.columns = ['time', 'region', 'value']
df['value'] = df['value'].astype(float)
df
time region value
index t0001 Albania 0.039144
0 t0001 Algeria 0.298994
1 t0001 Austria 1.011370
2 t0001 Belgium 0.306369
You should strive as much as possible to not reach a point that would necessitate running cleanup code like this.
Header = None is the proper solution but as an alternative solution you can also do :
df.loc[-1] = df.columns
df.index += 1
df.columns = ['time', 'region', 'value']
df.value = df.value.astype(float)
time region value
1 t0001 Algeria 0.298994
2 t0001 Austria 1.01137
3 t0001 Belgium 0.306369
0 t0001 Albania 0.039143

Categories

Resources