Dataset image
Please help, I have a dataset in which I have columns Country, Gas and Year from 2019 to 1991. Also attaching the snapshot of the dataset. I want to answer a question that I want to add all the values of a country column wise? For example, for Afghanistan, value should come 56.4 under 2019 (adding 28.79 + 6.23 + 16.37 + 5.01 = 56.4). Now I want it should calculate the result for every year. I have used below code for achieving 2019 data.
df.groupby(by='Country')['2019'].sum()
This is the output of that code:
Country
---------------------
Afghanistan 56.40
Albania 17.31
Algeria 558.67
Andorra 1.18
Angola 256.10
...
Venezuela 588.72
Vietnam 868.40
Yemen 50.05
Zambia 182.08
Zimbabwe 235.06
I have group the data country wise and adding the 2019 column values, but how should I add values of other years in single line of code?
Please help.
I can do the code shown here, to add rows and show multiple columns like this but this will be tedious task to do so write each column name.
df.groupby(by='Country')[['2019','2018','2017']].sum()
If you don't specify the column, it will sum all the numeric column.
df.groupby(by='Country').sum()
2019 2020 ...
Country
Afghanistan 56.40 32.4 ...
Albania 17.31 12.5 ...
Algeria 558.67 241.5 ...
Andorra 1.18 1.5 ...
Angola 256.10 32.1 ...
... ... ...
Venezuela 588.72 247.3 ...
Vietnam 868.40 323.5 ...
Yemen 50.05 55.7 ...
Zambia 182.08 23.4 ...
Zimbabwe 235.06 199.4 ...
Do a reset_index() to flatten the columns
df.groupby(by='Country').sum().reset_index()
Country 2020 2019 ...
Afghanistan 56.40 32.4 ...
Albania 17.31 12.5 ...
Algeria 558.67 241.5 ...
Andorra 1.18 1.5 ...
Angola 256.10 32.1 ...
... ... ...
Venezuela 588.72 247.3 ...
Vietnam 868.40 323.5 ...
Yemen 50.05 55.7 ...
Zambia 182.08 23.4 ...
Zimbabwe 235.06 199.4 ...
You can select columns keys in your dataframe starting from column 2019 till the last column key in this way:
df.groupby(by='Country')[df.keys()[2:]].sum()
Method df.keys will return all dataframe columns keys in a list then you can slice it from the index of 2019 key which is 2 till end of columns keys.
Suppose you want to select columns from 2016 till 1992 column:
df.groupby(by='Country')[df.keys()[5:-1]].sum()
you just need to slice the list of columns keys in correct index order.
Related
I am dealing with a dataset that uses ".." as a placeholder for null values. These null values span across all of my columns. My dataset looks as follows:
Country Code
Year
GDP growth (%)
GDP (constant)
AFG
2010
3.5
..
AFG
2011
..
2345
AFG
2012
1.4
3372
ALB
2010
..
4567
ALB
2011
..
5678
ALB
2012
4.2
..
DZA
2010
2.0
4321
DZA
2011
..
5432
DZA
2012
3.8
6543
I want to remove the rows containing missing data from my data however my solutions are not very clean.
I have tried:
df_GDP_1[df_GDP_1.str.contains("..")==False]
Which I had hoped to be a solution to deal with all columns at once, however this returns an error.
Otherwise I have tried:
df_GDP_1[df_GDP_1.col1 != '..' | df_GDP_1.col2 != '..']
However this solution requires me to alter names of columns to remove spaces and then reverse this after, and even at that, which seems unnecessarily long for the task at hand.
Any ideas which enable me to perform this in a cleaner manner would be greatly appreciated!
With combination of pandas.DataFrame.eq and pandas.DataFrame.any
functions.
.any(1) tells to find a match over the columns (axis=1)
the negation ~ tells to omit records with matches
In [269]: df[~df.eq("..").any(1)]
Out[269]:
Country Code Year GDP growth (%) GDP (constant)
2 AFG 2012 1.4 3372
6 DZA 2010 2.0 4321
8 DZA 2012 3.8 6543
This is the typical case of world bank data. Here's the simplest way to deal with this:
# This is just for reproducting your example dataset
your_example = """Country Code Year GDP growth (%) GDP (constant)
AFG 2010 3.5 ..
AFG 2011 .. 2345
AFG 2012 1.4 3372
ALB 2010 .. 4567
ALB 2011 .. 5678
ALB 2012 4.2 ..
DZA 2010 2.0 4321
DZA 2011 .. 5432
DZA 2012 3.8 6543"""
your_example = your_example.split("\n")
your_example = pd.DataFrame(
[row.split("\t") for row in your_example[1:]], columns=your_example[0].split("\t")
)
# You just have to do this:
your_example = your_example.replace({"..": None})
your_example = your_example.dropna()
print("DF after dropping rows with ..", your_example)
>>> Country Code Year GDP growth (%) GDP (constant)
>>> 2 AFG 2012 1.4 3372
>>> 6 DZA 2010 2.0 4321
>>> 8 DZA 2012 3.8 6543
I'm just replacing the ".." by None since you are saying this ".." represents a NULL. Then I'm deleting it using dropna() method of pandas dataframe, which is what you wanted to achieve.
Following you original approach (you were almost there!) you could use:
df_GDP_1 = df_GDP_1[(df_GDP_1['GPD Growth (%)']+ != '..') & (df_GDP_1['GDP (constant)'] != '..')]
names with spaces have to go in [ ] instead of dot notation. Also you want to keep rows where both the columns do not have the .. marker so use & not |. Each condition needs to be in ( ) brackets.
I have a df as below:
I want only the top 5 countries from each year but keeping the year ascending.
First I grouped the df by year and country name and then ran the following code:
df.sort_values(['year','hydro_total'], ascending=False).groupby(['year']).head(5)
The result didn't keep the index ascending, instead, it sorted the year index too. How do I get the top 5 countries and keep the year's group ascending?
The CSV file is uploaded HERE .
You already sort by year and hydro_total, both decreasingly. You need to sort the year as increasing:
(df.sort_values(['year','hydro_total'],
ascending=[True,False])
.groupby('year').head(5)
)
Output:
country year hydro_total hydro_per_person
440 Japan 1971 7240000.0 0.06890
160 China 1971 2580000.0 0.00308
240 India 1971 2410000.0 0.00425
760 North Korea 1971 788000.0 0.05380
800 Pakistan 1971 316000.0 0.00518
... ... ... ... ...
199 China 2010 62100000.0 0.04630
279 India 2010 9840000.0 0.00803
479 Japan 2010 7070000.0 0.05590
1119 Turkey 2010 4450000.0 0.06120
839 Pakistan 2010 2740000.0 0.01580
I have data frame
Unnamed: 0 COUNTRY GDP (BILLIONS) CODE
0 0 Afghanistan 21.71 AFG
1 1 Albania 13.40 ALB
2 2 Algeria 227.80 DZA
3 3 American Samoa 0.75 ASM
4 4 Andorra 4.80 AND
... ... ... ... ...
217 217 Virgin Islands 5.08 VGB
218 218 West Bank 6.64 WBG
219 219 Yemen 45.45 YEM
220 220 Zambia 25.61 ZMB
221 221 Zimbabwe 13.74 ZWE
I would like to know how I can output the Max and Min GDP from this dataframe.
I tried
df.loc[df['GDP(BILLIONS)'].idxmax()]
but got an error message
Thank you in advance
Using idxmax:
Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
Series.idxmax:
Return the row label of the maximum value.
You can use idxmax if you want to return the corresponding row values of max value, then max value
row_of_max_index = df.loc[df['GDP'].idxmax()] #series of max index row
print(row_of_max_index ) # 2 Algeria 227.80 DZA
print(row_of_max_index[2]) #index of GDP column to get 227.8
The same thing for idxmin:
row_of_min_index = df.loc[df['GDP'].idxmin()]
You can use
max_val = df['GDP(BILLIONS)'].max()
for maximum value and
min_val = df['GDP(BILLIONS)'].min()
for minimum value
I have 3 dataframes each with the same columns (years) and same indexes (countries).
Now I want to merge these 3 dataframes. But since all have the same columns it is appending those.
So 'd like to keep the country index and add a subindex for each dataframe because all represent different numbers for each year.
#dataframe 1
#CO2:
2005 2010 2015 2020
country
Afghanistan 169405 210161 259855 319447
Albania 762 940 1154 1408
Algeria 158336 215865 294768 400126
#dataframe 2
#Arrivals + Departures:
2005 2010 2015 2020
country
Afghanistan 977896 1326120 1794547 2414943
Albania 103132 154219 224308 319440
Algeria 3775374 5307448 7389427 10159656
#data frame 3
#Travel distance in km:
2005 2010 2015 2020
country
Afghanistan 9330447004 12529259781 16776152792 22337458954
Albania 63159063 82810491 107799357 139543748
Algeria 12254674181 17776784271 25782632480 37150057977
The result should be something like:
2005 2010 2015 2020
country
Afghanistan co2 169405 210161 259855 319447
flights 977896 1326120 1794547 2414943
traveldistance 9330447004 12529259781 16776152792 22337458954
Albania ....
How can I do this?
NOTE: The years are an input so these are not fixed. They could just be 2005,2010 for example.
Thanks in advance.
I have tried to solve the problem using concat and groupby using your dataset hope it helps
First concat the 3 dfs
l=[df,df2,df3]
f=pd.concat(l,keys= ['CO2','Flights','traveldistance'],axis=0,).reset_index().rename(columns={'level_0':'Category'})
the use groupby to get the values
result_df=f.groupby(['country', 'Category'])[f.columns[2:]].first()
Hope it helps and solve your problem
Output looks like this
The following is only the beginning for an Coursera assignment on Data Science. I hope this is not to trivial for. But I am lost on this and could not find an answer.
I am asked to import an Excelfile into a panda dataframe and to manipulate it afterwards. The file can be found here: http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls
What makes it difficult for me is
a) there is an 'overhead' of 17 lines and a footer
b) the first two columns are empty
c) the index column has no header name
After hours if seraching and reading I came up with this useless line:
energy=pd.read_excel('Energy Indicators.xls',
sheetname='Energy',
header=16,
skiprows=[17],
skipfooter=38,
skipcolumns=2
)
This seems to produce a multindex dataframe. Though the command energy.head() returns nothing.
I have two questions:
what did I wrong. Up to this exercise I thought I understand the dataframe. But now I am totally clueless and lost :-((
How do I have to tackle this? What do I have to do to get this Exceldata into a datafrae with the index consisting of the countries?
Thanks.
I think you need add parameters:
index_col for convert column to index
usecols - parse columns by positions
change header position to 15
energy=pd.read_excel('Energy Indicators.xls',
sheet_name='Energy',
skiprows=[17],
skipfooter=38,
header=15,
index_col=[0],
usecols=[2,3,4,5]
)
print (energy.head())
Energy Supply Energy Supply per capita \
Afghanistan 321 10
Albania 102 35
Algeria 1959 51
American Samoa ... ...
Andorra 9 121
Renewable Electricity Production
Afghanistan 78.669280
Albania 100.000000
Algeria 0.551010
American Samoa 0.641026
Andorra 88.695650
I installed xlrd package, with pip install xlrd and then loaded the file successfully as follows:
In [17]: df = pd.read_excel(r"http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls",
...: sheetname='Energy',
...: header=16,
...: skiprows=[17],
...: skipfooter=38,
...: skipcolumns=2)
In [18]: df.shape
Out[18]: (227, 3)
In [19]: df.head()
Out[19]:
Energy Supply Energy Supply per capita \
NaN Afghanistan Afghanistan 321 10
Albania Albania 102 35
Algeria Algeria 1959 51
American Samoa American Samoa ... ...
Andorra Andorra 9 121
Renewable Electricity Production
NaN Afghanistan Afghanistan 78.669280
Albania Albania 100.000000
Algeria Algeria 0.551010
American Samoa American Samoa 0.641026
Andorra Andorra 88.695650
In [20]: pd.__version__
Out[20]: u'0.20.3'
In [21]: df.columns
Out[21]:
Index([u'Energy Supply', u'Energy Supply per capita',
u'Renewable Electricity Production'],
dtype='object')
Notice that I am using the last version of pandas 0.20.3 make sure you have the latest version on your system.
I modified your code and was able to get the data into the dataframe. Instead of skipcolumns (which did not work), I used the argument usecols as follows
energy=pd.read_excel('Energy_Indicators.xls',
sheetname='Energy',
header=16,
skiprows=[16],
skipfooter=38,
usecols=[2,3,4,5]
)
Unnamed: 2 Petajoules Gigajoules %
0 Afghanistan 321 10 78.669280
1 Albania 102 35 100.000000
2 Algeria 1959 51 0.551010
3 American Samoa ... ... 0.641026
4 Andorra 9 121 88.695650
In order to make the countries as the index, you can do the following
# Rename the column Unnamed: 2 to Country
energy = energy.rename(columns={'Unnamed: 2':'Country'})
# Change the index to country column
energy.index = energy['Country']
# Drop the extra country column
energy = energy.drop('Country', axis=1)