I am using a basic python editor (Wing 101). When printing out tables (for analysis), I get rows and columns all messed up. Please see below:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import os
import matplotlib.pyplot as plt
#STEP 2: Access the Data files
data = pd.read_csv('data.csv', index_col=0)
data.sort_values(['Year', "Happiness Score"], ascending=[True, False], inplace=True)
#diplay first 10 rows
data.head(10)
And I get this garbage:
[evaluate World Happness Data Analysis.py]
data.head(10)
Country Region ... Dystopia Residual Year
141 Switzerland Western Europe ... 2.51738 2015
60 Iceland Western Europe ... 2.70201 2015
38 Denmark Western Europe ... 2.49204 2015
108 Norway Western Europe ... 2.46531 2015
25 Canada North America ... 2.45176 2015
46 Finland Western Europe ... 2.61955 2015
102 Netherlands Western Europe ... 2.46570 2015
140 Sweden Western Europe ... 2.37119 2015
103 New Zealand Australia and New Zealand ... 2.26425 2015
6 Australia Australia and New Zealand ... 2.26646 2015
[10 rows x 12 columns]
Actually, it looks reasonable above. But in Python shell, it is totally messed up.
here's the actual output
My question: How can I get the table printout with cell borders?
Change the font settings of your Python shell. Use some monospace font. Most common selections is Courier New or Courier.
Related
I currently have the following information in a Data Frame.
I need to create a graph that compares the Budget against the Worldwide Gross of the 5 films with the highest 'porcentage de ganancia' (or income).
Nothing seems to be working.
Update:
datos.nlargest(5, ['Porcentaje de ganancia'])
Movie Budget ... Year Porcentaje de ganancia
19 E.T. the Extra-Terrestrial 10500000 ... 1982 7551.529086
69 Star Wars: Episode IV - A New Hope 11000000 ... 1977 7049.072791
97 Wolf Warrior 2 30100000 ... 2017 2891.446641
83 The Lion King 45000000 ... 1994 2408.268616
45 Joker 55000000 ... 2019 1953.184202
Any help is greatly appreciated!
First, you'll want to get the n_largest values for the porcentage de ganancia column.
top_5 = df.nlargest(n=5, columns='porcentage de ganancia')
Then, plot using seaborn. Read more here to learn about how to customize the plot.
import seaborn as sns
g = sns.scatterplot(data=top_5, x='Budget', y='Worldwide Gross', hue='Movie')
I work in R and this operation would be easy in tidyverse; However, I'm having trouble figuring out how to do it in Python and Pandas.
Let's say we're using the gapminder dataset
data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'
gapminder = pd.read_csv(data_url)
and let's say that I want to filter out from the dataset all year values that are equal to 1952 and 1957. I would think that something like this would work, but it doesn't:
vector = [1952, 1957]
gapminder.query("year isin(vector)")
I realize here that I've made a vector in what is really a list. When I try to pass those two year values into an array as vector = pd.array(1952, 1957) That doesn't work either.
In R, for instance, you would have to do something simple like
vector = c(1952, 1957)
gapminder %>% filter(year %in% vector)
#or
gapminder %>% filter(year %in% c(1952, 1957))
So really this is a two part question: first, how can I create a vector of many values (if I were pulling these values from another dataset, I believe that I could just use pd.to_numpy) and then how do I then remove all rows based on that vector of observations from a dataframe?
I've looked at a lot of different variations for using query like here, for instance, https://www.geeksforgeeks.org/python-filtering-data-with-pandas-query-method/, but this has been surprisingly hard to find.
*Here I am updating my question: I found that this isn't working if I pull a vector from another dataset (or even from the same dataset); for instance:
vector = (1952, 1957)
#how to take a dataframe and make a vector
#how to make a vector
gapminder.vec = gapminder\
.query('year == [1952, 1958]')\
[['country']]\
.to_numpy()
gap_sum = gapminder.query("year != #gapminder.vec")
gap_sum
I receive the following error:
Thanks much!
James
You can use in or even == inside the query string like so:
# gapminder.query("year == #vector") returns the same result
print(gapminder.query("year in #vector"))
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
12 Albania 1952 1282697.0 Europe 55.230 1601.056136
13 Albania 1957 1476505.0 Europe 59.280 1942.284244
24 Algeria 1952 9279525.0 Africa 43.077 2449.008185
... ... ... ... ... ... ...
1669 Yemen Rep. 1957 5498090.0 Asia 33.970 804.830455
1680 Zambia 1952 2672000.0 Africa 42.038 1147.388831
1681 Zambia 1957 3016000.0 Africa 44.077 1311.956766
1692 Zimbabwe 1952 3080907.0 Africa 48.451 406.884115
1693 Zimbabwe 1957 3646340.0 Africa 50.469 518.764268
The # symbol tells the query string to look for a variable named vector outside of the context of the dataframe.
There are a couple of issues with the updated component of your question that I'll address:
The direct issue you're receiving is because you're using double square brackets to select a column. By using a double square bracket, you're forcing the selected column to be returned as a 2d table (e.g. a dataframe that contains a single column), instead of just the column itself. To resolve this issue, simply get rid of the double brackets. The to_numpy is also not necessary.
in your gap_sum variable, you're checking where the values in "year" are not in your gapminder.vec - which is a pd.Series (array for more generic term) of country names. So these don't really make sense to compare.
Don't use . notation to create variables in python. You're not making a new variable, but are attaching a new attribute to an existing object. Instead use underscores as is common practice in python (e.g. use gapminder_vec instead of gapminder.vec)
# countries that have years that are either 1952 or 1958
# will contain duplicate country names
gapminder_vec = gapminder.query('year == [1952, 1958]')['country']
# This won't actually filter anything- because `gapminder_vec` is
# a bunch of country names. Not years.
gapminder.query("year not in #gapminder_vec")
Also to perform a filter rather than a subset:
vec = (1952, 1958)
# returns a subset containing the rows who have a year in `vec`
subset_with_years_in_vec = gapminder.query('year in #vec')
# return subset containing rows who DO NOT have a year in `vec`
subset_without_years_in_vec = gapminder.query('year not in #vec')
To filter out years 1952 and 1957 you can use:
print(gapminder.loc[~(gapminder.year.isin([1952, 1957]))])
Prints:
country year pop continent lifeExp gdpPercap
2 Afghanistan 1962 1.026708e+07 Asia 31.99700 853.100710
3 Afghanistan 1967 1.153797e+07 Asia 34.02000 836.197138
4 Afghanistan 1972 1.307946e+07 Asia 36.08800 739.981106
5 Afghanistan 1977 1.488037e+07 Asia 38.43800 786.113360
6 Afghanistan 1982 1.288182e+07 Asia 39.85400 978.011439
7 Afghanistan 1987 1.386796e+07 Asia 40.82200 852.395945
8 Afghanistan 1992 1.631792e+07 Asia 41.67400 649.341395
9 Afghanistan 1997 2.222742e+07 Asia 41.76300 635.341351
10 Afghanistan 2002 2.526840e+07 Asia 42.12900 726.734055
11 Afghanistan 2007 3.188992e+07 Asia 43.82800 974.580338
14 Albania 1962 1.728137e+06 Europe 64.82000 2312.888958
15 Albania 1967 1.984060e+06 Europe 66.22000 2760.196931
16 Albania 1972 2.263554e+06 Europe 67.69000 3313.422188
17 Albania 1977 2.509048e+06 Europe 68.93000 3533.003910
...
I have the following dataframe (df):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import matplotlib as mpl
sns.set()
df = pd.DataFrame({
# some ways to create random data
'scenario':np.random.choice( ['BAU','ETS','ESD'], 27),
'region':np.random.choice( ['Italy','France'], 27),
'variable':np.random.choice( ['GDP','GHG'], 27),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'2015':np.random.randn(27),
'2016':np.random.randn(27),
'2017':np.random.randn(27),
'2018':np.random.randn(27),
'2019':np.random.randn(27),
'2020':np.random.randn(27),
'2021':np.random.randn(27)
})
df2=pd.melt(df,id_vars=['scenario','region','variable'],var_name='year')
all_names_index = df2.set_index(['scenario','region','variable','year']).sort_index()
How can I calculate for each variable, scenario and region its % change with respect to initial year (ie 2015)?
As an example:
2016=(2016-2015)/2015
2017=(2017-2015)/2015
...
2021=(2021-2015)/2015
You can try this to subtract the first element from every group, I'm summing the values for the same year:
all_names_index.reset_index(inplace=True)
all_names_index = all_names_index.groupby(by=['scenario', 'region', 'variable', 'year']).sum().reset_index()
all_names_index['pct_change'] = all_names_index.groupby(by=['scenario', 'region', 'variable'])['value'].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
print(all_names_index)
Output:
scenario region variable year value pct_change
0 BAU France GDP 2015 1.786506 0.000000
1 BAU France GDP 2016 0.020103 -98.874740
2 BAU France GDP 2017 3.190068 78.564690
3 BAU France GDP 2018 -3.581261 -300.461753
4 BAU France GDP 2019 0.500374 -71.991488
.. ... ... ... ... ... ...
72 ETS Italy GDP 2017 -0.557029 -153.990905
73 ETS Italy GDP 2018 -0.172391 -116.709261
74 ETS Italy GDP 2019 -0.238212 -123.089063
75 ETS Italy GDP 2020 -1.098866 -206.509438
76 ETS Italy GDP 2021 -0.405364 -139.290556
This is my dataset.
Country Type Disaster Count
0 CHINA P REP Industrial Accident 415
1 CHINA P REP Transport Accident 231
2 CHINA P REP Flood 175
3 INDIA Transport Accident 425
4 INDIA Flood 206
5 INDIA Storm 121
6 UNITED STATES Storm 348
7 UNITED STATES Transport Accident 159
8 UNITED STATES Flood 92
9 PHILIPPINES Storm 249
10 PHILIPPINES Transport Accident 84
11 PHILIPPINES Flood 71
12 INDONESIA Transport Accident 136
13 INDONESIA Flood 110
14 INDONESIA Seismic Activity 77
I would like to make a triple bar chart and the label is based on the column 'Type'. I would also like to group the bar based on the column 'Country'.
I have tried using (with df as the DataFrame object of the pandas library),
df.groupby('Country').plot.bar()
but the result came out as multiple bar charts representing each group in the 'Country' column.
The expected output is similar to this:
What are the codes that I need to run in order to achieve this graph?
There are two ways -
df.set_index('Country').pivot(columns='Type').plot.bar()
df.set_index(['Country','Type']).plot.bar()
The following is only the beginning for an Coursera assignment on Data Science. I hope this is not to trivial for. But I am lost on this and could not find an answer.
I am asked to import an Excelfile into a panda dataframe and to manipulate it afterwards. The file can be found here: http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls
What makes it difficult for me is
a) there is an 'overhead' of 17 lines and a footer
b) the first two columns are empty
c) the index column has no header name
After hours if seraching and reading I came up with this useless line:
energy=pd.read_excel('Energy Indicators.xls',
sheetname='Energy',
header=16,
skiprows=[17],
skipfooter=38,
skipcolumns=2
)
This seems to produce a multindex dataframe. Though the command energy.head() returns nothing.
I have two questions:
what did I wrong. Up to this exercise I thought I understand the dataframe. But now I am totally clueless and lost :-((
How do I have to tackle this? What do I have to do to get this Exceldata into a datafrae with the index consisting of the countries?
Thanks.
I think you need add parameters:
index_col for convert column to index
usecols - parse columns by positions
change header position to 15
energy=pd.read_excel('Energy Indicators.xls',
sheet_name='Energy',
skiprows=[17],
skipfooter=38,
header=15,
index_col=[0],
usecols=[2,3,4,5]
)
print (energy.head())
Energy Supply Energy Supply per capita \
Afghanistan 321 10
Albania 102 35
Algeria 1959 51
American Samoa ... ...
Andorra 9 121
Renewable Electricity Production
Afghanistan 78.669280
Albania 100.000000
Algeria 0.551010
American Samoa 0.641026
Andorra 88.695650
I installed xlrd package, with pip install xlrd and then loaded the file successfully as follows:
In [17]: df = pd.read_excel(r"http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls",
...: sheetname='Energy',
...: header=16,
...: skiprows=[17],
...: skipfooter=38,
...: skipcolumns=2)
In [18]: df.shape
Out[18]: (227, 3)
In [19]: df.head()
Out[19]:
Energy Supply Energy Supply per capita \
NaN Afghanistan Afghanistan 321 10
Albania Albania 102 35
Algeria Algeria 1959 51
American Samoa American Samoa ... ...
Andorra Andorra 9 121
Renewable Electricity Production
NaN Afghanistan Afghanistan 78.669280
Albania Albania 100.000000
Algeria Algeria 0.551010
American Samoa American Samoa 0.641026
Andorra Andorra 88.695650
In [20]: pd.__version__
Out[20]: u'0.20.3'
In [21]: df.columns
Out[21]:
Index([u'Energy Supply', u'Energy Supply per capita',
u'Renewable Electricity Production'],
dtype='object')
Notice that I am using the last version of pandas 0.20.3 make sure you have the latest version on your system.
I modified your code and was able to get the data into the dataframe. Instead of skipcolumns (which did not work), I used the argument usecols as follows
energy=pd.read_excel('Energy_Indicators.xls',
sheetname='Energy',
header=16,
skiprows=[16],
skipfooter=38,
usecols=[2,3,4,5]
)
Unnamed: 2 Petajoules Gigajoules %
0 Afghanistan 321 10 78.669280
1 Albania 102 35 100.000000
2 Algeria 1959 51 0.551010
3 American Samoa ... ... 0.641026
4 Andorra 9 121 88.695650
In order to make the countries as the index, you can do the following
# Rename the column Unnamed: 2 to Country
energy = energy.rename(columns={'Unnamed: 2':'Country'})
# Change the index to country column
energy.index = energy['Country']
# Drop the extra country column
energy = energy.drop('Country', axis=1)