Instructions given by Professor:
1. Using the list of countries by continent from World Atlas data, load in the countries.csv file into a pandas DataFrame and name this data set as countries.
2. Using the data available on Gapminder, load in the Income per person (GDP/capita, PPP$ inflation-adjusted) as a pandas DataFrame and name this data set as income.
3. Transform the data set to have years as the rows and countries as the columns. Show the head of this data set when it is loaded.
4. Graphically display the distribution of income per person across all countries in the world for any given year (e.g. 2000). What kind of plot would be best?
In the code below, I have some of these tasks completed, but I'm having a hard time understanding how to acquire data from a DataFrame row. I want to be able to acquire data from a row and then plot it. It may seem like a trivial concept, but I've been at it for a while and need assistance please.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
countries = pd.read_csv('2014_data/countries.csv')
countries.head(n=3)
income = pd.read_excel('indicator gapminder gdp_per_capita_ppp.xlsx')
income = income.T
def graph_per_year(year):
stryear = str(year)
dfList = income[stryear].tolist()
graph_per_year(1801)
Pandas uses three types of indexing.
If you are looking to use integer indexing, you will need to use .iloc
df_1
Out[5]:
consId fan-cnt
0 1155696024483 34.0
1 1155699007557 34.0
2 1155694005571 34.0
3 1155691016680 12.0
4 1155697016945 34.0
df_1.iloc[1,:] #go to the row with index 1 and select all the columns
Out[8]:
consId 1.155699e+12
fan-cnt 3.400000e+01
Name: 1, dtype: float64
And to go to a particular cell, you can use something along the following lines,
df_1.iloc[1][1]
Out[9]: 34.0
You need to go through the documentation for other types of indexing namely .ix and .loc as suggested by sohier-dane.
To answer your first question, a bar graph with a year sector would be best. You'll have to keep countries on y axis and per capita income on y. And a dropdown perhaps to select a particular year for which the graph will change.
Related
I have a DataFrame with the following Columns:
countriesAndTerritories = Name of the Countries (Only contains Portugal and Spain)
cases = Number of Covid Cases
This is how the DataFrame looks like:
We have 1 row per "dateRep".
I tried to create a BarChart with the following code:
df.plot.bar(x="countriesAndTerritories", y="cases", rot=70,
title="Number of Covid cases per Country")
The result is the following:
As you can see, instead of having the total number of cases per Country (Portugal and Spain), i have multiple values in the X axis.
I've tried to investigate a little, but the examples i've found were with a inline df. So, if someone can help me, i apreciate.
PS: I'm used to QlikSense, and what i'm trying to achieve, would be something along these lines:
I'm dealing with a materials science dataset and I'm in the following situation,
I have data organized like this:
Chemical_ Formula Property_name Property_Scalar
He Electrical conduc. 1
NO_2 Resistance 50
CuO3 Hardness
... ... ...
CuO3 Fluorescence 300
He Toxicity 39
NO2 Hardness 80
... ... ...
As you can understand it is really messy because the same chemical formula appears more than once through the entire dataset, but referred to a different property that is considered. My question is, how can I easily maybe split the dataset in smaller ones, fitting every formula with its descriptors in ORDER? ( I used fiction names and values, just to explain my problem.)
I'm on Jupyter Notebook and I'm using Pandas.
I'm editing my question trying to be more clear:
My goal would be to plot some histograms of (for example) nĀ°materials vs conductivity at different temperatures (100K, 200K, 300K). So I need to have both conductivity and temperature for each material to be clearly comparable. For example, I guess that a more convenient thing to obtain would be:
Chemical formula Conductivity Temperature
He 5 10K
NO_2 7 59K
CuO_3 10 300K
... ... ...
He 14 100K
NO_2 5 70K
... ... ...
I think that this issue can be related to reshaping the dataset but I should also have each formula to MATCH exactly the temperature and conductivity. Thank you for your help!
If you want to plot Conductivity versus Temperature for a given formula, you can simly select the rows that match this condition.
import pandas as pd
import matplotlib.pyplot as plt
formula = 'NO_2'
subset = df.loc[df['Chemical_Formula'] == formula].sort_values('Temperature')
x = subset['Temperature'].values
y = subset['Conductivity'].values
plt.plot(x, y)
Here, we are defining the formula you want to extract. Then we are selecting only the rows in the DataFrame where the value in the column 'Chemical Formula' matches your specified formula using df.loc[]. This returns a new DataFrame that is a subset of your original DataFrame that contains only rows where our condition is satisfied. We sort this subset by 'Temperature' (I assume you want to plot Temperature on the x-axis) and store it as subset. We then select the 'Temperature' and 'Conductivity' columns which return pandas.Seriesobjects, which we convert to numpy arrays by calling .values. We store these in x and y variables and pass them to the matplotlib plot function.
EDIT:
To get from the first DataFrame to the second DataFrame described in your post, you can use the pivot function (assuming your first DataFrame is named df):
df = df.pivot(index='Chemical_Formula', columns='Property_name', values='Property_Scalar')
I have the following df:
Country 2013 2014 2015 2016 2017
0 USA 40 30 20 30 30
1 Chile 1 2 4 6 1
So i need to plot the total Infected (which are the numbers in each year) throughout time per year.
So I did:
grid = sns.FacetGrid(data=df, col="Country", col_wrap=5, hue="Country")
grid.map(plt.plot,)
But this is not going to work because each year is a column and I cannot pass that to the grid.map
Any ideas on how to do this?
Not sure what exactly kind of plot you wanted, but this is one way I got around your problem:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Country':['USA', 'Chile'],
'2013':[40,1],
'2014':[30,2],
'2015':[20,4],
'2016':[30,6],
'2017':[30,1]})
df = df.T # This will transpose our df: see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html
df.columns = df.iloc[0] #Set the row [0] as our header
df.drop(['Country'], inplace=True, axis=0) # Drop row [0] since we don't want it.
Right now, this is what our df looks like:
From our df we can call:
df.plot.bar()
plt.xticks(rotation=0)
And we get the desired plot:
Plot
Ps. I can't post pictures so far, but please take a look o the links StackOverflow provides for them.
This code is one way of solving it, but definitely you can approach this by different method. Remember the plot is based on matplotlib, so you can customize as such.
import pandas as pd
import numpy as np
# Show the specified columns and save it to a new file
col_list= ["STATION", "NAME", "DATE", "AWND", "SNOW"]
df = pd.read_csv('Data.csv', usecols=col_list)
df.to_csv('filteredData.csv')
df['year'] = pd.DatetimeIndex(df['DATE']).year
df2016 = df[(df.year==2016)]
df_2016 = df2016.groupby(['NAME', 'DATE'])['SNOW'].mean()
df_2016.to_csv('average2016.csv')
How come my dates are not ordered correctly here? Row 12 should be on the top but it's on the bottom of May instead and same goes for row 25
The average of SNOW per NAME/month is also not being displayed on my excel sheet. Why is that? Basically, I'm trying to calculate the average SNOW for May in ADA 0.7 SE, MI US. Then calculate the average SNOW for June in ADA 0.7 SE, MI US. etc..
I've spent all day and this is all I have got... Any help will be appreciated. Thanks in advance.
original data
https://gofile.io/?c=1gpbyT
Please try
Data
df=pd.read_csv(r'directorywhere the data is\data.csv')
df
Working
df.dtypes# Checking the datatype on each column
df.columns#listing columns
df['DATE']=pd.to_datetime(df['DATE'])#Converting date from object to a date format
df.set_index(df['DATE'], inplace=True)#Seeting the date as index
df['SNOW'].fillna(0)#filling all Not a Number values with zeros to make aggregation possible
df['SnowMean']=df.groupby([df.index.month, df.NAME])['SNOW'].transform('mean')#Groupby name, month and calculate the mean of snow. Store the result in anew column called df['SnowMean']
df
Checking
df.loc[:,['DATE','Month','SnowMean']]# Slice relevant columns to check
I realize you have multiple years. If you wanted mean per month in each year, again extract the year and add it in the groups to groupby as follows
df['SnowMeanPerYearPerMonth']=df.groupby([df.index.month,df.index.year,df.NAME])['SNOW'].transform('mean')
df
Check again
pd.set_option('display.max_rows',999)#diaplay upto 999 rows to check
df.loc[:,['DATE','Month','Year','SnowMean']]# Slice relevant columns to check
can anyone please explain me how the below code is working? My Question is like if y variable has only price than how the last function is able to grouby doors? I am not able to get the flow and debug the flow. Please let me know as i am very new to this field.
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
y.groupby(df.Doors).mean()
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
print("The Doors")
print(df.Doors)
print("The Price")
print(y)
y.groupby(df.Doors).mean()
Try the above code you will understand the position or the index where the "df.Doors" given 4 and the price at that index in "y" are considered as one group and mean is taken, same is for 2 doors in "df.Doors" the other group.
It works because y is a pandas series, in which the values are prices but also has the index that it had in the df. When you do df.Doors you get a series with different values, but the same indexes (since an index is for the whole row). By comparing the indexes, pandas can perform the group by.
It loads the popular cars dataset to the dataframe df and assigns the colum price of the dataset to the variable y.
I would recommend you to get a general understanding of the data you loaded with the following commands:
df.info()
#shows you the range of the index as
#well as the data type of the colums
df.describe()
#shows common stats like mean or median
df.head()
#shows you the first 5 rows
The groupby command packs the rows (also called observations) of the cars dataframe df by the number of doors. And shows you the average price for cars with 2 doors or 4 doors and so on.
Check the output by adding a print() around the last line of code
edit: sorry I answered to fast, thought u asked for a general explanation of the code and not why is it working