Python interpolate throws no errors - but also does nothing - python

I am trying some DataFrame manipulation in Pandas that I have learnt. The dataset that I am playing with is from the EY Data Science Challenge.
This first part may be irrelevant but just for context - I have gone through and set some indexes:
import pandas as pd
import numpy as np
# loading the main dataset
df_main = pd.read_csv(filename)
'''Sorting Indexes'''
# getting rid of the id column
del df_main['id']
# sorting values by LOCATION and GENDER columns
# setting index to LOCATION (1st tier) then GENDER (2nd tier) and then re-
#sorting
df_main = df_main.sort_values(['LOCATION','TIME'])
df_main = df_main.set_index(['LOCATION','TIME']).sort_index()
The problem I have is with the missing values - I have decided that columns 7 ~ 18 can be interpolate because a lot of the data is very consistent year by year.
So I made a simple function to take in a list of columns and apply the interpolate function for each column.
'''Missing Values'''
x = df_main.groupby("LOCATION")
def interpolate_columns(list_of_column_names):
for column in list_of_column_names:
df_main[column] = x[column].apply(lambda x: x.interpolate(how = 'linear'))
interpolate_columns( list(df_main.columns[7:18]) )
However, the problem I am getting is one of the columns (Access to electricity (% of urban population with access) [1.3_ACCESS.ELECTRICITY.URBAN]) seems to not be interpolating when all the other columns are interpolated successfully.
I get no errors thrown when I run the function, and it is not trying to interpolate backwards either.
Any ideas regarding why this problem is occurring?
EDIT: I should also mention that the column in question was missing the same number of values - and in the same rows - as many of the other columns that interpolated successfully.

After looking at the data more closely, it seems like interpolate was not working on some columns because I was missing data at the first rows of the group in the groupby object.

Related

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

Not able to understand the below pandas code

can anyone please explain me how the below code is working? My Question is like if y variable has only price than how the last function is able to grouby doors? I am not able to get the flow and debug the flow. Please let me know as i am very new to this field.
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
y.groupby(df.Doors).mean()
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
print("The Doors")
print(df.Doors)
print("The Price")
print(y)
y.groupby(df.Doors).mean()
Try the above code you will understand the position or the index where the "df.Doors" given 4 and the price at that index in "y" are considered as one group and mean is taken, same is for 2 doors in "df.Doors" the other group.
It works because y is a pandas series, in which the values are prices but also has the index that it had in the df. When you do df.Doors you get a series with different values, but the same indexes (since an index is for the whole row). By comparing the indexes, pandas can perform the group by.
It loads the popular cars dataset to the dataframe df and assigns the colum price of the dataset to the variable y.
I would recommend you to get a general understanding of the data you loaded with the following commands:
df.info()
#shows you the range of the index as
#well as the data type of the colums
df.describe()
#shows common stats like mean or median
df.head()
#shows you the first 5 rows
The groupby command packs the rows (also called observations) of the cars dataframe df by the number of doors. And shows you the average price for cars with 2 doors or 4 doors and so on.
Check the output by adding a print() around the last line of code
edit: sorry I answered to fast, thought u asked for a general explanation of the code and not why is it working

Subtract each column by the preceding column on Dataframe in Python

Using the following code I can build a simple table with the current COVID-19 cases worldwide, per country:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
raw_data = pd.read_csv(url, sep=",")
raw_data.drop(['Province/State','Lat','Long'], axis = 1, inplace = True)
plot_data = raw_data.groupby('Country/Region').sum()
The plot_data is a simple DataFrame:
What I would like to do now is to subtract the values on each column by the values on the column on a prior day - i.e., I wan to get the new cases per day.
If I do something like plot_data['3/30/20'].add(-plot_data['3/29/20']), it works well. But if I do something like plot_data.iloc[:,68:69].add(-plot_data.iloc[:,67:68]), I got two columns with NaN values. I.e. Python tries to "preserve" de columns header and doesn't perform the operation the way I would like it to.
My goal was to perform this operation in an "elegant way". I was thinking something in the lines of plot_data.iloc[:,1:69].add(-plot_data.iloc[:,0:68]). But of course, if it doesn't work as the single-column example, it doesn't work with multiple columns either (as Python will match the column headers and return a bunch of zeros/NaN values).
Maybe there is a way to tell Python to ignore the headers during an operation with a DataFrame? I know that I can transform my DataFrame into a NumPy array and do a bunch of operations. However, since this is a simple/small table, I thought I would try to keep using a DataFrame data type.
The good old shift can be used on the horizontal axis:
plot_data - plot_data.shift(-1, axis=1)
should be what you want.
Thank you very much #Serge Ballesta! Your answer is exactly the type of "elegant solution" I was looking for. The only comment is the shift sign should be "positive".
plot_data - plot_data.shift(1, axis=1)
This way we bring the historical figures forward one day and now I can subtract it from the actual numbers on each day.

Z-score calculation/standardisation using pandas

I come across this video and it bugs me.
Essentially, at 5:50, they calculate Z-score for the whole data frame by the following snippet:
df_z = (df - df.describle.T['mean'])/df.describle.T['std']
It is a neat and beautiful line.
However, df.describle.T looks like this and df looks like this
df.describle.T['mean'] and df.describle.T['std'] are two individual series, which take the df columns name as index and describle statistic parameters as columns, and df is an ordinary pd.DataFramewhich has numercial index and columns names in the right places.
My question is: how does that line make sense when they are not matching at all, in particular, how do they ensure that every variable example (x_i) matches their mean or std?
Thank you.

question how to deal with KeyError: 0 or KeyError: 1 etc

I am new in python and this data science world and I am trying to play with different datasets.
In this case I am using the housing price index from quandl but unfortunately I get stuck when when I need to take the abbreviations names from the wiki page always getting the same Error KeyError.
import quandl
import pandas as pd
#pull every single housing price index from quandl
#quandl api key
api_key = 'xxxxxxxxxxxx'
#get stuff from quandl
df = quandl.get('FMAC/HPI_AK',authtoken = api_key) #alaska \
##print(df.head())
#get 50 states using pandas read html from wikipedia
fifty_states = pd.read_html('https://en.wikipedia.org /wiki/List_of_states_and_territories_of_the_United_States')
##print(fifty_states[0][1]) #first data frame is index 0, #looking for column 1,#from element 1 on
#get quandl frannymac query names for each 50 state
for abbv in fifty_states[0][1][2:]:
#print('FMAC/HPI_'+str(abbv))
So the problem I got in the following step:
#get 50 states using pandas read html from wikipedia
fifty_states = pd.read_html('https://en.wikipedia.org /wiki/List_of_states_and_territories_of_the_United_States')
##print(fifty_states[0][1]) #first data frame is index 0, #looking for column 1,#from element 1 on
I have tried different ways to get just the abbreviation but does not work
for abbv in fifty_states[0][1][2:]:
#print('FMAC/HPI_'+str(abbv))
for abbv in fifty_states[0][1][1:]:
#print('FMAC/HPI_'+str(abbv))
always Keyerror: 0
I just need this step to work, and to have the following output:
FMAC/HPI_AL,
FMAC/HPI_AK,
FMAC/HPI_AZ,
FMAC/HPI_AR,
FMAC/HPI_CA,
FMAC/HPI_CO,
FMAC/HPI_CT,
FMAC/HPI_DE,
FMAC/HPI_FL,
FMAC/HPI_GA,
FMAC/HPI_HI,
FMAC/HPI_ID,
FMAC/HPI_IL,
FMAC/HPI_IN,
FMAC/HPI_IA,
FMAC/HPI_KS,
FMAC/HPI_KY,
FMAC/HPI_LA,
FMAC/HPI_ME
for the 50 states from US and then proceed to make a data analysis from this data.
Can anybody tell me what am I doing wrong ? cheers
Note that fifty_states is a list of DataFrames, filled with
content of tables from the source page.
The first of them (at index 0 in fifty_states) is the table of US states.
If you don't know column names in a DataFrame (e.g. df),
to get column 1 from it (numeration form 0), run:
df.iloc[:, 1]
So, since we want this column from fifty_states[0], run:
fifty_states[0].iloc[:, 1]
Your code failed because you attempted to apply [1] to this DataFrame,
but this DataFrame has no column named 1.
Note that e.g. fifty_states[0][('Cities', 'Capital')] gives proper result,
because:
this DataFrame has a MultiIndex on columns,
one of columns has Cities at the first MultiIndex level
and Capital at the second level.
And getting back to your code, run:
for abbv in fifty_states[0].iloc[:, 1]:
print('FMAC/HPI_' + str(abbv))
Note that [2:] is not needed. You probably wanted to skip 2 initial rows
of the <table> HTML tag, containing column names,
but in Pandas they are actually kept in the MultiIndex on columns,
so to get all values, you don't need to skip anything.
If you want these strings as a list, for future use, the code can be:
your_list = ('FMAC/HPI_' + fifty_states[0].iloc[:, 1]).tolist()

Categories

Resources