Not able to understand the below pandas code - python

can anyone please explain me how the below code is working? My Question is like if y variable has only price than how the last function is able to grouby doors? I am not able to get the flow and debug the flow. Please let me know as i am very new to this field.
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
y.groupby(df.Doors).mean()

import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
y = df['Price']
print("The Doors")
print(df.Doors)
print("The Price")
print(y)
y.groupby(df.Doors).mean()
Try the above code you will understand the position or the index where the "df.Doors" given 4 and the price at that index in "y" are considered as one group and mean is taken, same is for 2 doors in "df.Doors" the other group.

It works because y is a pandas series, in which the values are prices but also has the index that it had in the df. When you do df.Doors you get a series with different values, but the same indexes (since an index is for the whole row). By comparing the indexes, pandas can perform the group by.

It loads the popular cars dataset to the dataframe df and assigns the colum price of the dataset to the variable y.
I would recommend you to get a general understanding of the data you loaded with the following commands:
df.info()
#shows you the range of the index as
#well as the data type of the colums
df.describe()
#shows common stats like mean or median
df.head()
#shows you the first 5 rows
The groupby command packs the rows (also called observations) of the cars dataframe df by the number of doors. And shows you the average price for cars with 2 doors or 4 doors and so on.
Check the output by adding a print() around the last line of code
edit: sorry I answered to fast, thought u asked for a general explanation of the code and not why is it working

Related

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

Pandas - How do you find the top n elements of 1 column based on a condition from another column

I am struggling with a question based on Pandas. I have an earthquake data set with columns of countries and magnitudes. I am asked to:
"Find the top 10 states / countries where the strongest and weakest
earthquakes occurred."
From this question, I garnered that I am meant to find the top 10 countries ["country"] with the highest values (value_counts) , but sorting by magnitude ["mag"].
How would I go about doing this? I've looked around but there's nothing I've found about this online.
Are you sure you did not find something useful? If I understand your question correct, it is a simple one. After creating a dataframe by using below methods, you will get what you need.
import pandas as pd
df = pd.read_csv(".csv")
df.nlargest(x, ['Column Name'])
x is the number of elements which are the largest ones.
Same is goes for nsmallest too. Just use these:
DataFrame.nsmallest(n, columns, keep='first')
DataFrame.nlargest(n, columns, keep='first')
Please read and check the documentation first.

How can I create a multi-indexed pandas dataframe within a for loop?

I have a weekly time-series of multiple varibles and I am trying to view what percentrank the last 26week correlation would be in vs. all previous 26week correlations.
So I can generate a correlation matrix for the first 26wk period using the pd.corr function in pandas, but I dont know how I can loop through all previous periods too find the different values for these correlations to then rank.
I hope there is a better way to achieve this if so please let me know
I have tried calculating parallel dataframes but i couldnt write a formula to rank the most recent - so i beleive that the solution lays with multi-indexing.
'''python
daterange = pd.date_range('20160701', periods = 100, freq= '1w')
np.random.seed(120)
df_corr = pd.DataFrame(np.random.rand(100,5), index= daterange, columns = list('abcde'))
df_corr_chg=df_corr.diff()
df_corr_chg=df_corr_chg[1:]
df_corr_chg=df_corr_chg.replace(0, 0.01)
d=df_corr_chg.shape[0]
df_CCC=df_corr_chg[::-1]
for s in range(0,d-26):
i=df_CCC.iloc[s:26+s]
I am looking for a multi-indexed table showing the correlations at different times
Example of output
e.g. (formatting issues)
a b
a 1 1 -0.101713
2 1 -0.031109
n 1 0.471764
b 1 -0.101713 1
2 -0.031109 1
n 0.471764 1
Here is a receipe how you could approach the problem.
I assume, you have one price per week (otherwise just preaggregate your dataframe).
# in case you your weeks are not numbered
# Sort your dataframe for symbol (EUR, SPX, ...) and week descending.
df.sort_values(['symbol', 'date'], ascending=False, inplace=True)
# Now add a pseudo
indexer= df.groupby('symbol').cumcount() < 26
df.loc[indexer, 'pricecolumn'].corr()
One more hint, in case you need to preaggregate your dataframe. You could add another aux column with the week number in your frame like:
df['week_number']=df['datefield'].dt.week
Then I guess you would like to have the last price of each week. You could do that as follows:
df_last= df.sort_values(['symbol', 'week_number', 'date'], ascending=True, inplace=False).groupby(['symbol', 'week_number']).aggregate('last')
df_last.reset_index(inplace=True)
Then use df_last in in place of the df above. Please check/change the field names, I assumed.

Unable to Plot using Seaborn

Hi there My dataset is as follows
username switch_state time
abcd sw-off 07:53:15 +05:00
abcd sw-on 07:53:15 +05:00
Now using this i need to find that on a given day how many times in a day the switch state is manipulated i.e switch on or switch off. My test code is given below
switch_off=df.loc[df['switch_state']=='sw-off']#only off switches
groupy_result=switch_off.groupby(['time','username']).count()['switch_state'].unstack#grouping the data on the base of time and username and finding the count on a given day. fair enough
the result of this groupby clause is given as
print(groupy_result)
username abcd
time
05:08:35 3
07:53:15 3
07:58:40 1
Now as you can see that the count is concatenated in the time column. I need to separate them so that i can plot it using Seaborn scatter plot. I need to have the x and y values which in my case will be x=time,y=count
Kindly help me out that how can i plot this column.
`
You can try the following to get the data as a DataFrame itself
df = df.loc[df['switch_state']=='sw-off']
df['count'] = df.groupby(['username','time'])['username'].transform('count')
The two lines of code will give you an updated data frame df, which will add a column called count.
df = df.drop_duplicates(subset=['username', 'time'], keep='first')
The above line will remove the duplicate rows. Then you can plot df['time'] and df['count'].
plt.scatter(df['time'], df['count'])

Python interpolate throws no errors - but also does nothing

I am trying some DataFrame manipulation in Pandas that I have learnt. The dataset that I am playing with is from the EY Data Science Challenge.
This first part may be irrelevant but just for context - I have gone through and set some indexes:
import pandas as pd
import numpy as np
# loading the main dataset
df_main = pd.read_csv(filename)
'''Sorting Indexes'''
# getting rid of the id column
del df_main['id']
# sorting values by LOCATION and GENDER columns
# setting index to LOCATION (1st tier) then GENDER (2nd tier) and then re-
#sorting
df_main = df_main.sort_values(['LOCATION','TIME'])
df_main = df_main.set_index(['LOCATION','TIME']).sort_index()
The problem I have is with the missing values - I have decided that columns 7 ~ 18 can be interpolate because a lot of the data is very consistent year by year.
So I made a simple function to take in a list of columns and apply the interpolate function for each column.
'''Missing Values'''
x = df_main.groupby("LOCATION")
def interpolate_columns(list_of_column_names):
for column in list_of_column_names:
df_main[column] = x[column].apply(lambda x: x.interpolate(how = 'linear'))
interpolate_columns( list(df_main.columns[7:18]) )
However, the problem I am getting is one of the columns (Access to electricity (% of urban population with access) [1.3_ACCESS.ELECTRICITY.URBAN]) seems to not be interpolating when all the other columns are interpolated successfully.
I get no errors thrown when I run the function, and it is not trying to interpolate backwards either.
Any ideas regarding why this problem is occurring?
EDIT: I should also mention that the column in question was missing the same number of values - and in the same rows - as many of the other columns that interpolated successfully.
After looking at the data more closely, it seems like interpolate was not working on some columns because I was missing data at the first rows of the group in the groupby object.

Categories

Resources