Z-score calculation/standardisation using pandas

Z-score calculation/standardisation using pandas - python

I come across this video and it bugs me.
Essentially, at 5:50, they calculate Z-score for the whole data frame by the following snippet:
df_z = (df - df.describle.T['mean'])/df.describle.T['std']
It is a neat and beautiful line.
However, df.describle.T looks like this and df looks like this
df.describle.T['mean'] and df.describle.T['std'] are two individual series, which take the df columns name as index and describle statistic parameters as columns, and df is an ordinary pd.DataFramewhich has numercial index and columns names in the right places.
My question is: how does that line make sense when they are not matching at all, in particular, how do they ensure that every variable example (x_i) matches their mean or std?
Thank you.

Related

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.

The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']

You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

Pandas - How do you find the top n elements of 1 column based on a condition from another column

I am struggling with a question based on Pandas. I have an earthquake data set with columns of countries and magnitudes. I am asked to:
"Find the top 10 states / countries where the strongest and weakest
earthquakes occurred."
From this question, I garnered that I am meant to find the top 10 countries ["country"] with the highest values (value_counts) , but sorting by magnitude ["mag"].
How would I go about doing this? I've looked around but there's nothing I've found about this online.

Are you sure you did not find something useful? If I understand your question correct, it is a simple one. After creating a dataframe by using below methods, you will get what you need.
import pandas as pd
df = pd.read_csv(".csv")
df.nlargest(x, ['Column Name'])
x is the number of elements which are the largest ones.
Same is goes for nsmallest too. Just use these:
DataFrame.nsmallest(n, columns, keep='first')
DataFrame.nlargest(n, columns, keep='first')
Please read and check the documentation first.

Pandas: going from long to wide format in a dataframe

I am having trouble going from a long format to a wide one, in pandas. There are plenty of examples going from wide to long, but I did not find one from long to wide.
I am trying to reformat my dataframe and pivot, groupby, unstack are a bit confusing for my use case.
This is how I want it to be. The numbers are actually the intensity column from the second image.
And this is how it is now
I tried to build a MultiIndex based on Peptide, Charge and Protein. Then I tried to pivot based on that multi index, and keep all the samples and their intensity as values:
df.set_index(['Peptide', 'Charge', 'Protein'], append=False)
df.pivot(index=df.index, columns='Sample', values='Intensity')
Of course, this does not work since my index is now a combination of the 3 and not an actual column in the dataframe.
It tells me
KeyError: None of [RangeIndex(start=0, stop=3397898, step=1)] are in the [columns]
I tried also to group by, but I am not sure how to move from the long format back to wide. I am quite new to the dataframe way of thinking and I want to learn how to do this right.
It was very tempting for me to do an old school "java"-like approach with 4 for loops and building it as a matrix. Thank you in advnace!

I think based on your attempt that this might work:
df2 = df.pivot(['Peptide', 'Charge', 'Protein'], columns='Sample', values='Intensity').reset_index()
After that, if you want to remove the name from the column axis:
df2 = df2.rename_axis(None, axis=1)

Subtract each column by the preceding column on Dataframe in Python

Using the following code I can build a simple table with the current COVID-19 cases worldwide, per country:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
raw_data = pd.read_csv(url, sep=",")
raw_data.drop(['Province/State','Lat','Long'], axis = 1, inplace = True)
plot_data = raw_data.groupby('Country/Region').sum()
The plot_data is a simple DataFrame:
What I would like to do now is to subtract the values on each column by the values on the column on a prior day - i.e., I wan to get the new cases per day.
If I do something like plot_data['3/30/20'].add(-plot_data['3/29/20']), it works well. But if I do something like plot_data.iloc[:,68:69].add(-plot_data.iloc[:,67:68]), I got two columns with NaN values. I.e. Python tries to "preserve" de columns header and doesn't perform the operation the way I would like it to.
My goal was to perform this operation in an "elegant way". I was thinking something in the lines of plot_data.iloc[:,1:69].add(-plot_data.iloc[:,0:68]). But of course, if it doesn't work as the single-column example, it doesn't work with multiple columns either (as Python will match the column headers and return a bunch of zeros/NaN values).
Maybe there is a way to tell Python to ignore the headers during an operation with a DataFrame? I know that I can transform my DataFrame into a NumPy array and do a bunch of operations. However, since this is a simple/small table, I thought I would try to keep using a DataFrame data type.

The good old shift can be used on the horizontal axis:
plot_data - plot_data.shift(-1, axis=1)
should be what you want.

Thank you very much #Serge Ballesta! Your answer is exactly the type of "elegant solution" I was looking for. The only comment is the shift sign should be "positive".
plot_data - plot_data.shift(1, axis=1)
This way we bring the historical figures forward one day and now I can subtract it from the actual numbers on each day.

Pandas - select lowest value to date

I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?

Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Z-score calculation/standardisation using pandas - python

Related

pandas computing new column as a average of other two conditions

Pandas - How do you find the top n elements of 1 column based on a condition from another column

Pandas: going from long to wide format in a dataframe

Subtract each column by the preceding column on Dataframe in Python

Pandas - select lowest value to date

Categories

Resources