Set column value as the mean of a group in pandas - python

I have a data frame with columns X Y temperature Label
label is an integer between 1 and 9
I want to add an additional column my_label_mean_temperature which will contain for each row the mean of the temperatures of the rows that has the same label.
I'm pretty sure i need to start with my_df.groupby('label') but not sure how to calculate the mean on temperature and propagate the values on all the rows of my original data frame

Your problem could be solved with the transform method of pandas.
You could try something like this :
df['my_label_mean_temperature'] = df.groupby(['label']).transform('mean')

Something like this?
df = pd.DataFrame(data={'x':np.random.rand(19),
'y':np.arange(19),
'temp':[22,33,22,55,3,7,55,1,33,4,5,6,7,8,9,4,3,6,2],
'label': [1,2,3,4,2,3,9,3,2,9,2,3,9,4,1,2,9,7, 1]})
df['my_label_mean_temperature'] = df.groupby(['label'], sort=False)['temp'].transform('mean')

df['my_label_mean_temperature']= df.groupby('label', as_index=False)['temperature'].mean()

Related

Adding random values in column depending on other columns with pandas

I have a dataframe with the Columns "OfferID", "SiteID" and "CatgeoryID" which should represent an online ad on a website. I then want to add a new Column called "NPS" for the net promoter score. The values should be given randomly between 1 and 10 but where the OfferID, the SideID and the CatgeoryID are the same, they need to have the same value for the NPS. I thought of using a dictionary where the NPS is the key and the pairs of different IDs are the values but I haven't found a good way to do this.
Are there any recommendations?
Thanks in advance.
Alina
The easiest would be first to remove all duplicates ; you can do this using :
uniques = df[['OfferID', 'SideID', 'CategoryID']].drop_duplicates(keep="first")
Afterwards, you can do something like this (note that your random values are not uniques) :
uniques['NPS'] = [random.randint(0, 100) for x in uniques.index]
And then :
df = df.merge(uniques, on=['OfferID', 'SideID', 'CategoryID'], how='left')

How to select the first column of datasets?

I am trying to get the first column of dataset to calculate the summary of the data such as mean, median, variance, stdev etc...
This is how I read my csv file
wine_data = pd.read_csv('winequality-white.csv')
I tried to select the first columns two ways
first_col = wine_data[wine_data.columns[0]]
wine_data.iloc[:,0]
But I get this whole result
0 7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
1 6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9...
2 8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;1...
4896 5.5;0.29;0.3;1.1;0.022;20;110;0.98869;3.34;0.3...
4897 6;0.21;0.38;0.8;0.02;22;98;0.98941;3.26;0.32;1...
Name: fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality", Length: 4898, dtype: object
How can I just select the first columns such as 7,6.3,8.1,5.5,6.0.
You might use the following:
#to see all columns
df.columns
#Selecting one column
df['column_name']
#Selecting multiple columns
df[['column_one', 'column_two','column_four', 'column_seven']]
Something like this example:
Or if you prefer, you might use the df.iloc
You can try this:
first_col = wind_data.ix[:,0]

Create a matrix with a set of ranges in columns and a set of ranges in rows with Pandas

I have a data frame in which one column 'F' has values from 0 to 100 and a second column 'E' has values from 0 to 500. I want to create a matrix in which frequencies fall within ranges in both 'F' and 'E'. For example, I want to know the frequency in range 20 to 30 for 'F' and range 400 to 500 for 'E'.
What I expect to have is the following matrix:
matrix of ranges
I have tried to group ranges using pd.cut() and groupby() but I don't know how to join data.
I really appreciate your help in creating the matrix with pandas.
you can use the cut function to create the bin "tag/name" for each column.
after you cat pivot the data frame.
df['rows'] = pd.cut(df['F'], 5)
df['cols'] = pd.cut(df['E'], 5)
df = df.groupby(['rows', 'cols']).agg('sum').reset_index([0,1], False) # your agg func here
df = df.pivot(columns='cols', index='rows')
So this is the way I found to create the matrix, that was obviously inspired by #usher's answer. I know it's more convoluted but wanted to share it. Thanks again #usher
E=df.E
F=df.F
bins_E=pd.cut(E, bins=(max(E)-min(E))/100)
bins_F=pd.cut(F, bins=(max(F)-min(F))/10)
bins_EF=bins_E.to_frame().join(bins_F)
freq_EF=bins_EF.groupby(['E', 'F']).size().reset_index(name="counts")
Mat_FE = freq_EF.pivot(columns='E', index='F')

Compute change from baseline by id

I have some longitudinal data which is animal_id by study day, as shown below:
How would I create a column which computes the change from baseline by each animal_id? Here, baseline would be where ord = 0?
Using transform first , notice this assuming your df is sorted already
df['New']=df['Body_Weight']-df.groupby('Animal_id')['Body_weight'].transform('first')
To achieve this, if you don't want to sort your data, you can use a surrogate data frame with the value of body weight for ord=0 and then merge it to the previous data frame.
df_ord = df.query("ord==0").rename(columns={'body_weight':'body_weight_base'})
df_ord = df_ord.drop('ord',axis=1)
df = df.merge(df_ord)
df['change'] = df['body_weight'] - df['body_weight_base']

Trying to divide a dataframe column by a float yields NaN

Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?

Categories

Resources