I need to create a new column in my df that holds the mean of another existing column, but I need it to take into account each individual location over time rather then the mean of the all the values in the existing column.
Based on the sample dataset below, what I am looking for is a new column that contains the Mean for each Site, not the mean of all the values independent of Site.
Sample Dataset
Use groupby and agg mean of that columns:
df = df.merge(df.groupby('Site',as_index=False).agg({'TIME_HOUR':'mean'})[['Site','TIME_HOUR']],on='Site',how='left')
Use groupby:
df.groupby('Site')['TIME_HOUR'].mean().reset_index()
And assign to a column
Related
I have a pyspark dataframe with columns "A", "B",and "C","D". I want to add a column with mean of rows. But the condition is that the column names for which mean need to be calculated (at row level) should be taken from a list l=["A","C"].
reason for the list is that the column names and number might vary and hence I need it to be flexible. for eg. I might want mean at row level for cols l=["A","B","C"] or just l=["A","D"].
Finally I want this mean column to be appended to the original pyspark dataframe.
how do I code this in pyspark?
When you say you want the mean, I assume that you want Arithmetic mean :
In that case, that's really simple. You can create a function like this :
from pyspark.sql import functions as F
def arithmetic_mean(*cols):
return sum(F.col(col) for col in cols)/len(cols)
Assuming df is you dataframe, you simply use it like this:
df.withColumn("mean", arithmetic_mean("A", "C"))
I have a data frame like this Main DataFrame.
How Can I extract new data frame from it like:
Second DataFrame
I know df['Group'].value_counts().index shows me counts of each 'Group' but I don't know how to create new dataframe with it.
You get a new dataframe by using for example groupby like this to do this counting:
df.groupby("Group", as_index=False).size()
Other interesting options for groupby that you might want to use depending on use case are dropna=False and sort=False.
And to rename the size column to counts you use can use rename: .rename(columns={'size': 'Count'})
I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')
I have two columns in a data frame that I want to merge together. The attached image shows the columns:
Image of the two columns I want to merge
I want the "precio_uf_y" column to take precedent over the "precio_uf_x" column a new column, but if there is a NaN value in the "precio_uf_y" column I want the value in the "precio_uf_x" column to go to the new column. My ideal new merged column would look like this:
Desired new column
I have tried different merge functions, and taking min and max with numpy, but maybe there is a way to write a function with these parameters?
Thank you in advance for any help.
You can use df.apply.
def get_new_val(x):
if np.isnan(x.precio_uf_y):
return x.precio_uf_x
else:
return x.precio_uf_y
df["new_precio_uf"] = df.apply(get_new_val, axis=1)
I have a pandas DataFrame where the first column is a country label and the second column contains a number. Most countries are in the list multiple times. In want to do 2 operations:
Calculate the mean for every country
Append the mean of every country as a third column
Perform a groupby by 'Country' and use transform to apply a function to that group which will return an index aligned to the original df
df.groupby('Country').transform('mean')
See the online docs: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation
Try something like :
df.groupby(['Country']).mean()