I have a pandas DataFrame where the first column is a country label and the second column contains a number. Most countries are in the list multiple times. In want to do 2 operations:
Calculate the mean for every country
Append the mean of every country as a third column
Perform a groupby by 'Country' and use transform to apply a function to that group which will return an index aligned to the original df
df.groupby('Country').transform('mean')
See the online docs: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation
Try something like :
df.groupby(['Country']).mean()
Related
I would like to ask how can I join the dataframe as shown in (exiting dataframe) to group values based on date&time and take the means of the values. what I meant is that if col B have 2 values in the same minute , it will take average of that value and do same for rest of the columns. What I want to achieve is to have one value each minutes as shown in (preprocessed dataframe)
Thank you
If your dataframe is called df, you can do as following :
df.groupby(['DataTime']).mean()
I have a dataframe and values are repeated in a column called label, I want to show only the two that are repeated the most........
I attach an image as an example
I tried
# the name of colums is label
# the name of dataframe is pd_data
filter = \['a','b'\]
pd_data = pd_data\[\~pd_data.labels.isin(filter)\]
print(len(pd_data))
pd_data.groupby('label').size().sort_values(ascending=False)
You may use value_counts
pd_data.value_counts()
or
pd_data.groupby(['label']).count()
You can use pandas.Series.value_counts() to get the counts of unique values in Series, then slicing the Series with top 2
pd_data['label'].value_counts()[:1]
I need to create a new column in my df that holds the mean of another existing column, but I need it to take into account each individual location over time rather then the mean of the all the values in the existing column.
Based on the sample dataset below, what I am looking for is a new column that contains the Mean for each Site, not the mean of all the values independent of Site.
Sample Dataset
Use groupby and agg mean of that columns:
df = df.merge(df.groupby('Site',as_index=False).agg({'TIME_HOUR':'mean'})[['Site','TIME_HOUR']],on='Site',how='left')
Use groupby:
df.groupby('Site')['TIME_HOUR'].mean().reset_index()
And assign to a column
I have a dataframe of two columns Stock and DueDate, where I need to select first row from the repeated consecutive entries based on stock column.
df:
I am expecting output like below,
Expected output:
My Approach
The approach I tried to use is to first list out what all rows repeating based on stock column by creating a new column repeated_yes and then subset the first row only if any rows are repeating more than twice.
I have used the below line of code to create new column "repeated_yes",
ss = df.Stock.ne(df.Stock.shift())
df['repeated_yes'] = ss.groupby(ss.cumsum()).cumcount() + 1
so the new updated dataframe looks like this,
df_new
But I am stuck on subsetting only row number 3 and 8 inorder to attain the result. If there are any other effective approach it would be helpful.
Edited:
Forgot to include the actual full question,
If there are any other rows below the last row in the dataframe df it should not display any output.
Chain another mask created by Series.duplicated with keep=False by & for bitwise AND and filter in boolean indexing:
ss = df.Stock.ne(df.Stock.shift())
ss1 = ss.cumsum().duplicated(keep=False)
df = df[ss & ss1]
I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')