I have a dataframe with values spread over several columns. I want to calculate the mean value of all items from specific columns.
All the solutions I looked up end up giving me either the separate means of each column or the mean of the means of the selected columns.
E.g. my Dataframe looks like this:
Name a b c d
Alice 1 2 3 4
Alice 2 4 2
Alice 3 2
Alice 1 5 2
Ben 3 3 1 3
Ben 4 1 2 3
Ben 1 2 2
And I want to see the mean of the values in columns b & c for each "Alice":
When I try:
df[df["Name"]=="Alice"][["b","c"]].mean()
The result is:
b 2.00
c 4.00
dtype: float64
In another post I found a suggestion to try a "double" mean one time for each axis e.g:
df[df["Name"]=="Alice"][["b","c"]].mean(axis=1).mean()
But the result was then:
3.00
which is the mean of the means of both columns.
I am expecting a way to calculate:
(2 + 3 + 4 + 5) / 4 = 3.50
Is there a way to do this in Python?
You can use numpy's np.nanmean [numpy-doc] here this will simply see your section of the dataframe as an array, and calculate the mean over the entire section by default:
>>> np.nanmean(df.loc[df['Name'] == 'Alice', ['b', 'c']])
3.5
Or if you want to group by name, you can first stack the dataframe, like:
>>> df[['Name','b','c']].set_index('Name').stack().reset_index().groupby('Name').agg('mean')
0
Name
Alice 3.500000
Ben 1.833333
Can groupby to sum all values and get their respective sizes. Then, divide to get the mean.
This way you get for all Names at once.
g = df.groupby('Name')[['b', 'c']]
g.sum().sum(1)/g.count().sum(1)
Name
Alice 3.500000
Ben 1.833333
dtype: float64
PS: In your example, looks like you have empty strings in some cells. That's not advisable, since you'll have dtypes set to object for your columns. Try to have NaNs instead, to take full advantage of vectorized operations.
Assume all your columns are numeric type and empty spaces are NaN. A simple set_index and stack and direct mean
df.set_index('Name')[['b','c']].stack().mean(level=0)
Out[117]:
Name
Alice 3.500000
Ben 1.833333
dtype: float64
Related
I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)
I have the following dataframe:
A,B,C,D
10,1,2,3
1,4,7,3
10,5,2,3
40,7,9,3
9,9,9,9
I would like to create another dataframe starting from the previous one which have only two row. The selection of these two rows is based on the minimum and maximum value in the column "A". I would like to get:
A,B,C,D
1,4,7,3
40,7,9,3
Do you think I should work with a sort of index.min e index.max and then select only the two rows and append then in a new dataframe? Do you have same other suggestions?
Thanks for any kind of help,
Best
IIUC you can simply subset the dataframe with an OR condition on df.A.min() and df.A.max():
df = df[(df.A==df.A.min())|(df.A==df.A.max())]
df
A B C D
1 1 4 7 3
3 40 7 9 3
Yes, you can use idxmin/idxmax and then use loc:
df.loc[df['A'].agg(['idxmin','idxmax']) ]
Output:
A B C D
1 1 4 7 3
3 40 7 9 3
Note that this only gives one row for min and one for max. If you want all values, you should use #CHRD's solution.
I have a DataFrame A in Jupiter that looks like the following
Index Var1.A.1 Var1.B.1 Var1.CA.1 Var2.A.1 Var2.B.1 Var2.CA.1
0 1 21 3 3 4 4
1 3 5 4 9 5 1
....
100 9 75 2 4 8 2
I'd like to assess the mean value based on the extension of the name, i.e.
Mean value of .A.1
Mean Value of .B.1
Mean value of .CA.1
For example, to assess the mean value of the variable with extension .A.1, I've tried the following, which doesn't return what I look for
List=['.A.1', '.B.1', '.CA.1']
A[List[List.str.contains('.A.1')]].mean()
However, in this way I get the mean values of the different variables, getting also CA.1, which is not what it look for.
Any advice?
thanks
If want mean per rows by all values after first . use groupby with lambda function and mean:
df = df.groupby(lambda x: x.split('.', 1)[-1], axis=1).mean()
print (df)
A.1 B.1 CA.1
0 2.0 12.5 3.5
1 6.0 5.0 2.5
100 6.5 41.5 2.0
Here is a thrid option:
columns = A.columns
A[[s for s in columns if ".A.1" in s]].stack().reset_index().mean()
dfA.filter(like='.A.1') - gives you the column containing the '.A.1' substring
I've heard in Pandas there's often multiple ways to do the same thing, but I was wondering –
If I'm trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count() and when does it make sense to use df['colA'].value_counts() ?
There is difference value_counts return:
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
but count not, it sort output by index (created by column in groupby('col')).
df.groupby('colA').count()
is for aggregate all columns of df by function count. So it count values excluding NaNs.
So if need count only one column need:
df.groupby('colA')['colA'].count()
Sample:
df = pd.DataFrame({'colB':list('abcdefg'),
'colC':[1,3,5,7,np.nan,np.nan,4],
'colD':[np.nan,3,6,9,2,4,np.nan],
'colA':['c','c','b','a',np.nan,'b','b']})
print (df)
colA colB colC colD
0 c a 1.0 NaN
1 c b 3.0 3.0
2 b c 5.0 6.0
3 a d 7.0 9.0
4 NaN e NaN 2.0
5 b f NaN 4.0
6 b g 4.0 NaN
print (df['colA'].value_counts())
b 3
c 2
a 1
Name: colA, dtype: int64
print (df.groupby('colA').count())
colB colC colD
colA
a 1 1 1
b 3 2 2
c 2 2 1
print (df.groupby('colA')['colA'].count())
colA
a 1
b 3
c 2
Name: colA, dtype: int64
Groupby and value_counts are totally different functions. You cannot perform value_counts on a dataframe.
Value Counts are limited only for a single column or series and it's sole purpose is to return the series of frequencies of values
Groupby returns a object so one can perform statistical computations over it. So when you do df.groupby(col).count() it will return the number of true values present in columns with respect to the specific columns in groupby.
When should be value_counts used and when should groupby.count be used :
Lets take an example
df = pd.DataFrame({'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]})
Groupby count:
df.groupby('color').count()
id size
color
b 2 2
g 2 2
r 3 3
Groupby count is generally used for getting the valid number of values
present in all the columns with reference to or with respect to one
or more columns specified. So not a number (nan) will be excluded.
To find the frequency using groupby you need to aggregate against the specified column itself like #jez did. (maybe to avoid this and make developers life easy value_counts is implemented ).
Value Counts:
df['color'].value_counts()
r 3
g 2
b 2
Name: color, dtype: int64
Value count is generally used for finding the frequency of the values
present in one particular column.
In conclusion :
.groupby(col).count() should be used when you want to find the frequency of valid values present in columns with respect to specified col.
.value_counts() should be used to find the frequencies of a series.
in simple words: .value_counts() Return a Series containing counts of unique rows in the DataFrame which means it counts up the individual values in a specific row and reports how many of the values are in the column:
imagine we have a dataframe like:
df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
first_name middle_name
0 John Smith
1 Anne <NA>
2 John <NA>
3 Beth Louise
then we apply value_counts on it:
df.value_counts()
first_name middle_name
Beth Louise 1
John Smith 1
dtype: int64
as you can see it didn't count rows with NA values.
however count() count non-NA cells for each column or row.
in our example:
df.count()
first_name 4
middle_name 2
dtype: int64
Suppose I have a Pandas dataframe df has columns a,b,c,d...z . And I want to: df.groupby('a').apply(my_func()) for columns d-z, while leave column 'b' & 'c' unchanged . How to do that ?
I notice Pandas can apply different function to different column by passing a dict . But I have a long column list and just want parameters to set or tip to simply tell Pandas to bypass some columns and apply my_func() to rest of columns ? (Otherwise I have to build a long dict)
One simple (and general) approach is to create a view of the dataframe with the subset you are interested in (or, stated for your case, a view with all columns except the ones you want to ignore), and then use APPLY for that view.
In [116]: df
Out[116]:
a b c d f
0 one 3 0.493808 a bob
1 two 8 0.150585 b alice
2 one 6 0.641816 c michael
3 two 5 0.935653 d joe
4 one 1 0.521159 e kate
Use your favorite methods to create the view you need. You could select a range of columns like so df_view = df.ix[:,'b':'d'], but the following might be more useful for your scenario:
#I want all columns except two
cols = df.columns.tolist()
mycols = [x for x in cols if not x in ['a','f']]
df_view = df[mycols]
Apply your function to that view. (Note this doesn't yet change anything in df.)
In [158]: df_view.apply(lambda x: x /2)
Out[158]:
b c d
0 1 0.246904 20
1 4 0.075293 25
2 3 0.320908 28
3 2 0.467827 28
4 0 0.260579 24
Update the df using update()
In [156]: df.update(df_view.apply(lambda x: x/2))
In [157]: df
Out[157]:
a b c d f
0 one 1 0.246904 20 bob
1 two 4 0.075293 25 alice
2 one 3 0.320908 28 michael
3 two 2 0.467827 28 joe
4 one 0 0.260579 24 kate