Is there a way to omit some of the output from the pandas describe?
This command gives me exactly what I want with a table output (count and mean of executeTime's by a simpleDate)
df.groupby('simpleDate').executeTime.describe().unstack(1)
However that's all I want, count and mean. I want to drop std, min, max, etc... So far I've only read how to modify column size.
I'm guessing the answer is going to be to re-write the line, not using describe, but I haven't had any luck grouping by simpleDate and getting the count with a mean on executeTime.
I can do count by date:
df.groupby(['simpleDate']).size()
or executeTime by date:
df.groupby(['simpleDate']).mean()['executeTime'].reset_index()
But can't figure out the syntax to combine them.
My desired output:
count mean
09-10-2013 8 20.523
09-11-2013 4 21.112
09-12-2013 3 18.531
... .. ...
.describe() attribute generates a Dataframe where count, std, max ... are values of the index, so according to the documentation you should use .loc to retrieve just the index values desired:
df.describe().loc[['count','max']]
Describe returns a series, so you can just select out what you want
In [6]: s = Series(np.random.rand(10))
In [7]: s
Out[7]:
0 0.302041
1 0.353838
2 0.421416
3 0.174497
4 0.600932
5 0.871461
6 0.116874
7 0.233738
8 0.859147
9 0.145515
dtype: float64
In [8]: s.describe()
Out[8]:
count 10.000000
mean 0.407946
std 0.280562
min 0.116874
25% 0.189307
50% 0.327940
75% 0.556053
max 0.871461
dtype: float64
In [9]: s.describe()[['count','mean']]
Out[9]:
count 10.000000
mean 0.407946
dtype: float64
Looking at the answers, I don't see one that actually works on a DataFrame returned from describe() after using groupby().
The documentation on MultiIndex selection gives a hint at the answer. The .xs() function works for one but not multiple selections, but .loc works.
df.groupby(['simpleDate']).describe().loc[:,(slice(None),['count','max'])]
This keeps the nice MultiIndex returned by .describe() but with only the columns selected.
The solution #Jeff provided just works for series.
#Rafa is on the point: df.describe().info() reveals that the resulting dataframe has Index: 8 entries, count to max
df.describe().loc[['count','max']] does work, but df.groupby('simpleDate').describe().loc[['count','max']], which is what the OP asked, does not work.
I think a solution may be this:
df = pd.DataFrame({'Y': ['A', 'B', 'B', 'A', 'B'],
'Z': [10, 5, 6, 11, 12],
})
grouping the df by Y:
df_grouped=df.groupby(by='Y')
In [207]df_grouped.agg([np.mean, len])
Out[207]:
Z
mean len
Y
A 10.500 2
B 7.667 3
Sticking with describe, you can unstack the indexes and then slice normally too
df.describe().unstack()[['count','max']]
Related
I can't get the average or mean of a column in pandas. A have a dataframe. Neither of things I tried below gives me the average of the column weight
>>> allDF
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 0.94212
The following returns several values, not one:
allDF[['weight']].mean(axis=1)
So does this:
allDF.groupby('weight').mean()
If you only want the mean of the weight column, select the column (which is a Series) and call .mean():
In [479]: df
Out[479]:
ID birthyear weight
0 619040 1962 0.123123
1 600161 1963 0.981742
2 25602033 1963 1.312312
3 624870 1987 0.942120
In [480]: df.loc[:, 'weight'].mean()
Out[480]: 0.83982437500000007
Try df.mean(axis=0) , axis=0 argument calculates the column wise mean of the dataframe so the result will be axis=1 is row wise mean so you are getting multiple values.
Do try to give print (df.describe()) a shot. I hope it will be very helpful to get an overall description of your dataframe.
Mean for each column in df :
A B C
0 5 3 8
1 5 3 9
2 8 4 9
df.mean()
A 6.000000
B 3.333333
C 8.666667
dtype: float64
and if you want average of all columns:
df.stack().mean()
6.0
you can use
df.describe()
you will get basic statistics of the dataframe and to get mean of specific column you can use
df["columnname"].mean()
You can also access a column using the dot notation (also called attribute access) and then calculate its mean:
df.your_column_name.mean()
You can use either of the two statements below:
numpy.mean(df['col_name'])
# or
df['col_name'].mean()
Additionally if you want to get the round value after finding the mean.
#Create a DataFrame
df1 = {
'Subject':['semester1','semester2','semester3','semester4','semester1',
'semester2','semester3'],
'Score':[62.73,47.76,55.61,74.67,31.55,77.31,85.47]}
df1 = pd.DataFrame(df1,columns=['Subject','Score'])
rounded_mean = round(df1['Score'].mean()) # specified nothing as decimal place
print(rounded_mean) # 62
rounded_mean_decimal_0 = round(df1['Score'].mean(), 0) # specified decimal place as 0
print(rounded_mean_decimal_0) # 62.0
rounded_mean_decimal_1 = round(df1['Score'].mean(), 1) # specified decimal place as 1
print(rounded_mean_decimal_1) # 62.2
Do note that it needs to be in the numeric data type in the first place.
import pandas as pd
df['column'] = pd.to_numeric(df['column'], errors='coerce')
Next find the mean on one column or for all numeric columns using describe().
df['column'].mean()
df.describe()
Example of result from describe:
column
count 62.000000
mean 84.678548
std 216.694615
min 13.100000
25% 27.012500
50% 41.220000
75% 70.817500
max 1666.860000
You can simply go for:
df.describe()
that will provide you with all the relevant details you need, but to find the min, max or average value of a particular column (say 'weights' in your case), use:
df['weights'].mean(): For average value
df['weights'].max(): For maximum value
df['weights'].min(): For minimum value
You can use the method agg (aggregate):
df.agg('mean')
It's possible to apply multiple statistics:
df.agg(['mean', 'max', 'min'])
You can easily follow the following code
import pandas as pd
import numpy as np
classxii = {'Name':['Karan','Ishan','Aditya','Anant','Ronit'],
'Subject':['Accounts','Economics','Accounts','Economics','Accounts'],
'Score':[87,64,58,74,87],
'Grade':['A1','B2','C1','B1','A2']}
df = pd.DataFrame(classxii,index = ['a','b','c','d','e'],columns=['Name','Subject','Score','Grade'])
print(df)
#use the below for mean if you already have a dataframe
print('mean of score is:')
print(df[['Score']].mean())
I have a dataframe with 11 000k rows. There are multiple columns but I am interested only in 2 of them: TagName and Samples_Value. One tag can repeat itself multiple times among rows. I want to calculate the average value for each tag and create a new dataframe with the average value for each tag. I don't really know how to walk through rows and how to calculate the average. Any help will be highly appreciated. Thank you!
Name DataType TimeStamp Value Quality
Food Float 2019-01-01 13:00:00 105.75 122
Food Float 2019-01-01 17:30:00 11.8110352 122
Food Float 2019-01-01 17:45:00 12.7932892 122
Water Float 2019-01-01 14:01:00 16446.875 122
Water Float 2019-01-01 14:00:00 146.875 122
RangeIndex: 11140487 entries, 0 to 11140486
Data columns (total 6 columns):
Name object
Value object
This is what I have and I know it is really noob ish but I am having a difficult time walking through rows.
for i in range(0, len(df):
if((df.iloc[i]['DataType']!='Undefined')):
print df.loc[df['Name'] == df.iloc[i]['Name'], df.iloc[i]['Value']].mean()
It sounds like the groupby() functionality is what you want. You define the column where your groups are and then you can take the mean() of each group. An example from the documentation:
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
df.groupby('A').mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
In your case it would be something like this:
df.groupby('TagName')['Samples_value'].mean()
Edit: So, I applied the code to your provided input dataframe and following is the output:
TagName
Steam 1.081447e+06
Utilities 3.536931e+05
Name: Sample_value, dtype: float64
Is this what you are looking for?
You don't need to walk through the rows, you can just take all of the fields that match your criteria
d = {'col1': [1,2,1,2,1,2], 'col2': [3, 4,5,6,7,8]}
df = pd.DataFrame(data=d)
#iterate over all unique entries in col1
for entry in df["col1"].unique():
# get all the col2 values where col1 is the current iter of col1 entries
meanofcurrententry=df[df["col1"]==entry]["col2"].mean()
print(meanofcurrententry)
This is not a full solution, but I think it helps more to understand the necessary logic. You still need to wrap it up into your own dataframe, however it hopefuly helps to understand how to use the indexing
You should avoid as much as possible to iterate rows in a dataframe, because it is very unefficient...
groupby is the way to go when you want to apply the same processing to various groups of rows identified by their values in one or more columns. Here what you want is (*):
df.groupby('TagName')['Sample_value'].mean().reset_index()
it gives as expected:
TagName Sample_value
0 Steam 1.081447e+06
1 Utilities 3.536931e+05
Details on the magic words:
groupby: identifies the column(s) used to group the rows (same values)
['Sample_values']: restrict the groupby object to the column of interest
mean(): computes the mean per group
reset_index(): by default the grouping columns go into the index, which is fine for the mean operation. reset_index make them back normal columns
I have a dataframe with several columns (the features).
>>> print(df)
col1 col2
a 1 1
b 2 2
c 3 3
d 3 2
I would like to compute the mode of one of them. This is what happens:
>>> print(df['col1'].mode())
0 3
dtype: int64
I would like to output simply the value 3.
This behavoiur is quite strange, if you consider that the following very similar code is working:
>>> print(df['col1'].mean())
2.25
So two questions: why does this happen? How can I obtain the pure mode value as it happens for the mean?
Because Series.mode() can return multiple values:
consider the following DF:
In [77]: df
Out[77]:
col1 col2
a 1 1
b 2 2
c 3 3
d 3 2
e 2 3
In [78]: df['col1'].mode()
Out[78]:
0 2
1 3
dtype: int64
From docstring:
Empty if nothing occurs at least 2 times. Always returns Series
even if only one value.
If you want to chose the first value:
In [83]: df['col1'].mode().iloc[0]
Out[83]: 2
In [84]: df['col1'].mode()[0]
Out[84]: 2
I agree that it's too cumbersome
df['col1'].mode().iloc[0].values[0]
a series can have one mean(), but a series can have more than one mode()
like
<2,2,2,3,3,3,4,4,4,5,6,7,8> its mode 2,3,4.
the output must be indexed
mode() will return all values that tie for the most frequent value.
In order to support that functionality, it must return a collection, which takes the form of a dataFrame or Series.
For example, if you had a series:
[2, 2, 3, 3, 5, 5, 6]
Then the most frequent values occur twice. The result would then be the series [2, 3, 5] since each of these occur twice.
If you want to collapse this into a single value, you can access the first value, compute the max(), min(), or whatever makes most sense for your application.
Say I want a function that changes the value of a named column in a given row number of a DataFrame.
One option is to find the column's location and use iloc, like that:
def ChangeValue(df, rowNumber, fieldName, newValue):
columnNumber = df.columns.get_loc(fieldName)
df.iloc[rowNumber, columnNumber] = newValue
But I wonder if there is a way to use the magic of iloc and loc in one go, and skip the manual conversion.
Any ideas?
I suggest just using iloc combined with the Index.get_loc method. eg:
df.iloc[0:10, df.columns.get_loc('column_name')]
A bit clumsy, but simple enough.
MultiIndex has both get_loc and get_locs which takes a sequence; unfortunately Index just seems to have the former.
Using loc
One has to resort to either employing integer location iloc all the way —as suggested in this answer,— or using plain location loc all the way, as shown here:
df.loc[df.index[[0, 7, 13]], 'column_name']
According to this answer,
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.
So you should especially be able to use df.ix[rowNumber, fieldname] in case type(df.index) != type(rowNumber).
Even it does not hold for each case, I'd like to add an easy one, if you are looking for top or bottom entries:
df.head(1)['column_name'] # first entry in 'column_name'
df.tail(5)['column_name'] # last 5 entries in 'column_name'
Edit: doing the following is not a good idea. I leave the answer as a counter example.
You can do this:
df.iloc[rowNumber].loc[fieldName] = newValue
Example
import pandas as pd
def ChangeValue(df, rowNumber, fieldName, newValue):
df.iloc[rowNumber].loc[fieldName] = newValue
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=[4, 5, 6], columns=['A', 'B', 'C'])
print(df)
A B C
4 0 2 3
5 0 4 1
6 10 20 30
ChangeValue(df, 1, "B", 999)
print(df)
A B C
4 0 2 3
5 0 999 1
6 10 20 30
But be careful, if newValue is not the same type it does not work and will fail silently
ChangeValue(df, 1, "B", "Oops")
print(df)
A B C
4 0 2 3
5 0 999 1
6 10 20 30
There is some good info about working with columns data types here: Change column type in pandas
I have a pandas DataFrame with a multi-level index ("instance" and "index"). I want to find all the first-level ("instance") index values which are non-unique and to print out those values.
My frame looks like this:
A
instance index
a 1 10
2 12
3 4
b 1 12
2 5
3 2
b 1 12
2 5
3 2
I want to find "b" as the duplicate 0-level index and print its value ("b") out.
You can use the get_duplicates() method:
>>> df.index.get_level_values('instance').get_duplicates()
[0, 1]
(In my example data 0 and 1 both appear multiple times.)
The get_level_values() method can accept a label (such as 'instance') or an integer and retrieves the relevant part of the MultiIndex.
Assuming that your df has an index made of 'instance' and 'index' you could do this:
df1 = df.reset_index().pivot_table(index=['instance','index'], values='A', aggfunc='count')
df1[df1 > 1].index.get_level_values(0).drop_duplicates()
Which yields:
Index([u'b'], dtype='object')
Adding .values at the end (.drop_duplicates().values) will make an array:
array(['b'], dtype=object)
Or the same with one line using .groupby:
df[df.groupby(level=['instance','index']).count() > 1].dropna().index.get_level_values(0).drop_duplicates()
This should give you the whole row which isn't quite what you asked for but might be close enough:
df[df.index.get_level_values('instance').duplicated()]
You want the duplicated method:
df['Instance'].duplicated()