How to calculate mean of specific rows in python dataframe? - python

I have a dataframe with 11 000k rows. There are multiple columns but I am interested only in 2 of them: TagName and Samples_Value. One tag can repeat itself multiple times among rows. I want to calculate the average value for each tag and create a new dataframe with the average value for each tag. I don't really know how to walk through rows and how to calculate the average. Any help will be highly appreciated. Thank you!
Name DataType TimeStamp Value Quality
Food Float 2019-01-01 13:00:00 105.75 122
Food Float 2019-01-01 17:30:00 11.8110352 122
Food Float 2019-01-01 17:45:00 12.7932892 122
Water Float 2019-01-01 14:01:00 16446.875 122
Water Float 2019-01-01 14:00:00 146.875 122
RangeIndex: 11140487 entries, 0 to 11140486
Data columns (total 6 columns):
Name object
Value object
This is what I have and I know it is really noob ish but I am having a difficult time walking through rows.
for i in range(0, len(df):
if((df.iloc[i]['DataType']!='Undefined')):
print df.loc[df['Name'] == df.iloc[i]['Name'], df.iloc[i]['Value']].mean()

It sounds like the groupby() functionality is what you want. You define the column where your groups are and then you can take the mean() of each group. An example from the documentation:
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
df.groupby('A').mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
In your case it would be something like this:
df.groupby('TagName')['Samples_value'].mean()
Edit: So, I applied the code to your provided input dataframe and following is the output:
TagName
Steam 1.081447e+06
Utilities 3.536931e+05
Name: Sample_value, dtype: float64
Is this what you are looking for?

You don't need to walk through the rows, you can just take all of the fields that match your criteria
d = {'col1': [1,2,1,2,1,2], 'col2': [3, 4,5,6,7,8]}
df = pd.DataFrame(data=d)
#iterate over all unique entries in col1
for entry in df["col1"].unique():
# get all the col2 values where col1 is the current iter of col1 entries
meanofcurrententry=df[df["col1"]==entry]["col2"].mean()
print(meanofcurrententry)
This is not a full solution, but I think it helps more to understand the necessary logic. You still need to wrap it up into your own dataframe, however it hopefuly helps to understand how to use the indexing

You should avoid as much as possible to iterate rows in a dataframe, because it is very unefficient...
groupby is the way to go when you want to apply the same processing to various groups of rows identified by their values in one or more columns. Here what you want is (*):
df.groupby('TagName')['Sample_value'].mean().reset_index()
it gives as expected:
TagName Sample_value
0 Steam 1.081447e+06
1 Utilities 3.536931e+05
Details on the magic words:
groupby: identifies the column(s) used to group the rows (same values)
['Sample_values']: restrict the groupby object to the column of interest
mean(): computes the mean per group
reset_index(): by default the grouping columns go into the index, which is fine for the mean operation. reset_index make them back normal columns

Related

pd.Grouper with datetime key in conjunction with another grouping key seemingly creates the wrong number of groups

Using pd.Grouper with a datetime key in conjunction with another key creates a set of groups, but this does not seem to encompass all of the groups that need to be created, in my opinion.
>>> test = pd.DataFrame({"id":["a","b"]*3, "b":pd.date_range("2000-01-01","2000-01-03", freq="9H")})
>>> test
id b
0 a 2000-01-01 00:00:00
1 b 2000-01-01 09:00:00
2 a 2000-01-01 18:00:00
3 b 2000-01-02 03:00:00
4 a 2000-01-02 12:00:00
5 b 2000-01-02 21:00:00
When I tried to create groups based on the date and id values:
>>> g = test.groupby([pd.Grouper(key='b', freq="D"), 'id'])
>>> g.groups
{(2000-01-01 00:00:00, 'a'): [0], (2000-01-02 00:00:00, 'b'): [1]}
g.groups shows only 2 groups when I expected 4 groups: both "a" and "b" for each day.
However, when I created another column based off of "b":
>>> test['date'] = test.b.dt.date
>>> g = test.groupby(['date', 'id'])
>>> g.groups
{(2000-01-01, 'a'): [0, 2], (2000-01-01, 'b'): [1], (2000-01-02, 'a'): [4], (2000-01-02, 'b'): [3, 5]}
The outcome was exactly what I expected.
I don't know how to make sense of these different outcomes. Please enlighten me.
You do have 4 groups with the Grouper, the output of g.groups is misleading (maybe worth reporting as a bug?):
g = test.groupby([pd.Grouper(key='b', freq="D"), 'id'])
g.ngroups
# 4
g.size()
# b id
# 2000-01-01 a 2
# b 1
# 2000-01-02 a 1
# b 2
# dtype: int64
I believe it is because of the difference between 'pd.Grouper' and the 'dt.date' method in pandas. 'pd.Grouper' groups by a range of values (e.g., daily, hourly, etc.) while 'dt.date' returns just the date part of a datetime object, effectively creating a categorical variable.
When you use 'pd.Grouper' with a frequency of "D", it will group by full days, so each day is represented by only one group. But in your case, each id has multiple records for a given day. So, 'pd.Grouper' is not able to capture all of the groups that you expect.
On the other hand, when you use the 'dt.date' method to extract the date part of the datetime, it creates a categorical variable that represents each date independently.
so when you group by this new date column along with the id column, each group will correspond to a unique combination of date and id, giving you the expected outcome.
In summary, pd.Grouper is useful when you want to group by a range of values (e.g., daily, hourly), while using a separate column for the exact values (e.g., a column for dates only) is useful when you want to group by specific values.

python pandas column with averages [duplicate]

This question already has an answer here:
Aggregation over Partition - pandas Dataframe
(1 answer)
Closed 7 months ago.
I have a dataframe with in column "A" locations and in column "B" values. Locations occure multiple times in this DataFrame, now i'd like to add a third column in which i store the average value of column "B" that have the same location value in column "A".
-I know the .mean() can be used to get an average
-I know how to filter with .loc()
I could make a list of all unique values in column A, and compute the average for all of them by making a for loop. Hover, this seems combersome to me. Any idea how this can be done more efficiently?
Sounds like what you need is GroupBy. Take a look here
Given
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
You can use
df.groupby('A').mean()
to group the values based on the common values in column "A" and find the mean.
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
I could make a list of all unique values in column A, and compute the
average for all of them by making a for loop.
This can be done using pandas.DataFrame.groupby consider following simple example
import pandas as pd
df = pd.DataFrame({"A":["X","Y","Y","X","X"],"B":[1,3,7,10,20]})
means = df.groupby('A').agg('mean')
print(means)
gives output
B
A
X 10.333333
Y 5.000000
import pandas as pd
data = {'A': ['a', 'a', 'b', 'c'], 'B': [32, 61, 40, 45]}
df = pd.DataFrame(data)
df2 = df.groupby(['A']).mean()
print(df2)
Based on your description, I'm not sure if you are trying to simply calculate the averages for each group, or if you are wanting to maintain the long format of your data. I'll break down a solution for each option.
The data I'll use below can be generated by running the following...
import pandas as pd
df = pd.DataFrame([['group1 ', 2],
['group2 ', 4],
['group1 ', 5],
['group2 ', 2],
['group1 ', 2],
['group2 ', 0]], columns=['A', 'B'])
Option 1 - Calculate Group Averages
This one is super simple. It uses the .groupby method, which is the bread and butter of crunching data calculations.
df.groupby('A').B.mean()
Output:
A
group1 3.0
group2 2.0
If you wish for this to return a dataframe instead of a series, you can add .to_frame() to the end of the above line.
Option 2 - Calculate Group Averages and Maintain Long Format
By long format, I mean you want your data to be structured the same as it is currently, but with a third column (we'll call it C) containing a mean that is connected to the A column. ie...
A
B
C (average)
group1
2
3
group2
4
2
group1
5
3
group2
2
2
group1
2
3
group2
0
2
Where the averages for each group are...
group1 = (2+5+2)/3 = 3
group2 = (4+2+0)/3 = 2
The most efficient solution, would be to use .transform, which behaves like an sql window function, but I think this method can be a little confusing when you're new to pandas.
import numpy as np
df.assign(C=df.groupby('A').B.transform(np.mean))
A less efficient, but more beginner friendly option would be to store the averages in a dictionary and then map each row to the group average.
I find myself using this option a lot for modeling projects, when I want to impute a historical average rather than the average of my sampled data.
To accomplish this, you can...
Create a dictionary containing the grouped averages
For every row in the dataframe, pass the group name into the dictionary
# Create the group averages
group_averages = df.groupby('A').B.mean().to_dict()
# For every row, pass the group name into the dictionary
new_column = df.A.map(group_averages)
# Add the new column to the dataframe
df = df.assign(C=new_column)
You can also, optionally, do all of this in a single line
df = df.assign(C=df.A.map(df.groupby('A').B.mean().to_dict()))

Pandas series inserted into dataframe are read as NaN

I'm finding that when adding a series, based on the same time-period, to an existing dataframe it gets imported as NaNs. The dataframe has a field column, but I don't understand why that should change anything. To see the steps of my code, you can review the attached image. Hope that someone can help!
Illustration showing how the dataframe that the series is inserted into and how it gets read as NaN
Assuming that the value in the Field Index column is "actual" for every row, a solution could be the following:
test.reset_index().set_index('Date').assign(m1=m1)
That solution works but it can be done shorter:
days = pd.to_datetime(['2018-01-31', '2018-02-28', '2018-03-31'])
df = pd.DataFrame({'Field': ['Actual']*3, 'Date': days, 'Val':[1, 2, 3]}).set_index(['Field', 'Date'])
m1 = pd.Series([0, 2, 4], index=days)
df.reset_index(level='Field').assign(m1=m1)
Field Val m1
Date
2018-01-31 Actual 1 0
2018-02-28 Actual 2 2
2018-03-31 Actual 3 4
btw, that would be a nice mcve

Replace a column in Pandas dataframe with another that has same index but in a different order

I'm trying to re-insert back into a pandas dataframe a column that I extracted and of which I changed the order by sorting it.
Very simply, I have extracted a column from a pandas df:
col1 = df.col1
This column contains integers and I used the .sort() method to order it from smallest to largest. And did some operation on the data.
col1.sort()
#do stuff that changes the values of col1.
Now the indexes of col1 are the same as the indexes of the overall df, but in a different order.
I was wondering how I can insert the column back into the original dataframe (replacing the col1 that is there at the moment)
I have tried both of the following methods:
1)
df.col1 = col1
2)
df.insert(column_index_of_col1, "col1", col1)
but both methods give me the following error:
ValueError: cannot reindex from a duplicate axis
Any help will be greatly appreciated.
Thank you.
Consider this DataFrame:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 5, 4]}, index=[0, 0, 1])
df
Out:
A B
0 1 6
0 2 5
1 3 4
Assign the second column to b and sort it and take the square, for example:
b = df['B']
b = b.sort_values()
b = b**2
Now b is:
b
Out:
1 16
0 25
0 36
Name: B, dtype: int64
Without knowing the exact operation you've done on the column, there is no way to know whether 25 corresponds to the first row in the original DataFrame or the second one. You can take the inverse of the operation (take the square root and match, for example) but that would be unnecessary I think. If you start with an index that has unique elements (df = df.reset_index()) it would be much easier. In that case,
df['B'] = b
should work just fine.

Modify output from Python Pandas describe

Is there a way to omit some of the output from the pandas describe?
This command gives me exactly what I want with a table output (count and mean of executeTime's by a simpleDate)
df.groupby('simpleDate').executeTime.describe().unstack(1)
However that's all I want, count and mean. I want to drop std, min, max, etc... So far I've only read how to modify column size.
I'm guessing the answer is going to be to re-write the line, not using describe, but I haven't had any luck grouping by simpleDate and getting the count with a mean on executeTime.
I can do count by date:
df.groupby(['simpleDate']).size()
or executeTime by date:
df.groupby(['simpleDate']).mean()['executeTime'].reset_index()
But can't figure out the syntax to combine them.
My desired output:
count mean
09-10-2013 8 20.523
09-11-2013 4 21.112
09-12-2013 3 18.531
... .. ...
.describe() attribute generates a Dataframe where count, std, max ... are values of the index, so according to the documentation you should use .loc to retrieve just the index values desired:
df.describe().loc[['count','max']]
Describe returns a series, so you can just select out what you want
In [6]: s = Series(np.random.rand(10))
In [7]: s
Out[7]:
0 0.302041
1 0.353838
2 0.421416
3 0.174497
4 0.600932
5 0.871461
6 0.116874
7 0.233738
8 0.859147
9 0.145515
dtype: float64
In [8]: s.describe()
Out[8]:
count 10.000000
mean 0.407946
std 0.280562
min 0.116874
25% 0.189307
50% 0.327940
75% 0.556053
max 0.871461
dtype: float64
In [9]: s.describe()[['count','mean']]
Out[9]:
count 10.000000
mean 0.407946
dtype: float64
Looking at the answers, I don't see one that actually works on a DataFrame returned from describe() after using groupby().
The documentation on MultiIndex selection gives a hint at the answer. The .xs() function works for one but not multiple selections, but .loc works.
df.groupby(['simpleDate']).describe().loc[:,(slice(None),['count','max'])]
This keeps the nice MultiIndex returned by .describe() but with only the columns selected.
The solution #Jeff provided just works for series.
#Rafa is on the point: df.describe().info() reveals that the resulting dataframe has Index: 8 entries, count to max
df.describe().loc[['count','max']] does work, but df.groupby('simpleDate').describe().loc[['count','max']], which is what the OP asked, does not work.
I think a solution may be this:
df = pd.DataFrame({'Y': ['A', 'B', 'B', 'A', 'B'],
'Z': [10, 5, 6, 11, 12],
})
grouping the df by Y:
df_grouped=df.groupby(by='Y')
In [207]df_grouped.agg([np.mean, len])
Out[207]:
Z
mean len
Y
A 10.500 2
B 7.667 3
Sticking with describe, you can unstack the indexes and then slice normally too
df.describe().unstack()[['count','max']]

Categories

Resources