This code below calculate moving average for every row within a group.
However I only interested in the moving average of the last 2 rows for each group of id.
Since my data is quite large, this code takes too much time to run.
The desired output is column avg having NaN for all rows except for time = 4 and 5.
Thank you so much for your help. HC
import pandas as pd
df = {'id':[1,1,1,1,1,1,2,2,2,2],
'time':[1,2,3,4,5,5,1,2,3,4],
'value': [1, 2, 3, 4,2 ,16, 26, 50, 10, 30],
}
df = pd.DataFrame(data=df)
df.sort_values(by=['id','time'], ascending=[True, True] , inplace=True)
df['avg'] = df['value'].groupby(df['id']).apply(lambda g: g.rolling( 3 ).mean())
df
Related
import pandas as pd
test = pd.DataFrame({'Area': ['Tipperary','Tipperary','Cork','Dublin'],
'Deaths': [11,33,44,55]}
)
I have this problem on a much larger scale but for readability I have created a smaller version, what groupby logic do i need to group by the Area column and sum, meaning I end up with 3 rows as opposed to 4 because Tipperary is in there twice. Say if I had 6 columns altogether how would I do this and keep my existing dataframe as it is? IE just reduce the row count because of the duplicated values in 'Area'
If the other columns have more than just numbers, you can use .groupby and .agg with different functions for each column. If you do not want to move the grouping column to the index, you can set the parameter as_index = False in groupby.
import pandas as pd
test = pd.DataFrame({'Area': ['Tipperary', 'Tipperary', 'Cork', 'Dublin'],
'Deaths': [11, 33, 44, 55],
'Text': ['a', 'b', 'c', 'd'],
'Numbers': [1, 4, 3, 2]}
)
out = test.groupby('Area', as_index=False).agg({'Deaths': 'sum', 'Text': lambda x: ','.join(i for i in x), 'Numbers': 'max'})
print(out)
Prints:
Area Deaths Text Numbers
0 Cork 44 c 3
1 Dublin 55 d 2
2 Tipperary 44 a,b 4
You can simply use the .groupby method
import pandas as pd
test = pd.DataFrame({'Area': ['Tipperary','Tipperary','Cork','Dublin'],
'Deaths': [11,33,44,55]}
)
test.groupby('Area').sum()
I have the following pandas DataFrame example. I try to to have sum of some spesific rows. I have researched how to carry out, however I could not find the solution. Could you give a direction, please? The example is as below. I thought that I can apply group by and sum but there is column (Value_3) that I would not like to sum of these, just keeping same. Value 3 is constant value, shaped due to Machine and Shift value.
data = {'Machine':['Mch_1', 'Mch_1', 'Mch_1', 'Mch_1', 'Mch_2', 'Mch_2'], 'Shift':['Day', 'Day', 'Night', 'Night', 'Night', 'Night'], 'Value_1':[1, 2, 0, 0, 1, 3], 'Value_2':[0, 2, 2, 1, 3, 0], 'Value_3':[5, 5, 2, 2, 6, 6]}
df = pd.DataFrame(data)
Output:
Mch_1__Day__1__0__5
Mch_1__Day__2__2__5
Mch_1__Night_0__2__2
Mch_1__Night_0__1__2
Mch_2__Night_1__3__6
Mch_2__Night_3__0__6
What I would like to have is like as showed in dataframe.
expected = {'Machine':['Mch_1', 'Mch_1', 'Mch_2'], 'Shift':['Day', 'Night', 'Night'], 'Value_1':[3, 0, 4], 'Value_2':[2, 3, 3]}
df_expected = pd.DataFrame(expected)
df_expected
Output:
Mch_1__Day__3__2__5
Mch_1__Night_0__3__2
Mch_2__Night_4__3__6
Thank you very much.
First idea is pass dictionary for aggregate functions, for last column is possible use first or last function:
d = {'Value_1':'sum','Value_2':'sum','Value_3':'first'}
df1 = df.groupby(['Machine','Shift'], as_index=False).agg(d)
If want more dynamic solution it means sum all columns without Value_3 create dyctionary by all columns without specified in list with dict.from_keys and Index.difference:
d = dict.fromkeys(df.columns.difference(['Machine','Shift', 'Value_3']), 'sum')
d['Value_3'] = 'first'
df1 = df.groupby(['Machine','Shift'], as_index=False).agg(d)
print (df1)
Machine Shift Value_1 Value_2 Value_3
0 Mch_1 Day 3 2 5
1 Mch_1 Night 0 3 2
2 Mch_2 Night 4 3 6
Given a data frame that looks like this
GROUP VALUE
1 5
2 2
1 10
2 20
1 7
I would like to compute the difference between the largest and smallest value within each group. That is, the result should be
GROUP DIFF
1 5
2 18
What is an easy way to do this in Pandas?
What is a fast way to do this in Pandas for a data frame with about 2 million rows and 1 million groups?
Using #unutbu 's df
per timing
unutbu's solution is best over large data sets
import pandas as pd
import numpy as np
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
df.groupby('GROUP')['VALUE'].agg(np.ptp)
GROUP
1 5
2 18
Name: VALUE, dtype: int64
np.ptp docs returns the range of an array
timing
small df
large df
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))
large df
many groups
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))
groupby/agg generally performs best when you take advantage of the built-in aggregators such as 'max' and 'min'. So to obtain the difference, first compute the max and min and then subtract:
import pandas as pd
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
result = df.groupby('GROUP')['VALUE'].agg(['max','min'])
result['diff'] = result['max']-result['min']
print(result[['diff']])
yields
diff
GROUP
1 5
2 18
Note: this will get the job done, but #piRSquared's answer has faster methods.
You can use groupby(), min(), and max():
df.groupby('GROUP')['VALUE'].apply(lambda g: g.max() - g.min())
How can I efficiently find overlapping dates between many date ranges?
I have a pandas dataframe containing information on the daily warehouse stock of many products. There are only records for those dates where stock actually changed.
import pandas as pd
df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
'stock': [10, 0, 10, 5, 0, 5],
'date': ['2016-01-01', '2016-01-05', '2016-01-15',
'2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
Out[4]:
date product stock
0 2016-01-01 a 10
1 2016-01-05 a 0
2 2016-01-15 a 10
3 2016-01-01 b 5
4 2016-01-10 b 0
5 2016-01-20 b 5
From this data I want to identify the number of days where stock of all products was 0. In the example this would be 5 days (from 2016-01-10 to 2016-01-14).
I initially tried resampling the date to create one record for every day and then comparing day by day. This works but it creates a very large dataframe, that I can hardly keep in Memory, because my data contains many dates where stock does not change.
Is there a more memory-efficient way to calculate overlaps other than creating a record for every date and comparing day by day?
Maybe I can somehow create a period representation for the time range implicit in every records and then compare all periods for all products?
Another option could be to first subset only those time periods where a product has zero stock (relatively few) and then apply the resampling only on that subset of the data.
What other, more efficient ways are there?
You can pivot the table using the dates as index and the products as columns, then fill nan's with previous values, convert to daily frequency and look for rows with 0's in all columns.
ptable = (df.pivot(index='date', columns='product', values='stock')
.fillna(method='ffill').asfreq('D', method='ffill'))
cond = ptable.apply(lambda x: (x == 0).all(), axis='columns')
print(ptable.index[cond])
DatetimeIndex(['2016-01-10', '2016-01-11', '2016-01-12', '2016-01-13',
'2016-01-14'],
dtype='datetime64[ns]', name=u'date', freq='D')
Here try this, I know its not the prettiest of codes but according to all the data provided here this should work:
from datetime import timedelta
import pandas as pd
df = pd.DataFrame({'product': ['a', 'a', 'a', 'b', 'b', 'b'],
'stock': [10, 0, 10, 5, 0, 5],
'date': ['2016-01-01', '2016-01-05', '2016-01-15',
'2016-01-01', '2016-01-10', '2016-01-20']})
df['date'] = pd.to_datetime(df['date'])
df = df.sort('date', ascending=True)
no_stock_dates = []
product_stock = {}
in_flag = False
begin = df['date'][0]
for index, row in df.iterrows():
current = row['date']
product_stock[row['product']] = row['stock']
if current > begin:
if sum(product_stock.values()) == 0 and not in_flag:
in_flag = True
begin = row['date']
if sum(product_stock.values()) != 0 and in_flag:
in_flag = False
no_stock_dates.append((begin, current-timedelta(days=1)))
print no_stock_dates
This code should run at O(n*k) where n is the number of lines, and k is the number of product categories.
Each column of the Dataframe needs their values to be normalized according the value of the first element in that column.
for timestamp, prices in data.iteritems():
normalizedPrices = prices / prices[0]
print normalizedPrices # how do we update the DataFrame with this Series?
However how do we update the DataFrame once we have created the normalized column of data? I believe if we do prices = normalizedPrices we are merely acting on a copy/view of the DataFrame rather than the original DataFrame itself.
It might be simplest to normalize the entire DataFrame in one go (and avoid looping over rows/columns altogether):
>>> df = pd.DataFrame({'a': [2, 4, 5], 'b': [3, 9, 4]}, dtype=np.float) # a DataFrame
>>> df
a b
0 2 3
1 4 9
2 5 4
>>> df = df.div(df.loc[0]) # normalise DataFrame and bind back to df
>>> df
a b
0 1.0 1.000000
1 2.0 3.000000
2 2.5 1.333333
Assign to data[col]:
for col in data:
data[col] /= data[col].iloc[0]
import numpy
data[0:] = data[0:].values/data[0:1].values