GroupBy and Shift based on the values of a column - python

Let's suppose that I have the following dataset:
Stock_id Week Stock_value
1 1 2
1 2 4
1 4 7
1 5 1
2 3 8
2 4 6
2 5 5
2 6 3
I want to shift the values of the Stock_value column but only for consecutive weeks.
This should give the following output:
Stock_id Week Stock_value
1 1 NA
1 2 2
1 4 NA
1 5 7
2 3 NA
2 4 8
2 5 6
2 6 5
So for example at store 1 the Stock_value of week 2 should not be shifted over to week 4 (since I want one week shift for now).
How can I do this easily?

IIUC using week with its diff create another group key
df.groupby([df.Stock_id,df.Week.diff().ne(1).cumsum()]).Stock_value.shift()
Out[157]:
0 NaN
1 2.0
2 NaN
3 7.0
4 NaN
5 8.0
6 6.0
7 5.0
Name: Stock_value, dtype: float64
#df['Stock_value2']= df.groupby([df.Stock_id,df.Week.diff().ne(1).cumsum()]).Stock_value.shift()

Related

Count preceding non NaN values in pandas

I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

Grouping data in pandas by rows

I have data with this structure:
id month val
1 0 4
2 0 4
3 0 5
1 1 3
2 1 7
3 1 9
1 2 12
2 2 1
3 2 5
1 3 10
2 3 4
3 3 7
...
I want to get mean val for each id, grouped by two months. Expected result:
id two_months val
1 0 3.5
2 0 5.5
3 0 7
1 1 11
2 1 2.5
3 1 6
What's the simplest way to do it using Pandas?
If months are consecutive integers starting by 0 use integer division by 2:
df = df.groupby(['id',df['month'] // 2])['val'].mean().sort_index(level=[1,0]).reset_index()
print (df)
id month val
0 1 0 3.5
1 2 0 5.5
2 3 0 7.0
3 1 1 11.0
4 2 1 2.5
5 3 1 6.0
Possible solution with convert to datetimes:
df.index = pd.to_datetime(df['month'].add(1), format='%m')
df = df.groupby(['id', pd.Grouper(freq='2MS')])['val'].mean().sort_index(level=[1,0]).reset_index()
print (df)
id month val
0 1 1900-01-01 3.5
1 2 1900-01-01 5.5
2 3 1900-01-01 7.0
3 1 1900-03-01 11.0
4 2 1900-03-01 2.5
5 3 1900-03-01 6.0

How to average a pandas data frame column before and after a given row based on some odd value n?

If I have a pandas data frame like this:
Col A
0 2
1 3
2 5
3 4
4 6
5 2
6 3
7 1
8 1
9 7
and some odd numerical value like this
n = 3
How do I average the value I'm replacing, the values before, and after based on half of my n. For n = 3 I would be averaging the value before and after. Such that I get a pandas dataframe like this:
Col A Col B
0 2 np.nan
1 3 3.33
2 5 4
3 4 5
4 6 4
5 2 3.67
6 3 2
7 1 2.5
8 1 3
9 7 np.nan
The first and last values are np.nan because there is not a value before/after them.
You can use rolling with center option:
n = 3
df['Col A'].rolling(n, center=True).mean()
Output:
0 NaN
1 3.333333
2 4.000000
3 5.000000
4 4.000000
5 3.666667
6 2.000000
7 1.666667
8 3.000000
9 NaN
Name: Col A, dtype: float64

GroupBy and Transform does not keep all columns of dataframe

Let's suppose that I am having the following dataset:
Stock_id Week Stock_value
1 1 2
1 2 4
1 4 7
1 5 1
2 3 8
2 4 6
2 5 5
2 6 3
I want to shift the values of the Stock_value column by one position so that I get the following:
Stock_id Week Stock_value
1 1 NA
1 2 2
1 4 4
1 5 7
2 3 NA
2 4 8
2 5 6
2 6 5
What I am doing is the following:
df = pd.read_csv('C:/Users/user/Desktop/test.txt', keep_default_na=True, sep='\t')
df = df.groupby('Store_id', as_index=False)['Waiting_time'].transform(lambda x:x.shift(periods=1))
But then this gives me:
Waiting_time
0 NaN
1 2.0
2 4.0
3 7.0
4 NaN
5 8.0
6 6.0
7 5.0
So it gives me the values shifted but it does not retain all the columns of the dataframe.
How do I also retain all the columns of the dataframe along with shifting the values of one column?
You can simplify solution by DataFrameGroupBy.shift and assign back to new column:
df['Waiting_time'] = df.groupby('Stock_id')['Stock_value'].shift()
Working same like:
df['Waiting_time']=df.groupby('Stock_id')['Stock_value'].transform(lambda x:x.shift(periods=1))
print (df)
Stock_id Week Stock_value Waiting_time
0 1 1 2 NaN
1 1 2 4 2.0
2 1 4 7 4.0
3 1 5 1 7.0
4 2 3 8 NaN
5 2 4 6 8.0
6 2 5 5 6.0
7 2 6 3 5.0
When you do df.groupby('Store_id', as_index=False)['Waiting_time'], you obtain a DataFrame with a single column 'Waiting_time', you can't generate the other columns from that.
As suggested by jezrael in the comment, you should do
df['new col'] = df.groupby('Store_id...
to add this new column to your previously existing DataFrame.

Sorting the dates and assigning a rank - python

Suppose I have data like,
user date
1 3/18/2016
1 1/11/2015
1 1/11/2015
1 5/8/2015
1 7/8/2015
2 3/17/2016
2 2/10/2015
2 9/8/2015
2 1/1/2016
2 1/1/2016
I want to sort the rows based on the dates for each user and then create a new column, which would assign 1-5 rank for each date.
The following are by tryings,
df.groupby(['user']).sort_values(['date']) for sorting the dates for each user. But I want to create a new column which would rank after sorting.
My ideal output would be,
user date rank
1 1/11/2015 1
1 1/11/2015 1
1 5/8/2015 2
1 7/8/2015 3
1 3/18/2016 4
2 2/10/2015 1
2 9/8/2015 2
2 1/1/2016 3
2 1/1/2016 3
2 3/17/2016 4
Can anybody help me in doing this? Thanks
Try this:
In [274]: df['rank'] = df.sort_values(['user','date']) \
.groupby(['user'])['date'] \
.rank(method='min').astype(int)
In [277]: df.sort_values(['user','date'])
Out[277]:
user date rank
1 1 2015-01-11 1
2 1 2015-01-11 1
3 1 2015-05-08 3
4 1 2015-07-08 4
0 1 2016-03-18 5
6 2 2015-02-10 1
7 2 2015-09-08 2
8 2 2016-01-01 3
9 2 2016-01-01 3
5 2 2016-03-17 5

Categories

Resources