Grouping data in pandas by rows

Grouping data in pandas by rows - python

I have data with this structure:
id month val
1 0 4
2 0 4
3 0 5
1 1 3
2 1 7
3 1 9
1 2 12
2 2 1
3 2 5
1 3 10
2 3 4
3 3 7
...
I want to get mean val for each id, grouped by two months. Expected result:
id two_months val
1 0 3.5
2 0 5.5
3 0 7
1 1 11
2 1 2.5
3 1 6
What's the simplest way to do it using Pandas?

If months are consecutive integers starting by 0 use integer division by 2:
df = df.groupby(['id',df['month'] // 2])['val'].mean().sort_index(level=[1,0]).reset_index()
print (df)
id month val
0 1 0 3.5
1 2 0 5.5
2 3 0 7.0
3 1 1 11.0
4 2 1 2.5
5 3 1 6.0
Possible solution with convert to datetimes:
df.index = pd.to_datetime(df['month'].add(1), format='%m')
df = df.groupby(['id', pd.Grouper(freq='2MS')])['val'].mean().sort_index(level=[1,0]).reset_index()
print (df)
id month val
0 1 1900-01-01 3.5
1 2 1900-01-01 5.5
2 3 1900-01-01 7.0
3 1 1900-03-01 11.0
4 2 1900-03-01 2.5
5 3 1900-03-01 6.0

Related

Count preceding non NaN values in pandas

I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.

There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

Modify time column to exclusive and inclusive time in Pandas DataFrame

I have the following DataFrame of individuals and the time of an event.
id time
1 0
2 0
3 0
4 0
2 1
3 1
1 2
4 2
1 3
2 3
1 4
2 4
3 4
4 4
I want a column of left exclusive time points (start: time of the previous event). The column of right inclusive time points (stop) is the column time.
id start stop
1 0 0
2 0 0
3 0 0
4 0 0
2 0 1
3 0 1
1 0 2
4 0 2
1 2 3
2 1 3
1 3 4
2 3 4
3 1 4
4 2 4
Any straightforward functions that accomplish this?

Use DataFrameGroupBy.shift in DataFrame.insert, for get new column like second column, last rename column:
df.insert(1, 'start', df.groupby('id')['time'].shift(fill_value=0))
df = df.rename(columns={'time':'stop'})
print (df)
id start stop
0 1 0 0
1 2 0 0
2 3 0 0
3 4 0 0
4 2 0 1
5 3 0 1
6 1 0 2
7 4 0 2
8 1 2 3
9 2 1 3
10 1 3 4
11 2 3 4
12 3 1 4
13 4 2 4

To get the previous value of every id, you want to group by 'id' and retrieve the previous value by using shift as your new column 'start':
df['start'] = df.groupby('id').time.shift(1, fill_value=0)
id time start
0 1 0 0.0
1 2 0 0.0
2 3 0 0.0
3 4 0 0.0
4 2 1 0.0
5 3 1 0.0
6 1 2 0.0
7 4 2 0.0
8 1 3 2.0
9 2 3 1.0
10 1 4 3.0
11 2 4 3.0
12 3 4 1.0
13 4 4 2.0
Then you might want to rename your 'time' column to 'end':
df.rename({'time':'end'}, axis=1, inplace=True)
If you want the switch start and end, reshuffle your columns like this:
df[['id', 'start', 'end']]

Add missing days in a dataframe

I need to fill missing days in the column 'day':
id month day trans
0 0 8 1 9
1 0 8 2 5
2 0 8 3 10
3 0 8 4 6
4 0 8 6 4
5 0 8 8 4
I am looking for output:
id month day trans
0 0 8 1 9
1 0 8 2 5
2 0 8 3 10
3 0 8 4 6
4 0 8 5 NAN
5 0 8 6 4
6 0 8 7 NAN
7 0 8 8 4

Use reindex()
df1=df.set_index('day').reindex([1,2,3,4,5,6,7]).reset_index()
df1[['month','id']]=df1[['month','id']].ffill()
Following your comment;
mux = pd.MultiIndex.from_product([df['id'].unique(),[1,2,3,4,5,6,7]], names=['id','day'])
df1=df.set_index(['id','day']).reindex(mux).reset_index()
df1[['month','id']]=df1[['month','id']].ffill()
id day month #trans
0 0 1 8.0 9.0
1 0 2 8.0 5.0
2 0 3 8.0 10.0
3 0 4 8.0 6.0
4 0 5 8.0 NaN
5 0 6 8.0 4.0
6 0 7 8.0 NaN

I think the best way to deal with it is building a pandas df that has all the [month, day] values of your output, and left merging your first df on [id, month, day] key.

Using pandas upsampling.
df['date'] = df.apply(lambda x: datetime(2020, x['month'], x['day']), axis=1)
df = df.set_index('date')
# Upsampling
df_daily = df.resample('D').asfreq().reset_index()
# reassign month and day
df_daily['month'] = df_daily.date.dt.month
df_daily['day'] = df_daily.date.dt.day
df_daily['id'] = df_daily['id'].fillna(method='ffill').astype(int)
del df_daily['date']

Merging row data using panda's in python

I am trying to write a small python application that creates a csv file that contains data for a recipe system,
Imagine the following structure of excel data
Manufacturer Product Data 1 Data 2 Data 3
Test 1 Product 1 1 2 3
Test 1 Product 2 4 5 6
Test 2 Product 1 1 2 3
Test 3 Product 1 1 2 3
Test 3 Product 1 4 5 6
Test 3 Product 1 7 8 9
When merged i woudl like the data to be displayed in following format,
Test 1 Product 1 1 2 3 0 0 0 0 0 0
Test 2 Product 2 4 5 6 0 0 0 0 0 0
Test 2 Product 1 1 2 3 0 0 0 0 0 0
Test 3 Product 1 1 2 3 4 5 6 7 8 9
Any help would be greatfully recieved, so far i can read the panda dataset and convert to a CSV
Regards
Lee

Use melt, groupby, pd.Series, and unstack:
(df.melt(['Manufacturer','Product'])
.groupby(['Manufacturer','Product'])['value']
.apply(lambda x: pd.Series(x.tolist()))
.unstack(fill_value=0)
.reset_index())
Output:
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 4 7 2 5 8 3 6 9

With groupby
df.groupby(['Manufacturer','Product']).agg(tuple).sum(1).apply(pd.Series).fillna(0)
Out[85]:
0 1 2 3 4 5 6 7 8
Manufacturer Product
Test1 Product1 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
Product2 4.0 5.0 6.0 0.0 0.0 0.0 0.0 0.0 0.0
Test2 Product1 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
Test3 Product1 1.0 4.0 7.0 2.0 5.0 8.0 3.0 6.0 9.0

cols = ['Manufacturer', 'Product']
d = df.set_index(cols + [df.groupby(cols).cumcount()]).unstack(fill_value=0)
d
Gets me
Data 1 Data 2 Data 3
0 1 2 0 1 2 0 1 2
Manufacturer Product
Test 1 Product 1 1 0 0 2 0 0 3 0 0
Product 2 4 0 0 5 0 0 6 0 0
Test 2 Product 1 1 0 0 2 0 0 3 0 0
Test 3 Product 1 1 4 7 2 5 8 3 6 9
Followed up wtih
d.sort_index(1, 1).pipe(lambda d: d.set_axis(range(d.shape[1]), 1, False).reset_index())
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 2 3 4 5 6 7 8 9
Or
cols = ['Manufacturer', 'Product']
pd.Series({
n: d.values.ravel() for n, d in df.set_index(cols).groupby(cols)
}).apply(pd.Series).fillna(0, downcast='infer').rename_axis(cols).reset_index()
Manufacturer Product 0 1 2 3 4 5 6 7 8
0 Test 1 Product 1 1 2 3 0 0 0 0 0 0
1 Test 1 Product 2 4 5 6 0 0 0 0 0 0
2 Test 2 Product 1 1 2 3 0 0 0 0 0 0
3 Test 3 Product 1 1 2 3 4 5 6 7 8 9
With defaultdict and itertools.count
from itertools import count
from collections import defaultdict
c = defaultdict(count)
pd.Series({(
m, p, next(c[(m, p)])): v
for _, m, p, *V in df.itertuples()
for v in V
}).unstack(fill_value=0)
0 1 2 3 4 5 6 7 8
Test 1 Product 1 1 2 3 0 0 0 0 0 0
Product 2 4 5 6 0 0 0 0 0 0
Test 2 Product 1 1 2 3 0 0 0 0 0 0
Test 3 Product 1 1 2 3 4 5 6 7 8 9

Count distinct strings in rolling window using pandas

How do I count the number of unique strings in a rolling window of a pandas dataframe?
a = pd.DataFrame(['a','b','a','a','b','c','d','e','e','e','e'])
a.rolling(3).apply(lambda x: len(np.unique(x)))
Output, same as original dataframe:
0
0 a
1 b
2 a
3 a
4 b
5 c
6 d
7 e
8 e
9 e
10 e
Expected:
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1

I think you need first convert values to numeric - by factorize or by rank. Also min_periods parameter is necessary for avoid NaN in start of column:
a[0] = pd.factorize(a[0])[0]
print (a)
0
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 4
9 4
10 4
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1
Or:
a[0] = a[0].rank(method='dense')
0
0 1.0
1 2.0
2 1.0
3 1.0
4 2.0
5 3.0
6 4.0
7 5.0
8 5.0
9 5.0
10 5.0
b = a.rolling(3, min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
print (b)
0
0 1
1 2
2 2
3 2
4 2
5 3
6 3
7 3
8 2
9 1
10 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping data in pandas by rows - python

Related

Count preceding non NaN values in pandas

Modify time column to exclusive and inclusive time in Pandas DataFrame

Add missing days in a dataframe

Merging row data using panda's in python

Count distinct strings in rolling window using pandas

Categories

Resources