I have a DataFrame that looks somehow like the following one:
time status A
0 0 2 20
1 1 2 21
2 2 2 20
3 3 2 19
4 4 10 18
5 5 2 17
6 6 2 18
7 7 2 19
8 8 2 18
9 9 10 17
... ... ... ...
Now, I'd like to select all rows with status == 2 and kind of group the resulting rows, that are not interupted by any other row-status so that I can access each group afterwards separately.
Something like:
print df1
time status A
0 0 2 20
1 1 2 21
2 2 2 20
3 3 2 19
print df2
time status A
0 5 2 17
1 6 2 18
2 7 2 19
3 8 2 18
Is there an efficient, loop-avoiding way to achieve this?
Thank you in advance!
Input data:
>>> df
time status A
0 0 2 20 # group 1
1 1 2 21 # 1
2 2 2 20 # 1
3 3 2 19 # 1
4 4 10 18 # group 2
5 5 2 17 # group 3
6 6 2 18 # 3
7 7 2 19 # 3
8 8 2 18 # 3
9 9 10 17 # group 4
df["group"] = df.status.ne(df.status.shift()).cumsum()
>>> df
time status A group
0 0 2 20 1
1 1 2 21 1
2 2 2 20 1
3 3 2 19 1
4 4 10 18 2
5 5 2 17 3
6 6 2 18 3
7 7 2 19 3
8 8 2 18 3
9 9 10 17 4
Now you can do what you want. For example:
(_, df1), (_, df2) = list(df.loc[df["status"] == 2].groupby("group"))
>>> df1
time status A group
0 0 2 20 1
1 1 2 21 1
2 2 2 20 1
3 3 2 19 1
>>> df2
time status A group
5 5 2 17 3
6 6 2 18 3
7 7 2 19 3
8 8 2 18 3
Related
assume i have df:
pd.DataFrame({'data': [0,0,0,1,1,1,2,2,2,3,3,4,4,5,5,0,0,0,0,2,2,2,2,4,4,4,4]})
data
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
10 3
11 4
12 4
13 5
14 5
15 0
16 0
17 0
18 0
19 2
20 2
21 2
22 2
23 4
24 4
25 4
26 4
I'm looking for a way to create a new column in df that shows the number of data items repeated in new column For example:
data new
0 0 1
1 0 2
2 0 3
3 1 1
4 1 2
5 1 3
6 2 1
7 2 2
8 2 3
9 3 1
10 3 2
11 4 1
12 4 2
13 5 1
14 5 2
15 0 1
16 0 2
17 0 3
18 0 4
19 2 1
20 2 2
21 2 3
22 2 4
23 4 1
24 4 2
25 4 3
26 4 4
My logic was to get the rows to python list compare and create a new list.
Is there a simple way to do this?
Example
df = pd.DataFrame({'data': [0,0,0,1,1,1,2,2,2,3,3,4,4,5,5,0,0,0,0,2,2,2,2,4,4,4,4]})
Code
grouper = df['data'].ne(df['data'].shift(1)).cumsum()
df['new'] = df.groupby(grouper).cumcount().add(1)
df
data new
0 0 1
1 0 2
2 0 3
3 1 1
4 1 2
5 1 3
6 2 1
7 2 2
8 2 3
9 3 1
10 3 2
11 4 1
12 4 2
13 5 1
14 5 2
15 0 1
16 0 2
17 0 3
18 0 4
19 2 1
20 2 2
21 2 3
22 2 4
23 4 1
24 4 2
25 4 3
26 4 4
This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 4 months ago.
I have the following data:
df = pd.DataFrame({'id' : [1,2,3,4,5,6], 'category' : [1,3,1,4,3,2], 'day1' : [10,20,30,40,50,60], 'day2' : [1,2,3,4,5,7], 'day3' : [0,1,2,3,7,9] })
df
id category day1 day2 day3
0 1 1 10 1 0
1 2 3 20 2 1
2 3 1 30 3 2
3 4 4 40 4 3
4 5 3 50 5 7
5 6 2 60 7 9
It is time series data and I need to prepare the new DataFrame as records of ('id', 'category', 'day'):
df = pd.DataFrame({'id' : [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6], 'category' : [1,1,1,3,3,3,1,1,1,4,4,4,3,3,3,2,2,2], 'day' : [10,1,0,20,2,1,30,3,2,40,4,3,50,5,7,60,7,9]})
df
id category day
0 1 1 10
1 1 1 1
2 1 1 0
3 2 3 20
4 2 3 2
5 2 3 1
6 3 1 30
7 3 1 3
8 3 1 2
9 4 4 40
10 4 4 4
11 4 4 3
12 5 3 50
13 5 3 5
14 5 3 7
15 6 2 60
16 6 2 7
17 6 2 9
But I don't know how to do it without looping by every DataFrame cell
A possible solution:
df.set_index(['id', 'category']).stack().rename(
'day').reset_index().drop('level_2', axis=1)
Output:
id category day
0 1 1 10
1 1 1 1
2 1 1 0
3 2 3 20
4 2 3 2
5 2 3 1
6 3 1 30
7 3 1 3
8 3 1 2
9 4 4 40
10 4 4 4
11 4 4 3
12 5 3 50
13 5 3 5
14 5 3 7
15 6 2 60
16 6 2 7
17 6 2 9
You can use pandas.melt :
df_new = df.melt(id_vars=['id', 'category'], value_name='day'
).sort_values(['id', 'variable']
).drop('variable', axis=1
).reset_index(drop=True)
print(df_new)
Output:
id category day
0 1 1 10
1 1 1 1
2 1 1 0
3 2 3 20
4 2 3 2
5 2 3 1
6 3 1 30
7 3 1 3
8 3 1 2
9 4 4 40
10 4 4 4
11 4 4 3
12 5 3 50
13 5 3 5
14 5 3 7
15 6 2 60
16 6 2 7
17 6 2 9
I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
print(df)
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 3 1 5
13 3 2 4
14 3 4 5
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).
Here's the ideal output:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 4 1 8
13 4 2 4
14 4 3 5
15 4 4 6
16 4 5 2
17 4 6 3
Any help much appreciated.
You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:
t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]
Prints:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
Good morning
I have a current problem when trying to replace some values. I have a dataframe that has a column "loc10p" that separates the records into 10 groups, and for each group I have grouped those records into smaller groups, but each group has a starting range of 1 of the subgroups instead of counting the last subgroup. For example:
c2[c2.loc10p.isin([1,2])].sort_values(['loc10p','subgrupoloc10'])[['loc10p','subgrupoloc10']]
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 1
8 2 1
9 2 1
16 2 1
17 2 1
18 2 2
23 2 2
How can I transform that into something like the following:
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 3
8 2 3
9 2 3
16 2 3
17 2 3
18 2 4
23 2 4
I tried to do a loop that separates each group category into a different dataframe and then, replacing the values of the subgroup with a counter of the previous group, but it didn't replace anything:
w=1
temporal=[]
for e in range(1,11):
temp=c2[c2['loc10p']==e]
temporal.append(temp)
for e,i in zip(temporal,range(1,9)):
try:
e.loc[,'subgrupoloc10']=w
w+=1
except:
pass
Any help will be really appreciated!!
Try with ngroup
df['out'] = df.groupby(['loc10p','subgrupoloc10']).ngroup()+1
Out[204]:
1 1
7 1
15 1
0 2
14 2
30 2
31 2
2 3
8 3
9 3
16 3
17 3
18 4
23 4
dtype: int64
Try:
groups = (df["subgrupoloc10"] != df["subgrupoloc10"].shift()).cumsum()
df["subgrupoloc10"] = groups
print(df)
Prints:
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 3
8 2 3
9 2 3
16 2 3
17 2 3
18 2 4
23 2 4
I'm a beginner in Python Data Science. I'm working on clickstream data and trying to count the consecutive clicks on an item in a given session. I'm getting the cumulative sum in 'Block' column. After that I'm aggregating on Block to get the count on each block. In the end I want to groupby Session and Item and aggregate the block count since there may be cases(Sid=6 here) where an item comes consecutively m times at first and again after other items, it comes consecutively n times. So the consecutive count should be 'm+n'.
Here is the dataset-
Sid Tstamp Itemid
0 1 2014-04-07T10:51:09.277Z 214536502
1 1 2014-04-07T10:54:09.868Z 214536500
2 1 2014-04-07T10:54:46.998Z 214536506
3 1 2014-04-07T10:57:00.306Z 214577561
4 2 2014-04-07T13:56:37.614Z 214662742
5 2 2014-04-07T13:57:19.373Z 214662742
6 2 2014-04-07T13:58:37.446Z 214825110
7 2 2014-04-07T13:59:50.710Z 214757390
8 2 2014-04-07T14:00:38.247Z 214757407
9 2 2014-04-07T14:02:36.889Z 214551617
10 3 2014-04-02T13:17:46.940Z 214716935
11 3 2014-04-02T13:26:02.515Z 214774687
12 3 2014-04-02T13:30:12.318Z 214832672
13 4 2014-04-07T12:09:10.948Z 214836765
14 4 2014-04-07T12:26:25.416Z 214706482
15 6 2014-04-03T10:44:35.672Z 214821275
16 6 2014-04-03T10:45:01.674Z 214821275
17 6 2014-04-03T10:45:29.873Z 214821371
18 6 2014-04-03T10:46:12.162Z 214821371
19 6 2014-04-03T10:46:57.355Z 214821371
20 6 2014-04-03T10:53:22.572Z 214717089
21 6 2014-04-03T10:53:49.875Z 214563337
22 6 2014-04-03T10:55:19.267Z 214706462
23 6 2014-04-03T10:55:47.327Z 214821371
24 6 2014-04-03T10:56:30.520Z 214821371
25 6 2014-04-03T10:57:19.331Z 214821371
26 6 2014-04-03T10:57:39.433Z 214819762
Here is my code-
k['Block'] =( k['Itemid'] != k['Itemid'].shift(1) ).astype(int).cumsum()
y=k.groupby('Block').count()
z=k.groupby(['Sid','Itemid']).agg({"y[Count]": lambda x: x.sum()})
Won't this work?
k.groupby(['Sid', 'Itemid']).Block.count()
Sid Itemid
1 214536500 1
214536502 1
214536506 1
214577561 1
2 214551617 1
214662742 2
214757390 1
214757407 1
214825110 1
3 214716935 1
214774687 1
214832672 1
4 214706482 1
214836765 1
6 214563337 1
214706462 1
214717089 1
214819762 1
214821275 2
214821371 6
Name: Block, dtype: int64
IIUC you can:
k['Block'] =( k['Itemid'] != k['Itemid'].shift(1) ).astype(int).cumsum()
#print k
z=k.groupby(['Sid','Itemid', 'Block']).size().groupby(level=[0,1]).sum().reset_index(name='sum_counts')
print z
Sid Itemid sum_counts
0 1 214536500 1
1 1 214536502 1
2 1 214536506 1
3 1 214577561 1
4 2 214551617 1
5 2 214662742 2
6 2 214757390 1
7 2 214757407 1
8 2 214825110 1
9 3 214716935 1
10 3 214774687 1
11 3 214832672 1
12 4 214706482 1
13 4 214836765 1
14 6 214701242 1
15 6 214826623 1
16 7 214826715 1
17 7 214826835 1
18 8 214838855 2
19 9 214576500 3
20 11 214563337 1
21 11 214706462 1
22 11 214717089 1
23 11 214819762 1
24 11 214821275 2
25 11 214821371 6