Good morning
I have a current problem when trying to replace some values. I have a dataframe that has a column "loc10p" that separates the records into 10 groups, and for each group I have grouped those records into smaller groups, but each group has a starting range of 1 of the subgroups instead of counting the last subgroup. For example:
c2[c2.loc10p.isin([1,2])].sort_values(['loc10p','subgrupoloc10'])[['loc10p','subgrupoloc10']]
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 1
8 2 1
9 2 1
16 2 1
17 2 1
18 2 2
23 2 2
How can I transform that into something like the following:
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 3
8 2 3
9 2 3
16 2 3
17 2 3
18 2 4
23 2 4
I tried to do a loop that separates each group category into a different dataframe and then, replacing the values of the subgroup with a counter of the previous group, but it didn't replace anything:
w=1
temporal=[]
for e in range(1,11):
temp=c2[c2['loc10p']==e]
temporal.append(temp)
for e,i in zip(temporal,range(1,9)):
try:
e.loc[,'subgrupoloc10']=w
w+=1
except:
pass
Any help will be really appreciated!!
Try with ngroup
df['out'] = df.groupby(['loc10p','subgrupoloc10']).ngroup()+1
Out[204]:
1 1
7 1
15 1
0 2
14 2
30 2
31 2
2 3
8 3
9 3
16 3
17 3
18 4
23 4
dtype: int64
Try:
groups = (df["subgrupoloc10"] != df["subgrupoloc10"].shift()).cumsum()
df["subgrupoloc10"] = groups
print(df)
Prints:
loc10p subgrupoloc10
1 1 1
7 1 1
15 1 1
0 1 2
14 1 2
30 1 2
31 1 2
2 2 3
8 2 3
9 2 3
16 2 3
17 2 3
18 2 4
23 2 4
Related
assume i have df:
pd.DataFrame({'data': [0,0,0,1,1,1,2,2,2,3,3,4,4,5,5,0,0,0,0,2,2,2,2,4,4,4,4]})
data
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
10 3
11 4
12 4
13 5
14 5
15 0
16 0
17 0
18 0
19 2
20 2
21 2
22 2
23 4
24 4
25 4
26 4
I'm looking for a way to create a new column in df that shows the number of data items repeated in new column For example:
data new
0 0 1
1 0 2
2 0 3
3 1 1
4 1 2
5 1 3
6 2 1
7 2 2
8 2 3
9 3 1
10 3 2
11 4 1
12 4 2
13 5 1
14 5 2
15 0 1
16 0 2
17 0 3
18 0 4
19 2 1
20 2 2
21 2 3
22 2 4
23 4 1
24 4 2
25 4 3
26 4 4
My logic was to get the rows to python list compare and create a new list.
Is there a simple way to do this?
Example
df = pd.DataFrame({'data': [0,0,0,1,1,1,2,2,2,3,3,4,4,5,5,0,0,0,0,2,2,2,2,4,4,4,4]})
Code
grouper = df['data'].ne(df['data'].shift(1)).cumsum()
df['new'] = df.groupby(grouper).cumcount().add(1)
df
data new
0 0 1
1 0 2
2 0 3
3 1 1
4 1 2
5 1 3
6 2 1
7 2 2
8 2 3
9 3 1
10 3 2
11 4 1
12 4 2
13 5 1
14 5 2
15 0 1
16 0 2
17 0 3
18 0 4
19 2 1
20 2 2
21 2 3
22 2 4
23 4 1
24 4 2
25 4 3
26 4 4
I have a DataFrame that looks somehow like the following one:
time status A
0 0 2 20
1 1 2 21
2 2 2 20
3 3 2 19
4 4 10 18
5 5 2 17
6 6 2 18
7 7 2 19
8 8 2 18
9 9 10 17
... ... ... ...
Now, I'd like to select all rows with status == 2 and kind of group the resulting rows, that are not interupted by any other row-status so that I can access each group afterwards separately.
Something like:
print df1
time status A
0 0 2 20
1 1 2 21
2 2 2 20
3 3 2 19
print df2
time status A
0 5 2 17
1 6 2 18
2 7 2 19
3 8 2 18
Is there an efficient, loop-avoiding way to achieve this?
Thank you in advance!
Input data:
>>> df
time status A
0 0 2 20 # group 1
1 1 2 21 # 1
2 2 2 20 # 1
3 3 2 19 # 1
4 4 10 18 # group 2
5 5 2 17 # group 3
6 6 2 18 # 3
7 7 2 19 # 3
8 8 2 18 # 3
9 9 10 17 # group 4
df["group"] = df.status.ne(df.status.shift()).cumsum()
>>> df
time status A group
0 0 2 20 1
1 1 2 21 1
2 2 2 20 1
3 3 2 19 1
4 4 10 18 2
5 5 2 17 3
6 6 2 18 3
7 7 2 19 3
8 8 2 18 3
9 9 10 17 4
Now you can do what you want. For example:
(_, df1), (_, df2) = list(df.loc[df["status"] == 2].groupby("group"))
>>> df1
time status A group
0 0 2 20 1
1 1 2 21 1
2 2 2 20 1
3 3 2 19 1
>>> df2
time status A group
5 5 2 17 3
6 6 2 18 3
7 7 2 19 3
8 8 2 18 3
I have a large dataframe in the following format. I need to parse out only the values where values ==1 and through the remaining id. This should reset on each ID so that it takes the first value in a unique id that contains the value 1 and ends when the id number terminates.
d={'ID':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4,4,4,4,5,5,5,5,5] \
,'values':[0,0,0,1,0,1,0,1,1,1,0,1,0,0,0,0,0,0,1,1,0,1,0,1,1,1,1,1,] }
df=pd.DataFrame(data=d)
df=pd.DataFrame(data=d)
df
ND = {'ID':[1,1,2,2,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5],\
'values':[1,0,1,0,1,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1]}
df_final=pd.DataFrame(ND)
df_final
'''
IIUC,
df[df.groupby('ID')['values'].transform('cummax')==1]
Output:
ID values
3 1 1
4 1 0
5 2 1
6 2 0
7 2 1
8 2 1
9 2 1
11 3 1
12 3 0
13 3 0
18 4 1
19 4 1
20 4 0
21 4 1
22 4 0
23 5 1
24 5 1
25 5 1
26 5 1
27 5 1
Details, use cummax to keep the value of 1 after first found. Then use equal to 1 to create a boolean series, which then is used to do boolean indexing.
if your column values is only 0 and 1, you can use groupby.cummax that will replace 0 by 1 if they are after a 1 per ID, then use this as a boolean mask:
df_ = df[df.groupby('ID')['values'].cummax().astype(bool).to_numpy()]
print(df_)
ID values
3 1 1
4 1 0
5 2 1
6 2 0
7 2 1
8 2 1
9 2 1
11 3 1
12 3 0
13 3 0
18 4 1
19 4 1
20 4 0
21 4 1
22 4 0
23 5 1
24 5 1
25 5 1
26 5 1
27 5 1
I'm a beginner in Python Data Science. I'm working on clickstream data and trying to count the consecutive clicks on an item in a given session. I'm getting the cumulative sum in 'Block' column. After that I'm aggregating on Block to get the count on each block. In the end I want to groupby Session and Item and aggregate the block count since there may be cases(Sid=6 here) where an item comes consecutively m times at first and again after other items, it comes consecutively n times. So the consecutive count should be 'm+n'.
Here is the dataset-
Sid Tstamp Itemid
0 1 2014-04-07T10:51:09.277Z 214536502
1 1 2014-04-07T10:54:09.868Z 214536500
2 1 2014-04-07T10:54:46.998Z 214536506
3 1 2014-04-07T10:57:00.306Z 214577561
4 2 2014-04-07T13:56:37.614Z 214662742
5 2 2014-04-07T13:57:19.373Z 214662742
6 2 2014-04-07T13:58:37.446Z 214825110
7 2 2014-04-07T13:59:50.710Z 214757390
8 2 2014-04-07T14:00:38.247Z 214757407
9 2 2014-04-07T14:02:36.889Z 214551617
10 3 2014-04-02T13:17:46.940Z 214716935
11 3 2014-04-02T13:26:02.515Z 214774687
12 3 2014-04-02T13:30:12.318Z 214832672
13 4 2014-04-07T12:09:10.948Z 214836765
14 4 2014-04-07T12:26:25.416Z 214706482
15 6 2014-04-03T10:44:35.672Z 214821275
16 6 2014-04-03T10:45:01.674Z 214821275
17 6 2014-04-03T10:45:29.873Z 214821371
18 6 2014-04-03T10:46:12.162Z 214821371
19 6 2014-04-03T10:46:57.355Z 214821371
20 6 2014-04-03T10:53:22.572Z 214717089
21 6 2014-04-03T10:53:49.875Z 214563337
22 6 2014-04-03T10:55:19.267Z 214706462
23 6 2014-04-03T10:55:47.327Z 214821371
24 6 2014-04-03T10:56:30.520Z 214821371
25 6 2014-04-03T10:57:19.331Z 214821371
26 6 2014-04-03T10:57:39.433Z 214819762
Here is my code-
k['Block'] =( k['Itemid'] != k['Itemid'].shift(1) ).astype(int).cumsum()
y=k.groupby('Block').count()
z=k.groupby(['Sid','Itemid']).agg({"y[Count]": lambda x: x.sum()})
Won't this work?
k.groupby(['Sid', 'Itemid']).Block.count()
Sid Itemid
1 214536500 1
214536502 1
214536506 1
214577561 1
2 214551617 1
214662742 2
214757390 1
214757407 1
214825110 1
3 214716935 1
214774687 1
214832672 1
4 214706482 1
214836765 1
6 214563337 1
214706462 1
214717089 1
214819762 1
214821275 2
214821371 6
Name: Block, dtype: int64
IIUC you can:
k['Block'] =( k['Itemid'] != k['Itemid'].shift(1) ).astype(int).cumsum()
#print k
z=k.groupby(['Sid','Itemid', 'Block']).size().groupby(level=[0,1]).sum().reset_index(name='sum_counts')
print z
Sid Itemid sum_counts
0 1 214536500 1
1 1 214536502 1
2 1 214536506 1
3 1 214577561 1
4 2 214551617 1
5 2 214662742 2
6 2 214757390 1
7 2 214757407 1
8 2 214825110 1
9 3 214716935 1
10 3 214774687 1
11 3 214832672 1
12 4 214706482 1
13 4 214836765 1
14 6 214701242 1
15 6 214826623 1
16 7 214826715 1
17 7 214826835 1
18 8 214838855 2
19 9 214576500 3
20 11 214563337 1
21 11 214706462 1
22 11 214717089 1
23 11 214819762 1
24 11 214821275 2
25 11 214821371 6
I have a data frame in pandas that looks like the following:
df =
Image_Number Parent_Object Child_Object
1 1 1
1 1 2
1 1 3
1 1 4
1 2 5
1 1 6
1 2 7
1 2 8
1 3 9
1 3 10
1 3 11
1 3 12
2 1 13
2 1 14
2 1 15
2 1 16
2 2 17
2 2 18
2 2 19
2 3 20
2 3 21
2 3 22
2 2 23
2 3 24
2 3 25
3 1 26
3 1 27
3 1 28
3 2 29
3 2 30
How could I write something that would classify the child objects to the parent objects for each image?
It would be extremely helpful to get an output like the following:
Image_Number Parent_Object Number_of_Child_Objects
1 1 5
1 2 3
1 3 4
2 1 4
2 2 3
2 3 3
3 1 3
3 2 2
What you want to do is calculate something (counts) for different values (groups) of Image_Number and Parent_Object. This can be done with the groupby method (see here for the docs: http://pandas.pydata.org/pandas-docs/stable/groupby.html
In your case:
df.groupby(by=['Image_Number', 'Parent_Object']).count()