I have two data frames one like this:
point sector
1 1 4
2 2 5
3 3 2
4 4 1
5 5 5
6 6 1
7 7 4
8 8 3
10 10 5
11 11 2
12 12 1
13 13 3
14 14 1
15 15 4
16 16 3
17 17 2
18 18 1
19 19 1
20 20 1
21 alt 1 2
22 alt 3 3
23 alt 2 5
And the other like this, where the entry corresponds to the sector I want the point to come from.
p1 p2 p3 p4
1 2 3 4
1 2 3 5
1 2 4 5
1 3 4 5
2 3 4 5
What I want to do is create another data frame that will give me a randomly selected set of points from the first dataframe based on their sector.
For example:
p1 p2 p3 p4
lane 1: 12 3 8 7
As you can see the numbers from lane 1 all have sectors that are in line 1 of the 2nd dataframe. I have been trying to use df.loc but was wondering if there is a better way?
For each row, fetch data from the first dataframe and random choice it:
df2.apply(lambda r: df.loc[r].groupby(level=0).point.apply(np.random.choice).values, axis=1)
Out[132]:
p1 p2 p3 p4
0 4 11 alt 3 1
1 6 11 13 alt 2
2 4 17 7 alt 2
3 19 alt 3 15 5
4 alt 1 13 7 10
Related
assume i have df:
pd.DataFrame({'data': [0,0,0,1,1,1,2,2,2,3,3,4,4,5,5,0,0,0,0,2,2,2,2,4,4,4,4]})
data
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
10 3
11 4
12 4
13 5
14 5
15 0
16 0
17 0
18 0
19 2
20 2
21 2
22 2
23 4
24 4
25 4
26 4
I'm looking for a way to create a new column in df that shows the number of data items repeated in new column For example:
data new
0 0 1
1 0 2
2 0 3
3 1 1
4 1 2
5 1 3
6 2 1
7 2 2
8 2 3
9 3 1
10 3 2
11 4 1
12 4 2
13 5 1
14 5 2
15 0 1
16 0 2
17 0 3
18 0 4
19 2 1
20 2 2
21 2 3
22 2 4
23 4 1
24 4 2
25 4 3
26 4 4
My logic was to get the rows to python list compare and create a new list.
Is there a simple way to do this?
Example
df = pd.DataFrame({'data': [0,0,0,1,1,1,2,2,2,3,3,4,4,5,5,0,0,0,0,2,2,2,2,4,4,4,4]})
Code
grouper = df['data'].ne(df['data'].shift(1)).cumsum()
df['new'] = df.groupby(grouper).cumcount().add(1)
df
data new
0 0 1
1 0 2
2 0 3
3 1 1
4 1 2
5 1 3
6 2 1
7 2 2
8 2 3
9 3 1
10 3 2
11 4 1
12 4 2
13 5 1
14 5 2
15 0 1
16 0 2
17 0 3
18 0 4
19 2 1
20 2 2
21 2 3
22 2 4
23 4 1
24 4 2
25 4 3
26 4 4
I have a DataFrame that looks somehow like the following one:
time status A
0 0 2 20
1 1 2 21
2 2 2 20
3 3 2 19
4 4 10 18
5 5 2 17
6 6 2 18
7 7 2 19
8 8 2 18
9 9 10 17
... ... ... ...
Now, I'd like to select all rows with status == 2 and kind of group the resulting rows, that are not interupted by any other row-status so that I can access each group afterwards separately.
Something like:
print df1
time status A
0 0 2 20
1 1 2 21
2 2 2 20
3 3 2 19
print df2
time status A
0 5 2 17
1 6 2 18
2 7 2 19
3 8 2 18
Is there an efficient, loop-avoiding way to achieve this?
Thank you in advance!
Input data:
>>> df
time status A
0 0 2 20 # group 1
1 1 2 21 # 1
2 2 2 20 # 1
3 3 2 19 # 1
4 4 10 18 # group 2
5 5 2 17 # group 3
6 6 2 18 # 3
7 7 2 19 # 3
8 8 2 18 # 3
9 9 10 17 # group 4
df["group"] = df.status.ne(df.status.shift()).cumsum()
>>> df
time status A group
0 0 2 20 1
1 1 2 21 1
2 2 2 20 1
3 3 2 19 1
4 4 10 18 2
5 5 2 17 3
6 6 2 18 3
7 7 2 19 3
8 8 2 18 3
9 9 10 17 4
Now you can do what you want. For example:
(_, df1), (_, df2) = list(df.loc[df["status"] == 2].groupby("group"))
>>> df1
time status A group
0 0 2 20 1
1 1 2 21 1
2 2 2 20 1
3 3 2 19 1
>>> df2
time status A group
5 5 2 17 3
6 6 2 18 3
7 7 2 19 3
8 8 2 18 3
My apologies if the title is confusing, I will do my best to explain the problem. I have an example data set here:
Ex.1
Segment Reach OutSeg Elevation
1 1 3 50
1 2 3 74
1 3 3 87
1 4 3 53
1 5 3 97
2 1 3 16
2 2 3 14
2 3 3 31
2 4 3 35
2 5 3 27
3 1 4 193
3 2 4 176
3 3 4 158
3 4 4 154
4 1 6 21
4 2 6 45
4 3 6 42
4 4 6 22
4 5 6 22
5 1 6 10
5 2 6 21
5 3 6 14
5 4 6 16
I would like to calculate the moving average (w a window of 3) of Elevation along each Segment number in the sequential order of its Reach, however, where the Segment has an OutSeg value I would like the moving average towards the end of each Segment to use Elevation values from the beginning of its referenced segment (OutSeg). For example, at Segment 1 Reach 5 (1,5), I would like the moving average to account for the values at (1,4), (1,5), and (3,1).
I believe some kind of for loop may be needed...I have tried the below code but it only calculates the moving average within each group:
df["moving"] = df.groupby("Segment")["Elevation"].transform(lambda x: x.rolling(3,1)df["moving"] = df.groupby("Segment")["Elevation"].transform(lambda x: x.rolling(3,1)
Thanks in advance!
I could not imagine a fully vectorized way, so I would just use a groupby on Segment applying a specific function. That function would add the first row (if any) from the OutSeg sub-dataframe compute the rolling meand and only return the original rows.
Code could be:
df = df.sort_values(['Segment', 'Reach']) # ensure correct order
def tx(dg):
seg = dg['Segment'].iat[0]
outseg = dg['OutSeg'].iat[0]
x = pd.concat([dg, df[df['Segment'] == outseg].head(1)])
y = x['Elevation'].rolling(3, 1, center=True).mean()
return(y[x['Segment'] == seg])
df['Mean3'] = df.groupby('Segment', as_index=False, group_keys=False).apply(tx)
print(df)
It gives:
Segment Reach OutSeg Elevation Mean3
0 1 1 3 50 62.000000
1 1 2 3 74 70.333333
2 1 3 3 87 71.333333
3 1 4 3 53 79.000000
4 1 5 3 97 114.333333
5 2 1 3 16 15.000000
6 2 2 3 14 20.333333
7 2 3 3 31 26.666667
8 2 4 3 35 31.000000
9 2 5 3 27 85.000000
10 3 1 4 193 184.500000
11 3 2 4 176 175.666667
12 3 3 4 158 162.666667
13 3 4 4 154 111.000000
14 4 1 6 21 33.000000
15 4 2 6 45 36.000000
16 4 3 6 42 36.333333
17 4 4 6 22 28.666667
18 4 5 6 22 22.000000
19 5 1 6 10 15.500000
20 5 2 6 21 15.000000
21 5 3 6 14 17.000000
22 5 4 6 16 15.000000
Say I have a data frame that looks like this:
Id ColA
1 2
2 2
3 3
4 5
5 10
6 12
7 18
8 20
9 25
10 26
I would like my code to create a new column at the end of the DataFrame that divides the total # of obvservations by 5 ranging from 5 to 1.
Id ColA Segment
1 2 5
2 2 5
3 3 4
4 5 4
5 10 3
6 12 3
7 18 2
8 20 2
9 25 1
10 26 1
I tried the following code but doesn't work:
df['segment'] = pd.qcut(df['Id'],5)
I also want to know what would happpen if the total of my observations was not dividable by 5.
Actually, you were closer to the answer than you think. This will work regardless of whether len(df) is a multiple of 5 or not.
bins = 5
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 4
4 5 10 3
5 6 12 3
6 7 18 2
7 8 20 2
8 9 25 1
9 10 26 1
Where,
pd.qcut(df['Id'], bins).cat.codes
0 0
1 0
2 1
3 2
4 3
5 4
6 4
dtype: int8
Represents the categorical intervals returned by pd.qcut as integer values.
Another example, for a DataFrame with 7 rows.
df = df.head(7).copy()
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 3
4 5 10 2
5 6 12 1
6 7 18 1
This should work:
df['segment'] = np.linspace(1, 6, len(df), False, dtype=int)
It creates a list of int between 1 and 5 of the size of your array. If you want from 5 to 1, just add [::-1] at the end of the line.
I'm a beginner in Python Data Science. I'm working on clickstream data and trying to count the consecutive clicks on an item in a given session. I'm getting the cumulative sum in 'Block' column. After that I'm aggregating on Block to get the count on each block. In the end I want to groupby Session and Item and aggregate the block count since there may be cases(Sid=6 here) where an item comes consecutively m times at first and again after other items, it comes consecutively n times. So the consecutive count should be 'm+n'.
Here is the dataset-
Sid Tstamp Itemid
0 1 2014-04-07T10:51:09.277Z 214536502
1 1 2014-04-07T10:54:09.868Z 214536500
2 1 2014-04-07T10:54:46.998Z 214536506
3 1 2014-04-07T10:57:00.306Z 214577561
4 2 2014-04-07T13:56:37.614Z 214662742
5 2 2014-04-07T13:57:19.373Z 214662742
6 2 2014-04-07T13:58:37.446Z 214825110
7 2 2014-04-07T13:59:50.710Z 214757390
8 2 2014-04-07T14:00:38.247Z 214757407
9 2 2014-04-07T14:02:36.889Z 214551617
10 3 2014-04-02T13:17:46.940Z 214716935
11 3 2014-04-02T13:26:02.515Z 214774687
12 3 2014-04-02T13:30:12.318Z 214832672
13 4 2014-04-07T12:09:10.948Z 214836765
14 4 2014-04-07T12:26:25.416Z 214706482
15 6 2014-04-03T10:44:35.672Z 214821275
16 6 2014-04-03T10:45:01.674Z 214821275
17 6 2014-04-03T10:45:29.873Z 214821371
18 6 2014-04-03T10:46:12.162Z 214821371
19 6 2014-04-03T10:46:57.355Z 214821371
20 6 2014-04-03T10:53:22.572Z 214717089
21 6 2014-04-03T10:53:49.875Z 214563337
22 6 2014-04-03T10:55:19.267Z 214706462
23 6 2014-04-03T10:55:47.327Z 214821371
24 6 2014-04-03T10:56:30.520Z 214821371
25 6 2014-04-03T10:57:19.331Z 214821371
26 6 2014-04-03T10:57:39.433Z 214819762
Here is my code-
k['Block'] =( k['Itemid'] != k['Itemid'].shift(1) ).astype(int).cumsum()
y=k.groupby('Block').count()
z=k.groupby(['Sid','Itemid']).agg({"y[Count]": lambda x: x.sum()})
Won't this work?
k.groupby(['Sid', 'Itemid']).Block.count()
Sid Itemid
1 214536500 1
214536502 1
214536506 1
214577561 1
2 214551617 1
214662742 2
214757390 1
214757407 1
214825110 1
3 214716935 1
214774687 1
214832672 1
4 214706482 1
214836765 1
6 214563337 1
214706462 1
214717089 1
214819762 1
214821275 2
214821371 6
Name: Block, dtype: int64
IIUC you can:
k['Block'] =( k['Itemid'] != k['Itemid'].shift(1) ).astype(int).cumsum()
#print k
z=k.groupby(['Sid','Itemid', 'Block']).size().groupby(level=[0,1]).sum().reset_index(name='sum_counts')
print z
Sid Itemid sum_counts
0 1 214536500 1
1 1 214536502 1
2 1 214536506 1
3 1 214577561 1
4 2 214551617 1
5 2 214662742 2
6 2 214757390 1
7 2 214757407 1
8 2 214825110 1
9 3 214716935 1
10 3 214774687 1
11 3 214832672 1
12 4 214706482 1
13 4 214836765 1
14 6 214701242 1
15 6 214826623 1
16 7 214826715 1
17 7 214826835 1
18 8 214838855 2
19 9 214576500 3
20 11 214563337 1
21 11 214706462 1
22 11 214717089 1
23 11 214819762 1
24 11 214821275 2
25 11 214821371 6