Pandas DataFrame iterate over a window of rows quickly - python

I've got a time-series dataframe that looks something like:
datetime gesture left-5-x ...30 columns omitted
2022-09-27 19:54:54.396680 gesture0255 533
2022-09-27 19:54:54.403298 gesture0255 534
2022-09-27 19:54:54.408938 gesture0255 535
2022-09-27 19:54:54.413995 gesture0255 523
2022-09-27 19:54:54.418666 gesture0255 522
... 95 000 rows omitted
And I want to create a new column df['cross_correlation'] which is the function of multiple sequential rows. So the cross_correlation of row i depends on the data from rows i-10 to i+10.
I could do this with df.iterrow(), but that seems like the non-idiomatic version. Is there a function like
df.window(-10, +10).apply(lambda rows: calculate_cross_correlation(rows)
or similar?
EDIT:
Thanks #chris, who pointed me towards df.rolling(), although I now have this example which better reflect the problem I'm having:
Here's a simplified version of the function I want to apply over the moving window. Note that the actual version requires that the input be the full 2D window of shape (window_size, num_columns) but the toy function below doesn't actually need the input to be 2D. I've added an assertion to make sure this is true:
def sum_over_2d(x):
assert len(x.shape) == 2, f'shape of input is {x.shape} and not of length 2'
return x.sum()
And now if I use .rolling with .apply
df.rolling(window=10, center=True).apply(
sum_over_2d
)
, I get an assertion error:
AssertionError: shape of input is (10,) and not of length 2
and if I print the input x before the assertion, I get:
0 533.0
1 534.0
2 535.0
3 523.0
4 522.0
5 526.0
6 510.0
7 509.0
8 502.0
9 496.0
dtype: float64
which is one column from my many-columned dataset. What I'm wanting is for the input x to be a dataframe or 2d numpy array.

IIUC, one way using pandas.Series.rolling.apply.
Example with sum:
df["new"] = df["left-5-x"].rolling(3, center=True, min_periods=1).sum()
Output:
datetime gesture left-5-x new explain
0 2022-09-27 19:54:54.396680 gesture0255 533 1067.0 533+534
1 2022-09-27 19:54:54.403298 gesture0255 534 1602.0 533+534+535
2 2022-09-27 19:54:54.408938 gesture0255 535 1592.0 534+535+523
3 2022-09-27 19:54:54.413995 gesture0255 523 1580.0 535+523+522
4 2022-09-27 19:54:54.418666 gesture0255 522 1045.0 523+522
You can see left-5-x are summed with +1 to -1 neighbors.
Edit:
If you want to use roll-ed dataframe, one way would be iterate over the rolling:
new_df = pd.concat([sum_over_2d(d) for d in df.rolling(window=10)],axis=1).T
Output:
0 1 2 3
0 0 1 2 3
1 4 6 8 10
2 12 15 18 21
3 24 28 32 36
4 40 45 50 55
5 60 66 72 78
6 84 91 98 105
7 112 120 128 136
8 144 153 162 171
9 180 190 200 210
Or as per #Sandwichnick's comment, you can use method=="table", but only if pass engine=="numba". In other words, your sum_over_2d must be numba compilable (which is beyond the scope of this question and my knowledge)
df.rolling(window=10, center=True, method="table").sum(engine="numba")

Related

How to replace a classic for loop with df.iterrows()?

I have a huge data frame.
I am using a for loop in the below sample code:
for i in range(1, len(df_A2C), 1):
A2C_TT= df_A2C.loc[(df_A2C['TO_ID'] == i)].sort_values('DURATION_H').head(1)
if A2C_TT.size > 0:
print (A2C_TT)
This is working fine but I want to use df.iterrows() since it will help me to automaticall avoid empty frame issues.
I want to iterate through TO_ID and looking for minimum values accordingly.
How should I replace my classical i loop counter with df.iterrows()?
Sample Data:
FROM_ID TO_ID DURATION_H DIST_KM
1 7 0.528555556 38.4398
2 26 0.512511111 37.38515
3 71 0.432452778 32.57571
4 83 0.599486111 39.26188
5 98 0.590516667 35.53107
6 108 1.077794444 76.79874
7 139 0.838972222 58.86963
8 146 1.185088889 76.39174
9 158 0.625872222 45.6373
10 208 0.500122222 31.85239
11 209 0.530916667 29.50249
12 221 0.945444444 62.69099
13 224 1.080883333 66.06291
14 240 0.734269444 48.1778
15 272 0.822875 57.5008
16 349 1.171163889 76.43536
17 350 1.080097222 71.16137
18 412 0.503583333 38.19685
19 416 1.144961111 74.35502
As far as I understand your question, you want to group your data by To_ID and select the row where Duration_H is the smallest? Is that right?
df.loc[df.groupby('TO_ID').DURATION_H.idxmin()]
here is one way about it
# run the loop for as many unique TO_ID you have
# instead of iterrows, which runs for all the DF or running to the size of DF
for idx in np.unique(df['TO_ID']):
A2C_TT= df.loc[(df['TO_ID'] == idx)].sort_values('DURATION_H').head(1)
print (A2C_TT)
ROM_ID TO_ID DURATION_H DIST_KM
498660 39 7 0.434833 25.53808
here is another way about it
df.loc[df['DURATION_H'].eq(df.groupby('TO_ID')['DURATION_H'].transform(min))]
ROM_ID TO_ID DURATION_H DIST_KM
498660 39 7 0.434833 25.53808

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

How can I loop though pandas groupby and manipulate data?

I am trying to work out the time delta between values in a grouped pandas df.
My df looks like this:
Location ID Item Qty Time
0 7 202545942 100130 1 07:19:46
1 8 202545943 100130 1 07:20:08
2 11 202545950 100130 1 07:20:31
3 13 202545955 100130 1 07:21:08
4 15 202545958 100130 1 07:21:18
5 18 202545963 100130 3 07:21:53
6 217 202546320 100130 1 07:22:43
7 219 202546324 100130 1 07:22:54
8 229 202546351 100130 1 07:23:32
9 246 202546376 100130 1 07:24:09
10 273 202546438 100130 1 07:24:37
11 286 202546464 100130 1 07:24:59
12 296 202546490 100130 1 07:25:16
13 297 202546491 100130 1 07:25:24
14 310 202546516 100130 1 07:25:59
15 321 202546538 100130 1 07:26:17
16 329 202546549 100130 1 07:28:09
17 388 202546669 100130 1 07:29:02
18 420 202546717 100130 2 07:30:01
19 451 202546766 100130 1 07:30:19
20 456 202546773 100130 1 07:30:27
(...)
42688 458 202546777 999969 1 06:51:16
42689 509 202546884 999969 1 06:53:09
42690 567 202546977 999969 1 06:54:21
42691 656 202547104 999969 1 06:57:27
I have grouped this using the following method:
ndf = df.groupby(['ID','Location','Time'])
If I add .size() to the end of the above and print(ndf) I get the following output:
(...)
ID Location Time
995812 696 07:10:36 1
730 07:11:41 1
761 07:12:30 1
771 07:20:49 1
995820 381 06:55:07 1
761 07:12:44 1
(...)
This is the as desired.
My challenge is that I need to work out the time delta between each time per Item and add this as a column in the dataframe grouping. It should give me the following:
ID Location Time Delta
(...)
995812 696 07:10:36 0
730 07:11:41 00:01:05
761 07:12:30 00:00:49
771 07:20:49 00:08:19
995820 381 06:55:07 0
761 07:12:44 00:17:37
(...)
I am pulling my hair out trying to work out a method of doing this, so I'm turning to the greats.
Please help. Thanks in advance.
Convert Time column to timedeltas by to_timedelta, sort by all 3 columns by DataFrame.sort_values, get difference per groups by DataFrameGroupBy.diff, replace missing values to 0 timedelta by Series.fillna:
#if strings astype should be omit
df['Time'] = pd.to_timedelta(df['Time'].astype(str))
df = df.sort_values(['ID','Location','Time'])
df['Delta'] = df.groupby('ID')['Time'].diff().fillna(pd.Timedelta(0))
Also is possible convert timedeltas to seconds - add Series.dt.total_seconds:
df['Delta_sec'] = df.groupby('ID')['Time'].diff().dt.total_seconds().fillna(0)
If you just wanted to iterate over the groupby object, based on your original question title you can do it:
for (x, y) in df.groupby(['ID','Location','Time']):
print("{0}, {1}".format(x, y))
# your logic
However, this works for 10.000 rows, 100.000 rows, but not so good for 10^6 rows or more.

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

python statistic top 10

using python 2.6
I have large text file.
Below are the first 3 entries, but there are over 50 users I need to check.
html_log:jeff 1153.3 1.84 625:54 1 2 71 3 2 10 7:58 499 3 5 616:36 241 36 html_log:fred 28.7 1.04 27:34 -10 18 13 0:48 37 18 8 -3.63 html_log:bob 1217.1 1.75 696:48 1 5 38 6 109 61 14:42 633 223 25 435:36 182 34 ... continues
I need to beable to find the username in this case the text after the "html_log:" tags
I also need the rating (first set of values next to the username.)
Output would check the entire txt file and output the top 10 highest rated players.
Please note that there are not always 16 sets of values, some contain far less.
producing:
bob 1217.1
jeff 1153
fred 28.7
In this case I would actually use a regular expression.
Just consider html_log: as a record start marker, the next part up until a whitespace is the name. The next part after it is the score, which you can convert to float for comparison:
s = "html_log:jeff 1153.3 1.84 625:54 1 2 71 3 2 10 7:58 499 3 5 616:36 241 36 html_log:fred 28.7 1.04 27:34 -10 18 13 0:48 37 18 8 -3.63 html_log:bob 1217.1 1.75 696:48 1 538 6 109 61 14:42 633 223 25 435:36 182 34"
pattern = re.compile("html_log:(?P<name>[^ ]*) (?P<score>[^ ]*)")
print sorted(pattern.findall(s), key=lambda x: float(x[1]), reverse=True)
# [('bob', '1217.1'), ('jeff', '1153.3'), ('fred', '28.7')]
If you are wondering how to read this file the straight forward algorithm would be, first, read the whole file in a string. then use string.split(' ') to split everything with space, then through a for loop on every pieces of these check whether an element contains html_log: if yes here is the username, and the next element is the highest rate! and store all these stuffs in a dictionary for further sorting or other operations.

Categories

Resources