I have the below sequence of data as a pandas dataframe
id,start,end,duration
303,2012-06-25 17:59:43,2012-06-25 18:01:29,105
404,2012-06-25 18:01:29,2012-06-25 18:01:55,25
303,2012-06-25 18:01:56,2012-06-25 18:02:06,10
303,2012-06-25 18:02:23,2012-06-25 18:02:44,21
404,2012-06-25 18:02:45,2012-06-25 18:02:51,6
303,2012-06-25 18:02:54,2012-06-25 18:03:17,23
404,2012-06-25 18:03:24,2012-06-25 18:03:41,17
303,2012-06-25 18:03:43,2012-06-25 18:05:51,128
101,2012-06-25 18:05:58,2012-06-25 18:24:22,1104
404,2012-06-25 18:24:24,2012-06-25 18:25:25,61
101,2012-06-25 18:25:25,2012-06-25 18:25:462,21
404,2012-06-25 18:25:49,2012-06-25 18:26:00,11
101,2012-06-25 18:26:01,2012-06-25 18:26:04,3
404,2012-06-25 18:26:05,2012-06-25 18:28:49,164
202,2012-06-25 18:28:52,2012-06-25 18:28:57,5
404,2012-06-25 18:29:00,2012-06-25 18:29:24,24
It should always be the case that id 404 gets repeated after another different id.
For example if the above is motion sensors in a house e.g. 404:hallway, 202:bedroom, 303:kitchen, 201:studyroom, where the hallway is in the middle, then moving from bedroom to kitchen to studyroom and back to bedroom should trigger 202, 404, 303, 404, 201, 404, 202 in that order because one always passes through the hallway (404) to any room. My output has cases that violate this sequence and I want to drop such rows.
For example from the snippet dataframe above the below rows violate this:
303,2012-06-25 18:01:56,2012-06-25 18:02:06,10
303,2012-06-25 18:02:23,2012-06-25 18:02:44,21
303,2012-06-25 18:03:43,2012-06-25 18:05:51,128
101,2012-06-25 18:05:58,2012-06-25 18:24:22,1104
and therefore the rows below should be droped (but of course I have a much larger dataset).
303,2012-06-25 18:02:23,2012-06-25 18:02:44,21
101,2012-06-25 18:05:58,2012-06-25 18:24:22,1104
I have tried shift and drop but the result still has some inconsistencies.
df['id_ns'] = df['id'].shift(-1)
df['id_ps'] = df['id'].shift(1)
if (df['id'] != 404):
df.drop(df[(df.id_ns != 404) & (df.id_ps != 404)].index, axis=0, inplace=True)
How best can I approach this?
Use Series.ne + Series.shift along with optional parameter fill_value to create a boolean mask, use this mask to filter/drop the rows:
mask = df['id'].ne(404) & df['id'].shift(fill_value=404).ne(404)
df = df[~mask]
Result:
print(df)
id start end duration
0 303 2012-06-25 17:59:43 2012-06-25 18:01:29 105
1 404 2012-06-25 18:01:29 2012-06-25 18:01:55 25
2 303 2012-06-25 18:01:56 2012-06-25 18:02:06 10
4 404 2012-06-25 18:02:45 2012-06-25 18:02:51 6
5 303 2012-06-25 18:02:54 2012-06-25 18:03:17 23
6 404 2012-06-25 18:03:24 2012-06-25 18:03:41 17
7 303 2012-06-25 18:03:43 2012-06-25 18:05:51 128
9 404 2012-06-25 18:24:24 2012-06-25 18:25:25 61
10 101 2012-06-25 18:25:25 2012-06-25 18:25:46 21
11 404 2012-06-25 18:25:49 2012-06-25 18:26:00 11
12 101 2012-06-25 18:26:01 2012-06-25 18:26:04 3
13 404 2012-06-25 18:26:05 2012-06-25 18:28:49 164
14 202 2012-06-25 18:28:52 2012-06-25 18:28:57 5
15 404 2012-06-25 18:29:00 2012-06-25 18:29:24 24
Related
I've got a time-series dataframe that looks something like:
datetime gesture left-5-x ...30 columns omitted
2022-09-27 19:54:54.396680 gesture0255 533
2022-09-27 19:54:54.403298 gesture0255 534
2022-09-27 19:54:54.408938 gesture0255 535
2022-09-27 19:54:54.413995 gesture0255 523
2022-09-27 19:54:54.418666 gesture0255 522
... 95 000 rows omitted
And I want to create a new column df['cross_correlation'] which is the function of multiple sequential rows. So the cross_correlation of row i depends on the data from rows i-10 to i+10.
I could do this with df.iterrow(), but that seems like the non-idiomatic version. Is there a function like
df.window(-10, +10).apply(lambda rows: calculate_cross_correlation(rows)
or similar?
EDIT:
Thanks #chris, who pointed me towards df.rolling(), although I now have this example which better reflect the problem I'm having:
Here's a simplified version of the function I want to apply over the moving window. Note that the actual version requires that the input be the full 2D window of shape (window_size, num_columns) but the toy function below doesn't actually need the input to be 2D. I've added an assertion to make sure this is true:
def sum_over_2d(x):
assert len(x.shape) == 2, f'shape of input is {x.shape} and not of length 2'
return x.sum()
And now if I use .rolling with .apply
df.rolling(window=10, center=True).apply(
sum_over_2d
)
, I get an assertion error:
AssertionError: shape of input is (10,) and not of length 2
and if I print the input x before the assertion, I get:
0 533.0
1 534.0
2 535.0
3 523.0
4 522.0
5 526.0
6 510.0
7 509.0
8 502.0
9 496.0
dtype: float64
which is one column from my many-columned dataset. What I'm wanting is for the input x to be a dataframe or 2d numpy array.
IIUC, one way using pandas.Series.rolling.apply.
Example with sum:
df["new"] = df["left-5-x"].rolling(3, center=True, min_periods=1).sum()
Output:
datetime gesture left-5-x new explain
0 2022-09-27 19:54:54.396680 gesture0255 533 1067.0 533+534
1 2022-09-27 19:54:54.403298 gesture0255 534 1602.0 533+534+535
2 2022-09-27 19:54:54.408938 gesture0255 535 1592.0 534+535+523
3 2022-09-27 19:54:54.413995 gesture0255 523 1580.0 535+523+522
4 2022-09-27 19:54:54.418666 gesture0255 522 1045.0 523+522
You can see left-5-x are summed with +1 to -1 neighbors.
Edit:
If you want to use roll-ed dataframe, one way would be iterate over the rolling:
new_df = pd.concat([sum_over_2d(d) for d in df.rolling(window=10)],axis=1).T
Output:
0 1 2 3
0 0 1 2 3
1 4 6 8 10
2 12 15 18 21
3 24 28 32 36
4 40 45 50 55
5 60 66 72 78
6 84 91 98 105
7 112 120 128 136
8 144 153 162 171
9 180 190 200 210
Or as per #Sandwichnick's comment, you can use method=="table", but only if pass engine=="numba". In other words, your sum_over_2d must be numba compilable (which is beyond the scope of this question and my knowledge)
df.rolling(window=10, center=True, method="table").sum(engine="numba")
I am importing a CSV into a pandas dataframe. One column contains an 18 digit LDAP timestamp. I am trying to convert this timestamp however it appears that it is being rounded causing incorrect calculation.
data.csv:
Event ID,Clock-Time,ProcessID
10,133081599160584000,2824,44
10,133081599160584000,2824,84
10,133081599160667000,2824,44
10,133081599160667000,2824,92
10,133081599160667000,2824,116
10,133081599160667000,2824,132
script.py
#!/usr/bin/python
from datetime import datetime, timedelta
import pandas
pandas.set_option('display.max_colwidth', None)
pandas.set_option('display.float_format','{:.0f}'.format)
pandas.set_option('display.precision', 20)
in_csv = "data.csv"
df = pandas.read_csv(in_csv,sep=',',header=0, float_precision='high')
print(df.dtypes)
print(df)
# convert win32 timestamp to unix timestamp
df['Clock-Time'] = df['Clock-Time'].apply(lambda timestamp: datetime(1601, 1, 1) + timedelta(seconds=(timestamp/10000000)))
print(df)
output:
Event ID int64
Clock-Time float64
ProcessID int64
Size int64
dtype: object
Event ID Clock-Time ProcessID Size
0 10 133082000000000000 2824 44
1 10 133082000000000000 2824 84
2 10 133082000000000000 2824 44
3 10 133082000000000000 2824 92
4 10 133082000000000000 2824 116
5 10 133082000000000000 2824 132
Event ID Clock-Time ProcessID Size
0 10 2022-09-21 02:13:20 2824 44
1 10 2022-09-21 02:13:20 2824 84
2 10 2022-09-21 02:13:20 2824 44
3 10 2022-09-21 02:13:20 2824 92
4 10 2022-09-21 02:13:20 2824 116
5 10 2022-09-21 02:13:20 2824 132
How can I get pandas to respect the full value so I can get the accurate timestamp?
I can't reproduce the issue by creating your .csv from Windows Notepad.
Event ID Clock-Time ProcessID Size
0 10 2022-09-20 15:05:16.058399 2824 44
1 10 2022-09-20 15:05:16.058399 2824 84
2 10 2022-09-20 15:05:16.066700 2824 44
3 10 2022-09-20 15:05:16.066700 2824 92
4 10 2022-09-20 15:05:16.066700 2824 116
5 10 2022-09-20 15:05:16.066700 2824 132
I bet you createad your .csv file from an Excel spreadsheet. By default, when you enter a number over 12 digits (e.g, a 18 digit LDAP timestamp) in a spreadsheet, the number is auto-corrected to scientific notation. For example, "133081599160584000" is converted to "1.33082E+17" (hence the single value "133082000000000000" in your initial dataframe after calling pandas.read_csv). So when you export the spreadsheet to a text file (e.g, .csv), it is the scientific notation (what
Excel sees) that will be exported and not the actual (x>12)-digit value.
You can fix that upstream by changing the type of the concerned column in Excel
before creating the .csv :
I've had a good look and I can't seem to find the answer to this question. I am wanting to replace all NaN values in my Department Code Column of my DataFrame with values from a dictionary, using the Job Number column as the Key matching that of the dictionary. The data can be seen Below: Please note there are many extra columns, these are just the two.)
df =
Job Number Department Code
0 3525 403
1 4555 NaN
2 5575 407
3 6515 407
4 7525 NaN
5 8535 102
6 3545 403
7 7455 102
8 3365 NaN
9 8275 403
10 3185 408
dict = {'4555': '012', '7525': '077', '3365': '034'}
What I am hoping the output to look like is:
Job Number Department Code
0 3525 403
1 4555 012
2 5575 407
3 6515 407
4 7525 077
5 8535 102
6 3545 403
7 7455 102
8 3365 034
9 8275 403
10 3185 408
The two columns are object datatypes and I have tried the replace function which I have used before but that only replaces the value if the key is in the same column.
df['Department Code'].replace(dict, inplace=True)
This does not replace the NaN values.
I'm sure the answer is very simple and I apologies in advance but i'm just stuck.
(Excuse my poor code display, it's handwritten as not sure how to export code from python to here.)
Better is avoid variable dict, because builtin (python code word), then use Series.fillna for replace matched values with Series.map, if no match values return NaN, so no replacement:
d = {'4555': '012', '7525': '077', '3365': '034'}
df['Department Code'] = df['Department Code'].fillna(df['Job Number'].astype(str).map(d))
print (df)
Job Number Department Code
0 3525 403
1 4555 012
2 5575 407
3 6515 407
4 7525 077
5 8535 102
6 3545 403
7 7455 102
8 3365 034
9 8275 403
10 3185 408
Or another way is using set_index and fillna:
df['Department Code'] = (df.set_index('Job Number')['Department Code']
.fillna(d).values)
print(df)
Job Number Department Code
0 3525 403
1 4555 012
2 5575 407
3 6515 407
4 7525 077
5 8535 102
6 3545 403
7 7455 102
8 3365 034
9 8275 403
10 3185 408
Lets say I have data like that and I want to group them in terms of feature and type.
feature type size
Alabama 1 100
Alabama 2 50
Alabama 3 40
Wyoming 1 180
Wyoming 2 150
Wyoming 3 56
When I apply df=df.groupby(['feature','type']).sum()[['size']], I get this as expected.
size
(Alabama,1) 100
(Alabama,2) 50
(Alabama,3) 40
(Wyoming,1) 180
(Wyoming,2) 150
(Wyoming,3) 56
However I want to sum sizes with only the same type not both type and feature.While doing this I want to keep indexes as (feature,type) tuple. I mean I want to get something like this,
size
(Alabama,1) 280
(Alabama,2) 200
(Alabama,3) 96
(Wyoming,1) 280
(Wyoming,2) 200
(Wyoming,3) 96
I am stuck trying to find a way to do this. I need some help thanks
Use set_index for MultiIndex and then transform with sum for return same length Series by aggregate function:
df = df.set_index(['feature','type'])
df['size'] = df.groupby(['type'])['size'].transform('sum')
print (df)
size
feature type
Alabama 1 280
2 200
3 96
Wyoming 1 280
2 200
3 96
EDIT: First aggregate both columns and then use transform
df = df.groupby(['feature','type']).sum()
df['size'] = df.groupby(['type'])['size'].transform('sum')
print (df)
size
feature type
Alabama 1 280
2 200
3 96
Wyoming 1 280
2 200
3 96
Here is one way:
df['size'] = df['type'].map(df.groupby('type')['size'].sum())
df.groupby(['feature', 'type'])['size_type'].sum()
# feature type
# Alabama 1 280
# 2 200
# 3 96
# Wyoming 1 280
# 2 200
# 3 96
# Name: size_type, dtype: int64
i have two dataframes:
my stock solutions (df1):
pH salt_conc
5.5 0 23596.0
200 19167.0
400 17052.5
6.0 0 37008.5
200 27652.0
400 30385.5
6.5 0 43752.5
200 41146.0
400 39965.0
and my measurements after i did something (df2):
pH salt_conc id
5.5 0 8 20953.0
11 24858.0
200 3 20022.5
400 13 17691.0
20 18774.0
6.0 0 14 38639.0
200 1 37223.5
2 36597.0
7 37039.0
10 37088.5
15 35968.5
16 36344.5
17 34894.0
18 36388.5
400 9 33386.0
6.5 0 4 41401.5
12 44933.5
200 5 43074.5
400 6 42210.5
19 41332.5
I would like to normalize each measurement in the second dataframe (df2) with its corresponding stock solution from which i took the sample.
Any suggestions ?
Figured it out with the help of this post:
SO: Binary operation broadcasting across multiindex
I had to reset the index of both grouped dataframes and set it again.
df_initial = df_initial.reset_index().set_index(['pH','salt_conc'])
df_second = df_second.reset_index().set_index(['pH','salt_conc'])
No i can do any calculation i want to do.