python: grouping or splitting up time series data based on conditions

python: grouping or splitting up time series data based on conditions - python

I work a lot with time series data at my job and I have been trying to use python--specifically pandas--to make some of the work a little faster. I have some code that reads through data in a DataFrame and identifies segments where specified conditions are met. It then separates those segments into individual DataFrames.
I have a sample DataFrame here:
Date Time Pressure Temp Flow Valve Position
0 3/5/2020 12:00:01 5.32 22.12 199 1.00
1 3/5/2020 12:00:02 5.36 22.25 115 0.95
2 3/5/2020 12:00:03 5.33 22.18 109 0.92
3 3/5/2020 12:00:04 5.38 23.51 103 0.90
4 3/5/2020 12:00:05 5.42 24.27 99 0.89
5 3/5/2020 12:00:06 5.49 25.91 92 0.85
6 3/5/2020 12:00:07 5.55 26.78 85 0.82
7 3/5/2020 12:00:08 5.61 29.88 82 0.76
8 3/5/2020 12:00:09 5.69 31.16 87 0.79
9 3/5/2020 12:00:10 5.72 32.01 97 0.87
10 3/5/2020 12:00:11 5.59 29.68 104 0.90
11 3/5/2020 12:00:12 5.53 24.55 111 0.93
12 3/5/2020 12:00:13 5.48 23.54 116 0.96
13 3/5/2020 12:00:14 5.44 23.11 119 1.00
14 3/5/2020 12:00:15 5.41 23.08 121 1.00
The code I have written does what I want but is really difficult to follow and I am sure its offensive to experienced python users.
Here is what it does though:
I more or less create a mask based on a set of conditions and I take the index locations for all the True values in the mask. Then it uses NumPy's .diff() function to identify discontinuity in the indices. Inside the for loop it splits up the mask at the location of each identified discontinuity. Once that is complete I can use the now separate sets of indices to slice out the desired segments of data from my original DataFrame. See the code below:
import pandas as pd
import numpy as np
df = pd.read_csv('sample_data.csv')
idx = np.where((df['Temp'] > 23) & (df['Temp'] < 30))[0]
discontinuity = np.where(np.diff(idx) > 1)[0]
intervals = {}
for i in range(len(discontinuity)+1):
if i == 0:
intervals[i] = df.iloc[idx[0]:idx[discontinuity[i]],1]
if len(intervals[i].values) < 1:
del intervals[i]
elif i == len(discontinuity):
intervals[i] = df.iloc[idx[discontinuity[i-1]+1]:idx[-1],1]
if len(intervals[i].values) < 1:
del intervals[i]
else:
intervals[i] = df.iloc[idx[discontinuity[i-1]+1]:idx[discontinuity[i]],1]
if len(intervals[i].values) < 1:
del intervals[i]
df1 = df.loc[intervals[0].index, :]
df2 = df.loc[intervals[1].index, :]
df1 and df2 contain all the data in the original DataFrame corresponding with the times (rows) that 'Temp' is between 23 and 30.
df1:
Date Time Pressure Temp Flow Valve Position
3 3/5/2020 12:00:04 5.38 23.51 103 0.90
4 3/5/2020 12:00:05 5.42 24.27 99 0.89
5 3/5/2020 12:00:06 5.49 25.91 92 0.85
6 3/5/2020 12:00:07 5.55 26.78 85 0.82
df2:
Date Time Pressure Temp Flow Valve Position
10 3/5/2020 12:00:11 5.59 29.68 104 0.90
11 3/5/2020 12:00:12 5.53 24.55 111 0.93
12 3/5/2020 12:00:13 5.48 23.54 116 0.96
13 3/5/2020 12:00:14 5.44 23.11 119 1.00
I am glad I was able to get this to work for me and I can live with the couple lines that get lost using this method but I know this is a really pedestrian approach and I can't help but think someone who is not a python beginning could do the same thing much more cleanly and efficiently.
Could groupby from itertools or pandas work for this? I haven't been able to find a way to make that work.

Welcome to Stack Overflow.
I think your code can be simplified as such:
# Get the subset that fulfills your conditions
df_conditioned = df.query('Temp > 23 and Temp < 30').copy()
# Check for discontinuities by looking at the indices
# I created a new column called 'Group' to keep track of the continuous indices
indices = df_conditioned.index.to_series()
df_conditioned['Group'] = ((indices - indices.shift(1)) != 1).cumsum()
# Store the groups (segments with same group number) as individual frames in a list
df_list = []
for group in df_conditioned['Group'].unique():
df_list.append(df_conditioned.query('Group == #group').drop(columns='Group'))
Hope it helps!

Related

Unable to merge all of the desired columns from Pandas DataFrame

I am a beginner working with a clinical data set using Pandas in Jupyter Notebook.
A column of my data contains census tract codes and I am trying to merge my data with a large transportation data file that also has a column with census tract codes.
I initially only wanted 2 of the other columns from that transportation file so, after I downloaded the file, I removed all of the other columns except the 2 that I wanted to add to my file and the census tract column.
This is the code I used:
df_my_data = pd.read_excel("my_data.xlsx")
df_transportation_data = pd.read_excel("transportation_data.xlsx")
df_merged_file = pd.merge(df_my_data, df_transportation_data)
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
This worked but then I wanted to add the other columns from the transportation file so I used my initial file (prior to adding the 2 transportation columns) and tried to merge the entire transportation file. This resulted in a new DataFrame with all of the desired columns but only 4 rows.
I thought maybe the transportation file is too big so I tried merging individual columns (other than the 2 I was initially able to merge) and this again results in all of the correct columns but only 4 rows merging.
Any help would be much appreciated.
Edits:
Sorry for not being more clear.
Here is the code for the 2 initial columns I merged:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_two_columns = pd.read_excel('two_columns_from_transportation_file.xlsx')
df_two_columns_merged = pd.merge(df_my_data, df_two_columns, on=['census_tract'])
df_two_columns_merged.to_excel('two_columns_merged.xlsx', index = False)
The outputs were:
df_my_data.head()
census_tract id e t
0 6037408401 1 1 1092
1 6037700200 2 1 1517
2 6065042740 3 1 2796
3 6037231210 4 1 1
4 6059076201 5 1 41
df_two_columns.head()
census_tract households_with_no_vehicle vehicles_per_household
0 6001400100 2.16 2.08
1 6001400200 6.90 1.50
2 6001400300 17.33 1.38
3 6001400400 8.97 1.41
4 6001400500 11.59 1.39
df_two_columns_merged.head()
census_tract id e t households_with_no_vehicle vehicles_per_household
0 6037408401 1 1 1092 4.52 2.43
1 6037700200 2 1 1517 9.88 1.26
2 6065042740 3 1 2796 2.71 1.49
3 6037231210 4 1 1 25.75 1.35
4 6059076201 5 1 41 1.63 2.22
df_my_data has 657 rows and df_two_columns_merged came out with 657 rows.
The code for when I tried to merge the entire transport file:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_transportation_data = pd.read_excel('transportation_data.xlsx')
df_merged_file = pd.merge(df_my_data, df_transportation_data, on=['census_tract'])
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
The output:
df_transportation_data.head()
census_tract Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6001400100 0.00 12.60 65.95 2.16 20.69 0.76 2.08
1 6001400200 5.68 3.66 45.79 6.90 39.01 5.22 1.50
2 6001400300 7.55 6.61 46.77 17.33 31.19 6.39 1.38
3 6001400400 8.85 11.29 43.91 8.97 27.67 4.33 1.41
4 6001400500 8.45 7.45 46.94 11.59 29.56 4.49 1.39
df_merged_file.head()
census_tract id e t Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6041119100 18 0 2755 1.71 3.02 82.12 4.78 8.96 3.32 2.10
1 6061023100 74 1 1201 0.00 9.85 86.01 0.50 2.43 1.16 2.22
2 6041110100 80 1 9 0.30 4.40 72.89 6.47 13.15 7.89 1.82
3 6029004902 123 0 1873 0.00 18.38 78.69 4.12 0.00 0.00 2.40
The df_merged_file only has 4 total rows.
So my question is: why is it that I am able to merge those initial 2 columns from the transportation file and keep all of the rows from my file but when I try to merge the entire transportation file I only get 4 rows of output?

I recommend specifying merge type and merge column(s).
When you use pd.merge(), the default merge type is inner merge, and on the same named columns using:
df_merged_file = pd.merge(df_my_data, df_transportation_data, how='left', left_on=[COLUMN], right_on=[COLUMN])
It is possible that one of the columns you removed from the "transportation_data.xlsx" file previously is the same name as a column in your "my_data.xlsx", causing unmatched rows to be removed due to an inner merge.
A 'left' merge would allow the two columns you need from "transportation_data.xlsx" to attach to values in your "my_data.xlsx", but only where there is a match. This means your merged DataFrame will have the same number of rows as your "my_data.xlsx" has currently.

Well, I think there was something wrong with the initial download of the transportation file. I downloaded it again and this time I was able to get a complete merge. Sorry for being an idiot. Thank you all for your help.

how to subtract first value with all the rest value

I have data like this:
timestamp high windSpeed windDir windU windV
04/05/2019 10:02 100 4.39 179.1 -0.14 8.53
150 2.44 164.5 -1.26 4.57
200 4.29 180.9 0.12 8.32
04/05/2019 10:03 100 4.39 179.1 -0.15 8.53
150 2.44 164.5 -1.26 4.57
200 4.29 180.9 0.12 8.32
04/05/2019 10:04 100 4.52 179.1 -0.16 8.79
150 2.15 162.8 -1.24 4
200 3.34 181.9 0.21 6.49
04/05/2019 10:05 100 4.52 179.1 -0.17 8.79
150 2.15 162.8 -1.24 4
200 3.34 181.9 0.21 6.49
and I want to subtract the value from higher level with lower level in each time.This is what I got so far, but this one only give me 1 value. Anyone can help me please? thank you.
for timestamp, group in grouped:
HeightIndices = group["high"].keys()
for heightIndex in range(HeightIndices[0], HeightIndices[0] + len(HeightIndices) - 1):
windMag = sqrt(group["windU"] ** 2 + group["windV"] ** 2)
diffMag = windMag[heightIndex+1]-windMag[heightIndex]

I'm not sure if I'm accomplishing what you're asking, but based on my looking at your code, it seems you are trying to get the difference between the i-th and i+1-th index in the column "high" and call that variable diffMag. If that's the case you can probably use one of the two methods.
Solution 1:
diff_mag = []
for i in range(len(wind['height'])-1):
diff_mag[i] = wind['height'][i+1] - wind['height'][i]
Solution 2:
Use numpy diff.
np.diff(wind['height'])
I made the assumption you're using pandas here based on what your code block looks like. Hope that helps.
EDIT
Okay..I think I understand what you are saying now.
I think this should work:
windMag = []
for timestamp, group in grouped:
HeightIndices = group["high"].keys()
for heightIndex in range(HeightIndices[0], HeightIndices[0] + len(HeightIndices) - 1):
windMag.append(sqrt(group["windU"] ** 2 + group["windV"] ** 2))
diffMag = np.diff(windMag)

Dataframe split columns value, how to solve error message?

I have a panda dataframe with the following columns:
Stock ROC5 ROC20 ROC63 ROCmean
0 IBGL.SW -0.59 3.55 6.57 3.18
0 EHYA.SW 0.98 4.00 6.98 3.99
0 HIGH.SW 0.94 4.22 7.18 4.11
0 IHYG.SW 0.56 2.46 6.16 3.06
0 HYGU.SW 1.12 4.56 7.82 4.50
0 IBCI.SW 0.64 3.57 6.04 3.42
0 IAEX.SW 8.34 18.49 14.95 13.93
0 AGED.SW 9.45 24.74 28.13 20.77
0 ISAG.SW 7.97 21.61 34.34 21.31
0 IAPD.SW 0.51 6.62 19.54 8.89
0 IASP.SW 1.08 2.54 12.18 5.27
0 RBOT.SW 10.35 30.53 39.15 26.68
0 RBOD.SW 11.33 30.50 39.69 27.17
0 BRIC.SW 7.24 11.08 75.60 31.31
0 CNYB.SW 1.14 4.78 8.36 4.76
0 FXC.SW 5.68 13.84 19.29 12.94
0 DJSXE.SW 3.11 9.24 6.44 6.26
0 CSSX5E.SW -0.53 5.29 11.85 5.54
How can I write in the dataframe a new columns "Symbol" with the stock without ".SW".
Example first row result should be IBGL (modified value IBGL.SW).
Example last row result should be CSSX5E (splited value SSX5E.SW).
If I send the following command:
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
Than I receive an error message:
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
How can I solve this problem?
Thanks a lot for your support.

METHOD 1:
You can do a vectorized operation by str.get(0) -
df['SYMBOL'] = df['Stock'].str.split('.').str.get(0)
METHOD 2:
You can do another vectorized operation by using expand=True in str.split() and then getting the first column.
df['SYMBOL'] = df['Stock'].str.split('.', expand = True)[0]
METHOD 3:
Or you can write a custom lambda function with apply (for more complex processes). Note, this is slower but good if you have your own UDF.
df['SYMBOL'] = df['Stock'].apply(lambda x:x.split('.')[0])

This is not an error, but a warning as you may have probably noticed your script finishes its execution.
edite: Given your comments it seems your issues generate previously in the code, therefore I suggest you use the following:
new_df = new_df.copy(deep=False)
And then proceed to solve it with:
new_df.loc['Symbol'] = new_df['Stock'].str.split('.').str[0]

new_df = new_df.copy()
new_df['Symbol'] = new_df.Stock.str.replace('.SW','')

Reading a multiline record into Pandas dataframe

I have an earthquake data I want to read into a Pandas dataframe. Data for each earthquake is spread over 5 fixed-format lines, and the format for each of the 5 lines is different. Some fields include variable whitespaces, so I can't just do a delimited read.
Is there an elegant way to parse that with read_fwf (or something else)? I think nesting loops with chunksize=1 might work but it's not very clean. Or I could reformat the file to cat each 5 line block out to a single line; but I'd rather use the original file.
Here's he first earthquake as an example:
MLI 1976/01/01 01:29:39.6 -28.61 -177.64 59.0 6.2 0.0 KERMADEC ISLANDS REGION
M010176A B: 0 0 0 S: 0 0 0 M: 12 30 135 CMT: 1 BOXHD: 9.4
CENTROID: 13.8 0.2 -29.25 0.02 -176.96 0.01 47.8 0.6 FREE O-00000000000000
26 7.680 0.090 0.090 0.060 -7.770 0.070 1.390 0.160 4.520 0.160 -3.260 0.060
V10 8.940 75 283 1.260 2 19 -10.190 15 110 9.560 202 30 93 18 60 88

Pandas diff SeriesGroupBy is relatively slow

Total time: 1.01876 s
Function: prepare at line 91
Line # Hits Time Per Hit % Time Line Contents
==============================================================
91 #profile
92 def prepare():
93
94 1 5681.0 5681.0 0.6
95 1 2416.0 2416.0 0.2
96
97
98 1 536.0 536.0 0.1 tss = df.groupby('user_id').timestamp
99 1 949643.0 949643.0 93.2 delta = tss.diff()
100 1 1822.0 1822.0 0.2
101 1 13030.0 13030.0 1.3
102 1 5193.0 5193.0 0.5
103 1 1251.0 1251.0 0.1
104
105 1 2038.0 2038.0 0.2
106
107 1 1851.0 1851.0 0.2
108
109 1 282.0 282.0 0.0
110
111 1 3088.0 3088.0 0.3
112 1 2943.0 2943.0 0.3
113 1 438.0 438.0 0.0
114 1 4658.0 4658.0 0.5
115 1 17083.0 17083.0 1.7
116 1 3115.0 3115.0 0.3
117 1 3691.0 3691.0 0.4
118
119 1 2.0 2.0 0.0
I have a dataframe which I group by some key and then select a column from each group and perform diff on that column (per group). As shown in the profiling results, the diff operation is super slow compared to the rest and is kind of a bottleneck. Is this expected? Are there faster alternatives to achieve the same result?
Edit: some more explanation
In my use case timestamps represent the times for some actions of a user to which I want to calculate the deltas between these actions (they are sorted) but each user's actions are completely independent of other users.
Edit: Sample code
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'ts':[1,2,3,4,60,61,62,63,64,150,155,156,
1,2,3,4,60,61,62,63,64,150,155,163,
1,2,3,4,60,61,62,63,64,150,155,183],
'id': [1,2,3,4,60,61,62,63,64,150,155,156,
71,72,73,74,80,81,82,83,64,160,165,166,
21,22,23,24,90,91,92,93,94,180,185,186],
'other':['x','x','x','','x','x','','x','x','','x','',
'y','y','y','','y','y','','y','y','','y','',
'z','z','z','','z','z','','z','z','','z',''],
'user':['x','x','x','x','x','x','x','x','z','x','x','y',
'y','y','y','y','y','y','y','y','x','y','y','x',
'z','z','z','z','z','z','z','z','y','z','z','z']
})
df.set_index('id',inplace=True)
deltas=df.groupby('user').ts.transform(pd.Series.diff)

If you do not wish to sort your data or drop down to numpy, then a significant performance improvement may be possible by changing your user series to Categorical. Categorical data is stored effectively as integer pointers.
In the below example, I see an improvement from 86ms to 59ms. This may improve further for larger datasets and where more users are repeated.
df = pd.concat([df]*10000)
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 86.1 ms per loop
%timeit df['user'].astype('category') # 23.4 ms per loop
df['user'] = df['user'].astype('category')
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 35.7 ms per loop
If you are performing multiple operations, then the one-off cost of converting to categorical can be discounted.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: grouping or splitting up time series data based on conditions - python

Related

Unable to merge all of the desired columns from Pandas DataFrame

how to subtract first value with all the rest value

Dataframe split columns value, how to solve error message?

Reading a multiline record into Pandas dataframe

Pandas diff SeriesGroupBy is relatively slow

Categories

Resources