Conditional average of values in a row, depending on data qualifiers - python
I hope you're all doing well.
So I've been working with Excel my whole life and I'm now switching to Python & Pandas. The Learning curve is proving to be quite steep for me, so please bare with me.
Day after day it's getting better. I've already managed to aggregate values, input/ouput from csv/excel, drop "na" values and much more. However, I've stumbeled upon a wall to high for me to climb right now...
I created an extract of the dataframe I'm working with. You can download it here, so you can understand what I'll be writing about: https://filetransfer.io/data-package/pWE9L29S#link
df_example
t_stamp,1_wind,2_wind,3_wind,4_wind,5_wind,6_wind,7_wind,1_wind_Q,2_wind_Q,3_wind_Q,4_wind_Q,5_wind_Q,6_wind_Q,7_wind_Q
2021-06-06 18:20:00,12.14397093693768,12.14570426940918,10.97993184016605,11.16468568605988,9.961717914791588,10.34653735907099,11.6856901451427,True,False,True,True,True,True,True
2021-05-10 19:00:00,8.045154709031468,8.572511270557484,8.499070711427668,7.949358210396142,8.252115912454919,7.116505042782365,8.815732567915179,True,True,True,True,True,True,True
2021-05-27 22:20:00,8.38946901817802,6.713454777683985,7.269814675171176,7.141862659613969,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-05 18:20:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-06-06 12:20:00,11.95525872119988,12.14570426940918,12.26086164116684,12.89527716859738,11.77172234144684,12.12409015586662,12.52180822809299,True,False,True,True,True,True,True
2021-06-04 03:30:00,14.72553364088618,12.72900662616056,10.59386275508178,10.96070182287055,12.38239256540934,12.07846616943932,10.58384464064597,True,True,True,True,False,True,True
2021-05-05 13:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-05-24 18:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.60747853230812,17.18577813727543,17.70745523935796,False,False,False,False,True,True,True
2021-05-07 19:00:00,13.94341927008482,10.95456999345216,13.36533234604886,0.0,3.782910539990379,10.86996953698871,13.45072022532649,True,True,True,False,False,True,True
2021-05-13 00:40:00,10.70940582779898,10.22222264510213,9.043496015164536,9.03805802580422,11.53775481234347,10.09538681656049,10.19345618536208,True,True,True,True,True,True,True
2021-05-27 19:40:00,10.8317678500958,7.929683248532885,8.264301219025942,8.184133252794958,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-09 12:00:00,10.55571650269678,7.635778078425459,10.43683108425784,7.847532146733346,8.100127641989639,7.770247510198059,8.040702032061867,True,True,True,True,True,True,True
2021-05-19 19:00:00,2.322496225799398,2.193219010982461,2.301622604435732,2.204278609893358,2.285408405883714,1.813280858368885,1.667207419773053,True,True,True,True,True,True,True
2021-05-30 12:30:00,5.776450801637788,8.488826231951345,10.98525552709715,7.03016556196849,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-24 14:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.93466266883504,17.04697174496121,17.0739475214739,False,False,False,False,True,False,True
What you are looking at:
"n" represents the number of measuring points.
First column: Timestamp of values
Columns index 1 to "n": Average windspeed at different points, of the last 10 minutes
Columns index "n+1" to last (-1): Qualifies if the value of the respective point is valid (True) or invalid (False). So to the value "1_wind", the qualifier "1_wind_Q" applies
Want I'm trying to achieve:
The goal is to create a new column called "Avg_WS" which iterates through every row and calculates the following:
Average of the value ranges, ONLY if the corresponding Qualifier is TRUE
Example: So if in a given row, the column "4_wind_Q" is "False", the value "4_wind" should be excluded from the average on that given row.
Extra: If all Qualifiers are "False" in a given row, "Avg_WS" should equal to "NaN" in that same row.
I've tried to use apply, but I can't figure out how to match the pairs of value-qualifier
Thnak you so much in advanced!
I tried using mask for this.
quals = ['1_wind_Q','2_wind_Q','3_wind_Q','4_wind_Q','5_wind_Q','6_wind_Q','7_wind_Q']
fields = ['1_wind', '2_wind', '3_wind', '4_wind', '5_wind', '6_wind', '7_wind']
df[fields].mask( ~df[quals].values ).mean( axis=1 )
# output
0 11.047089
1 8.178635
2 7.378650
3 NaN
4 12.254836
5 11.945236
6 NaN
7 17.500237
8 12.516802
9 10.119969
10 8.802471
11 8.626705
12 2.112502
13 8.070175
14 17.504305
dtype: float64
# assign this to the dataframe
df.loc[ :, 'Avg_WS' ] = df[fields].mask( ~df[quals].values ).mean( axis=1 )
mask works by essentially applying a boolean mask on each of the "fields" - the caveat is the bool mask must be the same shape as the data you are trying applying it on (i.e. must have same dimensions n x m)
mean( axis= 1 ) tells the data frame to apply the mean function across each row ( rather than column which axis=0 would imply.
Related
Pandas: combining information from rows with consecutive duplicate values in a column
The title might be a bit confusing here, but in essence: We have a dataframe, let’s say like this: SKUCODE PROCCODE RUNCODE RSNCODE BEGTIME ENDTIME RSNNAME 0 218032 A 21183528 1010 2020-11-19 04:00 2020-11-19 04:15 Bank holidays 1 218032 A 21183959 300.318 2021-01-27 10:50 2021-01-27 10:53 Base paper roll change 2 218032 A 21183959 300.302 2021-01-27 11:50 2021-01-27 11:53 Product/grade change 3 218032 A 21183959 300.302 2021-01-27 12:02 2021-01-27 13:12 Product/grade change 4 200402 A 21184021 300.302 2021-01-27 13:16 2021-01-27 13:19 Product/grade change Where each row is a break event happening on a production line. As can be seen, some singular break events (common RSNNAME) are spread out on multiple consecutive rows (for one data gathering reason or another), and we would like to compress all of these into just one row, for example compressing the rows 2 through 4 to a single row in our example dataframe, resulting in something like this: SKUCODE PROCCODE RUNCODE RSNCODE BEGTIME ENDTIME RSNNAME 0 218032 A 21183528 1010 2020-11-19 04:00 2020-11-19 04:15 Bank holidays 1 218032 A 21183959 300.318 2021-01-27 10:50 2021-01-27 10:53 Base paper roll change 2 218032 A 21183959 300.302 2021-01-27 11:50 2021-01-27 13:19 Product/grade change The resulting one row would have the BEGTIME (signifying the start of the break) of the first row that was combined, and the ENDTIME (signifying the end of the break) of the last row that was combined, this way making sure we capture the correct timestamps from the entire break event. If we want to make the problem harder still, we might want to add a time threshold for row combining. Say, if there is a period of more than 15 minutes between (the ENDTIME of the former and the BEGTIME of the latter) two different rows seemingly of the same row event, we would treat them as separate ones instead. This is accomplished quite easily through iterrows by comparing one row to the next, checking if they contain a duplicate value in the RSNNAME column, and grabbing the ENDTIME of the latter one onto the former one if that is the case. The latter row can then be dropped as useless. Here we might also introduce logic to see if the seemingly singular break events might actually be telling of two different ones of the same nature merely happening some time apart of each other. However, using iterrows for this purpose gets quite slow. Is there a way to figure out this problem through vectorized functions or other more efficient means? I've played around with shifting the rows around and comparing to each other - shifting and comparing two adjacent rows is quite simple and allows us to easily grab the ENDTIME of the latter row if a duplicate is detected, but we run into issues in the case of n consecutive duplicate causes. Another idea would be to create a boolean mask to check if the row below the current one is a duplicate, resulting in a scenario where, in the case of multiple consecutive duplicate rows, we have multiple corresponding consecutive "True" labels, the last of which before a "False" label signifies the last row where we would want to grab the ENDTIME from for the first consecutive "True" label of that particular series of consecutive "Trues". I'm yet to find a way to implement this in practice, however, using vectorization.
For the basic problem, drop_duplicates can be used. Essentially, you drop the duplicates on the RSNNAME column, keeping the first occurrence replace the ENDTIME column with the end times by dropping duplicates again, this time keeping the last occurrence. ( df.drop_duplicates("RSNNAME", keep="first").assign( ENDTIME=df.drop_duplicates("RSNNAME", keep="last").ENDTIME.values ) ) (By using the .values we ignore the index in the assignment.) To give you an idea for the more complex scenario: You are on the right track with your last idea. You want to .shift the column in question by one row and compare that to the original column. That gives you flags where new consecutive events start: >>> df.RSNNAME != df.shift().RSNNAME 0 True 1 True 2 True 3 False 4 False Name: RSNNAME, dtype: bool To turn that into something .groupby-able, you compute the cumulative sum: >>> (df.RSNNAME != df.shift().RSNNAME).cumsum() 0 1 1 2 2 3 3 3 4 3 Name: RSNNAME, dtype: int64 For your case, one option could be to extend the df.RSNNAME != df.shift().RSNNAME with some time difference to get the proper flags but I suggest you play a bit with this shift/cumsum approach.
df1 looks like :- new_df = pd.DataFrame(columns=df1.columns) for name,df in df1.groupby([(df1.RSNNAME != df1.RSNNAME.shift()).cumsum()] ) : if df.shape[0] == 1 : new_df = pd.concat([new_df,df]) else : df.iloc[0, df.columns.get_loc('ENDTIME')] = df.iloc[-1]["ENDTIME"] new_df = pd.concat([new_df,df.head(1)]) new_df looks like :- If you really think that I was using chatgpt then how do you explain all this? Be greatful that someone is trying to help you.
Remove zeros in pandas dataframe without effecting the imputation result
I have a timeseries dataset with 5M rows. The column has 19.5% missing values, 80% zeroes (don't go by the percentage values - although it means only 0.5% of data is useful but then 0.5% of 5M is enough). Now, I need to impute this column. Given the number of rows, it's taking around 2.5 hours for KNN to impute the whole thing. To make it faster, I thought of deleting all the zero values rows and then carry out the imputation process. But I feel that using KNN naively after this would lead to overestimation (since all the zero values are gone and keeping the number of neighbours fixed, the mean is expected to increase). So, is there a way: To modify the data input to the KNN model Carry out the imputation process after removing the rows with zeros so that the values obtained after imputation are the same or at least near To understand the problem more clearly, consider the following dummy dataframe: DATE VALUE 0 2018-01-01 0.0 1 2018-01-02 8.0 2 2018-01-03 0.0 3 2018-01-04 0.0 4 2018-01-05 0.0 5 2018-01-06 10.0 6 2018-01-07 NaN 7 2018-01-08 9.0 8 2018-01-09 0.0 9 2018-01-10 0.0 Now, if I use KNN (k=3), then with zeros, the value would be the weighted mean of 0, 10 and 9. But if I remove the zeros naively, the value will be imputed with the weighted mean of 8, 10 and 9. A few rough ideas which I thought of but could not proceed through were as follows: Modifying the weights (used in the weighted mean computation) of the KNN imputation process so that the removed 0s are taken into account during the imputation. Adding a column which says how many neighbouring zeros a particular column has and then somehow use it to modify the imputation process. Points 1. and 2. are just rough ideas which came across my mind while thinking about how to solve the problem and might help one while answering the answer. PS - Obviously, I am not feeding the time series data directly into KNN. What I am doing is extracting month, day, etc. from the date column, and then using this for imputation. I do not need parallel processing as an answer to make the code run faster. The data is so large that high RAM usage hangs my laptop.
Let's think logically, leave the machine learning part aside for the moment. Since we are dealing with time series, it would be good if you impute the data with the average of values for the same date in different years, say 2-3 years ( if we consider 2 years, then 1 year before and 1 year after the missing value year), would recommend not to go beyond 3 years. We have computed x now. Further to make this computed value x close to the current data, use an average of x and y, y is linear interpolation value. In the above example, y = (10 + 9)/2, i.e. average of one value before and one value after the data to be imputed.
linear interpolation between values stored in separate dataframes
I have numbers stored in 2 data frames (real ones are much bigger) which look like df1 A B C T Z 13/03/2017 1.321674 3.1790 3.774602 30.898 13.22 06/02/2017 1.306358 3.1387 3.712554 30.847 13.36 09/01/2017 1.361103 3.2280 3.738500 32.062 13.75 05/12/2016 1.339258 3.4560 3.548593 31.978 13.81 07/11/2016 1.295137 3.2323 3.188161 31.463 13.43 df2 A B C T Z 13/03/2017 1.320829 3.1530 3.7418 30.933 13.1450 06/02/2017 1.305483 3.1160 3.6839 30.870 13.2985 09/01/2017 1.359989 3.1969 3.7129 32.098 13.6700 05/12/2016 1.338151 3.4215 3.5231 32.035 13.7243 07/11/2016 1.293996 3.2020 3.1681 31.480 13.3587 and a list where I have stored all daily dates from 13/03/2017 to 7/11/2016. I would like to create a dataframe with the following features: the list of daily dates is the indexrow I would like to create columns (in this case from A to Z) and for each row/ day compute the linear interpolation value between the value in df1 and the corresponding value in df2 shifted by -1. For example, in the row '12/03/2017' for column A I want to compute [(34/35)*1.321674]+[(1/35)*1.305483] = 1.3212114. Where 35 is the number of days between 13/03/2017 and 06/02/2017, 1.321674 is the value in df1 corresponding to column A for the day 13/03/2017 and 1.305483 is the value in df2 corresponding to column A for the day 06/02/2017. For 11/03/2017 for column A I want to compute [(33/35)*1.321674]+[(2/35)*1.305483] = 1.3207488. Thus keeping fixed the values 1.321674 and 1.305483 for the time interval until 6/2/2017 where it should show 1.305483. Finally, the linear interpolation should shift interpolating values when the corresponding row shows a date which is the next time interval. For example, once I reach 05/02/2017, the linear interpolation should be between 1.306358 (df1, column A) and 1.359989 (df2, column B), that is shift one position down. For clarity, date format is 'dd/mm/yyyy' I would greatly appreciate any piece of advice or suggestion, I am aware it's a lot of work so any hint is valued! Please let me know if I you need more clarification. Thanks!
Filling missing time values in a multi-indexed dataframe
Problem and what I want I have a data file that comprises time series read asynchronously from multiple sensors. Basically for every data element in my file, I have a sensor ID and time at which it was read, but I do not always have all sensors for every time, and read times may not be evenly spaced. Something like: ID,time,data 0,0,1 1,0,2 2,0,3 0,1,4 2,1,5 # skip some sensors for some time steps 0,2,6 2,2,7 2,3,8 1,5,9 # skip some time steps 2,5,10 Important note the actual time column is of datetime type. What I want is to be able to zero-order hold (forward fill) values for every sensor for any time steps where that sensor does not exist, and either set to zero or back fill any sensors that are not read at the earliest time steps. What I want is a dataframe that looks like it was read from: ID,time,data 0,0,1 1,0,2 2,0,3 0,1,4 1,1,2 # ID 1 hold value from time step 0 2,1,5 0,2,6 1,2,2 # ID 1 still holding 2,2,7 0,3,6 # ID 0 holding 1,3,2 # ID 1 still holding 2,3,8 0,5,6 # ID 0 still holding, can skip totally missing time steps 1,5,9 # ID 1 finally updates 2,5,10 Pandas attempts so far I initialize my dataframe and set my indices: df = pd.read_csv(filename, dtype=np.int) df.set_index(['ID', 'time'], inplace=True) I try to mess with things like: filled = df.reindex(method='ffill') or the like with various values passed to the index keyword argument like df.index, ['time'], etc. This always either throws an error because I passed an invalid keyword argument, or does nothing visible to the dataframe. I think it is not recognizing that the data I am looking for is "missing". I also tried: df.update(df.groupby(level=0).ffill()) or level=1 based on Multi-Indexed fillna in Pandas, but I get no visible change to the dataframe again, I think because I don't have anything currently where I want my values to go. Numpy attempt so far I have had some luck with numpy and non-integer indexing using something like: data = [np.array(df.loc[level].data) for level in df.index.levels[0]] shapes = [arr.shape for arr in data] print(shapes) # [(3,), (2,), (5,)] data = [np.array([arr[i] for i in np.linspace(0, arr.shape[0]-1, num=max(shapes)[0])]) for arr in data] print([arr.shape for arr in data]) # [(5,), (5,), (5,)] But this has two problems: It takes me out of the pandas world, and I now have to manually maintain my sensor IDs, time index, etc. along with my feature vector (the actual data column is not just one column but a ton of values from a sensor suite). Given the number of columns and the size of the actual dataset, this is going to be clunky and inelegant to implement on my real example. I would prefer a way of doing it in pandas. The application Ultimately this is just the data-cleaning step for training recurrent neural network, where for each time step I will need to feed a feature vector that always has the same structure (one set of measurements for each sensor ID for each time step). Thank you for your help!
Here is one way , by using reindex and category df.time=df.time.astype('category',categories =[0,1,2,3,4,5]) new_df=df.groupby('time',as_index=False).apply(lambda x : x.set_index('ID').reindex([0,1,2])).reset_index() new_df['data']=new_df.groupby('ID')['data'].ffill() new_df.drop('time',1).rename(columns={'level_0':'time'}) Out[311]: time ID data 0 0 0 1.0 1 0 1 2.0 2 0 2 3.0 3 1 0 4.0 4 1 1 2.0 5 1 2 5.0 6 2 0 6.0 7 2 1 2.0 8 2 2 7.0 9 3 0 6.0 10 3 1 2.0 11 3 2 8.0 12 4 0 6.0 13 4 1 2.0 14 4 2 8.0 15 5 0 6.0 16 5 1 9.0 17 5 2 10.0
You can have a dictionary of last readings for each sensors. You'll have to pick some initial value; the most logical choice is probably to back-fill the earliest reading to earlier times. Once you've populated your last_reading dictionary, you can just sort all the readings by time, update the dictionary for each reading, and then fill in rows according to the dictionay. So after you have your last_reading dictionary initialized: last_time = readings[1][time] for reading in readings: if reading[time] > last_time: for ID in ID_list: df.loc[last_time,ID] = last_reading[ID] last_time = reading[time] last_reading[reading[ID]] = reading[data] #the above for loop doesn't update for the last time #so you'll have to handle that separately for ID in ID_list: df.loc[last_time,ID] = last_reading[ID] last_time = reading[time] This assumes that you have only one reading for each time/sensor pair, and that 'readings' a list of dictionaries sorted by time. It also assumes that df has the different sensors as columns and different times as index. Adjust the code as necessary if otherwise. You can also probably optimize it a bit more by updating a whole row at once instead of using a for loop, but I didn't want to deal with making sure I had the Pandas syntax right. Looking at the application, though, you might want to have each cell in the dataframe be not a number but a tuple of last value and time it was read, so replace last_reading[reading[ID]] = reading[data] with last_reading[reading[ID]] = [reading[data],reading[time]]. Your neural net can then decide how to weight data based on how old it is.
I got this to work with the following, which I think is pretty general for any case like this where the time index for which you want to fill values is the second in a multi-index with two indices: # Remove duplicate time indices (happens some in the dataset, pandas freaks out). df = df[~df.index.duplicated(keep='first')] # Unstack the dataframe and fill values per serial number forward, backward. df = df.unstack(level=0) df.update(df.ffill()) # first ZOH forward df.update(df.bfill()) # now back fill values that are not seen at the beginning # Restack the dataframe and re-order the indices. df = df.stack(level=1) df = df.swaplevel() This gets me what I want, although I would love to be able to keep the duplicate time entries if anybody knows of a good way to do this. You could also use df.update(df.fillna(0)) instead of backfilling if starting unseen values at zero is preferable for a particular application. I put the above code block in a function called clean_df that takes the dataframe as argument and returns the cleaned dataframe.
Pandas/Python - Updating dataframes based on value match
I want to update the mergeAllGB.Intensity columns NaN values with values from another dataframe where ID, weekday and hour are matching. I'm trying: mergeAllGB.Intensity[mergeAllGB.Intensity.isnull()] = precip_hourly[precip_hourly.SId == mergeAllGB.SId & precip_hourly.Hour == mergeAllGB.Hour & precip_hourly.Weekday == mergeAllGB.Weekday].Intensity However, this returns ValueError: Series lengths must match to compare. How could I do this? Minimal example: Inputs: _______ mergeAllGB SId Hour Weekday Intensity 1 12 5 NaN 2 5 6 3 precip_hourly SId Hour Weekday Intensity 1 12 5 2 Desired output: ________ mergeAllGB SId Hour Weekday Intensity 1 12 5 2 2 5 6 3
TL;DR this will (hopefully) work: # Set the index to compare by df = mergeAllGB.set_index(["SId", "Hour", "Weekday"]) fill_df = precip_hourly.set_index(["SId", "Hour", "Weekday"]) # Fill the nulls with the relevant values of intensity df["Intensity"] = df.Intensity.fillna(fill_df.Intensity) # Cancel the special indexes mergeAllGB = df.reset_index() Alternatively, the line before the last could be df.loc[df.Intensity.isnull(), "Intensity"] = fill_df.Intensity Assignment and comparison in pandas are done by index (which isn't shown in your example). In the example, running precip_hourly.SId == mergeAllGB.SId results in ValueError: Can only compare identically-labeled Series objects. This is because we try to compare the two columns by value, but precip_hourly doesn't have a row with index 1 (default indexing starts at 0), so the comparison fails. Even if we assume the comparison succeeded, the assignment stage is problematic. Pandas tries to assign according to the index - but this doesn't have the intended meaning. Luckily, we can use it for our own benefit - by setting the index to be ["SId", "Hour", "Weekday"], any comparison and assignments will be done with relation to this index, so running df.Intensity= fill_df.Intensity will assign to df.Intensity the values in fill_df.Intensity wherever the index match, that is, wherever they have the same ["SId", "Hour", "Weekday"]. In order to assign only to the places where the Intensity is NA, we need to filter first (or use fillna). Note that filter by df.Intensity[df.Intensity.isnull()] will work, but assignment to it will probably fail if you have several values with the same (SId, Hour, Weekday) values.