I hope you're all doing well.
So I've been working with Excel my whole life and I'm now switching to Python & Pandas. The Learning curve is proving to be quite steep for me, so please bare with me.
Day after day it's getting better. I've already managed to aggregate values, input/ouput from csv/excel, drop "na" values and much more. However, I've stumbeled upon a wall to high for me to climb right now...
I created an extract of the dataframe I'm working with. You can download it here, so you can understand what I'll be writing about: https://filetransfer.io/data-package/pWE9L29S#link
df_example
t_stamp,1_wind,2_wind,3_wind,4_wind,5_wind,6_wind,7_wind,1_wind_Q,2_wind_Q,3_wind_Q,4_wind_Q,5_wind_Q,6_wind_Q,7_wind_Q
2021-06-06 18:20:00,12.14397093693768,12.14570426940918,10.97993184016605,11.16468568605988,9.961717914791588,10.34653735907099,11.6856901451427,True,False,True,True,True,True,True
2021-05-10 19:00:00,8.045154709031468,8.572511270557484,8.499070711427668,7.949358210396142,8.252115912454919,7.116505042782365,8.815732567915179,True,True,True,True,True,True,True
2021-05-27 22:20:00,8.38946901817802,6.713454777683985,7.269814675171176,7.141862659613969,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-05 18:20:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-06-06 12:20:00,11.95525872119988,12.14570426940918,12.26086164116684,12.89527716859738,11.77172234144684,12.12409015586662,12.52180822809299,True,False,True,True,True,True,True
2021-06-04 03:30:00,14.72553364088618,12.72900662616056,10.59386275508178,10.96070182287055,12.38239256540934,12.07846616943932,10.58384464064597,True,True,True,True,False,True,True
2021-05-05 13:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-05-24 18:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.60747853230812,17.18577813727543,17.70745523935796,False,False,False,False,True,True,True
2021-05-07 19:00:00,13.94341927008482,10.95456999345216,13.36533234604886,0.0,3.782910539990379,10.86996953698871,13.45072022532649,True,True,True,False,False,True,True
2021-05-13 00:40:00,10.70940582779898,10.22222264510213,9.043496015164536,9.03805802580422,11.53775481234347,10.09538681656049,10.19345618536208,True,True,True,True,True,True,True
2021-05-27 19:40:00,10.8317678500958,7.929683248532885,8.264301219025942,8.184133252794958,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-09 12:00:00,10.55571650269678,7.635778078425459,10.43683108425784,7.847532146733346,8.100127641989639,7.770247510198059,8.040702032061867,True,True,True,True,True,True,True
2021-05-19 19:00:00,2.322496225799398,2.193219010982461,2.301622604435732,2.204278609893358,2.285408405883714,1.813280858368885,1.667207419773053,True,True,True,True,True,True,True
2021-05-30 12:30:00,5.776450801637788,8.488826231951345,10.98525552709715,7.03016556196849,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-24 14:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.93466266883504,17.04697174496121,17.0739475214739,False,False,False,False,True,False,True
What you are looking at:
"n" represents the number of measuring points.
First column: Timestamp of values
Columns index 1 to "n": Average windspeed at different points, of the last 10 minutes
Columns index "n+1" to last (-1): Qualifies if the value of the respective point is valid (True) or invalid (False). So to the value "1_wind", the qualifier "1_wind_Q" applies
Want I'm trying to achieve:
The goal is to create a new column called "Avg_WS" which iterates through every row and calculates the following:
Average of the value ranges, ONLY if the corresponding Qualifier is TRUE
Example: So if in a given row, the column "4_wind_Q" is "False", the value "4_wind" should be excluded from the average on that given row.
Extra: If all Qualifiers are "False" in a given row, "Avg_WS" should equal to "NaN" in that same row.
I've tried to use apply, but I can't figure out how to match the pairs of value-qualifier
Thnak you so much in advanced!
I tried using mask for this.
quals = ['1_wind_Q','2_wind_Q','3_wind_Q','4_wind_Q','5_wind_Q','6_wind_Q','7_wind_Q']
fields = ['1_wind', '2_wind', '3_wind', '4_wind', '5_wind', '6_wind', '7_wind']
df[fields].mask( ~df[quals].values ).mean( axis=1 )
# output
0 11.047089
1 8.178635
2 7.378650
3 NaN
4 12.254836
5 11.945236
6 NaN
7 17.500237
8 12.516802
9 10.119969
10 8.802471
11 8.626705
12 2.112502
13 8.070175
14 17.504305
dtype: float64
# assign this to the dataframe
df.loc[ :, 'Avg_WS' ] = df[fields].mask( ~df[quals].values ).mean( axis=1 )
mask works by essentially applying a boolean mask on each of the "fields" - the caveat is the bool mask must be the same shape as the data you are trying applying it on (i.e. must have same dimensions n x m)
mean( axis= 1 ) tells the data frame to apply the mean function across each row ( rather than column which axis=0 would imply.
For a time series analysis, I have to drop instances that occur on the same date. However, keep some of the 'deleted' information and add it to the remaining 'duplicate' instance. Below a short example of part of my dataset.
z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]})
lat lon time duration size
0 49.125 -114.125 2005-08-09 3 4
1 49.125 -114.125 2005-08-09 6 10
I would like to drop the (duplicate) instance which has the lowest 'duration' value but at the same time sum the 'size' variables. Output would look like:
lat lon time duration size
0 49.125 -114.125 2005-08-09 6 14
Does anyone know how I would be able to tackle such a problem? Furthermore, for another variable, I would like to take the mean of these values. Yet I do think the process would be similar to summing the values.
edit: so far I know how to get the highest duration value to remain using:
z.sort_values(by='duration', ascending=False).drop_duplicates(subset=['lat', 'lon','time'], keep='last')
If those are all the columns in your dataframe, you can get your result using a groupbyon your time column, and passing in your aggregations for each column.
More specifically, you can drop the (duplicate) instance which has the lowest 'duration' by keeping the max() duration, and at the same time sum the 'size' variables by using sum() on your size column.
res = z.groupby('time').agg({'lat':'first',
'lon':'first',
'duration':'max',
'size':'sum'}). \
reset_index()
res
time lat lon duration size
0 2005-08-09 49.125 -114.125 6 14
The only difference is that 'time' is now your first column, which you can quickly fix.
Group by to get the sum and merge back on unique values on the df without duplicates:
import pandas as pd
import numpy as np
z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]}) # original data
gp = z.groupby(['lat', 'lon','time'], as_index=False)[['size']].sum() # getting the sum of 'size' for unique combination of lat, lon, time
df = z.sort_values(by='duration', ascending=True).drop_duplicates(subset=['lat', 'lon','time'], keep='last') # dropping duplicates
pd.merge(df[['lat', 'lon', 'time', 'duration']], gp, on=['lat', 'lon', 'time']) # adding the columns summed onto the df without duplicates
lat lon time duration size
0 49.125 -114.125 2005-08-09 6 14
Another way base on sophocles answer:
res = z.sort_values(by='duration', ascending=False).groupby(['time', 'lat', 'lon']).agg({
'duration':'first', # same as 'max' since we've sorted the data by duration DESC
'size':'sum'})
This one could become less readable if you have several columns you want to keep (you'd have a lot of first in the agg function)
I have a weekly time-series of multiple varibles and I am trying to view what percentrank the last 26week correlation would be in vs. all previous 26week correlations.
So I can generate a correlation matrix for the first 26wk period using the pd.corr function in pandas, but I dont know how I can loop through all previous periods too find the different values for these correlations to then rank.
I hope there is a better way to achieve this if so please let me know
I have tried calculating parallel dataframes but i couldnt write a formula to rank the most recent - so i beleive that the solution lays with multi-indexing.
'''python
daterange = pd.date_range('20160701', periods = 100, freq= '1w')
np.random.seed(120)
df_corr = pd.DataFrame(np.random.rand(100,5), index= daterange, columns = list('abcde'))
df_corr_chg=df_corr.diff()
df_corr_chg=df_corr_chg[1:]
df_corr_chg=df_corr_chg.replace(0, 0.01)
d=df_corr_chg.shape[0]
df_CCC=df_corr_chg[::-1]
for s in range(0,d-26):
i=df_CCC.iloc[s:26+s]
I am looking for a multi-indexed table showing the correlations at different times
Example of output
e.g. (formatting issues)
a b
a 1 1 -0.101713
2 1 -0.031109
n 1 0.471764
b 1 -0.101713 1
2 -0.031109 1
n 0.471764 1
Here is a receipe how you could approach the problem.
I assume, you have one price per week (otherwise just preaggregate your dataframe).
# in case you your weeks are not numbered
# Sort your dataframe for symbol (EUR, SPX, ...) and week descending.
df.sort_values(['symbol', 'date'], ascending=False, inplace=True)
# Now add a pseudo
indexer= df.groupby('symbol').cumcount() < 26
df.loc[indexer, 'pricecolumn'].corr()
One more hint, in case you need to preaggregate your dataframe. You could add another aux column with the week number in your frame like:
df['week_number']=df['datefield'].dt.week
Then I guess you would like to have the last price of each week. You could do that as follows:
df_last= df.sort_values(['symbol', 'week_number', 'date'], ascending=True, inplace=False).groupby(['symbol', 'week_number']).aggregate('last')
df_last.reset_index(inplace=True)
Then use df_last in in place of the df above. Please check/change the field names, I assumed.