As per the following code, using panda, I am doing some analysis on one of the columns (HR):
aa = New_Data['index'].tolist()
aa = [0] + aa
avg = []
for i in range(1,len(aa)):
** val = raw_data.loc[(raw_data['index'] >= aa[i-1]) & (raw_data['index'] <= aa[i])['HR'].diff().mean()
avg.append(val)
New_Data['slope'] = avg
AT the end of the day, it will add a new column to the data ('Slope')
That is fine and is working. The problem is that I want to redo the line (which is specified by **) for every other columns (not just HR) as well. in Other words,:
** val = raw_data.loc[(raw_data['index'] >= aa[i-1]) & (raw_data['index'] <= aa[i])['**another column**'].diff().mean()
avg.append(val)
New_Data['slope'] = avg
Is there any way to do it automatically? I have around 100 columns so doing manually is not enticing. Thanks for your help
Not sure on the pure pandas way but you could just write in a external loop -
aa = New_Data['index'].tolist()
aa = [0] + aa
avg = []
for col in df.columns:
for i in range(1,len(aa)):
** val = raw_data.loc[(raw_data['index'] >= aa[i-1]) & (raw_data['index'] <= aa[i])[col].diff().mean()
avg.append(val)
New_Data['slope'] = avg
In the line
for col in df.columns
you can modify to only use columns you need.
Related
I am trying to split a pd.Series with sorted dates that have sometimes gaps between them that are bigger than the normal ones. To do this, I calculated the size of the gaps with pd.Series.diff() and then iterated over all the elements in the series with a while-loop. But this is unfortunately quite computationally intensive. Is there a better way (apart from parallelization)?
Minimal example with my function:
import pandas as pd
import time
def get_samples_separated_at_gaps(data: pd.Series, normal_gap) -> list:
diff = data.diff()
# creating list that should contains all samples
samples_list = [pd.Series(data[0])]
i = 1
while i < len(data):
if diff[i] == normal_gap:
# normal gap: add data[i] to last sample in samples_list
samples_list[-1] = samples_list[-1].append(pd.Series(data[i]))
else:
# not normal gap: creating new sample in samples_list
samples_list.append(pd.Series(data[i]))
i += 1
return samples_list
# make sample data as example
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
# start sampling
start_time = time.time()
my_list_with_samples = get_samples_separated_at_gaps(data_with_samples, normal_distance)
print(f"Duration: {time.time() - start_time}")
The real data have a size of over 150k and are calculated for several minutes... :/
I'm not sure I understand completely what you want but I think this could work:
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
idx = data_with_samples[data_with_samples.diff(1) > normal_distance].index
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
idx collects the indicees directly after a gap, and the rest is just splitting the series at this indicees and packing the pieces into the list samples_list.
If the index is non-standard, then you need some overhead (resetting index and later setting the index back to the original) to make sure that iloc can be used.
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
data_with_samples = data_with_samples.reset_index(drop=False).rename(columns={0: 'data'})
idx = data_with_samples.data[data_with_samples.data.diff(1) > normal_distance].index
data_with_samples.set_index('index', drop=True, inplace=True)
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
(You don't need that for your example.)
Your code is a bit unclear regarding the method to store these two different lists. Specifically, I'm not sure what is the correct structure of sample_list that you have in mind.
Regardless, using Series.pct_change and np.unique() you should achieve approximately what you're looking for.
uniques, indices = np.unique(
data_with_samples.diff()
[1:]
.pct_change(),
return_index=True)
Now indices points you to the start and end of that wrong gap.
If your data will have more than one gap then you'd want to only use diff()[1:].pct_change() and look for all values that are different than 0 using where().
same as above question mention
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
use time diff to compare with the normal_distance.seconds
create an auxiliary column tag to separate the gap group
# start sampling
start_time = time.time()
df = data_with_samples.to_frame()
df['time_diff'] = df[0].diff().dt.seconds
cond = (df['time_diff'] > normal_distance.seconds) | (df['time_diff'].isnull())
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum()
my_list_with_samples = []
for _, group in df.groupby('tag'):
my_list_with_samples.append(group[0])
print(f"Duration: {time.time() - start_time}")
If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03
Basically, I'm aggregating prices over three indices to determine: mean, std, as well as an upper/lower limit. So far so good. However, now I want to also find the lowest identified price which is still >= the computed lower limit.
My first idea was to use np.min to find the lowest price -> this obviously disregards the lower-limit and is not useful. Now I'm trying to store all the values the pivot table identified to find the price which still is >= lower-limit. Any ideas?
pivot = pd.pivot_table(temp, index=['A','B','C'],values=['price'], aggfunc=[np.mean,np.std],fill_value=0)
pivot['lower_limit'] = pivot['mean'] - 2 * pivot['std']
pivot['upper_limit'] = pivot['mean'] + 2 * pivot['std']
First, merge pivoted[lower_limit] back into temp. Thus, for each price in temp there is also a lower_limit value.
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
Then you can restrict your attention to those rows in temp for which the price is >= lower_limit:
temp.loc[temp['price'] >= temp['lower_limit']]
The desired result can be found by computing a groupby/min:
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
For example,
import numpy as np
import pandas as pd
np.random.seed(2017)
N = 1000
ABC = list('ABC')
temp = pd.DataFrame(np.random.randint(2, size=(N,3)), columns=ABC)
temp['price'] = np.random.random(N)
pivoted = pd.pivot_table(temp, index=['A','B','C'],values=['price'],
aggfunc=[np.mean,np.std],fill_value=0)
pivoted['lower_limit'] = pivoted['mean'] - 2 * pivoted['std']
pivoted['upper_limit'] = pivoted['mean'] + 2 * pivoted['std']
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
print(result)
yields
A B C
0 0 0 0.003628
1 0.000132
1 0 0.005833
1 0.000159
1 0 0 0.006203
1 0.000536
1 0 0.001745
1 0.025713
I have a Pandas data frame with the following columns:
blocked, rolling_mean, cumulative_i
I max trying to create a new column where:
c_(i) = max(0, blocked_i - (rolling_mean_i + k) + c_(i-1)) -- where k = 2
My current approach is:
for i in range(df.shape[0]):
if i > 1:
df.ix[i, 'cumulative_i'] = max(0, df['blocked'].iloc[i] -(df['rolling_mean'].iloc[i] + k) + df['cumulative_i'].iloc[i - 1])
Is there a more pythonic way of doing this?
Edit:
I tried doing the following.
df['cumulative_i'] = np.maximum(0, df['blocked'] - (df['rolling_mean'] + k) + df['cumulative_i'].shift())
On the third row under vectorised_output. The value 153.48 is what one would get if we don't add the previous value 213.72 (i.e. 367.2 - 213.72 = 153.48).
This is the output I would expect if I were only doing
df['cumulative_i'] = np.maximum(0, df['blocked'] - (df['rolling_mean'] + k))
I have two pandas DataFrames A and B, with columns ['start', 'end', 'value'] but not the same number of rows. I'd like to set the values for each row in A as follows:
A.iloc(i) = B['value'][B['start'] < A[i,'start'] & B['end'] > A[i,'end']]
There is a possibility of multiple rows of B satisfy this condition for each i, in that case max or sum of corresponding rows would be the result. In case if none satisfies the value of A.iloc[i] should not be updated or set to a default value of 0 (either way would be fine)
I'm interested to find the most efficient way of doing this.
import numpy as np
np.random.seed(1)
lenB = 10
lenA = 20
B_start = np.random.rand(lenB)
B_end = B_start + np.random.rand(lenB)
B_value = np.random.randint(100, 200, lenB)
A_start = np.random.rand(lenA)
A_end = A_start + np.random.rand(lenA)
#if you use dataframe
#B_start = B["start"].values
#B_end = ...
mask = (A_start[:, None ] > B_start) & (A_end[:, None] < B_end)
r, c = np.where(mask)
result = pd.Series(B_value[c]).groupby(r).max()
print result