Update rows of Pandas column using previous rows of same column

Update rows of Pandas column using previous rows of same column - python

I have a Pandas data frame with the following columns:
blocked, rolling_mean, cumulative_i
I max trying to create a new column where:
c_(i) = max(0, blocked_i - (rolling_mean_i + k) + c_(i-1)) -- where k = 2
My current approach is:
for i in range(df.shape[0]):
if i > 1:
df.ix[i, 'cumulative_i'] = max(0, df['blocked'].iloc[i] -(df['rolling_mean'].iloc[i] + k) + df['cumulative_i'].iloc[i - 1])
Is there a more pythonic way of doing this?
Edit:
I tried doing the following.
df['cumulative_i'] = np.maximum(0, df['blocked'] - (df['rolling_mean'] + k) + df['cumulative_i'].shift())
On the third row under vectorised_output. The value 153.48 is what one would get if we don't add the previous value 213.72 (i.e. 367.2 - 213.72 = 153.48).
This is the output I would expect if I were only doing
df['cumulative_i'] = np.maximum(0, df['blocked'] - (df['rolling_mean'] + k))

Related

How to optimize following algorithm that iterates over a DataFrame of few million of rows?

I have the following algorithm that iterates over a DataFrame with few millions of rows.
It takes a lot of time for the algorithm to finish. Do you have any suggestions?
def k_nn_averaging(df: pd.DataFrame, k: int = 15, use_abs_value: bool = False) -> pd.DataFrame:
df_averaged = df.copy()
df[helper.modifiable_columns] = df[helper.modifiable_columns].astype(float)
df_averaged[helper.modifiable_columns] = df_averaged[helper.modifiable_columns].astype(float)
for i in range(0, df.shape[0]):
neighbours = list(range(i-k if i-k >= 0 else 0, i+k if i+k <= df_averaged.shape[0] else df_averaged.shape[0]))
neighbours.remove(i)
selectedNeighbourIndex = choice(neighbours)
factor = uniform(0,1)
currentSampleValues = df[helper.modifiable_columns].iloc[i]
neighbourSampleValues = df[helper.modifiable_columns].iloc[selectedNeighbourIndex]
average = 0
if not use_abs_value: average = factor*(currentSampleValues - neighbourSampleValues)
else: average = factor*(abs(currentSampleValues - neighbourSampleValues))
df_averaged.loc[i,helper.modifiable_columns] = currentSampleValues + average
return df_averaged

The first thing you should always want is to vectorize loops. Here is the modified code that avoids using Python loops and uses NumPy operations instead:
import pandas as pd
import numpy as np
def k_nn_averaging(df: pd.DataFrame, k: int = 15, use_abs_value: bool = False) -> pd.DataFrame:
df_averaged = df.copy()
df_averaged[helper.modifiable_columns] = df_averaged[helper.modifiable_columns].astype(float)
num_rows = df.shape[0]
modifiable_columns = helper.modifiable_columns
# create a matrix of the neighbour indices for each row
neighbour_indices = np.empty((num_rows, k*2+1), dtype=int)
neighbour_indices[:, k] = np.arange(num_rows) # set the current row index as the middle value
for i in range(k):
# set the left neighbours
neighbour_indices[i+1:, i] = neighbour_indices[i:-1, k] - 1
# set the right neighbours
neighbour_indices[:-i-1, k+i+1] = neighbour_indices[1:, k] + 1
# set the values outside the range of the DataFrame to -1
neighbour_indices[neighbour_indices < 0] = -1
neighbour_indices[neighbour_indices >= num_rows] = -1
# select the neighbour indices to use for each row
selected_neighbour_indices = neighbour_indices[:, neighbour_indices[0] >= 0]
# create a matrix of factors
factors = np.random.uniform(size=(num_rows, selected_neighbour_indices.shape[1]))
# select the neighbour values for each row
neighbour_values = df[modifiable_columns].values[selected_neighbour_indices]
# select the current values for each row
current_values = df[modifiable_columns].values[:, np.newaxis]
# calculate the average values
if not use_abs_value:
averages = factors * (current_values - neighbour_values)
else:
averages = factors * np.abs(current_values - neighbour_values)
# update the values in the output DataFrame
df_averaged[modifiable_columns] = current_values + averages
return df_averaged
I think this will be much faster than the original script.

A more efficient way to split timeseries data (pd.Series) at gaps?

I am trying to split a pd.Series with sorted dates that have sometimes gaps between them that are bigger than the normal ones. To do this, I calculated the size of the gaps with pd.Series.diff() and then iterated over all the elements in the series with a while-loop. But this is unfortunately quite computationally intensive. Is there a better way (apart from parallelization)?
Minimal example with my function:
import pandas as pd
import time
def get_samples_separated_at_gaps(data: pd.Series, normal_gap) -> list:
diff = data.diff()
# creating list that should contains all samples
samples_list = [pd.Series(data[0])]
i = 1
while i < len(data):
if diff[i] == normal_gap:
# normal gap: add data[i] to last sample in samples_list
samples_list[-1] = samples_list[-1].append(pd.Series(data[i]))
else:
# not normal gap: creating new sample in samples_list
samples_list.append(pd.Series(data[i]))
i += 1
return samples_list
# make sample data as example
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
# start sampling
start_time = time.time()
my_list_with_samples = get_samples_separated_at_gaps(data_with_samples, normal_distance)
print(f"Duration: {time.time() - start_time}")
The real data have a size of over 150k and are calculated for several minutes... :/

I'm not sure I understand completely what you want but I think this could work:
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
idx = data_with_samples[data_with_samples.diff(1) > normal_distance].index
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
idx collects the indicees directly after a gap, and the rest is just splitting the series at this indicees and packing the pieces into the list samples_list.
If the index is non-standard, then you need some overhead (resetting index and later setting the index back to the original) to make sure that iloc can be used.
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
data_with_samples = data_with_samples.reset_index(drop=False).rename(columns={0: 'data'})
idx = data_with_samples.data[data_with_samples.data.diff(1) > normal_distance].index
data_with_samples.set_index('index', drop=True, inplace=True)
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
(You don't need that for your example.)

Your code is a bit unclear regarding the method to store these two different lists. Specifically, I'm not sure what is the correct structure of sample_list that you have in mind.
Regardless, using Series.pct_change and np.unique() you should achieve approximately what you're looking for.
uniques, indices = np.unique(
data_with_samples.diff()
[1:]
.pct_change(),
return_index=True)
Now indices points you to the start and end of that wrong gap.
If your data will have more than one gap then you'd want to only use diff()[1:].pct_change() and look for all values that are different than 0 using where().

same as above question mention
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
use time diff to compare with the normal_distance.seconds
create an auxiliary column tag to separate the gap group
# start sampling
start_time = time.time()
df = data_with_samples.to_frame()
df['time_diff'] = df[0].diff().dt.seconds
cond = (df['time_diff'] > normal_distance.seconds) | (df['time_diff'].isnull())
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum()
my_list_with_samples = []
for _, group in df.groupby('tag'):
my_list_with_samples.append(group[0])
print(f"Duration: {time.time() - start_time}")

Automatic analysis on multiple columns in pandas

As per the following code, using panda, I am doing some analysis on one of the columns (HR):
aa = New_Data['index'].tolist()
aa = [0] + aa
avg = []
for i in range(1,len(aa)):
** val = raw_data.loc[(raw_data['index'] >= aa[i-1]) & (raw_data['index'] <= aa[i])['HR'].diff().mean()
avg.append(val)
New_Data['slope'] = avg
AT the end of the day, it will add a new column to the data ('Slope')
That is fine and is working. The problem is that I want to redo the line (which is specified by **) for every other columns (not just HR) as well. in Other words,:
** val = raw_data.loc[(raw_data['index'] >= aa[i-1]) & (raw_data['index'] <= aa[i])['**another column**'].diff().mean()
avg.append(val)
New_Data['slope'] = avg
Is there any way to do it automatically? I have around 100 columns so doing manually is not enticing. Thanks for your help

Not sure on the pure pandas way but you could just write in a external loop -
aa = New_Data['index'].tolist()
aa = [0] + aa
avg = []
for col in df.columns:
for i in range(1,len(aa)):
** val = raw_data.loc[(raw_data['index'] >= aa[i-1]) & (raw_data['index'] <= aa[i])[col].diff().mean()
avg.append(val)
New_Data['slope'] = avg
In the line
for col in df.columns
you can modify to only use columns you need.

Right way to use dask for efficient conditional pairwise row operations in a DataFrame

I have the following sequential code:
c = []
for ind, a in df.iterrows():
for ind, b in df.iterrows():
if a.hit_id < b.hit_id :
c.append(dist(a, b))
c = numpy.array(c)
But the number of rows in the dataframe is close to 106. Therefore I want to somehow speedup this operation. I am thinking of using dask for the same along with group by.
Following is my approach:
#dask.delayed
def compute_pairwise_distance(val1, val2):
for i1 in val1:
for i2 in val2:
dist = np.sqrt(np.square(i1.x-i2.x) + np.square(i1.y-i2.y) + np.square(i1.z - i2.z))
gV.min_dist = min(gV.min_dist, dist)
gV.max_dist = max(gV.max_dist, dist)
def wrapper():
gV.grouped_df = gV.df_hits.groupby('layer_id')
unique_groups = gV.df_hits['layer_id'].compute().unique()
results = []
for gp1 in unique_groups:
for gp2 in unique_groups:
if gp1 < gp2 :
y = delay(compute_pairwise_distance)(gV.grouped_df.get_group(gp1), gV.grouped_df.get_group(gp2))
results.append(y)
results = dask.compute(*results)
wrapper()
print(str(gV.max_dist) + " " +str(gV.min_dist))
I don't know why but I am getting a key error KeyError: 'l'. Also is this the right way of using dask.

Returning my printed output in pairs as tuples?

for div in soup.find_all('table', class_="W(100%)"):
for y in div.find_all('td'):
print (y.text)
returns an output of:
Previous Close
38.08
Open
38.23
Bid
37.67 x 100
Ask
38.16 x 500
Day's Range
37.35 - 38.25
52 Week Range
23.50 - 40.92
Volume
29,152
Avg. Volume
118,446
Market Cap
1.66B
I want to return the output as
(Previous Close, 38.08)
(Open, 38.23)
(Bid, 37.67 x 100)
I honestly have no idea how to tackle this.
I thought about implementing an and odd and even counter, but even then, how would I join the previous entry with the next entry?

Look for rows (<tr> elements), not cells; you can then group cells per row:
for row in soup.select('table.W(100%) tr'):
columns = tuple(cell.text for cell in row.find_all('td'))
print(columns)
I used the CSS select() method to concisely request all table rows for your given table.

You can do this by going through every even index and taking that element and the one after it, and putting them into a pair. The following would work as a generator:
def pairs(array):
for i in range(0, len(array), 2):
yield (array[i], array[i + 1])
Alternatively, if you want the function to return a list of the pairs:
def pairs(array):
output = []
for i in range(0, len(array), 2):
output.append((array[i], array[i + 1])
Or, for a simpler (but less readable) program:
def pairs(array):
return map(lambda index: (array[index], array[index + 1]), range(0, len(array), 2))
If you only want to output it in that format, there's a different way to do that directly, other than converting it to tuples and outputting those:
def outputPairs(array):
for i in range(0, len(array), 2):
print("(" + str(array[i]) + ", " + str(array[i + 1]) + ")")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Update rows of Pandas column using previous rows of same column - python

Related

How to optimize following algorithm that iterates over a DataFrame of few million of rows?

A more efficient way to split timeseries data (pd.Series) at gaps?

Automatic analysis on multiple columns in pandas

Right way to use dask for efficient conditional pairwise row operations in a DataFrame

Returning my printed output in pairs as tuples?

Categories

Resources