For a data frame with 2 columns "series1" and "series2", I want to find the first index when a value (from series1) is greater than some threshold (in series2). I'm not sure how else to design the algorithm than iterating through series1 and find for each iteration min index.
Having 50M rows, it takes days to compute. Is there perhaps a different approach that I'm not seeing?
def pandas_version(df: pd.DataFrame, signal_series: pd.Series, threshold_max, threshold_min):
results = []
series1 = df[df.columns[0]]
series2 = df[df.columns[1]]
# select only values of interest
selection = signal_series[signal_series >= threshold_max]
series1 = series1.loc[selection.index]
series2 = series2.loc[selection.index[0]:]
for idx_dt, val in zip(series1.index, series1):
window_series2 = series2.loc[idx_dt:]
results.append(window_series2[window_series2 > val + threshold_min].index.min())
return pd.Series(results, index=selection.index, copy=False)
s1 = [i for i in range(10)]
s2 = [2*i for i in range(10)]
signal=s1 # or sth else
df = pd.DataFrame({"series1": s1, "series2": s2})
signal_series = pd.Series(signal)
pandas_version(df, signal_series, 3, 2)
Output:
3 3
4 4
5 5
6 6
7 7
8 8
9 9
Related
I want to get a specific amount of rows before and after a specific index. However, when I try to get the rows, and the range is greater than the number of indices, it does not return anything. Given this, I would like you to continue looking for rows, as I show below:
df = pd.DataFrame({'column': range(1, 6)})
column
0 1
1 2
2 3
3 4
4 5
index = 2
df.iloc[idx]
3
# Now I want to get three values before and after that index.
# Something like this:
def get_before_after_rows(index):
rows_before = df[(index-1): (index-1)-2]
rows_after = df[(index+1): (index+1)-2]
return rows_before, rows_after
rows_before, rows_after = get_before_after_rows(index)
rows_before
column
0 1
1 2
4 5
rows_after
column
0 1
3 4
4 5
You are mixing iloc and loc which is very dangerous. It works in your example because the index is sequentially numbered starting from zero so these two functions behave identically.
Anyhow, what you want is basically taking rows with wrap-around:
def get_around(df: pd.DataFrame, index: int, n: int) -> (pd.DataFrame, pd.DataFrame):
"""Return n rows before and n rows after the specified positional index"""
idx = index - np.arange(1, n+1)
before = df.iloc[idx].sort_index()
idx = (index + np.arange(1, n+1)) % len(df)
after = df.iloc[idx].sort_index()
return before, after
# Get 3 rows before and 3 rows after the *positional index* 2
before, after = get_around(df, 2, 3)
I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)
You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step
I'm looking to make it so that NaN values in a dataframe are filled in by the mean of all the values up to that point, as such:
A
0 1
1 2
2 3
3 4
4 5
5 NaN
6 NaN
7 11
8 NaN
Would become
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
You can solve it by running the following code
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": [ 1, 2, 3, 4, 5, pd.NA, pd.NA, 11, pd.NA ]
})
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
It iterates on each NaN and fills it with the mean of the previous values, including those filled NaNs.
At the end you will have:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
EDIT
As stated by RichieV, performance may be an issue with this solution (its runtime complexity is O(N^2)) when there are many NaNs, but we also should avoid python iterations, since they are slow when compared to native pandas / numpy calls.
Here is an optimized version:
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = int(cumsum / cumnum)
last_idx = idx
Result:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
Since in the worst case the script should pass on the dataframe twice, the runtime complexity is now O(N).
Marco's answer works fine but it can be optimized with incremental average formulas, from math.stackexchange.com
Here is an adaptation of that other question (not the exact formula, just the concept).
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
The main advantage with this code is not having to read all items up to the current index on each iteration to get the mean.
This option still uses a python loop--which is not the best choice with pandas--but there seems to be no way around it for this use case (hopefully someone will get inspired by this and find such method without a loop).
Performance tests
Three alternative functions were defined:
incremental: My answer.
from_origin: Marco's original answer.
incremental_pandas: Marco's updated answer.
Tests were done using timeit module with 3 repetitions on random samples with 0.4 probability of NaN.
Full code for testing
import pandas as pd
import numpy as np
import timeit
import collections
from matplotlib import pyplot as plt
def incremental(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
return df
def incremental_pandas(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = cumsum / cumnum
last_idx = idx
return df
def from_origin(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
return df
def get_random_sample(n, p):
np.random.seed(123)
return pd.DataFrame({'A':
np.random.choice(list(range(10)) + [np.nan],
size=n, p=[(1 - p) / 10] * 10 + [p])})
r = 3
p = 0.4 # portion of NaNs
# check result from all functions
results = []
for func in [from_origin, incremental, incremental_pandas]:
random_df = get_random_sample(1000, p)
new_df = random_df.copy(deep=True)
results.append(func(new_df))
print('Passed' if all(np.allclose(r, results[0]) for r in results[1:])
else 'Failed', 'implementation test')
timings = {}
for n in np.geomspace(10, 10000, 10):
random_df = get_random_sample(int(n), p)
timings[n] = collections.defaultdict(float)
results = {}
for func in ['incremental', 'from_origin', 'incremental_pandas']:
timings[n][func] = (
timeit.timeit(f'{func}(random_df.copy(deep=True))', number=r, globals=globals())
/ r
)
timings = pd.DataFrame(timings).T
print(timings)
timings.plot()
plt.xlabel('size of array')
plt.ylabel('avg runtime (s)')
plt.ylim(0)
plt.grid(True)
plt.tight_layout()
plt.show()
plt.close('all')
I would like to merge two dataframes based on overlap of spans (indicated by pairs (s,e), s- start of span, e - end of span), and while I have a pretty bad code for doing it, I would like to know if there is a good way to implement it. Here is example:
df1 = pd.DataFrame({'s':[0,10,20,33,424,5345],
'e':[3,17,30,39,1000,10987],
'data1':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'s':[1,45,0],
'e':[50,46,90],
'data2':[1,2,3]})
def overlap(a1,a2,b1,b2):
if type(b1) == list or type(b1)==np.ndarray:
assert(len(b1)==len(b2))
return np.asarray([overlap(a1,a2,b1[k],b2[k]) for k in range(len(b1))])
else:
return max((a2-a1)+(b2-b1)+min(a1,b1)-max(b2,a2)+1,0)
overlaps = [overlap(df1['s'].iloc[i],df1['e'].iloc[i],df2['s'].values,df2['e'].values)>0
for i in range(len(df1))]
df1['data2']=[df2['data2'][o].tolist() for o in overlaps]
Output is:
s e data1 data2
0 0 3 1 [1, 3]
1 10 17 2 [1, 3]
2 20 30 3 [1, 3]
3 33 39 4 [1, 3]
4 424 1000 5 []
5 5345 10987 6 []
Edit: also, in my particular case I am guaranteed that for df1 spans are non-overlapping and sequential (ie s[i]>s[i-1], e[i]>s[i], e[i] < s[i+1] )
Edit2: you can generate arbitrary amount of almost valid fake data (here we don't have guarantees on non-overlapping of spans in first df):
N=int(1e3)
sdf1=np.random.randint(0, high=10*N, size=(N,))
sdf1.sort()
edf1=sdf1+np.random.randint(1, high=10, size=(N,))
data1=range(N)
sdf2=np.random.randint(0, high=10*N, size=(N,))
edf2=sdf2+np.random.randint(1, high=10, size=(N,))
data2=range(N)
df1 = pd.DataFrame({'s':sdf1,
'e':edf1,
'data1':data1})
df2 = pd.DataFrame({'s':sdf2,
'e':edf2,
'data2':data2})
When it comes to pandas dataframe, you should always avoid for loops to process rows/columns and use apply, transform or other pandas functions. For example to get the overlaps you can do:
def has_overlap(a1, a2, b1, b2):
''' return True if spans overlap, otherwise return False '''
return (a2-a1)+(b2-b1)+min(a1,b1)-max(b2,a2)+1 > 0
def find_overlap(row1):
'''return indices of df2 which overlap with the given row of df1 as a list'''
df2['has_overlap'] = df2.apply(lambda row2: has_overlap(row1.s, row1.e, row2.s, row2.e), axis=1)
return list(df2['data2'].loc[df2['has_overlap']])
df1['data2'] = df1.apply(lambda row: find_overlap(row), axis=1)
print('df1: {}'.format(df1))
I have a dataframe where the row indices and column headings should determine the content of each cell. I'm working with a much larger version of the following df:
df = pd.DataFrame(index = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'],
columns = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
Specifically, I want to apply the custom function edit_distance() or equivalent (see here for function code) which calculates a difference score between two strings. The two inputs are the row and column names. The following works but is extremely slow:
for seq in df.index:
for seq2 in df.columns:
df.loc[seq, seq2] = edit_distance(seq, seq2)
This produces the result I want:
ae azde afgle arlde afghijklbcmde
afghijklde 8 7 5 6 3
afghijklmde 9 8 6 7 2
ade 1 1 3 2 10
afghilmde 7 6 4 5 4
amde 2 1 3 2 9
What is a better way to do this, perhaps using applymap() ?. Everything I've tried with applymap() or apply or df.iterrows() has returned errors of the kind AttributeError: "'float' object has no attribute 'index'" . Thanks.
Turns out there's an even better way to do this. onepan's dictionary comprehension answer above is good but returns the df index and columns in random order. Using a nested .apply() accomplishes the same thing at about the same speed and doesn't change the row/column order. The key is to not get hung up on naming the df's rows and columns first and filling in the values second. Instead, do it the other way around, initially treating the future index and columns as standalone pandas Series.
series_rows = pd.Series(['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'])
series_cols = pd.Series(['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
df.index = series_rows
df.columns = series_cols
you could use comprehensions, which speeds it up ~4.5x on my pc
first = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde']
second = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde']
pd.DataFrame.from_dict({f:{s:edit_distance(f, s) for s in second} for f in first}, orient='index')
# output
# ae azde afgle arlde afghijklbcmde
# ade 1 2 2 2 2
# afghijklde 1 3 4 4 9
# afghijklmde 1 3 4 4 10
# afghilmde 1 3 4 4 8
# amde 1 3 3 3 3
# this matches to edit_distance('ae', 'afghijklde') == 8, e.g.
note I used this code for edit_distance (first response in your link):
def edit_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]