Pandas dataframe, merge by intersection of spans? - python

I would like to merge two dataframes based on overlap of spans (indicated by pairs (s,e), s- start of span, e - end of span), and while I have a pretty bad code for doing it, I would like to know if there is a good way to implement it. Here is example:
df1 = pd.DataFrame({'s':[0,10,20,33,424,5345],
'e':[3,17,30,39,1000,10987],
'data1':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'s':[1,45,0],
'e':[50,46,90],
'data2':[1,2,3]})
def overlap(a1,a2,b1,b2):
if type(b1) == list or type(b1)==np.ndarray:
assert(len(b1)==len(b2))
return np.asarray([overlap(a1,a2,b1[k],b2[k]) for k in range(len(b1))])
else:
return max((a2-a1)+(b2-b1)+min(a1,b1)-max(b2,a2)+1,0)
overlaps = [overlap(df1['s'].iloc[i],df1['e'].iloc[i],df2['s'].values,df2['e'].values)>0
for i in range(len(df1))]
df1['data2']=[df2['data2'][o].tolist() for o in overlaps]
Output is:
s e data1 data2
0 0 3 1 [1, 3]
1 10 17 2 [1, 3]
2 20 30 3 [1, 3]
3 33 39 4 [1, 3]
4 424 1000 5 []
5 5345 10987 6 []
Edit: also, in my particular case I am guaranteed that for df1 spans are non-overlapping and sequential (ie s[i]>s[i-1], e[i]>s[i], e[i] < s[i+1] )
Edit2: you can generate arbitrary amount of almost valid fake data (here we don't have guarantees on non-overlapping of spans in first df):
N=int(1e3)
sdf1=np.random.randint(0, high=10*N, size=(N,))
sdf1.sort()
edf1=sdf1+np.random.randint(1, high=10, size=(N,))
data1=range(N)
sdf2=np.random.randint(0, high=10*N, size=(N,))
edf2=sdf2+np.random.randint(1, high=10, size=(N,))
data2=range(N)
df1 = pd.DataFrame({'s':sdf1,
'e':edf1,
'data1':data1})
df2 = pd.DataFrame({'s':sdf2,
'e':edf2,
'data2':data2})

When it comes to pandas dataframe, you should always avoid for loops to process rows/columns and use apply, transform or other pandas functions. For example to get the overlaps you can do:
def has_overlap(a1, a2, b1, b2):
''' return True if spans overlap, otherwise return False '''
return (a2-a1)+(b2-b1)+min(a1,b1)-max(b2,a2)+1 > 0
def find_overlap(row1):
'''return indices of df2 which overlap with the given row of df1 as a list'''
df2['has_overlap'] = df2.apply(lambda row2: has_overlap(row1.s, row1.e, row2.s, row2.e), axis=1)
return list(df2['data2'].loc[df2['has_overlap']])
df1['data2'] = df1.apply(lambda row: find_overlap(row), axis=1)
print('df1: {}'.format(df1))

Related

Adding two numeric pandas columns with different lengths based on condition

I am writing a piece of simulation software in python using pandas, here is my problem:
Imagine you have two pandas dataframes dfA and dfB with numeric columns A and B respectively.
Both dataframes have a different number of rows denoted by n and m.
Let's assume that n > m.
Moreover, dfA includes a binary column C, which has m times 1, and the rest 0.
Assume both dfA and dfB are sorted.
My question is, in order, I want to add the values in B to the values in column A if column C == 0.
In the example n = 6, m = 3.
Example data:
dataA = {'A': [7,7,7,7,7,7],
'C': [1,0,1,0,0,1]}
dfA = pd.Dataframe(dataA)
dfB = pd.Dataframe([3,5,4], columns = ['B'])
Example pseudocode:
DOES NOT WORK
if dfA['C'] == 1:
dfD['D'] = dfA['A']
else:
dfD['D'] = dfA['A'] + dfB['B']
Expected result:
dfD['D']
[7,10,7,12,11,7]
I can only think of obscure for loops with index counters for each of the three vectors, but I am sure that there is a faster way by writing a function and using apply. But maybe there is something completely different that I am missing.
*NOTE: In the real problem the rows are not single values, but row vectors of equal length. Moreover, in the real problem it is not just simple addition but a weighted average over the two row vectors
You can use:
m = dfA['C'].eq(1)
dfA['C'] = dfA['A'].where(m, dfA['A']+dfB['B'].set_axis(dfA.index[~m]))
Or:
dfA.loc[m, 'C'] = dfA.loc[m, 'A']
dfA.loc[~m, 'C'] = dfB['B'].values
Output:
A C
0 7 7
1 7 10
2 7 7
3 7 12
4 7 11
5 7 7
The alternative answer is pretty clever. I am just showing a different way if you would like to do it using loops:
# Create an empty df
dfD = pd.DataFrame()
# Create Loop
k = 0
for i in range(len(dfA)):
if dfA.loc[i, "C"] == 1:
dfD.loc[i, "D"] = dfA.loc[i, "A"]
else:
dfD.loc[i, "D"] = dfA.loc[i, "A"] + dfB.loc[k, "B"]
k = k+1
# Show results
dfD

Get rows before and after from an index in pandas dataframe

I want to get a specific amount of rows before and after a specific index. However, when I try to get the rows, and the range is greater than the number of indices, it does not return anything. Given this, I would like you to continue looking for rows, as I show below:
df = pd.DataFrame({'column': range(1, 6)})
column
0 1
1 2
2 3
3 4
4 5
index = 2
df.iloc[idx]
3
# Now I want to get three values before and after that index.
# Something like this:
def get_before_after_rows(index):
rows_before = df[(index-1): (index-1)-2]
rows_after = df[(index+1): (index+1)-2]
return rows_before, rows_after
rows_before, rows_after = get_before_after_rows(index)
rows_before
column
0 1
1 2
4 5
rows_after
column
0 1
3 4
4 5
You are mixing iloc and loc which is very dangerous. It works in your example because the index is sequentially numbered starting from zero so these two functions behave identically.
Anyhow, what you want is basically taking rows with wrap-around:
def get_around(df: pd.DataFrame, index: int, n: int) -> (pd.DataFrame, pd.DataFrame):
"""Return n rows before and n rows after the specified positional index"""
idx = index - np.arange(1, n+1)
before = df.iloc[idx].sort_index()
idx = (index + np.arange(1, n+1)) % len(df)
after = df.iloc[idx].sort_index()
return before, after
# Get 3 rows before and 3 rows after the *positional index* 2
before, after = get_around(df, 2, 3)

fill in entire dataframe cell by cell based on index AND column names?

I have a dataframe where the row indices and column headings should determine the content of each cell. I'm working with a much larger version of the following df:
df = pd.DataFrame(index = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'],
columns = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
Specifically, I want to apply the custom function edit_distance() or equivalent (see here for function code) which calculates a difference score between two strings. The two inputs are the row and column names. The following works but is extremely slow:
for seq in df.index:
for seq2 in df.columns:
df.loc[seq, seq2] = edit_distance(seq, seq2)
This produces the result I want:
ae azde afgle arlde afghijklbcmde
afghijklde 8 7 5 6 3
afghijklmde 9 8 6 7 2
ade 1 1 3 2 10
afghilmde 7 6 4 5 4
amde 2 1 3 2 9
What is a better way to do this, perhaps using applymap() ?. Everything I've tried with applymap() or apply or df.iterrows() has returned errors of the kind AttributeError: "'float' object has no attribute 'index'" . Thanks.
Turns out there's an even better way to do this. onepan's dictionary comprehension answer above is good but returns the df index and columns in random order. Using a nested .apply() accomplishes the same thing at about the same speed and doesn't change the row/column order. The key is to not get hung up on naming the df's rows and columns first and filling in the values second. Instead, do it the other way around, initially treating the future index and columns as standalone pandas Series.
series_rows = pd.Series(['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'])
series_cols = pd.Series(['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
df.index = series_rows
df.columns = series_cols
you could use comprehensions, which speeds it up ~4.5x on my pc
first = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde']
second = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde']
pd.DataFrame.from_dict({f:{s:edit_distance(f, s) for s in second} for f in first}, orient='index')
# output
# ae azde afgle arlde afghijklbcmde
# ade 1 2 2 2 2
# afghijklde 1 3 4 4 9
# afghijklmde 1 3 4 4 10
# afghilmde 1 3 4 4 8
# amde 1 3 3 3 3
# this matches to edit_distance('ae', 'afghijklde') == 8, e.g.
note I used this code for edit_distance (first response in your link):
def edit_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]

From a dataframe using the apply() method, how to return a new column with lists of elements from the dataframe?

There's an operation that is a little counter intuitive when using pandas apply() method. It took me a couple of hours of reading to solve, so here it is.
So here is what I was trying to accomplish.
I have a pandas dataframe like so:
test = pd.DataFrame({'one': [[2],['test']], 'two': [[5],[10]]})
one two
0 [2] [5]
1 [test] [10]
and I want to add the columns per row to create a resulting list of length = to the DataFrame's original length like so:
def combine(row):
result = row['one'] + row['two']
return(result)
When running it through the dataframe using the apply() method:
test.apply(lambda x: combine(x), axis=1)
one two
0 2 5
1 test 10
Which isn't quite what we wanted. What we want is:
result
0 [2, 5]
1 [test, 10]
EDIT
I know there are simpler solutions to this example. But this is an abstraction from a much more complex operation.Here's an example of a more complex one:
df_one:
org_id date status id
0 2 2015/02/01 True 3
1 10 2015/05/01 True 27
2 10 2015/06/01 True 18
3 10 2015/04/01 False 27
4 10 2015/03/01 True 40
df_two:
org_id date
0 12 2015/04/01
1 10 2015/02/01
2 2 2015/08/01
3 10 2015/08/01
Here's a more complex operation:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return (id_list)
then finally run:
df_one.sort_values('date', inplace=True)
df_two['id_list'] = df_two.apply(
operation,
axis=1,
args=(df_one,)
)
This would be impossible with simpler solutions. Hence my proposed one below would be to re write operation to:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return pd.Series({'id_list': id_list})
We'd expect the following result:
id_list
0 []
1 []
2 [3]
3 [27,18,40]
IIUC we can simply sum two columns:
In [93]: test.sum(axis=1).to_frame('result')
Out[93]:
result
0 [2, 5]
1 [test, 10]
because when we sum lists:
In [94]: [2] + [5]
Out[94]: [2, 5]
they are getting concatenated...
So the answer to this problem lies in how pandas.apply() method works.
When defining
def combine(row):
result = row['one'] + row['two']
return(result)
the function will be returning a list for each row that gets passed in. This is a problem if we use the function with the .apply() method because it will interpret the resulting lists as a Series where each element is a column of that same row.
To solve this we need to create a Series where we specify a new column name like so:
def combine(row):
result = row['one'] + row['two']
return pd.Series({'result': result})
And if we run this again:
test.apply(lambda x: combine(x), axis=1)
result
0 [2, 5]
1 [test, 10]
We'll get what we originally wanted! Again, this is because we are forcing pandas to interpret the entire result as a column.

Pandas assign label based on index value

I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned

Categories

Resources