Pairwise match and merging of overlapping sequences in pandas dataframe - python

I have a pandas dataframe, containing four columns; a reference sequence, a read from that reference sequence, and start/end positions of that read. I am trying to iterate over this dataframe and check rows pairwise to see if the reads overlap based on their start and end positions, and merge them if they do. Next, I want to check this newly merged read to the next read in the dataframe to see if they overlap, and merge these as well if they do. So far I have put my data in a pandas DataFrame, but maybe I'm starting to believe that this maybe not be the optimal solution, and maybe e.g. a dictionary would be more suited for this kind of operation.
I have tried multiple things, without any luck, so I am hoping that one of you wonderful people may be able to come up with a solution from the data:
data = [
["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "ABCDE", 1, 5],
["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "DEFGHIJK", 4, 11],
["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "IJKLMNOPQRST", 9, 20],
["TESTINGONETWOTHREE", "TEST", 1, 4],
["TESTINGONETWOTHREE", "NGONE", 6, 10],
["TESTINGONETWOTHREE", "NETWOTHR", 9, 16],
]
df = pd.DataFrame(
data, columns=["reference", "read", "start", "end"]
)
print(df)
reference read start end
0 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDE 1 5
1 ABCDEFGHIJKLMNOPQRSTUVWXYZ DEFGHIJK 4 11
2 ABCDEFGHIJKLMNOPQRSTUVWXYZ IJKLMNOPQRST 9 20
3 TESTINGONETWOTHREE TEST 1 4
4 TESTINGONETWOTHREE NGONE 6 10
5 TESTINGONETWOTHREE NETWOTHR 9 16
In this case, I would like to end up with a new dataframe (or dictionary) that has the merged reads, the reference sequence that they are from and their start and stop positions, like so:
reference read start end
0 ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRST 1 20
1 TESTINGONETWOTHREE TEST 1 4
2 TESTINGONETWOTHREE NGONETWOTHR 6 16
I would very much appreciate any help on this :)
Cheers!

You could use a custom group to identify the non overlapping stretches, then use it to aggregate with join/min/max:
group = df['start'].gt(df.groupby('reference')['end'].shift()-1).cumsum()
# [0, 0, 0, 0, 1, 1]
(df.groupby(['reference', group])
.agg({'read': ''.join, 'start': 'min', 'end': 'max'})
)
output:
read start end
reference
ABCDEFGHIJKLMNOPQRSTUVWXYZ 0 ABCDEDEFGHIJKIJKLMNOPQRST 1 20
TESTINGONETWOTHREE 0 TEST 1 4
1 NGONENETWOTHR 6 16

Related

Creating a DataFrame from a dictionary of Series results in lost indices and NaNs

dict_with_series = {'Even':pd.Series([2,4,6,8,10]),'Odd':pd.Series([1,3,5,7,9])}
Data_frame_using_dic_Series = pd.DataFrame(dict_with_series)
# Data_frame_using_dic_Series = pd.DataFrame(dict_with_series,index=\[1,2,3,4,5\]), gives a NaN value I dont know why
display(Data_frame_using_dic_Series)
I tried labeling the index but when i did it eliminates the first column and row instead it prints extra column and row at the bottom with NaN value. Can anyone explain me why is it behaving like this , have I done something wrong
If I don't use the index labeling argument it works fine
When you run:
Data_frame_using_dic_Series = pd.DataFrame(dict_with_series,index=[1,2,3,4,5])
You request to only use the indices 1-5 from the provided Series, but the original indexing of a Series is from 0, thus resulting in a reindexing.
If you want to change the index, do it afterwards:
Data_frame_using_dic_Series = (pd.DataFrame(dict_with_series)
.set_axis([1, 2, 3, 4, 5])
)
Output:
Even Odd
1 2 1
2 4 3
3 6 5
4 8 7
5 10 9

Store values from rows of a DataFrame and use them in another operation

I have a DataFrame I read from a CSV file and I want to store the individual values from the rows in the DataFrame in some variables. I want to use the values from the DataFrame in another step to perform another operation. Note that I do not want the result as series but values such as integers. I am still learning but I could not understand those resources I have consulted. Thank you in advance.
X
Y
Z
1
2
3
3
2
1
4
5
6
I want the values in a variable as x=1,3,4 and so on, as stated above.
There are many ways you can do this but one simple method is to use the index method. Other people may give other methods but let me illustrate the index method here. I will create a dictionary and change it to DataFrame from which rows iteration can be performed.
# Start by importing pandas as pd
import pandas as pd
# Proceed by defining a dictionary that contains a player's stats (just for
ilustration, not real data)
myData = {'Football Club': ['Chelsea', 'Man Utd', 'Inter Milan', 'Everton'],
'Matches Played': [2, 32, 36, 37],
'Goals Scored': [1, 12, 24, 25],
'Assist Given': [0, 0, 11, 6],
'Red card': [0,0,0,0,],
'Yellow Card':[0,4,4,3]}
# Next create a DataFrame from the dictionary from previous step
df = pd.DataFrame(myData, columns = ['Football Club', 'Matches Played', 'Goals
Scored', 'Red card', 'Yellow Card'])
#See what the data look like.
print("This is the created Dataframe from the dictionary:\n", df)
print("\n Now, you can iterate over selected rows or all the rows using
index
attribute as follows:\n")
#Store the values in variables
for indIte in df.index:
clubs=df['Football Club'][indIte]
goals =df['Goals Scored'][indIte]
matches=df['Matches Played'][indIte]
#To see the results that can be used later in the same program
print(clubs, matches, goals)
#You will get the following results:
This is the created Dataframe from the dictionary :
Football Club Matches Played Goals Scored Red card Yellow Card
0 Chelsea 2 1 0 0
1 Man Utd 32 12 0 4
2 Inter Milan 36 24 0 4
3 Everton 37 25 0 3
Now, you can iterate over selected rows or all the rows using index
attribute as follows:
Chelsea 2 1
Man Utd 32 12
Inter Milan 36 24
Everton 37 25
Use:
x, y, z = df.to_dict(orient='list').values()
>>> x
[1, 3, 4]
>>> y
[2, 2, 5]
>>> z
[3, 1, 6]
df.values is a numpy array of a dataframe. So you can manipulate df.values for subsequent processing.

Is there a more efficient or concise way to divide a df according to a list of indexes?

I'm trying to slice/divide the following dataframe
df = pd.DataFrame(
{'time': [4, 10, 15, 6, 0, 20, 40, 11, 9, 12, 11, 25],
'value': [0, 0, 0, 50, 100, 0, 0, 70, 100, 0,100, 20]}
)
according to a list of indexes to split on :
[5, 7, 9]
The first and last items of the list are the first and last indexes of the dataframe. I'm trying to get the following four dataframes as a result (defined by the three given indexes and the beginning and end of the original df) each assigned to their own variable:
time value
0 4 0
1 10 0
2 15 0
3 6 50
4 0 100
time value
5 20 0
6 40 0
time value
7 11 70
8 9 100
time value
9 12 0
10 11 100
11 25 20
My current solution gives me a list of dataframes that I could then assign to variables manually by list index, but the code is a bit complex, and I'm wondering if there's a simpler/more efficient way to do this.
indexes = [5,7,9]
indexes.insert(0,0)
indexes.append(df.index[-1]+1)
i = 0
df_list = []
while i+1 < len(indexes):
df_list.append(df.iloc[indexes[i]:indexes[i+1]])
i += 1
This is all coming off of my attempt to answer this question. I'm sure there's a better approach to that answer, but I did feel like there should be a simpler way to do this kind of slicing that what I thought of.
you can use np.split like
df_list = np.split(df, indexes)

Is it possible to choose the right rows in an ordered sequence of events without loops?

example row:
B1
S1
B2
B3/S2
B4
B5
B6/S4
S3
Rules:
A row can be B (buy), S (sell) or both
It is known which sell belongs to which buy and viceversa
buys are ordered, sells are possibly not ordered
When a buy has no matching sell, all the subsequent buys are discarded
we want all the buy rows such that if there is a sell for that row, all the buy rows from that point up to the respective sell row are discarded.
This can be done with a simple loop, that skips the overlapping buys, but trying to implement this with vectors has been challenging and I am wondering if it is possible?
The most promising method I tried was padding the index of the buy and backfilling the indexes of the sell, and making sense of the possible combinations, although I am not sure they can give a unique view of the state...
Output from example would be:
B1
B2
B4
Here is a suggestion, using pandas. I dont know if it is more efficient than what you are doing, but if the goal is to avoid looping, I think this will do it.
I will just assume your buy/sell-data can be split into two dataframes, one for buys and one for sells. I also add a 'time' column to each frame. I.e. when is the order to buy/sell, placed. Putting your data in a dataframe and the splitting this into the two abovementioned dataframes is probably an easy exercise, but I will skip it.
import pandas as pd
# Your data split into two frames (for instance, in df_buy, num=2, would be equivalent
# to B2 occuring at the second, zero-indexed, time-step)
df_buy = pd.DataFrame({'Num': [1, 2, 3, 4, 5, 6],
'Time': [0, 2, 3, 4, 5, 6]})
# S1, S2, S4, S3 happening at time 1, 3, 6 and 7
df_sell = pd.DataFrame({'Num':[1, 2, 4, 3],
'Time': [1, 3, 6, 7]})
# Merge buy/sell to find all possible trades
df_trades = pd.merge(df_buy, df_sell, on='Num', suffixes=['_Buy', '_Sell'])
# Order all trades according to which time they would happen, i.e. time_sell.
# (or perhaps at max(time_sell, time_buy)?)
df_trades.sort_values(by='Time_Sell', inplace=True)
# Only trades that happen in increasing order would be allowed. So we filter
# out the trades that happen in decreasing order (ie. trade 3. cannot come
# after trade 4)
df_final = df_trades[df_trades['Num'].sub(df_trades['Num'].shift(), fill_value=0)>=0]
# Here we have Num = 1, 2, 4 i.e. B1/S1, B2/S2 and B4/S4
Out[11]:
Num Time_Buy Time_Sell
0 1 0 1
1 2 2 3
3 4 4 6

Creating columns with numpy Python

I have some elements stored in numpy.array[]. I wish to store them in a ".txt" file. The case is it needs to fit a certain standard, which means each element needs to be stored x lines into the file.
Example:
numpy.array[0] needs to start in line 1, col 26.
numpy.array[1] needs to start in line 1, col 34.
I use numpy.savetxt() to save the arrays to file.
Later I will implement this in a loop to create a lagre ".txt" file with coordinates.
Edit: This good example was provided below, it does point out my struggle:
In [117]: np.savetxt('test.txt',A.T,'%20d %10d')
In [118]: cat test.txt
0 6
1 7
2 8
3 9
4 10
5 11
The fmt option '%20d %10d' gives you spacing which depend on the last integer. What I need is an option which lets me set the spacing from the left side regardless of other integers.
Template is need to fit integers into:
XXXXXXXX.XXX YYYYYYY.YYY ZZZZ.ZZZ
Final Edit:
I solved it by creating a test which checks how many spaces the last float used. I was then able to predict the number of spaces the next float needed to fit the template.
Have you played with the fmt of np.savetxt?
Let me illustrate with a concrete example (the sort that you should have given us)
Make a 2 row array:
In [111]: A=np.arange((12)).reshape(2,6)
In [112]: A
Out[112]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
Save it, and get 2 rows, 6 columns
In [113]: np.savetxt('test.txt',A,'%d')
In [114]: cat test.txt
0 1 2 3 4 5
6 7 8 9 10 11
save its transpose, and get 6 rows, 2 columns
In [115]: np.savetxt('test.txt',A.T,'%d')
In [116]: cat test.txt
0 6
1 7
2 8
3 9
4 10
5 11
Put more detail into fmt to space out the columns
In [117]: np.savetxt('test.txt',A.T,'%20d %10d')
In [118]: cat test.txt
0 6
1 7
2 8
3 9
4 10
5 11
I think you can figure out how to make a fmt string that puts your numbers in the correct columns (join 26 spaces etc, or use left and right justification - the usual Python formatting issues).
savetxt also takes an opened file. So you can open a file for writing, write one array, add some filler lines, and write another. Also, savetxt doesn't do anything fancy. It just iterates through the rows of the array, and writes each row to a line, e.g.
for row in A:
file.write(fmt % tuple(row))
So if you don't like the control that savetxt gives you, write the file directly.

Categories

Resources