I am a newbie in Python learning. I want to identify a motif sequence in a large protein data set.
Using the one-line code mentioned below, I was able to identify proteins that I am interested in. However, I also want the start and end position of the motif in these proteins. It will be helpful if someone can suggest what additional arguments I have to use along with the below-mentioned code.
thank you in advance.
import re
df.loc[df ['Protein_sequence'].str.contains ("WA[T]R",regex=True)]
Protein_name Protein_sequence
242 >PST130_487694 MLRFFRLAALVLLMTSWEVAGDTYDPKTKTTYFGCHKNVDAVCSEP...
358 >Pucstr1_10722 MLRFFRSIALVWLMASWEVSTAGKYPNNPDPVNGAKYFGCHKNVDA...
475 >Pucstr1_2774 MLRFLILTALVLLVASWQVTDTLSQDPGDILFWCHKNVDAVCSETI...
One option using re.search:
import re
pat = re.compile('WAT?R')
out = df.join(pd.DataFrame([m.span(0) if (m:=pat.search(x)) else (pd.NA,)*2
for x in df['Protein_sequence']],
index=df.index, columns=['start', 'end'])
).dropna(subset=['start', 'end'], how='all')
Output (on a modified input):
Protein_name Protein_sequence start end
242 >PST130_487694 MLRFFRLAALVLLMTSWARAGDTYDPKTKTTYFGCHKNVDAVCSEP... 16 19
358 >Pucstr1_10722 MLRFFRSIALVWATRSWEVSTAGKYPNNPDPVNGAKYFGCHKNVDA... 11 15
Related
I'm currently trying to convert character-level spans to token-level spans and am wondering if there's a functionality in the library that I may not be taking advantage of.
The data that I'm currently using consists of "proper" text (I say "proper" as in it's written as if it's a normal document, not with things like extra whitespaces for easier split operations) and annotated entities. The entities are annotated at the character level but I would like to obtain the tokenized subword-level span.
My plan was to first convert character-level spans to word-level spans, then convert that to subword-level spans. A piece of code that I wrote looks like this:
new_text = []
for word in original_text.split():
if (len(word) > 1) and (word[-1] in ['.', ',', ';', ':']):
new_text.append(word[:-1] + ' ' + word[-1])
else:
new_text.append(word)
new_text = ' '.join(new_text).split()
word2char_span = {}
start_idx = 0
for idx, word in enumerate(new_text):
char_start = start_idx
char_end = char_start + len(word)
word2char_span[idx] = (char_start, char_end)
start_idx += len(word) + 1
This seems to work well but one edge case I didn't think of is parentheses. To give a more concrete example, one paragraph-entity pair looks like this:
>>> original_text = "RDH12, a retinol dehydrogenase causing Leber's congenital amaurosis, is also involved in \
steroid metabolism. Three retinol dehydrogenases (RDHs) were tested for steroid converting abilities: human and \
murine RDH 12 and human RDH13. RDH12 is involved in retinal degeneration in Leber's congenital amaurosis (LCA). \
We show that murine Rdh12 and human RDH13 do not reveal activity towards the checked steroids, but that human type \
12 RDH reduces dihydrotestosterone to androstanediol, and is thus also involved in steroid metabolism. Furthermore, \
we analyzed both expression and subcellular localization of these enzymes."
>>> entity_span = [139, 143]
>>> print(original_text[139:143])
'RDHs'
This example actually returns a KeyError when I try to refer to (139, 143) because the adjustment code I wrote takes (RDHs) as the entity rather than RDHs. I don't want to hardcode parentheses handling either because there are some entities where the parentheses are included.
I feel like there should be a simpler approach to this issue and I'm overthinking things a bit. Any feedback on how I could achieve what I want is appreciated.
Tokenization is tricky. I'd suggest using SpaCy to process your data as you can access the offset of each token in the source text at character level, which should make mapping of your character spans to tokens straightforward:
import spacy
nlp = spacy.load("en_core_web_sm")
original_text = "Three retinol dehydrogenases (RDHs) were tested for steroid converting abilities:"
# Process data
doc = nlp(original_text)
for token in doc:
print(token.idx, token, sep="\t")
Output:
0 Three
6 retinol
14 dehydrogenases
29 (
30 RDHs
34 )
36 were
41 tested
48 for
52 steroid
60 converting
71 abilities
80 :
I'm trying to find and extract the date and time in a column that contain text sentences. The example data is as below.
df = {'Id': ['001', '002',...],
'Description': ['
THERE IS AN INTERUPTION/FAILURE # 9.6AM ON 27.1.2020 FOR JB BRANCH. THE INTERUPTION ALSO INVOLVED A, B, C AND SOME OTHER TOWN AREAS. OTC AND SST SERVICES INTERRUPTED AS GENSET ALSO WORKING AT THAT TIME. WE CALL FOR SERVICE. THE TECHNICHIAN COME AT 10.30AM. THEN IT BECOME OK AROUND 10.45AM', 'today is 23/3/2013 #10:AM we have',...],
....
}
df = pd.DataFrame (df, columns = ['Id','Description'])
I have tried the datefinder library below but it gives todays date which is wrong.
findDate = dtf.find_dates(le['Description'][0])
for dates in findDate:
print(dates)
Does anyone know what is the best way to extract it and automatically put it into a new column? Or does anyone know any library that can calculate duration between time and date in a string text. Thank you.
So you have two issues here.
you want to know how to apply a function on a DataFrame.
you want a function to extract a pattern from a bunch of text
Here is how to apply a function on a Serie (if selecting only one column as I did, you get a Serie). Bonus points: Read the DataFrame.apply() and Series.apply() documentation (30s) to become a Pandas-chad!
def do_something(x):
some-code()
df['new_text_column'] = df['original_text_column'].apply(do_something)
And here is one way to extract patterns from a string using regexes. Read the regex doc (or follow a course)and play around with RegExr to become an omniscient god (that is, if you use a command-line on Linux, along with your regex knowledge).
Modified from: How to extract the substring between two markers?
import re
text = 'gfgfdAAA1234ZZZuijjk'
# Searching numbers.
m = re.search('\d+', text)
if m:
found = m.group(0)
# found: 1234
joined_Gravity1.head()
Comments
____________________________________________________
0 Why the old Pike/Lyrik?
1 This is good
2 So clean
3 Looks like a Decoy
Input: type(joined_Gravity1)
Output: pandas.core.frame.DataFrame
The following code allows me to select strings that contain keywords: "ender"
joined_Gravity1[joined_Gravity1["Comments"].str.contains("ender", na=False)]
Output:
Comments
___________________________
194 We need a new Sender 😂
7 What about the sender
179 what about the sender?😏
How to revise the code to include words similar to 'Sender' such as 'snder','bnder'?
I don't see a reason why regex=True inside the contains function won't work here.
joined_Gravity1[joined_Gravity1["Comments"].str.contains(pat="ender|snder|bndr", na=False, regex=True)]
I have used "ender|snder|bnder" only. You can make a list of all such words say list_words, and pass in pat='|'.join(list_words) in contains function above.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
There can be a massive number of possibilities that can occur with combinations of alphabets in such words. What you are trying to do is a fuzzy match between 2 string. I can recommend using the following -
#!pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process
word = 'sender'
others = ['bnder', 'snder', 'sender', 'hello']
process.extractBests(word, others)
[('sender', 100), ('snder', 91), ('bnder', 73), ('hello', 18)]
Based on this you can decide which threshold to choose and then mark the ones that are above the threshold as a match (using the code you used above)
Here is a method to do this in your exact problem statement with a function -
df = pd.DataFrame(['hi there i am a sender',
'I dont wanna be a bnder',
'can i be the snder?',
'i think i am a nerd'], columns=['text'])
#s = sentence, w = match word, t = match threshold
def get_match(s,w,t):
ss = process.extractBests(w,s.split())
return any([i[1]>t for i in ss])
#What its doing - Match each word in each row in df.text with
#the word sender and see of any of the words have a match greater
#than threshold ratio 70.
df['match'] = df['text'].apply(get_match, w='sender', t=70)
print(df)
text match
0 hi there i am a sender True
1 I dont wanna be a bnder True
2 can i be the snder? True
3 i think i am a nerd False
Tweek the t value from 70 to 80 if you want more exact match or lower for more relaxed match.
Finally you can filter it out -
df[df['match']==True][['text']]
text
0 hi there i am a sender
1 I dont wanna be a bnder
2 can i be the snder?
from difflib import get_close_matches
def closeMatches(patterns, word):
print(get_close_matches(word, patterns))
list_patterns = joined_Gravity1[joined_Gravity1["Comments"].str.contains("ender", na=False)]
word = 'Sender'
patterns = list_patterns
closeMatches(patterns, word)
Im very new to coding and only know the very basics. I am using python and trying to print everything between two sentences in a text. I only want the content between, not before or after. It`s probably very easy, but i couldnt figure it out.
Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom) Ev 134 Haukelifjell Hordaland / Telemark — Icy. 10 o'clock 1 degree. Valid from: 05.01.2020 13:53 Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)
I want to collect the bold text to use in website later. Everything except the italic text(the sentence before and after) is dynamic if that has anything to say.
You can use split to cut the string and access the parts that you are interested in.
If you know how to get the full text already, it's easy to get the bold sentence by removing the two constant sentences before and after.
full_text = "Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom) Ev 134 Haukelifjell Hordaland / Telemark — Icy. 10 o'clock 1 degree. Valid from: 05.01.2020 13:53 Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)"
s1 = "Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom)"
s2 = "Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)"
bold_text = full_text.split(s1)[1] # Remove the left part.
bold_text = bold_text.split(s2)[0] # Remove the right part.
bold_text = bold_text.strip() # Clean up spaces on each side if needed.
print(bold_text)
It looks like a job for regular expressions, there is the re module in Python.
You should:
Open the file
Read its content in a variable
Use search or match function in the re module
In particular, in the last step you should use your "surrounding" strings as "delimiters" and capture everything between them. You can achieve this using a regex pattern like str1 + "(.*)" + str2.
You can give a look at regex documentation, but just to give you an idea:
".*" captures everything
"()" allows you actually capture the content inside them and access it later with an index (e.g. re.search(pattern, original_string).group(1))
I am trying to write a regular expression which returns a part of substring which is after a string. For example: I want to get part of substring along with spaces which resides after "15/08/2017".
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
Is there a way to get 'AFFIDAVIT OF' and 'CASH & MTGE' as separate strings?
Here is the expression I have pieced together so far:
doc = (a.split('15/08/2017', 1)[1]).strip()
'AFFIDAVIT OF CASH & MTGE'
Not a regex based solution. But does the trick.
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
doc = (a.split('15/08/2017', 1)[1]).strip()
# used split with two white spaces instead of one to get the desired result
print(doc.split(" ")[0].strip()) # outputs AFFIDAVIT OF
print(doc.split(" ")[-1].strip()) # outputs CASH & MTGE
Hope it helps.
re based code snippet
import re
foo = '''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
pattern = '.*\d{2}/\d{2}/\d{4}\s+(\w+\s+\w+)\s+(\w+\s+.*\s+\w+)'
result = re.findall(pattern, foo, re.MULTILINE)
print "1st match: ", result[0][0]
print "2nd match: ", result[0][1]
Output
1st match: AFFIDAVIT OF
2nd match: CASH & MTGE
We can try using re.findall with the following pattern:
PHASED OF ((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)
Searching in multiline and DOTALL mode, the above pattern will match everything occurring between PHASED OF until, but not including, CONDOMINIUM PLAN.
input = "182 246 612 01/10/2018 PHASED OF CASH & MTGE\n CONDOMINIUM PLAN"
result = re.findall(r'PHASED OF (((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)', input, re.DOTALL|re.MULTILINE)
output = result[0][0].strip()
print(output)
CASH & MTGE
Note that I also strip off whitespace from the match. We might be able to modify the regex pattern to do this, but in a general solution, maybe you want to keep some of the whitespace, in certain cases.
Why regular expressions?
It looks like you know the exact delimiting string, just str.split() by it and get the first part:
In [1]: a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
In [2]: a.split("15/08/2017", 1)[0]
Out[2]: '172 211 342 '
I would avoid using regex here, because the only meaningful separation between the logical terms appears to be 2 or more spaces. Individual terms, including the one you want to match, may also have spaces. So, I recommend doing a regex split on the input using \s{2,} as the pattern. These will yield a list containing all the terms. Then, we can just walk down the list once, and when we find the forward looking term, we can return the previous term in the list.
import re
a = "172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE"
parts = re.compile("\s{2,}").split(a)
print(parts)
for i in range(1, len(parts)):
if (parts[i] == "15/08/2017"):
print(parts[i-1])
['172 211 342', '15/08/2017', 'TRANSFER OF LAND', '$610,000', 'CASH & MTGE']
172 211 342
positive lookbehind assertion**
m=re.search('(?<=15/08/2017).*', a)
m.group(0)
You have to return the right group:
re.match("(.*?)15/08/2017",a).group(1)
You nede to use group(1)
import re
re.match("(.*?)15/08/2017",a).group(1)
Output
'172 211 342 '
Building on your expression, this is what I believe you need:
import re
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
re.match("(.*?)(\w+/)",a).group(1)
Output:
'172 211 342 '
You can do this by using group(1)
re.match("(.*?)15/08/2017",a).group(1)
UPDATE
For updated string you can use .search instead of .match
re.search("(.*?)15\/08\/2017",a).group(1)
Your problem is that your string is formatted the way it is.
The line you are looking for is
182 246 612 01/10/2018 PHASED OF CASH & MTGE
And then you are looking for what ever comes after 'PHASED OF' and some spaces.
You want to search for
(?<=PHASED OF)\s*(?P.*?)\n
in your string. This will return a match object containing the value you are looking for in the group value.
m = re.search(r'(?<=PHASED OF)\s*(?P<your_text>.*?)\n', a)
your_desired_text = m.group('your_text')
Also: There are many good online regex testers to fiddle around with your regexes.
And only after finishing up the regex just copy and paste it into python.
I use this one: https://regex101.com/