How to not match string not containing two consecutive newlines

How to not match string not containing two consecutive newlines - python

Demo at regex101. I have the following text file (a bibtex .bbl file):
\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{a}})\textit{Alfonsi, Spogli,
De~Franceschi, Romano, Aquino, Dodson, and Mitchell}}]{alfonsi2011bcg}
Alfonsi, L., L.~Spogli, G.~De~Franceschi, V.~Romano, M.~Aquino, A.~Dodson, and
C.~N. Mitchell (2011{\natexlab{a}}), Bipolar climatology of {GPS} ionospheric
scintillation at solar minimum, \textit{Radio Science}, \textit{46}(3),
\doi{10.1029/2010RS004571}.
\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{b}})\textit{Alfonsi, Spogli,
Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, and
Mitchell}}]{alfonsi2011gsa}
Alfonsi, L., L.~Spogli, J.~Tong, G.~De~Franceschi, V.~Romano, A.~Bourdillon,
M.~Le~Huy, and C.~Mitchell (2011{\natexlab{b}}), {GPS} scintillation and
{TEC} gradients at equatorial latitudes in april 2006, \textit{Advances in
Space Research}, \textit{47}(10), 1750--1757,
\doi{10.1016/j.asr.2010.04.020}.
\bibitem[{\textit{Anghel et~al.}(2008)\textit{Anghel, Astilean, Letia, and
Komjathy}}]{anghel2008nrm}
Anghel, A., A.~Astilean, T.~Letia, and A.~Komjathy (2008), Near real-time
monitoring of the ionosphere using dual frequency {GPS} data in a kalman
filter approach, in \textit{{IEEE} International Conference on Automation,
Quality and Testing, Robotics, 2008. {AQTR} 2008}, vol.~2, pp. 54--58,
\doi{10.1109/AQTR.2008.4588793}.
\bibitem[{\textit{Baker and Wing}(1989)}]{baker1989nmc}
Baker, K.~B., and S.~Wing (1989), A new magnetic coordinate system for
conjugate studies at high latitudes, \textit{Journal of Geophysical Research:
Space Physics}, \textit{94}(A7), 9139--9143, \doi{10.1029/JA094iA07p09139}.
I want to match the whole \bibitem command for a single entry (with some capture groups) if I know the reference code at the end of the command. I use this regex, which works for the first entry, but not for the rest (second entry exemplified below):
\\bibitem\[{(.*?)\((.*?)\)(.*?)}\]{alfonsi2011gsa}
This doesn't work, since it matches everything from the start of the first \bibitem command to the end of the second \bibitem command. How can I match only the second \bibitem command? I have tried using a negative lookahead for ^$ and \n\n, but I couldn't get either to work - basically, I want the third (.*?) to match any string not including two consecutive newlines. (If there's any other way to do this, I'm all ears.)

You can use negative look-arounds (?!) to prevent the match from having multiple occurrences of 'bibitem'. With this, the match will start with the 'bibitem' which immediately precedes your reference code. This seems to work:
\\bibitem\[{(((?!bibitem).)*?)\((((?!bibitem).)*?)\)(((?!bibitem).)*?)}\]{alfonsi2011gsa}

regex is not my strong point but this will get all the content you want without reading all the content into memory at once:
from itertools import groupby
import re
with open("file.txt") as f:
r = re.compile(r"\[{(.*?)\((.*?)\)(.*?)}\]\{alfonsi2011gsa\}")
for k, v in groupby(map(str.strip, f), key=lambda x: bool(x.strip())):
match = r.search("".join(v))
if match:
print(match.groups())
('\\textit{Alfonsi et~al.}', '2011{\\natexlab{b}}', '\\textit{Alfonsi, Spogli,Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, andMitchell}')

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):

Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.

Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]

Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}

Delete regex matches within loop and continue with updated string version

I have a string that I want to run through four wordlists, one with four-grams, one with tri-grams, one with bigrams and one with single terms. To avoid that a word of the single term wordlist gets counted twice when it also forms part of a bigram or trigrams for example, I start with counting for four-grams, then want to update the string in terms of removing the matches to only check the remaining part of the string for matches of trigrams, bigrams and single terms, respectively. I have used the following code and illustrate it here just starting with fourgrams and then trigrams:
financial_trigrams_count=0
financial_fourgrams_count=0
strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."
pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]
for i in pattern_fourgrams:
financial_fourgrams_count=financial_fourgrams_count+strn.count(i)
new_strn=strn
def clean_text1(pattern_fourgrams, new_strn):
for r in pattern_fourgrams:
new_strn = re.sub(r, '', new_strn)
return new_strn
for i in pattern_trigrams:
financial_trigrams_count=financial_trigrams_count+new_strn.count(i)
new_strn1=new_strn
def clean_text2(pattern_trigrams, new_strn1):
for r in pattern_trigrams:
new_strn1 = re.sub(r, '', new_strn1)
return new_strn1
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)
For fourgrams there is not match, so new_strn will be similar to strn. However, there is one match with trigrams ("chief financial officer"), however, I do not succees in deleteing the match from new_strn1. Instead, I again yield the full string, namely strn (or new_strn which is the same).
Could someone help me find the mistake here?

(As a complement to Tilak Putta's answer)
Note that you are searching the string twice: once when counting the occurrences of the ngrams with .count() and once more when you remove the matches using re.sub().
You can increase performance by counting and removing at the same time.
This can be done using re.subn. This function takes the same parameters as re.sub but returns a tuple containing the cleaned string as well as the number of matches.
Example:
for i in pattern_fourgrams:
new_strn, n = re.subn(r, '', new_strn)
financial_fourgrams_count += n
Note that this assumes the n-grams are pairwaise different (for fixed n), i.e. they shouldn't have a common word, since subn will delete that word the firs time it sees it and thus won't be able to find occurence of other ngrams containing that particular word.

you need to remove def
import re
financial_trigrams_count=0
financial_fourgrams_count=0
strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."
pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]
for i in pattern_fourgrams:
financial_fourgrams_count=financial_fourgrams_count+strn.count(i)
new_strn=strn
for r in pattern_fourgrams:
new_strn = re.sub(r, '', new_strn)
for i in pattern_trigrams:
financial_trigrams_count=financial_trigrams_count+new_strn.count(i)
new_strn1=new_strn
for r in pattern_trigrams:
new_strn1 = re.sub(r, '', new_strn1)
print(new_strn1)
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)

regex to find image sequences from list of filenames

I need some help with a regex string to pull any filename that looks like it might be part of a frame sequence out of a previously generated list of filenames.
Frames in a sequence will generally have a minimum padding of 3 and will be preceeded by either a '.' or a '_' An exception is: if the filename is only made up of a number and the .jpg extension (e.g 0001.jpg, 0002.jpg, etc.). I'd like to capture all these in one line of regex, if possible.
Here's what I have so far:
(.*?)(.|_)(\d{3,})(.*)\.jpg
Now I know this doesn't do the "preceeded by . or _" bit and instead just finds a . or _ anywhere in the string to return a positive. I've tried a bit of negative lookbehind testing, but can't get the syntax to work.
A sample of data is:
test_canon_shot02.jpg
test_shot01-04.jpg
test_shot02-03.jpg
test_shot02-02.jpg
test_shot01-03.jpg
test_canon_shot03.jpg
test_shot01-02.jpg
test_shot02.jpg
test_canon_shot02.jpg
test_shot01.jpg
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
OrangeXmas2015_Print_A ct2.jpg
sh120_HF_V01-01.jpg
sh120_HF_V01-02.jpg
sh200_DMP_v04.jpg
sh120_HF_V04.jpg
sh120_HF_V03.jpg
sh120_HF_V02.jpg
blah_v02.jpg
blah_v01.jpg
blah_Capture0 4.jpg
blah_Capture03 .jpg
blah_Capture01. jpg
blah_Capture02.jpg
Wall_GraniteBlock_G rey_TC041813.jpg
Renders10_wire.jpg
Renders10.jpg
Renders09_wire.jpg
Renders09.jpg
Renders08_wire.jpg
Renders08.jpg
Renders07_wire.jpg
Renders07.jpg
Renders06_wire.jpg
Renders06.jpg
Renders05_wire.jpg
Renders05.jpg
Renders04_wire.jpg
Renders04.jpg
Renders03_wire.jpg
Renders03.jpg
Renders02_wire.jpg
Renders02.jpg
Renders01_wire.jpg
Renders01.jpg
archmodels58_057_carpinusbetulus_leaf_diffuse.jpg
archmodels58_042_bark_bump.jpg
archmodels58_023_leaf_diffuse.jpg
WINDY TECHNICZNE-reflect00.jpg
archmodels58_057_leaf_opacity.jpg
archmodels58_057_bark_reflect.jpg
archmodels58_057_bark_bump.jpg
blahC-00-oknaka.jpg
bed
debt
cab
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg
The result I'm after is 2 sequences identified:
GameAssets_.00000.jpg to GameAssets_.00024.jpg
00000.jpg to 00018.jpg

Based on the rules you specified in your question, this pattern should accomplish what you need:
(^|\r?\n|.*_|.*\.)\d{3,}.*\.jpg

for item in re.findall(r'.*?[._]?0{3,}.*',data):
print(item)
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg

Try
(.*?)(\.|_?)(000\d{0,})(.*)\.jpg
Notice that I had to escape the '.' in the second group. Also, I had to make the search for '.' and '_' optional in the second group. Finally, I had to add the minimum padding to the third group.
I used regex101.com to test and refine the regex: regex101

Python regex string groups capture

I have a number of medical reports from each which i am trying to capture 6 groups (groups 5 and 6 are optional):
(clinical details | clinical indication) + (text1) + (result|report) + (text2) + (interpretation|conclusion) + (text3).
The regex I am using is:
reportPat=re.compile(r'(Clinical details|indication)(.*?)(result|description|report)(.*?)(Interpretation|conclusion)(.*)',re.IGNORECASE|re.DOTALL)
works except on strings missing the optional groups on whom it fails.i have tried putting a question mark after group5 like so: (Interpretation|conclusion)?(.*) but then this group gets merged into group4. I am pasting two conflicting strings (one containing group 5/6 and the other without it) for people to test their regex. thanks for helping
text 1 (all groups present)
Technical Report:\nAdministrations:\n1.04 ml of Fluorine 18, fluorodeoxyglucose with aco - Bronchus and lung\nJA - Staging\n\nClinical Details:\nSquamous cell lung cancer, histology confirmed ?stage\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion. \n\nThere is a large mass noted in the left upper lobe proximally, with lower grade uptake within a collapsed left upper lobe. This lesi\n\nInterpretation: \nThe scan findings are in keeping with the known lung primary in the left upper lobe and involvement of the lymph nodes as dThere is no evidence of distant metastatic disease.
text 2 (without group 5 and 6)
Technical Report:\nAdministrations:\n0.81 ml of Fluorine 18, fluorodeoxyglucose with activity 312.79\nScanner: 3D Static\nPatient Position: Supine, Head First. Arms up\n\n\nDiagnosis Codes:\n- Bronchus and lung\nJA - Staging\n\nClinical Indication:\nNewly diagnosed primary lung cancer with cranial metastasis. PET scan to assess any further metastatic disease.\n\nScanner DST 3D\n\nSession 1 - \n\n.\n\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion.\n\nThere is increased FDG uptake in the right lower lobe mass abutting the medial and posterior pleura with central necrosis (maximum SUV 18.2). small nodule at the right paracolic gutte

It seems like that what you were missing is basically an end of pattern match to fool the greedy matches when combining with the optional presence of the group 5 & 6. This regexp should do the trick, maintaining your current group numbering:
reportPat=re.compile(
r'(Clinical details|indication)(.*)'
r'(result|description|report)(.*?)'
r'(?:(Interpretation|conclusion)(.*))?$',
re.IGNORECASE|re.DOTALL)
Changes done are adding the $ to the end, and enclosing the two last groups in a optional non-capturing group, (?: ... )?. Also note how you easily can make the entire regexp more readable by splitting the lines (which the interpreter will autoconnect when compiling).
Added: When reviewing the result of the matches I saw some :\n or : \n, which can easily be cleaned up by adding (?:[:\s]*)? inbetween the header and text groups. This is an optional non-capturing group of colons and whitespace. Your regexp does then look like this:
reportPat=re.compile(
r'(Clinical details|indication)(?:[:\s]*)?(.*)'
r'(result|description|report)(?:[:\s]*)?(.*?)'
r'(?:(Interpretation|conclusion)(?:[:\s]*)?(.*))?$',
re.IGNORECASE|re.DOTALL)
Added 2: At this link: https://regex101.com/r/gU9eV7/3, you can see the regex in action. I've also added some unit test cases to verify that it works against both texts, and that in for text1 it has a match for text1, and that for text2 it has nothing. I used this parallell to direct editing in a python script to verify my answer.

The following pattern works for both your test cases though given the format of the data you're having to parse I wouldn't be confident that the pattern will work for all cases (for example I've added : after each of the keyword matches to try to prevent inadvertent matches against more common words like result or description):
re.compile(
r'(Clinical details|indication):(.+?)(result|description|report):(.+?)((Interpretation|conclusion):(.+?)){0,1}\Z',
re.IGNORECASE|re.DOTALL
)
I grouped the last 2 groups and marked them as optional using {0,1}. This means the output groups will vary a little from your original pattern (you'll have an extra group, the 4th group will now contain the output of both the last 2 groups and the data for the last 2 groups will be in groups 5 and 6).

Use regex to extract unit number

I have a list of descriptions and I want to extract the unit information using regular expression
I watched a video on regex and here's what I got
import re
x = ["Four 10-story towers - five 11-story residential towers around Lake Peterson - two 9-story hotel towers facing Devon Avenue & four levels of retail below the hotels",
"265 rental units",
"10 stories and contain 200 apartments",
"801 residential properties that include row homes, town homes, condos, single-family housing, apartments, and senior rental units",
"4-unit townhouse building (6,528 square feet of living space & 2,755 square feet of unheated garage)"]
unit=[]
for item in x:
extract = re.findall('[0-9]+.unit',item)
unit.append(extract)
print unit
This works with string ends in unit, but I also strings end with 'rental unit','apartment','bed' and other as in this example.
I could do this with multiple regex, but is there a way to do this within one regex?
Thanks!

As long as your not afraid of making a hideously long regex you could use something to the extent of:
compiled_re = re.compile(ur"(\d*)-unit|(\d*)\srental unit|(\d*)\sbed|(\d*)\sappartment")
unit = []
for item in x:
extract = re.findall(compiled_re, item)
unit.append(extract)
You would have to extend the regex pattern with a new "|" followed by a search pattern for each possible type of reference to unit numbers. Unfortunately, if there is very low consistency in the entries this approach would become basically unusable.
Also, might I suggest using a regex tester like Regex101. It really helps determining if your regex will do what you want it to.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to not match string not containing two consecutive newlines - python

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

Delete regex matches within loop and continue with updated string version

regex to find image sequences from list of filenames

Python regex string groups capture

Use regex to extract unit number

Categories

Resources