I have a strings like this:
this is SBN: 1245897 some text some text SBN4589364 some text some
text SBN-7859157, some text some text Part Num: 2615, some text SBN:7859148 asasas
YYY-485 some text some text...
I need to get all SBN related numbers only and replace it with links to some site, i'm trying this code, but that's does not work
p = re.compile('SBN[:- ] ( \d+ )')
p.sub(r'\1', string)
Please help..
All whitespace counts in a regular expression! Only include whitespace exactly when you would want whitespace in your match as well. You can limit it to just spaces ( ), but generally 'all whitespace' is more useful (\s).
I would also suggest always writing a regular expression as a raw string.
This regex catches all examples in your question:
p = re.compile(r'SBN[:-]\s*(\d+)')
Related
I am trying to get better with ReGex in Python, and I am trying to figure out how I would isolate a specific substring in some text. I have some text, and that text could look like any of the following:
possible_strings = [
"some text (and words) more text (and words)",
"textrighthere(with some more)",
"little trickier (this time) with (all of (the)(values))"
]
With each string, despite the fact that I don't know what's in them, I know it always ends with some information in parentheses. To include examples like #3, where the final pair of parentheses have parentheses in them.
How could I go about using re/ReGex to isolate the text only inside of the last pair of parentheses? So in the previous example, I would want the output to be:
output = [
"and words",
"with some more",
"all of (the)(values)"
]
Any tips or help would be much appreciated!
In python you can use the regex module as it is supports recurssion:
import regex
pat = r'(\((?:[^()]|(?1))*\))$'
regex.findall(pat, '\n'.join(possible_strings), regex.M)
['(and words)', '(with some more)', '(all of (the)(values))']
The regex might be quite complicated for a beginner. Click here for the explanations and examples
Abit of explanation:
( # 1st Capturing Group
\( # matches the character (
(?:#Non-capturing group
[^()] # 1st Alternative Match a single character not present in the character class
| # or
(?1) #2nd Alternative matches the expression defined in the 1st capture group recursively
) # closes capturing group
* # matches zero or more times
\) #matches the character )
$ asserts position at the end of a line
For the first two, start matching an opening bracket, that could be either of these:
"some text (and words) more text (and words)"
^ ^
followed by anything which isn't an opening bracket:
"some text (and words) more text (and words)"
^^^^^^^^^^^^^^^^^^^^^^X^^^^^^^^^^^
|- starting at the first ( hit
another ( which isn't allowed.
followed by end of line. Only the last () fits "no more ( until end of line".
>>> import re
>>> re.findall('\([^(]+\)$', "some text (and words) more text (and words)")
['(and words)']
RegEx is not a good fit for your third example; there's no easy way to pair up the parens, you may have to install and use a different regex engine to get nested structure support. See also
Matching Nested Structures With Regular Expressions in Python
Python: How to match nested parentheses with regex?
I have scanned a PDF with Tika which contains the text in the following format, having multiple line breaks
Some non Interview text
interview with Mr.XYZ
Question: How are you?
Answer: I am fine.
Question: What do you do?
Answer: Nothing
Some non Interview text
How do I apply regex?I can match words and spaces but it is not going multiline. I tried the following regex:
https://regex101.com/r/sekUyT/1
What all I want is interview related text which starts with interview with and is considered end when the text does not contain any more Question: and Answer:
Use the re.findall funtion to get all the occurances of a particular text.
match = re.findall('interview with \s*?\w+.\w+',text)
match is a list of occurences of the matched text, if you only want the names, use : 'interview with \s*?(\w+.\w+)' as the search string.
I have lines I am iterating through that look like this:
random text and A08524SDD here (00-04) more random text
lame text (junk data) more text (08-12) more text 4000 5553
random text and numbers 44553349 (2008)
random text (2005) junk text (junk)
nothing important (13-15) not important (not important)
I am trying to figure out how to pull ONLY the dates (range or single year) from the parenthesis without pulling the other random junk from the parenthesis.
Currently using this, but it is returning the random text as well:
date = re.findall('\(([^)]+)', line)
Edit: Each line in the string I am iterating over 1 line at a time. It is not one single string. I have a for loop that is searching each line and trying to extract the date range from each line. Also, there is random numbers included in the random text, so I cannot just search for ##-## or #### in the entire string. It will have to be encased in ()'s.
Edit2: #CarySwoveland has answered my origional question. As a bonus, I do have a few lines that look like this, that if they could also be included would be nice.
random text and numbers 44553349 (2008 important text)
random text (2005 important text) junk text (junk) 55555555 (08-09 important text)
nothing important (13-15) not important (not important)(2008 important text)
In the lines with more than 1 () that both start with a ##-## or a #### I need to grab both of them WITH the text. Out of about 35,000 lines of text, only about ~50 or so have these random issues, and I do not mind doing them by hand. But if a solution exists, it would be nice to implement.
THANK YOU TO ALL WHO HAVE POSTED! THIS HAS HELPED ME OUT GREATLY!!!!
As per both your question and your added comments I would suggest the following pattern:
(?<=\()\d\d-?\d\d.*?(?=\))
This would cater for all patterns of interest, like: (####), (##-##) and (##-## text) and possibly (#### text). Here is an online demo
From left to right:
(?<=\() - Positive lookbehind for an opening paranthesis
\d\d-?\d\d - Two digits followed by an optional hyphen and again two more digits
.*? - Match any character except newlines but non-greedy
(?=\)) - A positive lookahead to check for a closing paranthesis.
If you want to be very explicit about a closing paranthesis behind the 4th digit and the possibility for text you could extend. For example (?<=\()\d\d-?\d\d(?:\s\w+)?(?=\)), where we have a non-capturing group (?:...) where we check for a space \s followed by one-or-more word-characters \w+. The non-capturing group is optional ...)? and then the same positive lookahead as above.
Don't forget, using these patterns in Python, you'd need to make sure to use them as raw strings.
Note: I escaped both the opening and closing paranthesis in the lookarounds with a backslash, e.g: \( and \) to use them as literals. Not doing so will prematurely open/close another (non-)capturing group!
A Python example:
import re
lines = ['random text and A08524SDD here (00-04) more random text',
'lame text (junk data) more text (08-12) more text 4000 5553',
'random text and numbers 44553349 (2008)',
'random text (2005) junk text (junk)',
'nothing important (13-15) not important (not important)',
'random text and numbers 44553349 (2008 important text)',
'random text (2005 important text) junk text (junk) 55555555 (08-09 important text)',
'nothing important (13-15) not important (not important)(2008 important text)']
for line in lines:
print(re.findall(r'(?<=\()\d\d-?\d\d.*?(?=\))', line))
Returns:
['00-04']
['08-12']
['2008']
['2005']
['13-15']
['2008 important text']
['2005 important text', '08-09 important text']
['13-15', '2008 important text']
You can use the following regular expression.
(?m)(?<=\()(?:\d{4}|\d{2}-\d{2})(?=\))
Regex Demo <¯\_(ツ)_/¯> Python demo
Python's regex engine performs the following operations.
(?m) multiline mode
(?<=\() match is preceded by '(' (positive lookbehind)
(?: begin non-capture group
\d{4} match 4 digits
| or
\d{2}-\d{2} match 2 digits, a hyphen, 2 digits
) end non-capture group
(?=\)) match is followed by ')' (positive lookahead)
does something like this work for you?
this is assuming strings is a list of your lines
def getter(string):
return re.search(r'(\(\d{4}\)|\(\d{2}-\d{2}\))', string).group()
list(map(getter, strings))
output:
['(00-04)', '(08-12)', '(2008)', '(2005)', '(13-15)']
as per your edit...if you are looping, just apply the function in the loop on each line
I have a regex expression that captures the text from a text file. The regex can ve viewed at the following URL:
https://regex101.com/r/wwEjTF/1
In my python code I would like extract the text that is matched by the regex only from all the other text in the text file.
I have following python code for matching th regex and storing it in a variable.
match = re.findall(r'test\s.+\n\sdescription\s\"(.+)\"', text, re.S)
I am expecting all the matches to be in the match variable and returned a list.
But when I do print (match) I get empty list. I do not understand why it is coming up an empty list. How do i capture the matched part of the regex into the variable.
Thanks for your help.
Just in case if there is an issue with the above url, is the regex and the sample text string:
test\s.+\n\sdescription\s\"(.+)\"
some random text
test 111.333.555.666
description "text10"
some random text
some random text
test 22.44.55.66
description "text12"
some random text
some random text
test 77.77.88.99
description "text13"
some random text
some random text
test 14.22.55.99
description "text16"
some random text
some random text
test 13.33.55.66
description "text17"
some random text`
Like I said in my comment try omitting re.S because It makes the '.' special character match any character at all, including a newline
Also \n\s is not appropriate. As \s includes newline. You will have to use \s+
Your regex will be:
match = re.findall(r'test\s.+\s+description\s\"(.+)\"', text)
Ideone Demo
Example in the regex101 uses tab as separator for description, so single \s works for it. Replace it with repetition:
match = re.findall(r'test\s.+\n\s+description\s\"(.+)\"', text, re.S)
THE COMPANY
ABCD is a new company in the field of Marketing. bla bla bla.
DESCRIPTION
xyz....
SUMMARY
asdf.......
For text like above, I want to pick up headings like THE COMPANY and SUMMARY. I would like to write a code in Python. I have been trying to use RegEx but have not found a way to write a pattern that matches centrally aligned words. I am open to any new method as long as it can be implemented in Python.
If I understand you correctly, you want to match lines composed of some sort of indentation, followed by some number of upper case words.
If so, the following regex should do the trick:
(?m)^(?: +)[A-Z\s]+$
Let's take that piece by piece.
(?m) tells the regex matcher to treat ^ and $ as the beginning and end of the line instead of the beginning and end of the string.
^ matches the beginning of the line.
(?: +) is a non-capturing group of one or more spaces. In other words, this part of the pattern finds as many spaces as possible, and then ignores them. After all, we're not looking for the spaces, we're looking for the text. If you want the spaces too, just remove the (?: and ), leaving you with \s+. If you prefer tabs, replace \s with \t.
[A-Z\s]+ matches one or more uppercase letters or whitespace characters.
$ matches the end of the line.
Putting it all together (and into Python) we get:
import re
headers = re.findall('(?m)^(?:\s+)[A-Z\s]+$', your_string)
I prefer to use re.match:
import re
example = ''' THE COMPANY
ABCD is a new company in the field of Marketing. bla bla bla.
DESCRIPTION
xyz....
SUMMARY
asdf.......'''
headlines = []
for line in example.split('\n'):
m = re.match(r'^\s{4,}([A-Z0-9 \t\._-]+)', line)
if m:
headlines.append(m.group(1))
print(headlines)
An other way is re.findall:
headlines = [x.lstrip(' \n') for x in re.findall(r'^\s{4,}[A-Z0-9 \t\._-]+', example, re.M)]
print(headlines)