Python extract all substrings between character and another string - python

Similar questions out there, but my use case is to extract all substrings that exist between a marker and another string that also includes a '(', which seems to be throwing off regex. Like this-
qry_text -
with
qry_1 as ( qry text)
,
qry_2 as (qry text)
I'd like to extract all subqueries with something like extract between ' ' and 'as ('
re.findall(r'''(.+?)as (',qry_text)
To get -
qry_1,qry2
Regex is not well understood to me, so any suggestions are appreciated.

Maybe named groups in regex can bring you some handy features:
import re
input_str = """with
qry_1 as ( qry text)
,
qry_2 as (qry text)"""
for text in input_str.splitlines():
match = re.search(r'(?P<query>^.*?) as \((?P<text>.*?)\)', text)
if match:
print(match.groupdict())
# {'query': 'qry_1', 'text': ' qry text'}
# {'query': 'qry_2', 'text': 'qry text'}

Related

Python Regex - Get strings before and after substring

import re
txt = '<li>one. URL : http://local.ru (10.02.2022).</li><li>Two</li><li>Three. URL : https://local.ru (15.11.2021).</li>'
re.findall(r'(<li>.*?)\s?URL\s?:\s?(<a.*?>).*?(</a>.*?</li>)', txt)
I need gen output
[('<li>one.', '', ' (10.02.2022).</li>'),
('<li>Three.', '', ' (15.11.2021).</li>')]
If without the first brackets, then it works. But it does not output the text
Seems like your regex was too generous on the .*?, if you limit to non-node with [^<>], then you get the expected output.
import re
txt = (
'<li>one. URL : http://local.ru (10.02.2022).</li>'
'<li>Two</li>'
'<li>Three. URL : https://local.ru (15.11.2021).</li>'
)
re.findall(r"(<li>[^<>]*?)\s?URL\s?:\s?(<a[^>]*?>).*?(</a>.*?</li>)", txt)
gives
[('<li>one.', '', ' (10.02.2022).</li>'),
('<li>Three.', '', ' (15.11.2021).</li>')]

Parse sentences with [value](type) format

I want to parse and extract key, values from a given sentence which follow the following format:
I want to get [samsung](brand) within [1 week](duration) to be happy.
I want to convert it into a split list like below:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
I have tried to split it using [ or ) :
re.split('\[|\]|\(|\)',s)
which is giving output:
['I want to get ',
'samsung',
'',
'brand',
' within ',
'1 week',
'',
'duration',
' to be happy.']
and
re.split('\[||\]|\(|\)',s)
is giving below output :
['I want to get ',
'samsung](brand) within ',
'1 week](duration) to be happy.']
Any help is appreciated.
Note: This is similar to stackoverflow inline links as well where if we type : go to [this link](http://google.com) it parse it as link.
As first step we split the string, and in second step we modify the string:
s = 'I want to get [samsung](brand) within [1 week](duration) to be happy.'
import re
s = re.split('(\[[^]]*\]\([^)]*\))', s)
s = [re.sub('\[([^]]*)\]\(([^)]*)\)', r'\1:\2', i) for i in s]
print(s)
Prints:
['I want to get ', 'samsung:brand', ' within ', '1 week:duration', ' to be happy.']
You may use a two step approach: process the [...](...) first to format as needed and protect these using some rare/unused chars, and then split with that pattern.
Example:
s = "I want to get [samsung](brand) within [1 week](duration) to be happy.";
print(re.split(r'⦅([^⦅⦆]+)⦆', re.sub(r'\[([^][]*)]\(([^()]*)\)', r'⦅\1:\2⦆', s)))
See the Python demo
The \[([^\][]*)]\(([^()]*)\) pattern matches
\[ - a [ char
([^\][]*) - Group 1 ($1): any 0+ chars other than [ and ]
]\( - ]( substring
([^()]*) - Group 2 ($2): any 0+ chars other than ( and )
\) - a ) char.
The ⦅([^⦅⦆]+)⦆ pattern just matches any ⦅...⦆ substring but keeps what is in between as it is captured.
You could replace the ]( pattern first, then split on [) characters
re.replace('\)\[', ':').split('\[|\)',s)
One approach, using re.split with a lambda function:
sentence = "I want to get [samsung](brand) within [1 week](duration) to be happy."
parts = re.split(r'(?<=[\])])\s+|\s+(?=[\[(])', sentence)
processTerms = lambda x: re.sub('\[([^\]]+)\]\(([^)]+)\)', '\\1:\\2', x)
parts = list(map(processTerms, parts))
print(parts)
['I want to get', 'samsung:brand', 'within', '1 week:duration', 'to be happy.']

match single keyword from re.compile which has a list of keyword

i have keywords like
cat="AUTHORISATION,FORTHCOMING BOARD MEETINGS,PREVIOUS BOARD MEETINGS,BOARD MEETINGS,BOARD MEETING,MINUTES,BOARD PAPERS,AGENDA,COMMUNITY PROFILES,FORTHCOMING GOVERNOR MEETINGS,PREVIOUS GOVERNOR MEETINGS,GOVERNOR MEETINGS,GOVERNOR MEETING,GOVERNOR,COUNCIL OF GOVERNORS,GOVERNING BODY MEETINGS,COMPARISON,APC SUMMARY OF DECISIONS"
i have some pre-processing like this
cat_list=cat.split(',')
cat_list=filter(None, cat_list)
cat_list=[s.strip() for s in cat_list]
cat_list=[re.sub('\r\n' , ' ', s) for s in cat_list]
cat_list=[re.sub(r'([^\s])\s([^\s])', r'\1+(.)+\2',x) for x in cat_list]
cat_list=[re.sub(r'([a-z][a-z]+)', r'(\1)',a,flags=re.I) for a in cat_list]
regexes_cat=[re.compile((r'(?:%s)' % '|'.join(cat_list)),re.IGNORECASE),]
which gives re.compile expressions in list for me to perform re.search
so the final regex pattern after processing looks like this
(?:(AUTHORISATION)|(FORTHCOMING)+(.)+(BOARD)+(.)+(MEETINGS)|(PREVIOUS)+(.)+(BOARD)+(.)+(MEETINGS)|(BOARD)+(.)+(MEETINGS)|(BOARD)+(.)+(MEETING)|(MINUTES)|(BOARD)+(.)+(PAPERS)|(AGENDA)|(COMMUNITY)+(.)+(PROFILES)|(FORTHCOMING)+(.)+(GOVERNOR)+(.)+(MEETINGS)|(PREVIOUS)+(.)+(GOVERNOR)+(.)+(MEETINGS)|(GOVERNOR)+(.)+(MEETINGS)|(GOVERNOR)+(.)+(MEETING)|(GOVERNOR)|(COUNCIL)+(.)+(OF)+(.)+(GOVERNORS)|(GOVERNING)+(.)+(BODY)+(.)+(MEETINGS)|(COMPARISON)|(APC)+(.)+(SUMMARY)+(.)+(OF)+(.)+(DECISIONS))
but i got results like this if i print group(0)
GOVERNORS-MEETINGS.ASP?P=GOVERNORS%27.COUNCIL.MEETINGS
so i searched and found that i have to use ? to make it non-greedy but i am unable get the required output
which should be
GOVERNORS-MEETINGS
i am performing re.search against URL and text present on webpage
http://www.qehkl.nhs.uk/governors-meetings.asp?p=governors%27.council.meetings&s=main&ss=becoming.a.foundation.trust
The solution I suggest is based on the following assumptions:
The regex match should happen in the last subpart of the path (i.e. in the file part, before any eventual query string)
The query string is optional
So, the solution is to parse the URL first with urlparse to only get the string to run the regex on, and forget about lookarounds. Instead of (.)+, just use a lazy (.*?) to match any 0+ chars as few as possible:
import re
from urlparse import urlparse
cat="AUTHORISATION,FORTHCOMING BOARD MEETINGS,PREVIOUS BOARD MEETINGS,BOARD MEETINGS,BOARD MEETING,MINUTES,BOARD PAPERS,AGENDA,COMMUNITY PROFILES,FORTHCOMING GOVERNOR MEETINGS,PREVIOUS GOVERNOR MEETINGS,GOVERNOR MEETINGS,GOVERNOR MEETING,GOVERNOR,COUNCIL OF GOVERNORS,GOVERNING BODY MEETINGS,COMPARISON,APC SUMMARY OF DECISIONS"
cat_list=cat.split(',')
cat_list=filter(None, cat_list)
cat_list=[s.strip() for s in cat_list]
cat_list=[re.sub('\r\n' , ' ', s) for s in cat_list]
cat_list=[re.sub(r'([^\s])\s([^\s])', r'\1(.*?)\2',x) for x in cat_list] # Allow anything in between the keywords, but as few as possible
cat_list=[re.sub(r'([a-z][a-z]+)', r'(\1)', a, flags=re.I) for a in cat_list]
regex_cat=re.compile(r"(?:{})".format('|'.join(cat_list)),re.IGNORECASE)
#print(regex_cat.pattern)
urls = "GOVERNORS/GOVERNORS-MEETINGS.ASP?P=GOVERNORS%27.COUNCIL.MEETINGS "
o = urlparse(urls) # Parse the URL
last_subpart = o.path.split('/').pop() # Get the last subpart
m = regex_cat.search(last_subpart) # Run the regex search
if m: # If there is a match...
print(m.group()) # Print or do anything with the value
See the Python demo
Try the following code -
cat_list=cat.split(',')
cat_list=filter(None, cat_list)
cat_list=[s.strip() for s in cat_list]
cat_list=[re.sub('\r\n' , ' ', s) for s in cat_list]
#Till now all same, following statements have changes
cat_list=[re.sub(r'([^\s])\s([^\s])', r'\1+.+?\2',x) for x in cat_list]
cat_list=['(%s)'%re.sub(r'([a-z]+)', r'(\1)',a,flags=re.I) for a in cat_list]
regexes_cat=[re.compile((r'(?:%s)' % '|'.join(cat_list)),re.IGNORECASE),]
Here's the working demo.

Extract unquoted text from a string

I have a string that may contain random segments of quoted and unquoted texts. For example,
s = "\"java jobs in delhi\" it software \"pune\" hello".
I want to separate out the quoted and unquoted parts of this string in python.
So, basically I expect the output to be:
quoted_string = "\"java jobs in delhi\"" "\"pune\""
unquoted_string = "it software hello"
I believe using a regex is the best way to do it. But I am not very good with regex. Is there some regex expression that can help me with this?
Or is there a better solution available?
I dislike regex for something like this, why not just use a split like this?
s = "\"java jobs in delhi\" it software \"pune\" hello"
print s.split("\"")[0::2] # Unquoted
print s.split("\"")[1::2] # Quoted
If your quotes are as basic as in your example, you could just split; example:
for s in (
'"java jobs in delhi" it software "pune" hello',
'foo "bar"',
):
result = s.split('"')
print 'text between quotes: %s' % (result[1::2],)
print 'text outside quotes: %s' % (result[::2],)
Otherwise you could try:
import re
pattern = re.compile(
r'(?<!\\)(?:\\\\)*(?P<quote>["\'])(?P<value>.*?)(?<!\\)(?:\\\\)*(?P=quote)'
)
for s in data:
print pattern.findall(s)
I explain the regex (I use it in ihih):
(?<!\\)(?:\\\\)* # find backslash
(?P<quote>["\']) # any quote character (either " or ')
# which is *not* escaped (by a backslash)
(?P<value>.*?) # text between the quotes
(?<!\\)(?:\\\\)*(?P=quote) # end (matching) quote
Debuggex Demo
/
Regex101 Demo
Use a regex for that:
re.findall(r'"(.*?)"', s)
will return
['java jobs in delhi', 'pune']
You should use Python's shlex module, it's very nice:
>>> from shlex import shlex
>>> def get_quoted_unquoted(s):
... lexer = shlex(s)
... items = list(iter(lexer.get_token, ''))
... return ([i for i in items if i[0] in "\"'"],
[i for i in items if i[0] not in "\"'"])
...
>>> get_quoted_unquoted("\"java jobs in delhi\" it software \"pune\" hello")
(['"java jobs in delhi"', '"pune"'], ['it', 'software', 'hello'])
>>> get_quoted_unquoted("hello 'world' \"foo 'bar' baz\" hi")
(["'world'", '"foo \'bar\' baz"'], ['hello', 'hi'])
>>> get_quoted_unquoted("does 'nested \"quotes\" work' yes")
(['\'nested "quotes" work\''], ['does', 'yes'])
>>> get_quoted_unquoted("what's up with single quotes?")
([], ["what's", 'up', 'with', 'single', 'quotes', '?'])
>>> get_quoted_unquoted("what's up when there's two single quotes")
([], ["what's", 'up', 'when', "there's", 'two', 'single', 'quotes'])
I think this solution is as simple as any other solution (basically a oneliner, if you remove the function declaration and grouping) and it handles nested quotes well etc.

Python regex: Match ALL consecutive capitalized words

Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?
You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.
There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]
If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.

Categories

Resources