Extract unquoted text from a string

Extract unquoted text from a string - python

I have a string that may contain random segments of quoted and unquoted texts. For example,
s = "\"java jobs in delhi\" it software \"pune\" hello".
I want to separate out the quoted and unquoted parts of this string in python.
So, basically I expect the output to be:
quoted_string = "\"java jobs in delhi\"" "\"pune\""
unquoted_string = "it software hello"
I believe using a regex is the best way to do it. But I am not very good with regex. Is there some regex expression that can help me with this?
Or is there a better solution available?

I dislike regex for something like this, why not just use a split like this?
s = "\"java jobs in delhi\" it software \"pune\" hello"
print s.split("\"")[0::2] # Unquoted
print s.split("\"")[1::2] # Quoted

If your quotes are as basic as in your example, you could just split; example:
for s in (
'"java jobs in delhi" it software "pune" hello',
'foo "bar"',
):
result = s.split('"')
print 'text between quotes: %s' % (result[1::2],)
print 'text outside quotes: %s' % (result[::2],)
Otherwise you could try:
import re
pattern = re.compile(
r'(?<!\\)(?:\\\\)*(?P<quote>["\'])(?P<value>.*?)(?<!\\)(?:\\\\)*(?P=quote)'
)
for s in data:
print pattern.findall(s)
I explain the regex (I use it in ihih):
(?<!\\)(?:\\\\)* # find backslash
(?P<quote>["\']) # any quote character (either " or ')
# which is *not* escaped (by a backslash)
(?P<value>.*?) # text between the quotes
(?<!\\)(?:\\\\)*(?P=quote) # end (matching) quote
Debuggex Demo
/
Regex101 Demo

Use a regex for that:
re.findall(r'"(.*?)"', s)
will return
['java jobs in delhi', 'pune']

You should use Python's shlex module, it's very nice:
>>> from shlex import shlex
>>> def get_quoted_unquoted(s):
... lexer = shlex(s)
... items = list(iter(lexer.get_token, ''))
... return ([i for i in items if i[0] in "\"'"],
[i for i in items if i[0] not in "\"'"])
...
>>> get_quoted_unquoted("\"java jobs in delhi\" it software \"pune\" hello")
(['"java jobs in delhi"', '"pune"'], ['it', 'software', 'hello'])
>>> get_quoted_unquoted("hello 'world' \"foo 'bar' baz\" hi")
(["'world'", '"foo \'bar\' baz"'], ['hello', 'hi'])
>>> get_quoted_unquoted("does 'nested \"quotes\" work' yes")
(['\'nested "quotes" work\''], ['does', 'yes'])
>>> get_quoted_unquoted("what's up with single quotes?")
([], ["what's", 'up', 'with', 'single', 'quotes', '?'])
>>> get_quoted_unquoted("what's up when there's two single quotes")
([], ["what's", 'up', 'when', "there's", 'two', 'single', 'quotes'])
I think this solution is as simple as any other solution (basically a oneliner, if you remove the function declaration and grouping) and it handles nested quotes well etc.

Related

regex match word and what comes after it

I need some help with a regex I am writing. I have a list of words that I want to match and words that might come after them (words meaning [A-Za-z/\s]+) I.e no parenthesis,symbols, numbers.
words = ['qtr','hard','quarter'] # keywords that must exist
test=['id:12345 cli hard/qtr Mix',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']
excepted output is
['hard/qtr Mix', 'qtr', 'hard', 'hard work', None]
What I have tried so far
re.search(r'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))',x,re.I)

The problem with the pattern you have i.e.'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))', you have \s inside squared brackets [] which means to match the characters individually i.e. either \ or s, instead, you can just use space character i.e.
You can join all the words in words list by | to create the pattern '((qtr|hard|quarter)([a-zA-Z/ ]*))', then search for the pattern in each of strings in the list, if the match is found, take the group 0 and append it to the resulting list, else, append None:
pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
result = []
for x in test:
groups = pattern.search(x)
if groups:
result.append(groups.group(0))
else:
result.append(None)
OUTPUT:
result
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
And since you are including the space characters, you may end up with some values that has space at the end, you can just strip off the white space characters later.

Idea extracted from the existing answer and made shorter :
>>> pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
>>> [pattern.search(x).group(0) if pattern.search(x) else None for x in test])
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
As mentioned in comment :
But it is quite inefficient, because it needs to search for same pattern twice, once for pattern.search(x).group(0) and the other one for if pattern.search(x), and list-comprehension is not the best way to go about in such scenarios.
We can try this to overcome that issue :
>>> [v.group(0) if v else None for v in (pattern.search(x) for x in test)]
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]

You can put all needed words in or expression and put your word definition after that
import re
words = ['qtr','hard','quarter']
regex = r"(" + "|".join(words) + ")[A-Za-z\/\s]+"
p = re.compile(regex)
test=['id:12345 cli hard/qtr Mix(qtr',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']
for string in test:
result = p.search(string)
if result is not None:
print(p.search(string).group(0))
else:
print(result)
Output:
hard/qtr Mix
qtr
hard
hard work
None

Split string keeping the separator without a starting space

Hi I would like to split some strings like these using python re module:
str1 = "-t Hello World -a New Developr -d A description"
str2 = "-t Bye -a sometext -d a large description"
I must keep the - because I'm currently programming a python CLI project.
I've tried using
re.split(r'(?=-)',aux)
but I recieved
['', '-t Hello World ', '-a Author ', '-d Description ']
instead of
['-t Hello World', '-a Author', '-d Description']
Any recommendation?

Use a negative lookbehind to avoid splitting at the start of the string: (?<!^)
Then strip each of the resulting strings.
for s in str1, str2:
print([x.strip() for x in re.split(r'(?<!^)(?=\s+-)', str1)])
(I also added \s+ to avoid splitting words like Spider-Man.)
Output:
['-t Hello World', '-a New Developr', '-d A description']
['-t Hello World', '-a New Developr', '-d A description']
However, I'd recommend looking into libraries that can do this for you - maybe argparse and shlex to start, and I would question where these strings are coming from, cause normally you deal with command line arguments as a list before parsing, like this:
['-t', 'Hello', 'World', '-a', 'New', 'Developr', '-d', 'A', 'description']

Try applying the strip() method on the results.
foo = re.split(r'(?=-)',str1)
foo = [s.strip() for s in foo]

Regexp to catch multiple latex command in a single line

I am writing a latex to text converter, and I'm basing my work on top of a well-known Python parser for latex (python-latex). I am improving it day after day, but now I have a problem when parsing multiple commands inside one line. A latex command can be in the following four forms:
\commandname
\commandname[text]
\commandname{other text}
\commandname[text]{other text}
In the assumption that the commands are not split over lines, and that there could be spaces in the text (but not in the command name), I ended up with the following regexp to catch a command in a line:
'(\\.+\[*.*\]*\{.*\})'
In fact, a sample program is working:
string="\documentclass[this is an option]{this is a text} this is other text ..."
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', string)
>>>['', '\\documentclass[this is an option]{this is a text}', ' ', 'this', ' ', 'is', ' ', 'other', ' ', 'text', ' ...']
Well, to be honest, I would prefer an output like this:
>>> [ '\\documentclass[this is an option]{this is a text}', 'this is other text ...' ]
But the first one can work anyway. Now, my problem arises if, in one line, there are more than one command, like in the following example:
dstring=string+" \emph{tt}"
print (dstring)
\documentclass[this is an option]{this is a text} this is other text ... \emph{tt}
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', dstring)
['', '\\documentclass[this is an option]{this is a text} this is other text ... \\emph{tt}', '']
As you can see, the result is quite different from the one that I would like:
[ '\\documentclass[this is an option]{this is a text}', 'this is other text ...', '\\emph{tt}']
I have tried to use lookahead and look-back proposition, but since they expect a fixed number of characters, it is impossible to use them. I hope there is a solution.
Thank you!

You can accomplish this simply with github.com/alvinwan/TexSoup. This will give you what you want, albeit with whitespaces preserved.
>>> from TexSoup import TexSoup
>>> string = "\documentclass[this is an option]{this is a text} this is other text ..."
>>> soup = TexSoup(string)
>>> list(soup.contents)
[\documentclass[this is an option]{this is a text}, ' this is other text ...']
>>> string2 = string + "\emph{tt}"
>>> soup2 = TexSoup(string2)
[\documentclass[this is an option]{this is a text}, ' this is other text ...', \emph{tt}]
Disclaimer: I know (1) I'm posting over a year later and (2) OP asks for regex, but assuming the task is tool-agnostic, I'm leaving this here for folks with similar problems. Also, I wrote TexSoup, so take this suggestion with a grain of salt.

Python regex: Match ALL consecutive capitalized words

Short question:
I have a string:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
So, expected output should be:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?

You can use this:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.

There's probably a more efficient way to do this, but you could use a regex like this:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$) to ensure the matched group(current) matches the matched group(next).
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]

If your Python code at the moment is this
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.

Python - Don't Understand The Returned Results of This Concatenated Regex Pattern

I am a Python newb trying to get more understanding of regex. Just when I think I got a good grasp of the basics, something throws me - such as the following:
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s' + '|'.join(noun_list) + r'\s'
>>> found = re.findall(noun_patt, text)
>>> found
[' eggs', 'bacon', 'donkey']
Since I set the regex pattern to find 'whitespace' + 'pipe joined list of nouns' + 'whitespace' - how come:
' eggs' was found with a space before it and not after it?
'bacon' was found with no spaces either side of it?
'donkey' was found with no spaces either side of it and the fact there is no whitespace after it?
The result I was expecting: [' eggs ', ' bacon ']
I am using Python 2.7

You misunderstand the pattern. There is no group around the joint list of nouns, so the first \s is part of the eggs option, the bacon and donkey options have no spaces, and the dog option includes the final \s meta character .
You want to put a group around the nouns to delimit what the | option applies to:
noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
The non-capturing group here ((?:...)) puts a limit on what the | options apply to. The \s spaces are now outside of the group and are thus not part of the 4 choices.
You need to use a non-capturing group because if you were to use a regular (capturing) group .findall() would return just the noun, not the spaces.
Demo:
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
>>> re.findall(noun_patt, text)
[' eggs ', ' bacon ']
Now both spaces are part of the output.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract unquoted text from a string - python

I dislike regex for something like this, why not just use a split like this? s = "\"java jobs in delhi\" it software \"pune\" hello" print s.split("\"")[0::2] # Unquoted print s.split("\"")[1::2] # Quoted

Use a regex for that: re.findall(r'"(.*?)"', s) will return ['java jobs in delhi', 'pune']

Related

regex match word and what comes after it

Split string keeping the separator without a starting space

Regexp to catch multiple latex command in a single line

Python regex: Match ALL consecutive capitalized words

Python - Don't Understand The Returned Results of This Concatenated Regex Pattern

Categories

Resources