Hi I would like to split some strings like these using python re module:
str1 = "-t Hello World -a New Developr -d A description"
str2 = "-t Bye -a sometext -d a large description"
I must keep the - because I'm currently programming a python CLI project.
I've tried using
re.split(r'(?=-)',aux)
but I recieved
['', '-t Hello World ', '-a Author ', '-d Description ']
instead of
['-t Hello World', '-a Author', '-d Description']
Any recommendation?
Use a negative lookbehind to avoid splitting at the start of the string: (?<!^)
Then strip each of the resulting strings.
for s in str1, str2:
print([x.strip() for x in re.split(r'(?<!^)(?=\s+-)', str1)])
(I also added \s+ to avoid splitting words like Spider-Man.)
Output:
['-t Hello World', '-a New Developr', '-d A description']
['-t Hello World', '-a New Developr', '-d A description']
However, I'd recommend looking into libraries that can do this for you - maybe argparse and shlex to start, and I would question where these strings are coming from, cause normally you deal with command line arguments as a list before parsing, like this:
['-t', 'Hello', 'World', '-a', 'New', 'Developr', '-d', 'A', 'description']
Try applying the strip() method on the results.
foo = re.split(r'(?=-)',str1)
foo = [s.strip() for s in foo]
Related
I am writing a latex to text converter, and I'm basing my work on top of a well-known Python parser for latex (python-latex). I am improving it day after day, but now I have a problem when parsing multiple commands inside one line. A latex command can be in the following four forms:
\commandname
\commandname[text]
\commandname{other text}
\commandname[text]{other text}
In the assumption that the commands are not split over lines, and that there could be spaces in the text (but not in the command name), I ended up with the following regexp to catch a command in a line:
'(\\.+\[*.*\]*\{.*\})'
In fact, a sample program is working:
string="\documentclass[this is an option]{this is a text} this is other text ..."
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', string)
>>>['', '\\documentclass[this is an option]{this is a text}', ' ', 'this', ' ', 'is', ' ', 'other', ' ', 'text', ' ...']
Well, to be honest, I would prefer an output like this:
>>> [ '\\documentclass[this is an option]{this is a text}', 'this is other text ...' ]
But the first one can work anyway. Now, my problem arises if, in one line, there are more than one command, like in the following example:
dstring=string+" \emph{tt}"
print (dstring)
\documentclass[this is an option]{this is a text} this is other text ... \emph{tt}
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', dstring)
['', '\\documentclass[this is an option]{this is a text} this is other text ... \\emph{tt}', '']
As you can see, the result is quite different from the one that I would like:
[ '\\documentclass[this is an option]{this is a text}', 'this is other text ...', '\\emph{tt}']
I have tried to use lookahead and look-back proposition, but since they expect a fixed number of characters, it is impossible to use them. I hope there is a solution.
Thank you!
You can accomplish this simply with github.com/alvinwan/TexSoup. This will give you what you want, albeit with whitespaces preserved.
>>> from TexSoup import TexSoup
>>> string = "\documentclass[this is an option]{this is a text} this is other text ..."
>>> soup = TexSoup(string)
>>> list(soup.contents)
[\documentclass[this is an option]{this is a text}, ' this is other text ...']
>>> string2 = string + "\emph{tt}"
>>> soup2 = TexSoup(string2)
[\documentclass[this is an option]{this is a text}, ' this is other text ...', \emph{tt}]
Disclaimer: I know (1) I'm posting over a year later and (2) OP asks for regex, but assuming the task is tool-agnostic, I'm leaving this here for folks with similar problems. Also, I wrote TexSoup, so take this suggestion with a grain of salt.
I have a string that may contain random segments of quoted and unquoted texts. For example,
s = "\"java jobs in delhi\" it software \"pune\" hello".
I want to separate out the quoted and unquoted parts of this string in python.
So, basically I expect the output to be:
quoted_string = "\"java jobs in delhi\"" "\"pune\""
unquoted_string = "it software hello"
I believe using a regex is the best way to do it. But I am not very good with regex. Is there some regex expression that can help me with this?
Or is there a better solution available?
I dislike regex for something like this, why not just use a split like this?
s = "\"java jobs in delhi\" it software \"pune\" hello"
print s.split("\"")[0::2] # Unquoted
print s.split("\"")[1::2] # Quoted
If your quotes are as basic as in your example, you could just split; example:
for s in (
'"java jobs in delhi" it software "pune" hello',
'foo "bar"',
):
result = s.split('"')
print 'text between quotes: %s' % (result[1::2],)
print 'text outside quotes: %s' % (result[::2],)
Otherwise you could try:
import re
pattern = re.compile(
r'(?<!\\)(?:\\\\)*(?P<quote>["\'])(?P<value>.*?)(?<!\\)(?:\\\\)*(?P=quote)'
)
for s in data:
print pattern.findall(s)
I explain the regex (I use it in ihih):
(?<!\\)(?:\\\\)* # find backslash
(?P<quote>["\']) # any quote character (either " or ')
# which is *not* escaped (by a backslash)
(?P<value>.*?) # text between the quotes
(?<!\\)(?:\\\\)*(?P=quote) # end (matching) quote
Debuggex Demo
/
Regex101 Demo
Use a regex for that:
re.findall(r'"(.*?)"', s)
will return
['java jobs in delhi', 'pune']
You should use Python's shlex module, it's very nice:
>>> from shlex import shlex
>>> def get_quoted_unquoted(s):
... lexer = shlex(s)
... items = list(iter(lexer.get_token, ''))
... return ([i for i in items if i[0] in "\"'"],
[i for i in items if i[0] not in "\"'"])
...
>>> get_quoted_unquoted("\"java jobs in delhi\" it software \"pune\" hello")
(['"java jobs in delhi"', '"pune"'], ['it', 'software', 'hello'])
>>> get_quoted_unquoted("hello 'world' \"foo 'bar' baz\" hi")
(["'world'", '"foo \'bar\' baz"'], ['hello', 'hi'])
>>> get_quoted_unquoted("does 'nested \"quotes\" work' yes")
(['\'nested "quotes" work\''], ['does', 'yes'])
>>> get_quoted_unquoted("what's up with single quotes?")
([], ["what's", 'up', 'with', 'single', 'quotes', '?'])
>>> get_quoted_unquoted("what's up when there's two single quotes")
([], ["what's", 'up', 'when', "there's", 'two', 'single', 'quotes'])
I think this solution is as simple as any other solution (basically a oneliner, if you remove the function declaration and grouping) and it handles nested quotes well etc.
A script is located in "just a dir/my script.sh",
and a call to this script is:
path_options = 'just\ a\ dir/my\ script.sh --option'
How to split the path_options into the below?:
['just a dir/my script.sh', '--option']
OK, #Kasra gave an answer for the above example, but what if there may be multiple options like?:
path_options = 'just\ a\ dir/my\ script.sh --option --option2 ... --optionN'
Ended up doing:
a = re.split(r'[^\\] ', path_options, maxsplit=1) # Split at first space that is not '\ '
a = [re.sub(r'\\ ', ' ', a[0])] + a[1:] # Convert all '\ ' in first element to ' '
gives:
['just a dir/my script.s', '--option .. --optionN']
As per #JoelCornett comment:
a = shlex.split(path_options)
gives:
['^/just a dir/my script.sh', '--option', '..', '--optionN']
You can use re.split to split your path with \ or space then slice and join :
>>> l=re.split(r'[\\ ]+',path_options)
>>> l=[' '.join(l[:-1]),l[-1]]
>>> l
['just a dir/my script.sh', '--option']
I am trying to remove some garbage from a text and would like to remove all words that have "," in the middle of 2 characters. I have tried both expressions bellow
r'\s.*;.*\s' and r'\s.*\W.*\s'
in this text
'the cat as;asas was wjdwi;qs at home'
And it seems to miss some white spaces, returning
'cat as;asas was wjdwi;qs at '
When I needed
'the cat was at home'
Simple solution is to not use a regex:
s = 'the cat as;asas was wjdwi;qs at home'
res = ' '.join(w for w in s.split() if ';' not in w)
# the cat was at home
You might need a more complicated check, but split it into "words" first, then apply a check to each "word"...
You can use this:
re.sub(r'(?i)\s?[a-z]+;[a-z]+\s?', ' ', yourstr)
I am a Python newb trying to get more understanding of regex. Just when I think I got a good grasp of the basics, something throws me - such as the following:
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s' + '|'.join(noun_list) + r'\s'
>>> found = re.findall(noun_patt, text)
>>> found
[' eggs', 'bacon', 'donkey']
Since I set the regex pattern to find 'whitespace' + 'pipe joined list of nouns' + 'whitespace' - how come:
' eggs' was found with a space before it and not after it?
'bacon' was found with no spaces either side of it?
'donkey' was found with no spaces either side of it and the fact there is no whitespace after it?
The result I was expecting: [' eggs ', ' bacon ']
I am using Python 2.7
You misunderstand the pattern. There is no group around the joint list of nouns, so the first \s is part of the eggs option, the bacon and donkey options have no spaces, and the dog option includes the final \s meta character .
You want to put a group around the nouns to delimit what the | option applies to:
noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
The non-capturing group here ((?:...)) puts a limit on what the | options apply to. The \s spaces are now outside of the group and are thus not part of the 4 choices.
You need to use a non-capturing group because if you were to use a regular (capturing) group .findall() would return just the noun, not the spaces.
Demo:
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
>>> re.findall(noun_patt, text)
[' eggs ', ' bacon ']
Now both spaces are part of the output.