Python - How to split a string by non alpha characters - python

I'm trying to use python to parse lines of c++ source code. The only thing I am interested in is include directives.
#include "header.hpp"
I want it to be flexible and still work with poor coding styles like:
# include"header.hpp"
I have gotten to the point where I can read lines and trim whitespace before and after the #. However I still need to find out what directive it is by reading the string until a non-alpha character is encountered regardless of weather it is a space, quote, tab or angled bracket.
So basically my question is: How can I split a string starting with alphas until a non alpha is encountered?
I think I might be able to do this with regex, but I have not found anything in the documentation that looks like what I want.
Also if anyone has advice on how I would get the file name inside the quotes or angled brackets that would be a plus.

Your instinct on using regex is correct.
import re
re.split('[^a-zA-Z]', string_to_split)
The [^a-zA-Z] part means "not alphabetic characters".

You can do that with a regex. However, you can also use a simple while loop.
def splitnonalpha(s):
pos = 1
while pos < len(s) and s[pos].isalpha():
pos+=1
return (s[:pos], s[pos:])
Test:
>>> splitnonalpha('#include"blah.hpp"')
('#include', '"blah.hpp"')

The two options mentioned by others that are best in my opinion are re.split and re.findall:
>>> import re
>>> re.split(r'\W+', '#include "header.hpp"')
['', 'include', 'header', 'hpp', '']
>>> re.findall(r'\w+', '#include "header.hpp"')
['include', 'header', 'hpp']
A quick benchmark:
>>> setup = "import re; word_pattern = re.compile(r'\w+'); sep_pattern = re.compile(r'\W+')"
>>> iterations = 10**6
>>> timeit.timeit("re.findall(r'\w+', '#header foo bar!')", setup=setup, number=iterations)
3.000092029571533
>>> timeit.timeit("word_pattern.findall('#header foo bar!')", setup=setup, number=iterations)
1.5247418880462646
>>> timeit.timeit("re.split(r'\W+', '#header foo bar!')", setup=setup, number=iterations)
3.786440134048462
>>> timeit.timeit("sep_pattern.split('#header foo bar!')", setup=setup, number=iterations)
2.256173849105835
The functional difference is that re.split keeps empty tokens. That’s usually not useful for tokenization purposes, but the following should be identical to the re.findall solution:
>>> filter(bool, re.split(r'\W+', '#include "header.hpp"'))
['include', 'header', 'hpp']

You can use regex. The \W token will match all non-word characters (which is about the same as non-alphanumeric). Word characters are A-Z, a-z, 0-9, and _. If you want to match underscores as well you could just do [\W_].
>>> import re
>>> line = '# include"header.hpp" '
>>> m = re.match(r'^\s*#\s*include\W+([\w\.]+)\W*$', line)
>>> m.group(1)
'header.hpp'

import re
s = 'foo bar- blah/hm.lala'
print(re.findall(r"\w+",s))
output : ['foo', 'bar', 'blah', 'hm', 'lala']

import re
re.split('[^a-zA-Z0-9]', string_to_split)
for all !(alphanumaric) characters

While not exact, most parse header directives like this
(?m)^\h*#\h*include\h*["<](\w[\w.]*)\h*[">]
Where, (?m) is multi-line mode, \h is horizontal whitespace (aka [^\S\r\n] ).

This works:
import re
test_str = ' # include "header.hpp"'
match = re.match(r'\s*#\s*include\s*("[\w.]*")', test_str)
if match:
print match.group(1)

Related

Remove n characters after certain character

I have an string that looks something like this:
*45hello I'm a string *2jwith some *plweird things
I need to remove all the * and the 2 chars that follow those * to get this:
hello I'm a string with some weird things
Is there a practical way to do it without iterating over the string?
Thanks!
Using regular expression:
import re
s = "*45hello I'm a string *2jwith some *plweird things"
s = re.sub(r'\*..', '', s)
You can use regex:
import re
regex = r"\*(.{2})"
test_str = "*45hello I'm a string *2jwith some *plweird things"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, '', test_str, 0)

Python regular expression string search to return start index

Apologies if this has already been asked. Is there a way in Python 3x to search for a whole word in a string and return its starting index?
Any help would be greatly appreciated.
Yes, with a regex and word boundary anchors:
>>> import re
>>> s = "rebar bar barbed"
>>> regex = re.compile(r"\bbar\b")
>>> for match in regex.finditer(s):
... print(match.group(), match.start(), match.end())
...
bar 6 9
The \b anchors make sure that only entire words can match. If you're dealing with non-ASCII words, use re.UNICODE to compile the regex, otherwise \b won't work as expected, at least not in Python 2.
If you just want the first occurrence, you can use re.finditer and next.
s = "foo bar foobar"
import re
m = next(re.finditer(r"\bfoobar\b",s),"")
if m:
print(m.start())
Or as #Tim Pietzcker commented use re.search:
import re
m = re.search(r"\bfoobar\b",s)
if m:
print(m.start())

Why can’t I get rid of the L with this python regular expression?

I’m trying to get rid of the Ls at the ends of integers with a regular expression in python:
import re
s = '3535L sadf ddsf df 23L 2323L'
s = re.sub(r'\w(\d+)L\w', '\1', s)
However, this regex doesn't even change the string. I've also tried s = re.sub(r'\w\d+(L)\w', '', s) since I thought that maybe the L could be captured and deleted, but that didn't work either.
I'm not sure what you're trying to do with those \ws in the first place, but to match a string of digits followed by an L, just use \d+L, and to remove the L you just need to put the \d+ part in a capture group so you can sub it for the whole thing:
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> re.sub(r'(\d+)L', r'\1', s)
'3535 sadf ddsf df 23 2323'
Here's the regex in action:
(\d+)L
Debuggex Demo
Of course this will also convert, e.g., 123LBQ into 123BQ, but I don't see anything in your examples or in your description of the problem that indicates that this is possible, or which possible result you want for that, so…
\w = [a-zA-Z0-9_]
In other words, \w does not include whitespace characters. Each L is at the end of the word and therefore doesn't have any "word characters" following it. Perhaps you were looking for word boundaries?
re.sub(r'\b(\d+)L\b', '\1', s)
Demo
You can use look behind assertion
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> s = re.sub(r'\w(?<=\d)L\b', '', s)
>>> s
'353 sadf ddsf df 2 232'
(?<=\d)L asserts that the L is presceded by a digit, in which case replace it with null''
Try this:
re.sub(r'(?<=\d)L', '\1', s)
This uses a lookbehind to find a digit followed by an "L".
Why not use a - IMO more readable - generator expression?
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> ' '.join(x.rstrip('L') if x[-1:] =='L' and x[:-1].isdigit() else x for x in s.split())
'3535 sadf ddsf df 23 2323'

How can I remove all the punctuations from a string?

for removing all punctuations from a string, x.
i want to use re.findall(), but i've been struggling to know what to write in it..
i know that i can get all the punctuations by writing:
import string
y = string.punctuation
but if i write:
re.findall(y,x)
it says:
raise error("multiple repeat")
sre_constants.error: multiple repeat
can someone explain what exactly we should write in re.findall function?
You may not even need RegEx for this. You can simply use translate, like this
import string
print data.translate(None, string.punctuation)
Several characters in string.punctuation have special meaning in regular expression. They should be escaped.
>>> import re
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>> import re
>>> re.escape(string.punctuation)
'\\!\\"\\#\\$\\%\\&\\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\#\\[\\\\\\]\\^\\_\\`\\{\\|\\}\\~'
And if you want to match any one of them, use character class ([...])
>>> '[{}]'.format(re.escape(string.punctuation))
'[\\!\\"\\#\\$\\%\\&\\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\#\\[\\\\\\]\\^\\_\\`\\{\\|\\}\\~]'
>>> import re
>>> pattern = '[{}]'.format(re.escape(string.punctuation))
>>> re.sub(pattern, '', 'Hell,o World.')
'Hello World'

python regex find all words in text

This sounds very simple, I know, but for some reason I can't get all the results I need
Word in this case is any char but white-space that is separetaed with white-space
for example in the following string: "Hello there stackoverflow."
the result should be: ['Hello','there','stackoverflow.']
My code:
import re
word_pattern = "^\S*\s|\s\S*\s|\s\S*$"
result = re.findall(word_pattern,text)
print result
but after using this pattern on a string like I've shown it only puts the first and the last words in the list and not the words separeted with two spaces
What is the problem with this pattern?
Use the \b boundary test instead:
r'\b\S+\b'
Result:
>>> import re
>>> re.findall(r'\b\S+\b', 'Hello there StackOverflow.')
['Hello', 'there', 'StackOverflow']
or not use a regular expression at all and just use .split(); the latter would include the punctiation in a sentence (the regex above did not match the . in the sentence).
to find all words in a string best use split
>>> "Hello there stackoverflow.".split()
['Hello', 'there', 'stackoverflow.']
but if you must use regular expressions, then you should change your regex to something simpler and faster: r'\b\S+\b'.
r turns the string to a 'raw' string. meaning it will not escape your characters.
\b means a boundary, which is a space, newline, or punctuation.
\S you should know, is any non-whitespace character.
+ means one or more of the previous.
so together it means find all visible sets of characters (words/numbers).
How about simply using -
>>> s = "Hello there stackoverflow."
>>> s.split()
['Hello', 'there', 'stackoverflow.']
The other answers are good. Depending on what you want (eg. include/exclude punctuation or other non-word characters) an alternative could be to use a regex to split by one or more whitespace characters:
re.split(r'\s+', 'Hello there StackOverflow.')
['Hello', 'There', 'StackOverflow.']

Categories

Resources