Splitting a string into an iterator - python

Does python have a build-in (meaning in the standard libraries) to do a split on strings that produces an iterator rather than a list? I have in mind working on very long strings and not needing to consume most of the string.

Not directly splitting strings as such, but the re module has re.finditer() (and corresponding finditer() method on any compiled regular expression).
#Zero asked for an example:
>>> import re
>>> s = "The quick brown\nfox"
>>> for m in re.finditer('\S+', s):
... print(m.span(), m.group(0))
...
(0, 3) The
(4, 9) quick
(13, 18) brown
(19, 22) fox

Like s.Lott, I don't quite know what you want. Here is code that may help:
s = "This is a string."
for character in s:
print character
for word in s.split(' '):
print word
There are also s.index() and s.find() for finding the next character.
Later: Okay, something like this.
>>> def tokenizer(s, c):
... i = 0
... while True:
... try:
... j = s.index(c, i)
... except ValueError:
... yield s[i:]
... return
... yield s[i:j]
... i = j + 1
...
>>> for w in tokenizer(s, ' '):
... print w
...
This
is
a
string.

If you don't need to consume the whole string, that's because you are looking for something specific, right? Then just look for that, with re or .find() instead of splitting. That way you can find the part of the string you are interested in, and split that.

There is no built-in iterator-based analog of str.split. Depending on your needs you could make a list iterator:
iterator = iter("abcdcba".split("b"))
iterator
# <list_iterator at 0x49159b0>
next(iterator)
# 'a'
However, a tool from this third-party library likely offers what you want, more_itertools.split_at. See also this post for an example.

Here's an isplit function, which behaves much like split - you can turn off the regex syntax with the regex argument. It uses the re.finditer function, and returns the strings "inbetween" the matches.
import re
def isplit(s, splitter=r'\s+', regex=True):
if not regex:
splitter = re.escape(splitter)
start = 0
for m in re.finditer(splitter, s):
begin, end = m.span()
if begin != start:
yield s[start:begin]
start = end
if s[start:]:
yield s[start:]
_examples = ['', 'a', 'a b', ' a b c ', '\na\tb ']
def test_isplit():
for example in _examples:
assert list(isplit(example)) == example.split(), 'Wrong for {!r}: {} != {}'.format(
example, list(isplit(example)), example.split()
)

Look at itertools. It contains things like takewhile, islice and groupby that allows you to slice an iterable -- a string is iterable -- into another iterable based on either indexes or a boolean condition of sorts.

You could use something like SPARK (which has been absorbed into the Python distribution itself, though not importable from the standard library), but ultimately it uses regular expressions as well so Duncan's answer would possibly serve you just as well if it was as easy as just "splitting on whitespace".
The other, far more arduous option would be to write your own Python module in C to do it if you really wanted speed, but that's a far larger time investment of course.

Related

Efficient way of matching and replacing multiple strings in python 3?

I have multiple (>30) compiled regex's
regex_1 = re.compile(...)
regex_2 = re.compile(...)
#... define multiple regex's
regex_n = re.compile(...)
I then have a function which takes a text and replaces some of its words using every one of the regex's above and the re.sub method as follows
def sub_func(text):
text = re.sub(regex_1, "string_1", text)
# multiple subsitutions using all regex's ...
text = re.sub(regex_n, "string_n", text)
return text
Question: Is there a more efficient way to make these replacements?
The regex's cannot be generalized or simplified from their current form.
I feel like reassigning the value of text each time for every regex is quite slow, given that the function only replaces a word or two from the entirety of text for each reassignment. Also, given that I have to do this for multiple documents, that slows things down even more.
Thanks in advance!
Reassigning a value takes constant time in Python. Unlike in languages like C, variables are more of a "name tag". So, changing what the name tag points to takes very little time.
If they are constant strings, I would collect them into a tuple:
regexes = (
(regex_1, 'string_1'),
(regex_2, 'string_2'),
(regex_3, 'string_3'),
...
)
And then in your function, just iterate over the list:
def sub_func_2(text):
for regex, sub in regexes:
text = re.sub(regex, sub, text)
return text
But if your regexes are actually named regex_1, regex_2, etc., they probably should be directly defined in a list of some sort.
Also note, if you are doing replacements like 'cat' -> 'dog', the str.replace() method might be easier (text = text.replace('cat', 'dog')), and it will probably be faster.
If your strings are very long, and re-making it from scratch with the regexes might take very long. An implementation of #Oliver Charlesworth's method that was mentioned in the comments could be:
# Instead of this:
regexes = (
('1(1)', '$1i'),
('2(2)(2)', '$1a$2'),
('(3)(3)3', '$1a$2')
)
# Merge the regexes:
regex = re.compile('(1(1))|(2(2)(2))|((3)(3)3)')
substitutions = (
'{1}i', '{1}a{2}', '{1}a{2}'
)
# Keep track of how many groups are in each alternative
group_nos = (1, 2, 2)
cumulative = [1]
for i in group_nos:
cumulative.append(cumulative[-1] + i + 1)
del i
cumulative = tuple(zip(substitutions, cumulative))
def _sub_func(match):
iter_ = iter(cumulative)
for sub, x in iter_:
if match.group(x) is not None:
return sub.format(*map(match.group, range(x, next(iter_)[1])))
def sub_func(text):
return re.sub(regex, _sub_func, text)
But this breaks down if you have overlapping text that you need to substitute.
we can pass a function to re.sub repl argument
simplify to 3 regex for easier understanding
assuming regex_1, regex_2, and regex_3 will be 111,222 and 333 respectively. Then, regex_replace will be the list holding string that will be use for replace follow the order of regex_1, regex_2 and regex_3.
regex_1 will be replace will 'one'
regex_2 replace with 'two' and so on
Not sure how much this will improve the runtime though, give it a try
import re
regex_x = re.compile('(111)|(222)|(333)')
regex_replace = ['one', 'two', 'three']
def sub_func(text):
return re.sub(regex_x, lambda x:regex_replace[x.lastindex-1], text)
>>> sub_func('testing 111 222 333')
>>> 'testing one two three'

Python replace function [replace once]

I need help with a program I'm making in Python.
Assume I wanted to replace every instance of the word "steak" to "ghost" (just go with it...) but I also wanted to replace every instance of the word "ghost" to "steak" at the same time. The following code does not work:
s="The scary ghost ordered an expensive steak"
print s
s=s.replace("steak","ghost")
s=s.replace("ghost","steak")
print s
it prints: The scary steak ordered an expensive steak
What I'm trying to get is The scary steak ordered an expensive ghost
I'd probably use a regex here:
>>> import re
>>> s = "The scary ghost ordered an expensive steak"
>>> sub_dict = {'ghost':'steak','steak':'ghost'}
>>> regex = '|'.join(sub_dict)
>>> re.sub(regex, lambda m: sub_dict[m.group()], s)
'The scary steak ordered an expensive ghost'
Or, as a function which you can copy/paste:
import re
def word_replace(replace_dict,s):
regex = '|'.join(replace_dict)
return re.sub(regex, lambda m: replace_dict[m.group()], s)
Basically, I create a mapping of words that I want to replace with other words (sub_dict). I can create a regular expression from that mapping. In this case, the regular expression is "steak|ghost" (or "ghost|steak" -- order doesn't matter) and the regex engine does the rest of the work of finding non-overlapping sequences and replacing them accordingly.
Some possibly useful modifications
regex = '|'.join(map(re.escape,replace_dict)) -- Allows the regular expressions to have special regular expression syntax in them (like parenthesis). This escapes the special characters to make the regular expressions match the literal text.
regex = '|'.join(r'\b{0}\b'.format(x) for x in replace_dict) -- make sure that we don't match if one of our words is a substring in another word. In other words, change he to she but not the to tshe.
Split the string by one of the targets, do the replace, and put the whole thing back together.
pieces = s.split('steak')
s = 'ghost'.join(piece.replace('ghost', 'steak') for piece in pieces)
This works exactly as .replace() would, including ignoring word boundaries. So it will turn "steak ghosts" into "ghost steaks".
Rename one of the words to a temp value that doesn't occur in the text. Note this wouldn't be the most efficient way for a very large text. For that a re.sub might be more appropriate.
s="The scary ghost ordered an expensive steak"
print s
s=s.replace("steak","temp")
s=s.replace("ghost","steak")
S=s.replace("temp","steak")
print s
Use the count variable in the string.replace() method. So using your code, you wouold have:
s="The scary ghost ordered an expensive steak"
print s
s=s.replace("steak","ghost", 1)
s=s.replace("ghost","steak", 1)
print s
http://docs.python.org/2/library/stdtypes.html
How about something like this? Store the original in a split list, then have a translation dict. Keeps your core code short, then just adjust the dict when you need to adjust the translation. Plus, easy to port to a function:
def translate_line(s, translation_dict):
line = []
for i in s.split():
# To take account for punctuation, strip all non-alnum from the
# word before looking up the translation.
i = ''.join(ch for ch in i if ch.isalnum()]
line.append(translation_dict.get(i, i))
return ' '.join(line)
>>> translate_line("The scary ghost ordered an expensive steak", {'steak': 'ghost', 'ghost': 'steak'})
'The scary steak ordered an expensive ghost'
Note Considering the viewership of this Question, I undeleted and rewrote it for different types of test cases
I have considered four competing implementations from the answers
>>> def sub_noregex(hay):
"""
The Join and replace routine which outpeforms the regex implementation. This
version uses generator expression
"""
return 'steak'.join(e.replace('steak','ghost') for e in hay.split('ghost'))
>>> def sub_regex(hay):
"""
This is a straight forward regex implementation as suggested by #mgilson
Note, so that the overheads doesn't add to the cummulative sum, I have placed
the regex creation routine outside the function
"""
return re.sub(regex,lambda m:sub_dict[m.group()],hay)
>>> def sub_temp(hay, _uuid = str(uuid4())):
"""
Similar to Mark Tolonen's implementation but rather used uuid for the temporary string
value to reduce collission
"""
hay = hay.replace("steak",_uuid).replace("ghost","steak").replace(_uuid,"steak")
return hay
>>> def sub_noregex_LC(hay):
"""
The Join and replace routine which outpeforms the regex implementation. This
version uses List Comprehension
"""
return 'steak'.join([e.replace('steak','ghost') for e in hay.split('ghost')])
A generalized timeit function
>>> def compare(n, hay):
foo = {"sub_regex": "re",
"sub_noregex":"",
"sub_noregex_LC":"",
"sub_temp":"",
}
stmt = "{}(hay)"
setup = "from __main__ import hay,"
for k, v in foo.items():
t = Timer(stmt = stmt.format(k), setup = setup+ ','.join([k, v] if v else [k]))
yield t.timeit(n)
And the generalized test routine
>>> def test(*args, **kwargs):
n = kwargs['repeat']
print "{:50}{:^15}{:^15}{:^15}{:^15}".format("Test Case", "sub_temp",
"sub_noregex ", "sub_regex",
"sub_noregex_LC ")
for hay in args:
hay, hay_str = hay
print "{:50}{:15.10}{:15.10}{:15.10}{:15.10}".format(hay_str, *compare(n, hay))
And the Test Results are as follows
>>> test((' '.join(['steak', 'ghost']*1000), "Multiple repeatation of search key"),
('garbage '*998 + 'steak ghost', "Single repeatation of search key at the end"),
('steak ' + 'garbage '*998 + 'ghost', "Single repeatation of at either end"),
("The scary ghost ordered an expensive steak", "Single repeatation for smaller string"),
repeat = 100000)
Test Case sub_temp sub_noregex sub_regex sub_noregex_LC
Multiple repeatation of search key 0.2022748797 0.3517142003 0.4518992298 0.1812594258
Single repeatation of search key at the end 0.2026047957 0.3508259952 0.4399926194 0.1915298898
Single repeatation of at either end 0.1877455356 0.3561734007 0.4228843986 0.2164233388
Single repeatation for smaller string 0.2061019057 0.3145984487 0.4252060592 0.1989413449
>>>
Based on the Test Result
Non Regex LC and the temp variable substitution have better performance though the performance of the usage of temp variable is not consistent
LC version has better performance compared to generator (confirmed)
Regex is more than two times slower (so if the piece of code is a bottleneck then the implementation change can be reconsidered)
The Regex and non regex versions are equivalently Robust and can scale

Applying a Regex to a Substring Without using String Slice

I want to search for a regex match in a larger string from a certain position onwards, and without using string slices.
My background is that I want to search through a string iteratively for matches of various regex's. A natural solution in Python would be keeping track of the current position within the string and using e.g.
re.match(regex, largeString[pos:])
in a loop. But for really large strings (~ 1MB) string slicing as in largeString[pos:] becomes expensive. I'm looking for a way to get around that.
Side note: Funnily, in a niche of the Python documentation, it talks about an optional pos parameter to the match function (which would be exactly what I want), which is not to be found with the functions themselves :-).
The variants with pos and endpos parameters only exist as members of regular expression objects. Try this:
import re
pattern = re.compile("match here")
input = "don't match here, but do match here"
start = input.find(",")
print pattern.search(input, start).span()
... outputs (25, 35)
The pos keyword is only available in the method versions. For example,
re.match("e+", "eee3", pos=1)
is invalid, but
pattern = re.compile("e+")
pattern.match("eee3", pos=1)
works.
>>> import re
>>> m=re.compile ("(o+)")
>>> m.match("oooo").span()
(0, 4)
>>> m.match("oooo",2).span()
(2, 4)
You could also use positive lookbehinds, like so:
import re
test_string = "abcabdabe"
position=3
a = re.search("(?<=.{" + str(position) + "})ab[a-z]",test_string)
print a.group(0)
yields:
abd

An elegant way to get hashtags out of a string in Python?

I'm looking for a clean way to get a set (list, array, whatever) of words starting with # inside a given string.
In C#, I would write
var hashtags = input
.Split (' ')
.Where (s => s[0] == '#')
.Select (s => s.Substring (1))
.Distinct ();
What is comparatively elegant code to do this in Python?
EDIT
Sample input: "Hey guys! #stackoverflow really #rocks #rocks #announcement"
Expected output: ["stackoverflow", "rocks", "announcement"]
With #inspectorG4dget's answer, if you want no duplicates, you can use set comprehensions instead of list comprehensions.
>>> tags="Hey guys! #stackoverflow really #rocks #rocks #announcement"
>>> {tag.strip("#") for tag in tags.split() if tag.startswith("#")}
set(['announcement', 'rocks', 'stackoverflow'])
Note that { } syntax for set comprehensions only works starting with Python 2.7.
If you're working with older versions, feed list comprehension ([ ]) output to set function as suggested by #Bertrand.
[i[1:] for i in line.split() if i.startswith("#")]
This version will get rid of any empty strings (as I have read such concerns in the comments) and strings that are only "#". Also, as in Bertrand Marron's code, it's better to turn this into a set as follows (to avoid duplicates and for O(1) lookup time):
set([i[1:] for i in line.split() if i.startswith("#")])
the findall method of regular expression objects can get them all at once:
>>> import re
>>> s = "this #is a #string with several #hashtags"
>>> pat = re.compile(r"#(\w+)")
>>> pat.findall(s)
['is', 'string', 'hashtags']
>>>
I'd say
hashtags = [word[1:] for word in input.split() if word[0] == '#']
Edit: this will create a set without any duplicates.
set(hashtags)
there are some problems with the answers presented here.
{tag.strip("#") for tag in tags.split() if tag.startswith("#")}
[i[1:] for i in line.split() if i.startswith("#")]
wont works if you have hashtag like '#one#two#'
2 re.compile(r"#(\w+)") wont work for many unicode languages (even using re.UNICODE)
i had seen more ways to extract hashtag, but found non of them answering on all cases
so i wrote some small python code to handle most of the cases. it works for me.
def get_hashtagslist(string):
ret = []
s=''
hashtag = False
for char in string:
if char=='#':
hashtag = True
if s:
ret.append(s)
s=''
continue
# take only the prefix of the hastag in case contain one of this chars (like on: '#happy,but i..' it will takes only 'happy' )
if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
ret.append(s)
s=''
hashtag=False
if hashtag:
s+=char
if s:
ret.append(s)
return set(ret)
Another option is regEx:
import re
inputLine = "Hey guys! #stackoverflow really #rocks #rocks #announcement"
re.findall(r'(?i)\#\w+', inputLine) # will includes #
re.findall(r'(?i)(?<=\#)\w+', inputLine) # will not include #

Find out number of capture groups in Python regular expressions

Is there a way to determine how many capture groups there are in a given regular expression?
I would like to be able to do the follwing:
def groups(regexp, s):
""" Returns the first result of re.findall, or an empty default
>>> groups(r'(\d)(\d)(\d)', '123')
('1', '2', '3')
>>> groups(r'(\d)(\d)(\d)', 'abc')
('', '', '')
"""
import re
m = re.search(regexp, s)
if m:
return m.groups()
return ('',) * num_of_groups(regexp)
This allows me to do stuff like:
first, last, phone = groups(r'(\w+) (\w+) ([\d\-]+)', 'John Doe 555-3456')
However, I don't know how to implement num_of_groups. (Currently I just work around it.)
EDIT: Following the advice from rslite, I replaced re.findall with re.search.
sre_parse seems like the most robust and comprehensive solution, but requires tree traversal and appears to be a bit heavy.
MizardX's regular expression seems to cover all bases, so I'm going to go with that.
def num_groups(regex):
return re.compile(regex).groups
f_x = re.search(...)
len_groups = len(f_x.groups())
Something from inside sre_parse might help.
At first glance, maybe something along the lines of:
>>> import sre_parse
>>> sre_parse.parse('(\d)\d(\d)')
[('subpattern', (1, [('in', [('category', 'category_digit')])])),
('in', [('category', 'category_digit')]),
('subpattern', (2, [('in', [('category', 'category_digit')])]))]
I.e. count the items of type 'subpattern':
import sre_parse
def count_patterns(regex):
"""
>>> count_patterns('foo: \d')
0
>>> count_patterns('foo: (\d)')
1
>>> count_patterns('foo: (\d(\s))')
1
"""
parsed = sre_parse.parse(regex)
return len([token for token in parsed if token[0] == 'subpattern'])
Note that we're only counting root level patterns here, so the last example only returns 1. To change this, tokens would need to searched recursively.
First of all if you only need the first result of re.findall it's better to just use re.search that returns a match or None.
For the groups number you could count the number of open parenthesis '(' except those that are escaped by '\'. You could use another regex for that:
def num_of_groups(regexp):
rg = re.compile(r'(?<!\\)\(')
return len(rg.findall(regexp))
Note that this doesn't work if the regex contains non-capturing groups and also if '(' is escaped by using it as '[(]'. So this is not very reliable. But depending on the regexes that you use it might help.
Using your code as a basis:
def groups(regexp, s):
""" Returns the first result of re.findall, or an empty default
>>> groups(r'(\d)(\d)(\d)', '123')
('1', '2', '3')
>>> groups(r'(\d)(\d)(\d)', 'abc')
('', '', '')
"""
import re
m = re.search(regexp, s)
if m:
return m.groups()
return ('',) * len(m.groups())
Might be wrong, but I don't think there is a way to find the number of groups that would have been returned had the regex matched. The only way I can think of to make this work the way you want it to is to pass the number of matches your particular regex expects as an argument.
To clarify though: When findall succeeds, you only want the first match to be returned, but when it fails you want a list of empty strings? Because the comment seems to show all matches being returned as a list.

Categories

Resources