I would like to cut the list elements after a chosen delimiters(many at once): '-', ',' and ':'
I have an example list:
list_1 = ['some text – some another', 'some text, some another', 'some text: some another']
I'd like to cut the list elements(strings in that case) so that it will return the following output:
splitted_list = ['some text', 'some text', 'some text']
I already tried with split() but it only takes 1 delimiter at a time:
splited_list = [i.split(',', 1)[0] for i in list_1]
I would prefer something which is more understandable for me and where I could decide which delimiter to use. For example, I don't want to cut string after - but after -.
List of delimiters:
: , - , ,
Note that - has space before and after, : only after, just like , .
You may use this regex in re.sub and replace it with an empty string:
\s*[^\w\s].*
This will match 0 or more whitespace followed by a character that is not a whitespace and not a word character and anything afterwards.
import re
list_1 = ['some text – some another', 'some text, some another', 'some text: some another']
delims = [',', ':', ' –']
delimre = '(' + '|'.join(delims) + r')\s.*'
splited_list = [re.sub(delimre, '', i) for i in list_1]
print (splited_list)
Output:
['some text', 'some text', 'some text']
Related
The answer to the question at Python remove all whitespace in a string shows separate ways to remove leading/ending, duplicated, and all spaces, respectively, from a string in Python. But strip() removes tabs and newlines, and lstrip() only affects leading spaces. The solution using .join(sentence.split()) also appears to remove Unicode whitespace characters.
Suppose I have a string, in this case scraped from a website using Scrapy, like this:
['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text', '
']
The newlines preserve formatting of the text when I use it in another contexts, but all the extra space is a nuisance. How do I remove all the leading, ending, and duplicated internal spaces while preserving the newline characters (in addition to any \r or \t characters, if there are any)?
The result I want (after I join the individual strings) would then be:
['\n\n\nSome text and some more text\nand on another line some more text']
No sample code is provided because what I've tried so far is just the suggestions on the page referenced above, which gets the results I'm trying to avoid.
In that case str.strip() won't help you (even if you use " " as an argument because it won't remove the spaces inside, only at the start/end of your string, and it would remove the single space before "and" as well.
Instead, use regex to remove 2 or more spaces from your strings:
l= ['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text']
import re
result = "".join([re.sub(" +","",x) for x in l])
print(repr(result))
prints:
'\n\n\nSome text and some more text\n and on another a line some more text'
EDIT: if we apply the regex to each line, we cannot detect \n in some cases, as you noted. So, the alternate and more complex solution would be to join the strings before applying regex, and apply a more complex regex (note that I changed the test list of strings to add more corner cases):
l= ['\n \n ',
'\n ',
'Some text',
' and some more text \n',
'\n and on another a line some more text ']
import re
result = re.sub("(^ |(?<=\n) | +| (?=\n)| $)","","".join(l))
print(repr(result))
prints:
'\n\n\nSome text and some more text\n\nand on another a line some more text'
There are 5 cases in the regex now that will be removed:
start by one space
space following a newline
2 or more spaces
space followed by a newline
end by one space
Aftertought: looks (and is) complicated. There is a non-regex solution after all which gives exactly the same result (if there aren't multiple spaces between words):
result = "\n".join([x.strip(" ") for x in "".join(l).split("\n")])
print(repr(result))
just join the strings, then split according to newline, apply strip with " " as argument to preserve tabs, and join again according to newline.
Chain with re.sub(" +"," ",x.strip(" ")) to take care of possible double spaces between words:
result = "\n".join([re.sub(" +"," ",x.strip(" ")) for x in "".join(l).split("\n")])
You can also do the whole thing in terms of built in string operations if you like.
l = ['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text',
' ']
def remove_duplicate_spaces(l):
words = [w for w in l.split(' ') if w != '']
return ' '.join(words)
lines = ''.join(l).split('\n')
formatted_lines = map(remove_duplicate_spaces, lines)
u = "\n".join(formatted_lines)
print(repr(u))
gives
'\n\n\nSome text and some more text\nand on another a line some more text'
You can also collapse the whole thing into a one-liner:
s = '\n'.join([' '.join([s for s in x.strip(' ').split(' ') if s!='']) for x in ''.join(l).split('\n')])
# OR
t = '\n'.join(map(lambda x: ' '.join(filter(lambda s: s!='', x.strip(' ').split(' '))), ''.join(l).split('\n')))
I'd like to do something like this:
import re
s = 'This is a test'
re.split('(?<= )', s)
add get back something like this:
['This ', 'is ', 'a ', 'test']
but that doesn't work.
Can anybody suggest a simple way to split a string based on a regular expression (my actual code is more complicated and does require a regex) without discarding any content?
The purpose of re.split() is to define a delimiter to split by. While you will find other answers that can actually make your case work, I sense that you would be happier with something like re.findall()
re.findall(r'(\S+\s*)', s)
gives you
['This ', 'is ', 'a ', 'test']
You can use regex module here.
import regex
s = 'This is a test'
print regex.split('(?<= )', s,flags=regex.VERSION1)
Output:
['This ', 'is ', 'a ', 'test']
or
import re
s = 'This is a test'
print [i for i in re.split(r'(\w+\s+)', s,) if i]
Note: 0 width assertions are not supported in re module for split
Why not just use re.findall?
re.findall(r"(\w+\s*)", s)
Capture the delimiter and then rejoin the delimiter to the previous word:
>>> it = iter(re.split('( )', s)+[''])
>>> [word+delimiter for word, delimiter in zip(it, it)]
['This ', 'is ', 'a ', 'test']
at least on alphabetic characters and one space for split :
[i for i in re.split('(\w+ +)',s) if i] # ['This ', 'is ', 'a ', 'test']
I have a large file from which I need to load into a list of strings. each element will contain text until a ',' that immediately follows numbers
for eg:
this is some text, value 45789, followed by, 1245, and more text 78965, more random text 5252,
this should become:
["this is some text, value 45789", "followed by, 1245", "and more text 78965", "more random text 5252"]
I currently doing re.sub(r'([0-9]+),','~', <input-string>) and then splitting on '~' (since my file doesnt contain ~) but this throws out the numbers before the commas.. any thoughts?
You can use re.split with positive look-behind assertion:
>>> import re
>>>
>>> text = 'this is some text, value 45789, followed by, 1245, and more text 78965, more random text 5252,'
>>> re.split(r'(?<=\d),', text)
['this is some text, value 45789',
' followed by, 1245',
' and more text 78965',
' more random text 5252',
'']
If you want it to deal with spaces as well, do this:
string = " blah, lots , of , spaces, here "
pattern = re.compile("^\s+|\s*,\s*|\s+$")
result = [x for x in pattern.split(string) if x]
print(result)
>>> ['blah', 'lots', 'of', 'spaces', 'here']
My program needs to split my natural language text into sentences. I made a mock sentence splitter using re.split in Python 3+. It looks like this:
re.split('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]', content)
I need to split the sentence at the whitespace when the pattern occurs. But the code, as it should, will split the text at the point the pattern occurs and not at the whitespace. It will not save the last character of the sentence including the sentence terminator.
"Is this the number 3? The text goes on..."
will look like
"Is this the number " and "he text goes on..."
Is there a way I can specify at which point the data should be split while keeping my patterns or do I have to look for alternatives?
As #jonrsharpe says, one can use lookaround to reduce the number of characters splitted away, for instance to a single one. For instance if you don't mind losing space characters, you could use something like:
>>> re.split('\s(?=[A-Z])',content)
['Is this the number 3?', 'The text goes on...']
You can split using spaces with the next character an uppercase. But the T is not consumed, only the space.
Alternative approach: alternating split/capture item
You can however use another approach. In case you split, you eat content, but you can use the same regex to generate a list of matches. These matches is the data that was placed in between. By merging the matches in between the splitted items, you reconstruct the full list:
from itertools import chain, izip
import re
def nonconsumesplit(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [val for pair in zip(outer,inner) for val in pair]
Which results in:
>>> nonconsumesplit('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number ', '3? ', 'The text goes on...', '']
>>> list(nonconsumesplit('\s',content))
['Is', ' ', 'this', ' ', 'the', ' ', 'number', ' ', '3?', ' ', 'The', ' ', 'text', ' ', 'goes', ' ', 'on...', '']
Or you can use a string concatenation:
def nonconsumesplitconcat(regex,content):
outer = re.split(regex,content)
inner = re.findall(regex,content)+['']
return [pair[0]+pair[1] for pair in zip(outer,inner)]
Which results in:
>>> nonconsumesplitconcat('\D[.!?]\s[A-Z]|$|\d[!?]\s|\d[.]\s[A-Z]',content)
['Is this the number 3? ', 'The text goes on...']
>>> nonconsumesplitconcat('\s',content)
['Is ', 'this ', 'the ', 'number ', '3? ', 'The ', 'text ', 'goes ', 'on...']
I have a piece of code which splits a string after commas and dots (but not when a digit is before or after a comma or dot):
text = "This is, a sample text. Some more text. $1,200 test."
print re.split('(?<!\d)[,.]|[,.](?!\d)', text)
The result is:
['This is', ' a sample text', ' Some more text', ' $1,200 test', '']
I don't want to lose the commas and dots. So what I am looking for is:
['This is,', 'a sample text.', 'Some more text.', '$1,200 test.']
Besides, if a dot in the end of text it produces an empty string in the end of the list. Furthermore, there are white-spaces at the beginning of the split strings. Is there a better method without using re? How would you do this?
Unfortunately you can't use re.split() on a zero-length match, so unless you can guarantee that there will be whitespace after the comma or dot you will need to use a different approach.
Here is one option that uses re.findall():
>>> text = "This is, a sample text. Some more text. $1,200 test."
>>> print re.findall(r'(?:\d[,.]|[^,.])*(?:[,.]|$)', text)
['This is,', ' a sample text.', ' Some more text.', ' $1,200 test.', '']
This doesn't strip whitespace and you will get an empty match at the end if the string ends with a comma or dot, but those are pretty easy fixes.
If it is a safe assumption that there will be whitespace after every comma and dot you want to split on, then we can just split the string on that whitespace which makes it a little simpler:
>>> print re.split(r'(?<=[,.])(?<!\d.)\s', text)
['This is,', 'a sample text.', 'Some more text.', '$1,200 test.']