I have a large file from which I need to load into a list of strings. each element will contain text until a ',' that immediately follows numbers
for eg:
this is some text, value 45789, followed by, 1245, and more text 78965, more random text 5252,
this should become:
["this is some text, value 45789", "followed by, 1245", "and more text 78965", "more random text 5252"]
I currently doing re.sub(r'([0-9]+),','~', <input-string>) and then splitting on '~' (since my file doesnt contain ~) but this throws out the numbers before the commas.. any thoughts?
You can use re.split with positive look-behind assertion:
>>> import re
>>>
>>> text = 'this is some text, value 45789, followed by, 1245, and more text 78965, more random text 5252,'
>>> re.split(r'(?<=\d),', text)
['this is some text, value 45789',
' followed by, 1245',
' and more text 78965',
' more random text 5252',
'']
If you want it to deal with spaces as well, do this:
string = " blah, lots , of , spaces, here "
pattern = re.compile("^\s+|\s*,\s*|\s+$")
result = [x for x in pattern.split(string) if x]
print(result)
>>> ['blah', 'lots', 'of', 'spaces', 'here']
Related
I want to remove numeric values without losing numbers that are part of alphanumeric words from a string.
String = "Jobid JD123 has been abended with code 346"
Result = "Jobid JD123 has been abended with code"
I am using the following code:
result = ''.join([i for i in String if not i.isdigit()])
which gives me the result as 'Jobid JD has been abended with code'
Is there anyway we can remove the words that only contain digits, while retaining those that contain a mix of letters and digits?
You can use regex to find runs of one or more digits \d+ between two word boundaries \b, and replace them with nothing.
>>> import re
>>> string = "Jobid JD123 has been abended with code 346"
>>> re.sub(r"\b\d+\b", "", string).strip()
'Jobid JD123 has been abended with code'
Note that the regex doesn't get rid of the trailing space (between "code" and the digits), so you need to strip() the result of the re.sub().
Use .isnumeric() to remove any word that doesn't only contain numbers:
s = "Jobid JD123 has been abended with code 346"
result = ' '.join(c for c in s.split() if not c.isnumeric())
print(result)
This outputs:
Jobid JD123 has been abended with code
split the string into words and check if the entire word is numerical
Result = " ".join(word for word in String.split() if not word.isnumeric())
>>> Result
'Jobid JD123 has been abended with code'
I would like to cut the list elements after a chosen delimiters(many at once): '-', ',' and ':'
I have an example list:
list_1 = ['some text – some another', 'some text, some another', 'some text: some another']
I'd like to cut the list elements(strings in that case) so that it will return the following output:
splitted_list = ['some text', 'some text', 'some text']
I already tried with split() but it only takes 1 delimiter at a time:
splited_list = [i.split(',', 1)[0] for i in list_1]
I would prefer something which is more understandable for me and where I could decide which delimiter to use. For example, I don't want to cut string after - but after -.
List of delimiters:
: , - , ,
Note that - has space before and after, : only after, just like , .
You may use this regex in re.sub and replace it with an empty string:
\s*[^\w\s].*
This will match 0 or more whitespace followed by a character that is not a whitespace and not a word character and anything afterwards.
import re
list_1 = ['some text – some another', 'some text, some another', 'some text: some another']
delims = [',', ':', ' –']
delimre = '(' + '|'.join(delims) + r')\s.*'
splited_list = [re.sub(delimre, '', i) for i in list_1]
print (splited_list)
Output:
['some text', 'some text', 'some text']
I am trying to extract a paragraph containing a certain string using Python. Example:
text = """test textract.
new line
test word.
another line."""
The following code works:
myword = ("word")
re.findall(r'(?<=(\n)).*?'+ myword + r'+.*?(?=(\n))',text)
and will return:
['test word.']
However, if i want to extract ['new line test word.'], none of the following works:
re.findall(r'(?<=(\.\n)).*?'+ myword + r'+.*?(?=(\.\n))',text) -> []
re.findall(r'(?<=(\.\n)).|\n*?'+ myword + r'+.|\n*?(?=(\.\n))',text) -> [('', '.\n'), ('', '.\n')]
re.findall(r'(?<=(\.\n)).*|\n*?'+ myword + r'+.*|\n*?(?=(\.\n))',text) -> [('', '.\n'), ('.\n', ''), ('', '.\n'), ('.\n', '')]
What should be the right way to do this?
You are matching a single line, because the newlines are asserted and not matched.
What you could do is use an anchor ^ and repeat all lines that have at least a single non whitespace char and match at least a line that contains word
You can also start the pattern with a newline, but then it could miss a first paragraph that is not preceded by a newline.
^(?:[^\S\n]*\S.*\n)*.*word.*(?:\n[^\S\n]*\S.*)*
Regex demo
import re
text = """test textract.
new line
test word.
another line."""
pattern = r"^(?:[^\S\n]*\S.*\n)*.*word.*(?:\n[^\S\n]*\S.*)*"
print(re.findall(pattern, text, re.MULTILINE))
Output
['new line\ntest word.']
You need to use re.MULTILINE and re.DOTALL here to analyse the whole text as a single line, and treat newlines as regular characters:
import re
text = """\
test textract.
new line
test word.
another line."""
print(re.findall(r'\n+(.*word.*)\n+', text, re.MULTILINE | re.DOTALL))
Output:
['new line\ntest word.\n']
I have a text and a list.
text = "Some texts [remove me] that I want to [and remove me] replace"
remove_list = ["[remove me]", "[and remove me]"]
I want to replace all elements from list in the string. So, I can do this:
for element in remove_list:
text = text.replace(element, '')
I can also use regex.
But can this be done in list comprehension or any single liner?
You can use functools.reduce:
from functools import reduce
text = reduce(lambda x, y: x.replace(y, ''), remove_list, text)
# 'Some texts that I want to replace'
I would do this with re.sub to remove all the substrings in one pass:
>>> import re
>>> regex = '|'.join(map(re.escape, remove_list))
>>> re.sub(regex, '', text)
'Some texts that I want to replace'
Note that the result has two spaces instead of one where each part was removed. If you want each occurrence to leave just one space, you can use a slightly more complicated regex:
>>> re.sub(r'\s*(' + regex + r')', '', text)
'Some texts that I want to replace'
There are other ways to write similar regexes; this one will remove the space preceding a match, but you could alternatively remove the space following a match instead. Which behaviour you want will depend on your use-case.
You can do this with a regex by building a regex from an alternation of the words to remove, taking care to escape the strings so that the [ and ] in them don't get treated as special characters:
import re
text = "Some texts [remove me] that I want to [and remove me] replace"
remove_list = ["[remove me]", "[and remove me]"]
regex = re.compile('|'.join(re.escape(r) for r in remove_list))
text = regex.sub('', text)
print(text)
Output:
Some texts that I want to replace
Since this may result in double spaces in the result string, you can remove them with replace e.g.
text = regex.sub('', text).replace(' ', ' ')
Output:
Some texts that I want to replace
I have a piece of code which splits a string after commas and dots (but not when a digit is before or after a comma or dot):
text = "This is, a sample text. Some more text. $1,200 test."
print re.split('(?<!\d)[,.]|[,.](?!\d)', text)
The result is:
['This is', ' a sample text', ' Some more text', ' $1,200 test', '']
I don't want to lose the commas and dots. So what I am looking for is:
['This is,', 'a sample text.', 'Some more text.', '$1,200 test.']
Besides, if a dot in the end of text it produces an empty string in the end of the list. Furthermore, there are white-spaces at the beginning of the split strings. Is there a better method without using re? How would you do this?
Unfortunately you can't use re.split() on a zero-length match, so unless you can guarantee that there will be whitespace after the comma or dot you will need to use a different approach.
Here is one option that uses re.findall():
>>> text = "This is, a sample text. Some more text. $1,200 test."
>>> print re.findall(r'(?:\d[,.]|[^,.])*(?:[,.]|$)', text)
['This is,', ' a sample text.', ' Some more text.', ' $1,200 test.', '']
This doesn't strip whitespace and you will get an empty match at the end if the string ends with a comma or dot, but those are pretty easy fixes.
If it is a safe assumption that there will be whitespace after every comma and dot you want to split on, then we can just split the string on that whitespace which makes it a little simpler:
>>> print re.split(r'(?<=[,.])(?<!\d.)\s', text)
['This is,', 'a sample text.', 'Some more text.', '$1,200 test.']