I have some scraped data that varies in format slightly, however in order to standadise it I need to remove anything within the parenthesis including the parenthesis, if they exist that is. I have attempted to useing strip in various ways but to no avail.
Some example data:
Text (te)
Text Text (tes)
Text-Text (te)
Text Text
Text-Text (tes)
And how I need to appear after standardisation:
Text
Text Text
Text-Text
Text Text
Text-Text
Can anyone offer me a solution for this? Thanks SMNALLY
from re import sub
x = sub("(?s)\(.*\)", "", x)
This will remove everything between the parenthesis (including newlines) as well as the parenthesis themselves.
Assuming the parenthesis do not nest, and that there is at most one pair per string, try this:
import re
myString = re.sub(r'\(.*\)', '', myString)
A more specific pattern might be:
myString = re.sub(r'\s*\(\w+\)\s*$', '', myString)
The above pattern deletes the whitespace that surrounds the parenthetical expression, and only deletes from the end of the line.
Related
I need to get the text inside the parenthesis where the text ends with .md using a regex (if you know another way you can say it) in python.
Original string:
[Romanian (Romania)](books/free-programming-books-ro.md)
Expected result:
books/free-programming-books-ro.md
This should work:
import re
s = '[Romanian (Romania)](books/free-programming-books-ro.md)'
result = re.findall(r'[^\(]+\.md(?=\))',s)
['books/free-programming-books-ro.md']
I have a string formatted like this
s="""
stkcode="10001909" marketid="sh" isstop="S 01" turnover="0" contractid="000000" time="84445850"
"""
I want to capture all the "keyword args" substrings in it, i.e., stkcode="10001909", isstop="S 01". Note that a plain s.split() won't work because of possible white spaces in certain field values, for example isstop="S 01". The correct way to go seems to be re.split, but I don't know how to write the appropriate regex. Can anyone help? Thanks!
edit
To add more info: we are guaranteed there is no " in each entry value. Actually, we only need a "protective" split, i.e. only split the whitespace outside of a pairing ".
EDIT: XML is the way to go, not regex. Apologies
My original data comprises many lines of Timestamp + some aux info + an XML string. So it cannot be directly parsed by an XML parser and has to be read line by line as strings. So I initially thought just stick with string and regex for each (relatively easy) single string. But I was wrong apparently. And XML parser is the way to go for sure.
re.findall(r'((?!\<).*?)="(.*?)"', s)
Produces:
[('stkcode', '10001909'),
(' marketid', 'sh'),
(' isstop', 'S 01'),
(' turnover', '0'),
(' contractid', '000000'),
(' time', '84445850')]
Regex Explanation:
(...)="(...)"
Matches everything in this format, the kwarg format you've defined
Now the first group:
((?!\<).*?) will match all characters (.*?) except for the leading bracket ((?!\<))
And the second group:
(.*?)
will just match all characters. The closing bracket is outside of the the quotes of the original matching pattern, so you don't have to worry about it.
EDIT:
To ignore whitespace around characters add this reverse matching group
(?!\s)
Not sure where whitespace would appear in your strings, but this new regex would handle it in every relevant place:
((?!\<)(?!\s).*?(?!\s))="(?!\s)(.*?)(?!\s)
I have a string with some markup which I'm trying to parse, generally formatted like this.
'[*]\r\n[list][*][*][/list][*]text[list][*][/list]'
I want to match the asterisks within the [list] tags so I can re.sub them as [**] but I'm having trouble forming an expression to grab them. So far, I have:
match = re.compile('\[list\].+?\[/list\]', re.DOTALL)
This gets everything within the list, but I can't figure out a way to narrow it down to the asterisks alone. Any advice would be massively appreciated.
You may use a re.sub and use a lambda in the replacement part. You pass the match to the lambda and use a mere .replace('*','**') on the match value.
Here is the sample code:
import re
s = '[*]\r\n[list][*][*][/list][*]text[list][*][/list]'
match = re.compile('\[list].+?\[/list]', re.DOTALL)
print(match.sub(lambda m: m.group().replace('*', '**'), s))
# = > [*]
# [list][**][**][/list][*]text[list][**][/list]
See the IDEONE demo
Note that a ] outside of a character class does not have to be escaped in Python re regex.
I have been using python with regex to clean up a text file. I have been using the following method and it has generally been working:
mystring = compiledRegex.sub("replacement",mystring)
The string in question is an entire text file that includes many embedded newlines. Some of the compiled regex's cover multiple lines using the re.DOTALL option. If the last character in the compiled regex is a \n the above command will substitute all matches of the regex except the match that ends with the final newline at the end of the string. In fact, I have had several other no doubt related problems dealing with newlines and multiple newlines when they appear at the very end of the string. Can anyone give me a pointer as to what is going on here? Thanks in advance.
If i correctly undestood you and all that you need is to get a text without newline at the end of the each line and then iterate over this text in order to find a required word than you can try to use the following:
data = (line for line in text.split('\n') if line.strip())# gives you all non empty lines without '\n'at the end
Now you can either search/replace any text you need using list slicing or regex functionality.
Or you can use replace in order to replace all '\n' to whenever you want:
text.replace('\n', '')
My bet is that your file does not end with a newline...
>>> content = open('foo').read()
>>> print content
TOTAL:.?C2
abcTOTAL:AC2
defTOTAL:C2
>>> content
'TOTAL:.?C2\nabcTOTAL:AC2\ndefTOTAL:C2'
...so the last line does not match the regex:
>>> regex = re.compile('TOTAL:.*?C2\n', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefTOTAL:C2'
If that is the case, the solution is simple: just match either a newline or the end of the file (with $):
>>> regex = re.compile('TOTAL:.*?C2(\n|$)', re.DOTALL)
>>> regex.sub("XXX", content)
'XXXabcXXXdefXXX'
I can't get a good handle on what is going on from your explanation but you may be able to fix it by replacing all multiple newlines with a single newline as you read in the file. Another option might be to just trim() the regex removing the \n at the end unless you need it for something.
Is the question mark to prevent the regex matching more than one iine at a time? If so then you probably want to be using the MULTILINE flag instead of DOTALL flag. The ^ sign will now match just after a new line or the beginning of a string and the $ sign will now match just before a newline character or the end of a string.
eg.
regex = re.compile('^TOTAL:.*$', re.MULTILINE)
content = regex.sub('', content)
However, this still leaves with the problem of empty lines. But why not just run one additional regex at the end that removes blank lines.
regex = re.compile('\n{2,}')
content = regex.sub('\n', content)
I some text with HTML artifacts where the < and > of tags got dropped, so now I need something that will match a small p followed by a capital letter, like
pThe next day they....
And I also need something that will catch the trailing /p which is easier. These need to be stripped, i.e. replaced with "" in python.
What RE would I use for that? Thanks!
Stephan.
Try this:
re.sub(r"(/?p)(?=[A-Z]|$)", r"<\1>", str)
You might want to extend the boundary assertion (here (?=[A-Z]|$)) with additional characters like whitespace.
I got is. You use backreferences,
import re
smallBig = re.compile(r'[a-z]([A-Z])')
...
cleanedString = smallBig.sub(r'\1', dirtyString)
This removes the small letter but keeps the capital letter in cases where the '<' and '>' of html tags were stripped and you sit with text like
pSome new paragraph text /p
Quick and dirty but it works in my case.