Python re.sub returning binary characters - python

I'm trying to repair a JSON feed using re.sub() regex expressions in Python. (I'm also working with the feed provider to fix it). I have two expressions to fix:
1.
"milepost": "
"milepost": "723.46
which are missing an end quote, and
2.
},
}
which shouldn't have the comma. Note, there is no blank line between them, it's just "},\n }" (trouble with this editor...)
I have a short snippet of the feed, located at:
http://hardhat.ahmct.ucdavis.edu/tmp/test.txt
Sample code below. Here, I have tests for finding the patterns, and then for doing the replacements. The match for #2 gives some odd results, but I can't see why:
Brace matches found:
[('}', '\r\n }')]
The match for #1 seems good.
Main problem is, when I do the re.sub, my resulting string has "\x01\x02" in it. I have no clue where this is coming from. Any advice greatly appreciated.
Sample code:
import urllib2
import json
import re
if __name__ == "__main__":
# wget version of real feed:
# url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.json"
# Short text, for milepost and brace substitution test:
url = "http://hardhat.ahmct.ucdavis.edu/tmp/test.txt"
request = urllib2.urlopen(url)
rawResponse = request.read()
# print("Raw response:")
# print(rawResponse)
# Find extra comma after end of records:
p1 = re.compile('(}),(\r?\n *})')
l1 = p1.findall(rawResponse)
print("Brace matches found:")
print(l1)
# Check milepost:
#p2 = re.compile('( *\"milepost\": *\")')
p2 = re.compile('( *\"milepost\": *\")([0-9]*\.?[0-9]*)\r?\n')
l2 = p2.findall(rawResponse)
print("Milepost matches found:")
print(l2)
# Do brace substitutions:
subst = "\1\2"
response = re.sub(p1, subst, rawResponse)
# Do milepost substitutions:
subst = "\1\2\""
response = re.sub(p2, subst, response)
print(response)

You need to use raw strings, or "\1\2" will be interpreted by the Python string processor as ASCII 01 ASCII 02 instead of backslash 1 backslash 2.
Instead of
subst = "\1\2"
use
subst = r"\1\2" # or subst = "\\1\\2"
Things get a bit trickier with the second replacement:
subst = "\1\2\""
needs to become
subst = r'\1\2"' # or subst = "\\1\\2\""

Related

Replacing 3 line string

When calling external API I get this kind of response It is 2 lines of 44 characters total 88. Which is perfect.
r.text = "P<RUSBASZNAGDCIEWS<<AZIZAS<<<<<<<<<<<<<<<<<<"
"00000000<ORUS5911239F160828525911531023<<<10"
But some times I get this kind of response and I need to make it the same as in example 1. 2 lines of 44 characters.
All this big く should be replaced with normal < and spaces also removed
r.text = "P<RUSALUZAFEE<<ZUZILLAS<<<<
くくくくくくくくくく、
00000000<ORUS7803118 F210127747803111025<<<64"
expected OUTPUT:
string = "P<RUSALUZAFEE<<ZUZILLAS<<<<<<<<<<<<<<<<<<<<<
00000000<ORUS7803118F210127747803111025<<<64"
Here is best attempt guess you will find it helpful
import re
txt =""" P<RUSALUZAFEE<<ZUZILLAS<<<<
くくくくくくくくくく、
00000000<ORUS7803118 F210127747803111025<<<64"""
txt_1 = re.sub('(く |く)', '<', txt).replace('、','')
txt_2 = re.sub(r'\s+', '', txt_1)
regex = r"(\w<?\w+<+\w+<+)(\w*<?\w+<+\w+)"
result = re.match(regex, txt_2)
print(f'{result.group(1)}\n{result.group(2)}')
Output
P<RUSALUZAFEE<<ZUZILLAS<<<<<<<<<<<<<<
00000000<ORUS7803118F210127747803111025<<<64
import re
pattern = r'\n.*く.*\n'
s = re.compile(pattern)
string = s.sub('\n', r.text)
you can do it with re.sub from the module re like the following
new_txt = re.sub("<", "く", old_txt)
or with str.replace like the following
new_str = OldStr.replace("く", "<")
or use regex and combine it with if else like
if pattern:
re.sub # or str.replace
else:
pass

Python - Regex - combination of letters and numbers (undefined length)

I am trying to get a File-ID from a text file. In the above example the filename is d735023ds1.htm which I want to get in order to build another url. Those filenames differ however in their length and I would need a universal regex expression to cover all possibilities.
Example filenames
d804478ds1a.htm.
d618448ds1a.htm.
d618448.htm
My code
for cik in leftover_cik_list:
r = requests.get(filing.url)
content = str(r.content)
fileID = None
for line in content.split("\n"):
if fileID == None:
fileIDIndex = line.find("<FILENAME>")
if fileIDIndex != -1:
trimmedText = line[fileIDIndex:]
result = RegEx.search(r"^[\w\d.htm]*$", trimmedText)
if result:
fileID = result.group()
print ("fileID",fileID)
document_link = "https://www.sec.gov/Archives/edgar/data/{0}/{1}/{2}.htm".format(cik, accession_number, fileID)
print ("Document Link to S-1:", document_link)
import re
...
result = re.search('^d\d{1,6}.+\.htm$', trimmedText)
if result:
fileID = result.group()
^d = Start with a d
\d{1,6} = Look for 1-6 digits, if there could be an unlimited amount of digits replace with \d{1,}
.+ = Wild card
\.htm$ = End in .htm
You should try re.match() which searches for a pattern at the beginning of the input string. Also, your regex is not good, you have to add an anti-shash before ., as point means "any character" in regex.
import re
result = re.match('[\w]+\.htm', trimmedText)
Try this regex:
import re
files = [
"d804478ds1a.htm",
"d618448ds1a.htm",
"d618448.htm"
]
for f in files:
match = re.search(r"d\w+\.htm", f)
print(match.group())
d804478ds1a.htm
d618448ds1a.htm
d618448.htm
The assumptions in the above are that the file name starts with a d, ends with .htm and contains only letters, digits and underscores.

Capture ALL strings within a Python script with regex

This question was inspired by my failed attempts after trying to adapt this answer: RegEx: Grabbing values between quotation marks
Consider the following Python script (t.py):
print("This is also an NL test")
variable = "!\n"
print('And this has an escaped quote "don\'t" in it ', variable,
"This has a single quote ' but doesn\'t end the quote as it" + \
" started with double quotes")
if "Foo Bar" != '''Another Value''':
"""
This is just nonsense
"""
aux = '?'
print("Did I \"failed\"?", f"{aux}")
I want to capture all strings in it, as:
This is also an NL test
!\n
And this has an escaped quote "don\'t" in it
This has a single quote ' but doesn\'t end the quote as it
started with double quotes
Foo Bar
Another Value
This is just nonsense
?
Did I \"failed\"?
{aux}
I wrote another Python script using re module and, from my attempts into regex, the one which finds most of them is:
import re
pattern = re.compile(r"""(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg, re.DOTALL)
for i, s in enumerate(x):
print(f'[{i}]',s.group(0))
with the following result:
[0] And this has an escaped quote "don\'t" in it
[1] This has a single quote ' but doesn\'t end the quote as it started with double quotes
[2] Foo Bar
[3] Another Value
[4] Did I \"failed\"?
To improve my failures, I couldn't also fully replicate what I can found with regex101.com:
I'm using Python 3.6.9, by the way, and I'm asking for more insights into regex to crack this one.
Because you want to match ''' or """ or ' or " as the delimiter, put all of that into the first group:
('''|"""|["'])
Don't put \b after it, because then it won't match strings when those strings start with something other than a word character.
Because you want to make sure that the final delimiter isn't treated as a starting delimiter when the engine starts the next iteration, you'll need to fully match it (not just lookahead for it).
The middle part to match anything but the delimiter can be:
((?:\\.|.)*?)
Put it all together:
('''|"""|["'])((?:\\.|.)*?)\1
and the result you want will be in the second capture group:
pattern = re.compile(r"""(?s)('''|\"""|["'])((?:\\.|.)*?)\1""")
with open('t.py', 'r') as f:
msg = f.read()
x = pattern.finditer(msg)
for i, s in enumerate(x):
print(f'[{i}]',s.group(2))
https://regex101.com/r/dvw0Bc/1

How would I match a string that may or may not span multiple lines?

I have a document that when converted to text splits the phone number onto multiple lines like this:
(xxx)-xxx-
xxxx
For a variety of reasons related to my project I can't simply join the lines.
If I know the phonenumber="(555)-555-5555" how can I compile a regex so that if I run it over
(555)-555-
5555
it will match?
**EDIT
To help clarify my question here it is in a more abstract form.
test_string = "xxxx xx x xxxx"
text = """xxxx xx
x
xxxx"""
I need the test string to be found in the text. Newlines can be anywhere in the text and characters that need to be escaped should be taken into consideration.
A simple workaround would be to replace all the \n characters in the document text before you search it:
pat = re.compile(r'\(\d{3}\)-\d{3}\d{4}')
numbers = pat.findall(text.replace('\n',''))
# ['(555)-555-5555']
If this cannot be done for any reasons, the obvious answer, though unsightly, would be to handle a newline character between each search character:
pat = re.compile(r'\(\n*5\n*5\n*5\n*\)\n*-\n*5\n*5\n*5\n*-\n*5\n*5\n*5\n*5')
If you needed to handle any format, you can pad the format like so:
phonenumber = '(555)-555-5555'
pat = re.compile('\n*'.join(['\\'+i if not i.isalnum() else i for i in phonenumber]))
# pat
# re.compile(r'\(\n*5\n*5\n*5\n*\)\n*\-\n*5\n*5\n*5\n*\-\n*5\n*5\n*5\n*5', re.UNICODE)
Test case:
import random
def rndinsert(s):
i = random.randrange(len(s)-1)
return s[:i] + '\n' + s[i:]
for i in range(10):
print(pat.findall(rndinsert('abc (555)-555-5555 def')))
# ['(555)-555-5555']
# ['(555)-5\n55-5555']
# ['(555)-5\n55-5555']
# ['(555)-555-5555']
# ['(555\n)-555-5555']
# ['(5\n55)-555-5555']
# ['(555)\n-555-5555']
# ['(555)-\n555-5555']
# ['(\n555)-555-5555']
# ['(555)-555-555\n5']
You can search for a possible \n existing in the string:
import re
nums = ["(555)-555-\n5555", "(555)-555-5555"]
new_nums = [i for i in nums if re.findall('\([\d\n]+\)[\n-][\d\n]+-[\d\n]+', i)]
Output:
['(555)-555-\n5555', '(555)-555-5555']
data = ["(555)-555-\n5555", "(55\n5)-555-\n55\n55", "(555\n)-555-\n5555", "(555)-555-5555"]
input = '(555)-555-5555'
#add new lines to input string
input = re.sub(r'(?!^|$)', r'\\n*', input)
#escape brackets ()
input = re.sub(r'(?=[()])', r'\\',input)
r = re.compile(input)
match = filter(r.match, data)
Code demo

Python: re.compile and re.sub

Question part 1
I got this file f1:
<something #37>
<name>George Washington</name>
<a23c>Joe Taylor</a23c>
</something #37>
and I want to re.compile it that it looks like this f1: (with spaces)
George Washington Joe Taylor
I tried this code but it kind of deletes everything:
import re
file = open('f1.txt')
fixed = open('fnew.txt', 'w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ', text)
fixed.write(fixed_doc)
My guess is the re.compile line but I'm not quite sure what to do with it. I'm not supposed to use 3rd party extensions. Any ideas?
Question part 2
I had a different question about comparing 2 files I got this code from Alfe:
from collections import Counter
def test():
with open('f1.txt') as f:
contentsI = f.read()
with open('f2.txt') as f:
contentsO = f.read()
tokensI = Counter(value for value in contentsI.split()
if value not in [])
tokensO = Counter(value for value in contentsO.split()
if value not in [])
return not (tokensI - tokensO) and not (set(tokensO) - set(tokensI))
Is it possible to implement the re.compile and re.sub in the 'if value not in []' section?
I will explain what happens with your code:
import re
file = open('f1.txt')
fixed = open('fnew.txt','w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ',text)
fixed.write(fixed_doc)
The instruction text = file.read() creates an object text of type string named text.
Note that I use bold characters text to express an OBJECT, and text to express the name == IDENTIFIER of this object.
As a consequence of the instruction for unwanted in text:, the identifier unwanted is successively assigned to each character referenced by the text object.
Besides, re.compile('<.*>') creates an object of type RegexObject (which I personnaly call compiled) regex or simply regex , <.*> being only the regex pattern).
You assign this compiled regex object to the identifier match: it's a very bad practice, because match is already the name of a method of regex objects in general, and of the one you created in particular, so then you could write match.match without error.
match is also the name of a function of the re module.
This use of this name for your particular need is very confusing. You must avoid that.
There's the same flaw with the use of file as a name for the file-handler of file f1. file is already an identifier used in the language, you must avoid it.
Well. Now this bad-named match object is defined, the instruction fixed_doc = match.sub(r' ',text) replaces all the occurences found by the regex match in text with the replacement r' '.
Note that it's completely superfluous to write r' ' instead of just ' ' because there's absolutely nothing in ' ' that needs to be escaped. It's a fad of some anxious people to write raw strings every time they have to write a string in a regex problem.
Because of its pattern <.+> in which the dot symbol means "greedily eat every character situated between a < and a > except if it is a newline character" , the occurences catched in the text by match are each line until the last > in it.
As the name unwanted doesn't appear in this instruction, it is the same operation that is done for each character of the text, one after the other. That is to say: nothing interesting.
To analyze the execution of a programm, you should put some printing instructions in your code, allowing to understand what happens. For example, if you do print repr(fixed_doc), you'll see the repeated printing of this: ' \n \n \n '. As I said: nothing interesting.
There's one more default in your code: you open files, but you don't shut them. It is mandatory to shut files, otherwise it could happen some weird phenomenons, that I personnally observed in some of my codes before I realized this need. Some people pretend it isn't mandatory, but it's false.
By the way, the better manner to open and shut files is to use the with statement. It does all the work without you have to worry about.
.
So , now I can propose you a code for your first problem:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
li.append(mat.span(2))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\\1>))',
re.DOTALL)
text = '''<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
1------------------------------------1
<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>
2------------------------------------2
George <wxc>Washington
Joe </zazaza>Taylor
3------------------------------------3
The principle is as follows:
When the regex detects a tag,
- if it's an end tag, it matches
- if it's a start tag, it matches only if there is a corresponding end tag somewhere further in the text
For each match, the method sub() of the regex r calls the function ripl() to perform the replacement.
If the match is with a start tag (which is necessary followed somewhere in the text by its corresponding end tag, by construction of the regex), then ripl() returns ''.
If the match is with an end tag, ripl() returns '' only if this end tag has previously in the text been detected has being the corresponding end tag of a previous start tag. This is done possible by recording in a list li the span of each corresponding end tag's span each time a start tag is detected and matching.
The recording list li is defined as a default argument in order that it's always the same list that is used at each call of the function ripl() (please, refer to the functionning of default argument to undertsand, because it's subtle).
As a consequence of the definition of li as a parameter receiving a default argument, the list object li would retain all the spans recorded when analyzing several text in case several texts would be analyzed successively. In order to avoid the list li to retain spans of past text matches, it is necessary to make the list empty. I wrote the function so that the first parameter is defined with a default argument None: that allows to call ripl() without argument before any use of it in a regex's sub() method.
Then, one must think to write ripl() before any use of it.
.
If you want to remove the newlines of the text in order to obtain the precise result you showed in your question, the code must be modified to:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
return ''
elif mat.group(2):
li.append(mat.span(3))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('( *\n *)'
'|'
'</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\\2>)) *',
re.DOTALL)
text = '''<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
1------------------------------------1
<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>
2------------------------------------2
George <wxc>WashingtonJoe </zazaza>Taylor
3------------------------------------3
You can use Beautiful Soup to do this easily:
from bs4 import BeautifulSoup
file = open('f1.txt')
fixed = open('fnew.txt','w')
#now for some soup
soup = BeautifulSoup(file)
fixed.write(str(soup.get_text()).replace('\n',' '))
The output of the above line will be:
George Washington Joe Taylor
(Atleast this works with the sample you gave me)
Sorry I don't understand part 2, good luck!
Don't need re.compile
import re
clean_string = ''
with open('f1.txt') as f1:
for line in f1:
match = re.search('.+>(.+)<.+', line)
if match:
clean_string += (match.group(1))
clean_string += ' '
print(clean_string) # 'George Washington Joe Taylor'
Figured the first part out it was the missing '?'
match = re.compile('<.*?>')
does the trick.
Anyway still not sure about the second questions. :/
For part 1 try the below code snippet. However consider using a library like beautifulsoup as suggested by Moe Jan
import re
import os
def main():
f = open('sample_file.txt')
fixed = open('fnew.txt','w')
#pattern = re.compile(r'(?P<start_tag>\<.+?\>)(?P<content>.*?)(?P<end_tag>\</.+?\>)')
pattern = re.compile(r'(?P<start><.+?>)(?P<content>.*?)(</.+?>)')
output_text = []
for text in f:
match = pattern.match(text)
if match is not None:
output_text.append(match.group('content'))
fixed_content = ' '.join(output_text)
fixed.write(fixed_content)
f.close()
fixed.close()
if __name__ == '__main__':
main()
For part 2:
I am not completely clear with what you are asking - however my guess is that you want to do something like if re.sub(value) not in []. However, note that you need to call re.compile only once prior to initializing the Counter instance. It would be better if you clarify the second part of your question.
Actually, I would recommend you to use the built-in Python diff module to find difference between two files. Using this way better than using your own diff algorithm, since the diff logic is well tested and widely used and is not vulnerable to logical or programmatic errors resulting from presence of spurious newlines, tab and space characters.

Categories

Resources