python replace unicode characters

python replace unicode characters - python

I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field.
Below is one of the example:
(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
I want to replace all the \x.. with a ?
I explicitly type \xc2 as follows works
line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
re.sub('\\\xc2', '?', line)
result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)'
But its not working if I write as follow:
re.sub('\\\x..', '?', line)
How I can write a regular expression to replace them all?

There are better tools for this job than regex, you could try for example:
>>> line
'(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
>>> line.decode('ascii', 'ignore')
u'(13)p(5)example(3)com(0)'
That skips non-ascii characters. Or with replace, you can swap them for a '?' placeholder:
>>> print line.decode('ascii', 'replace')
(13)��������p����(5)example(3)com(0)
But the best solution is to find out what erroneous encoding/decoding caused the mojibake to happen in the first place, so you can recover data by using the correct code pages.
There is an excellent answer about unbaking emojibake here. Note that it's an inexact science, and a lot of the crucial information is actually in the comment thread under that answer.

what about this?
line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
pattern = r'\\x.+'
re.sub(pattern, r'?', line)

Related

Python regular expression help needed, multiple lines regex

I was trying to scape a link out of a .eml file but somehow I always get "NONE" as return for my search. But I don't even get the link with the confirm brackets, no problem in getting that valid link once the string is pulled.
One problem that I see is, that the string that is found by the REGEX has multiple lines, but the REGES itself seems to be valid.
CODE/REGEX I USE:
def get_url(raw):
#get rid of whitespaces
raw = raw.replace(' ', '')
#search for the link
url = re.search('href=3D(.*?)token([^\s]+)\W([^\s]+)\W([^\s]+)\W([^\s]+)\W([^\s]+)', raw).group(1)
return url

First thing, the .eml is encoded in MIME quoted-printable (the hint is the = signs at the end of the line. You should decode this first, instead of dealing with the encoded raw text.
Second, regex is overkill. Some nice string.split() usage will work just as fine. Regex is extremely usefull in it's proper usage scenarios, but some simple python can usually do the same without having to use regex' flavor of magic, which can be confusing as [REDACTED].
Note that if you're building regex, it's always adviced to use one of the gazillion regex editors as these will help you build your regex... My personal favorite is regex101
EDIT: added regex way to do it.
import quopri
import re
def get_url_by_regex(raw):
decoded = quopri.decodestring(raw).decode("utf-8")
return re.search('(<a href=")(.*?)(")', decoded).group(2)
def get_url(raw):
decoded = quopri.decodestring(raw).decode("utf-8")
for line in decoded.split('\n'):
if 'token=' in line:
return line.split('<a href="')[1].split('"')[0]
return None # just in case this is needed
print(get_url(raw_email))
print(get_url_by_regex(raw_email))
result is:
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]
https://app.rule.io/subscriber/optIn?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzd[REST_OF_TOKEN_REDACTED]

python regular expression not matching file contents with re.match and re.MULTILINE flag

I'm reading in a file and storing its contents as a multiline string. Then I loop through some values I get from a django query to run regexes based on the query results values. My regex seems like it should be working, and works if I copy the values returned by the query, but for some reason isn't matching when all the parts are working together that ends like this
My code is:
with open("/path_to_my_file") as myfile:
data=myfile.read()
#read saved settings then write/overwrite them into the config
items = MyModel.objects.filter(some_id="s100009")
for item in items:
regexString = "^\s*"+item.feature_key+":"
print regexString #to verify its what I want it to be, ie debug
pq = re.compile(regexString, re.M)
if pq.match(data):
#do stuff
So basically my problem is that the regex isn't matching. When I copy the file contents into a big old string, and copy the value(s) printed by the print regexString line, it does match, so I'm thinking theres some esoteric python/django thing going on (or maybe not so esoteric as python isnt my first language).
And for examples sake, the output of print regexString is :
^\s*productDetailOn:
File contents:
productDetailOn:true,
allOff:false,
trendingWidgetOn:true,
trendingWallOn:true,
searchResultOn:false,
bannersOn:true,
homeWidgetOn:true,
}
Running Python 2.7. Also, dumped the types of both item.feature and data, and both were unicode. Not sure if that matters? Anyway, I'm starting to hit my head off the desk after working this for a couple hours, so any help is appreciated. Cheers!

According to documentation, re.match never allows searching at the beginning of a line:
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
You need to use a re.search:
regexString = r"^\s*"+item.feature_key+":"
pq = re.compile(regexString, re.M)
if pq.search(data):
A small note on the raw string (r"^\s+"): in this case, it is equivalent to "\s+" because there is no \s escape sequence (like \r or \n), thus, Python treats it as a raw string literal. Still, it is safer to always declare regex patterns with raw string literals in Python (and with corresponding notations in other languages, too).

Issues with string appending - python

I'm trying to append a string in python and the following code produces
buildVersion =request.values.get("buildVersion", None)
pathToSave = 'processedImages/%s/'%buildVersion
print pathToSave
prints out
processedImages/V41
/
I'm expecting the string to be of format: processedImages/V41/
It doesn't seem to be a new line character.
pathToSave = pathToSave.replace("\n", "")
This dint really help

It might not be relevant to actual question but, in addition to Alex Martelli's answer, I would also check if buildVersion ever exists in the first place, because otherwise all solutions posted here will give you another errors:
import re
buildVersion = request.values.get('buildVersion')
if buildVersion is not None:
return 'processedImages/{}/'.format(re.sub('\W+', '', buildVersion))
else:
return None

It might be a \r or other special whitespace character. Just clean up buildVersion of all such whitespace before executing
pathToSave = 'processedImages/%s/' % buildVersion
You can approach the clean-up task in several ways -- for example, if valid characters in buildVersion are only "word characters" (letters, digits, underscore), something like
import re
buildVersion = re.sub('\W+', '', buildVersion)
would usefully clean up even whitespace inside the string. It's hard to be more specific without knowing exactly what characters you need to accept in buildVersion, of course.

search for hexadecimal number on python using re

I am processing an html text file, and serching for hexadecimal numbers as follows:
example \xb7\xc7\xa0....
I tried with this code
t=re.findall (r'\\x[0-9a-fA-F]+', line)
but I can only gained empty list.
please tell the right way of writing the code.

It works fine for me. Two scenarios come to mind that might explain your problem:
You're testing this by assigning the string to a variable line like so:
line = 'example \xb7\xc7\xa0....'
In this case, you need to escape the backslashes:
line = 'example \\xb7\\xc7\\xa0....'
You are viewing the contents of the file or line as a Python string, so that the \xb7 you are seeing is actually the character who's code is B7 hex, not the character sequence '\', '\x', 'b', '7'.

Your code works fine if the backslash is escaped inside the regular expression:
t = re.findall (r'\\x[0-9a-fA-F]+', line)
Result:
['\\xb7', '\\xc7', '\\xa0']
ideone: http://ideone.com/MPO5j
If it doesn't work it might be because you string contains literal binary characters. Then try something like this instead:
t = re.findall (r'[\x80-\xff]', line)
ideone: http://ideone.com/ChIsh

Your code works fine for me:
>>> line = r'\xb7\xc7\xa0....'
>>> t=re.findall (r'\\x[0-9a-fA-F]+', line)
>>> t
['\\xb7', '\\xc7', '\\xa0']

Python regex for reading CSV-like rows

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:
data1, data2 ,"data3'''", 'data4""',,,data5,
but this one is malformed:
data1, data2, da"ta3", 'data4',
-- quotation marks can only be prepended or trailed by spaces.
Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.
I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.
So, maybe someone with experience in parsing something similar could help me on this?
(Or maybe this is too complex for regex and I should just write a function)
EDIT1:
csv module is not much of use here:
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
[['2', ' "dat', 'a1"', " 'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
[['2', 'dat,a1', "'dat", "a2'", '']]
-- unless this can be tuned?
EDIT2: A few language edits - I hope it's more valid English now
EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.

While the csv module is the right answer here, a regex that could do this is quite doable:
import re
r = re.compile(r'''
\s* # Any whitespace.
( # Start capturing here.
[^,"']+? # Either a series of non-comma non-quote characters.
| # OR
"(?: # A double-quote followed by a string of characters...
[^"\\]|\\. # That are either non-quotes or escaped...
)* # ...repeated any number of times.
" # Followed by a closing double-quote.
| # OR
'(?:[^'\\]|\\.)*'# Same as above, for single quotes.
) # Done capturing.
\s* # Allow arbitrary space before the comma.
(?:,|$) # Followed by a comma or the end of a string.
''', re.VERBOSE)
line = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
print r.findall(line)
# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']
EDIT: To validate lines, you can reuse the regex above with small additions:
import re
r_validation = re.compile(r'''
^(?: # Capture from the start.
# Below is the same regex as above, but condensed.
# One tiny modification is that it allows empty values
# The first plus is replaced by an asterisk.
\s*([^,"']*?|"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')\s*(?:,|$)
)*$ # And don't stop until the end.
''', re.VERBOSE)
line1 = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""
if r_validation.match(line1):
print 'Line 1 is valid.'
else:
print 'Line 1 is INvalid.'
if r_validation.match(line2):
print 'Line 2 is valid.'
else:
print 'Line 2 is INvalid.'
# Prints:
# Line 1 is valid.
# Line 2 is INvalid.

Although it would likely be possible with some combination of pre-processing, use of csv module, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csv module, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).
In complex parsing cases, pyparsing is always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...
You can likely find examples of pyparsing-based CSV parsing easily, with this question maybe enough to get you started.

Python has a standard library module to read csv files:
import csv
reader = csv.reader(open('file.csv'))
for line in reader:
print line
For your example input this prints
['data1', ' data2 ', "data3'''", ' \'data4""\'', '', '', 'data5', '']
EDIT:
you need to add skipinitalspace=True to allow spaces before double quotation marks for the extra examples you provided. Not sure about the single quotes yet.
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]

It is not possible to give you an answer, because you have not completely specified the protocol that is being used by the writer.
It evidently contains rules like:
If a field contains any commas or single quotes, quote it with double quotes.
Else if the field contains any double quotes, quote it with single quotes.
Note: the result is still valid if you swap double and single in the above 2 clauses.
Else don't quote it.
The resultant field may have spaces (or other whitespace?) prepended or appended.
The so-augmented fields are assembled into a row, separated by commas and terminated by the platform's newline (LF or CRLF).
What is not mentioned is what the writer does in these cases:
(0) field contains BOTH single quotes and double quotes
(1) field contains leading non-newline whitespace
(2) field contains trailing non-newline whitespace
(3) field contains any newlines.
Where the writer ignores any of these cases, please specify what outcomes you want.
You also mention "quotation marks can only be prepended or trailed by spaces" -- surely you mean commas are allowed also, otherwise your example 'data4""',,,data5, fails on the first comma.
How is your data encoded?

This probably sounds too simple, but really from the looks of things you are looking for a string that contains either [a-zA-Z0-9]["']+[a-zA-Z0-9], I mean without in depth testing against the data really what you're looking for is a quote or double quote (or any combination) in between letters (you could also add numbers there).
Based on what you were asking, it really doesn't matter that it's a CSV, it matter's that you have data that doesn't conform. Which I believe just doing a search for a letter, then any combination of one or more " or ' and another letter.
Now are you looking to get a "quantity" or just a printout of the line that contains it so you know which ones to go back and fix?
I'm sorry I don't know python regex's but in perl this would look something like this:
# Look for one or more letter/number at least one ' or " or more and at least one
# or more letter/number
if ($line =~ m/[a-zA-Z0-9]+['"]+[a-zA-Z0-9]+/ig)
{
# Prints the line if the above regex is found
print $line;
}
Just simply convert that for when you look at a line.
I'm sorry if I misunderstood the question
I hope it helps!

If your goal is to convert the data to XML (or JSON, or YAML), look at this example for a Gelatin syntax that produces the following output:
<xml>
<line>
<column>data1</column>
<column>data2 </column>
<column>data3'''</column>
<column>data4""</column>
<column/>
<column/>
<column>data5</column>
<column/>
</line>
</xml>
Note that Gelatin also has a Python API:
from Gelatin.util import compile, generate_to_file
syntax = compile('syntax.gel')
generate_to_file(syntax, 'input.csv', 'output.xml', 'xml')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.