search for hexadecimal number on python using re - python

I am processing an html text file, and serching for hexadecimal numbers as follows:
example \xb7\xc7\xa0....
I tried with this code
t=re.findall (r'\\x[0-9a-fA-F]+', line)
but I can only gained empty list.
please tell the right way of writing the code.

It works fine for me. Two scenarios come to mind that might explain your problem:
You're testing this by assigning the string to a variable line like so:
line = 'example \xb7\xc7\xa0....'
In this case, you need to escape the backslashes:
line = 'example \\xb7\\xc7\\xa0....'
You are viewing the contents of the file or line as a Python string, so that the \xb7 you are seeing is actually the character who's code is B7 hex, not the character sequence '\', '\x', 'b', '7'.

Your code works fine if the backslash is escaped inside the regular expression:
t = re.findall (r'\\x[0-9a-fA-F]+', line)
Result:
['\\xb7', '\\xc7', '\\xa0']
ideone: http://ideone.com/MPO5j
If it doesn't work it might be because you string contains literal binary characters. Then try something like this instead:
t = re.findall (r'[\x80-\xff]', line)
ideone: http://ideone.com/ChIsh

Your code works fine for me:
>>> line = r'\xb7\xc7\xa0....'
>>> t=re.findall (r'\\x[0-9a-fA-F]+', line)
>>> t
['\\xb7', '\\xc7', '\\xa0']

Related

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","author":null,"d‌​escription":null,"fi‌​leAssetId":"034b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991","filename"‌​:"Reports.pdf"},"cre‌​atedBy":1531,"create‌​dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌​bat.png","id":3041,"‌​inheritedPermissions‌​":false,"name":"map"‌​,"permissions":[23,8‌​7,35,49,65],"type":3‌​,"viewLevel":2},{"__‌​type":"WikiNode:http‌​:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","children":[],"c‌​ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌​, 60d9, 45c2, b6d6, 0f24‌​b59e1991
Im not to sure how to get the data as its displayed.
How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.
You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991
Try adding \n to the string that you are entering in to the file (\n means new line)
Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Read regexes from file and avoid or undo escaping

I want to read regular expressions from a file, where each line contains a regex:
lorem.*
dolor\S*
The following code is supposed to read each and append it to a list of regex strings:
vocabulary=[]
with open(path, "r") as vocabularyFile:
for term in vocabularyFile:
term = term.rstrip()
vocabulary.append(term)
This code seems to escape the \ special character in the file as \\. How can I either avoid escaping or unescape the string so that it can be worked with as if I wrote this?
regex = r"dolor\S*"
You are getting confused by echoing the value. The Python interpreter echoes values by printing the repr() function result, and this makes sure to escape any meta characters:
>>> regex = r"dolor\S*"
>>> regex
'dolor\\S*'
regex is still an 8 character string, not 9, and the single character at index 5 is a single backslash:
>>> regex[4]
'r'
>>> regex[5]
'\\'
>>> regex[6]
'S'
Printing the string writes out all characters verbatim, so no escaping takes place:
>>> print(regex)
dolor\S*
The same process is applied to the contents of containers, like a list or a dict:
>>> container = [regex, 'foo\nbar']
>>> print(container)
['dolor\\S*', 'foo\nbar']
Note that I didn't echo there, I printed. str(list_object) produces the same output as repr(list_object) here.
If you were to print individual elements from the list, you get the same unescaped result again:
>>> print(container[0])
dolor\S*
>>> print(container[1])
foo
bar
Note how the \n in the second element was written out as a newline now. It is for that reason that containers use repr() for contents; to make otherwise hard-to-detect or non-printable data visible.
In other words, your strings do not contain escaped strings here.

python replace unicode characters

I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field.
Below is one of the example:
(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
I want to replace all the \x.. with a ?
I explicitly type \xc2 as follows works
line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
re.sub('\\\xc2', '?', line)
result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)'
But its not working if I write as follow:
re.sub('\\\x..', '?', line)
How I can write a regular expression to replace them all?
There are better tools for this job than regex, you could try for example:
>>> line
'(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
>>> line.decode('ascii', 'ignore')
u'(13)p(5)example(3)com(0)'
That skips non-ascii characters. Or with replace, you can swap them for a '?' placeholder:
>>> print line.decode('ascii', 'replace')
(13)��������p����(5)example(3)com(0)
But the best solution is to find out what erroneous encoding/decoding caused the mojibake to happen in the first place, so you can recover data by using the correct code pages.
There is an excellent answer about unbaking emojibake here. Note that it's an inexact science, and a lot of the crucial information is actually in the comment thread under that answer.
what about this?
line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
pattern = r'\\x.+'
re.sub(pattern, r'?', line)

python regular expression not matching file contents with re.match and re.MULTILINE flag

I'm reading in a file and storing its contents as a multiline string. Then I loop through some values I get from a django query to run regexes based on the query results values. My regex seems like it should be working, and works if I copy the values returned by the query, but for some reason isn't matching when all the parts are working together that ends like this
My code is:
with open("/path_to_my_file") as myfile:
data=myfile.read()
#read saved settings then write/overwrite them into the config
items = MyModel.objects.filter(some_id="s100009")
for item in items:
regexString = "^\s*"+item.feature_key+":"
print regexString #to verify its what I want it to be, ie debug
pq = re.compile(regexString, re.M)
if pq.match(data):
#do stuff
So basically my problem is that the regex isn't matching. When I copy the file contents into a big old string, and copy the value(s) printed by the print regexString line, it does match, so I'm thinking theres some esoteric python/django thing going on (or maybe not so esoteric as python isnt my first language).
And for examples sake, the output of print regexString is :
^\s*productDetailOn:
File contents:
productDetailOn:true,
allOff:false,
trendingWidgetOn:true,
trendingWallOn:true,
searchResultOn:false,
bannersOn:true,
homeWidgetOn:true,
}
Running Python 2.7. Also, dumped the types of both item.feature and data, and both were unicode. Not sure if that matters? Anyway, I'm starting to hit my head off the desk after working this for a couple hours, so any help is appreciated. Cheers!
According to documentation, re.match never allows searching at the beginning of a line:
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
You need to use a re.search:
regexString = r"^\s*"+item.feature_key+":"
pq = re.compile(regexString, re.M)
if pq.search(data):
A small note on the raw string (r"^\s+"): in this case, it is equivalent to "\s+" because there is no \s escape sequence (like \r or \n), thus, Python treats it as a raw string literal. Still, it is safer to always declare regex patterns with raw string literals in Python (and with corresponding notations in other languages, too).

How to eliminate the ☎ unicode?

During web scraping and after getting rid of all html tags, I got the black telephone character \u260e in unicode (☎). But unlike this response I do want to get rid of it too.
I used the following regular expressions in Scrapy to eliminate html tags:
pattern = re.compile("<.*?>| |&",re.DOTALL|re.M)
Then I tried to match \u260e and I think I got caught by the backslash plague. I tried unsuccessfully this patterns:
pattern = re.compile("<.*?>| |&|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>| |&|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>| |&|\\\\u260e",re.DOTALL|re.M)
None of this worked and I still have \u260e as an output.
How can I make this disappear?
Using Python 2.7.3, the following works fine for me:
import re
pattern = re.compile(u"<.*?>| |&|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)
Output:
u'bla ble blo'
As pointed by #Zack, this works due to the fact that the string is now in unicode, i.e., the string is already converted, and the sequence of characters \u260e is now the -- probably -- two bytes used to write that little black phone ☎ (:
Once both the string to be searched and the regular expression have the black phone itself, and not the sequence of characters \u260e, they both match.
If your string is already unicode, there's two easy ways. The second one will affect more than just the ☎, obviously.
>>> import string
>>> foo = u"Lorum ☎ Ipsum"
>>> foo.replace(u'☎', '')
u'Lorum Ipsum'
>>> "".join(s for s in foo if s in string.printable)
u'Lorum Ipsum'
Remove non-ascii characters but leave periods and spaces for more information about string.printable
The SHORTEST way to remove multiple spaces in a string in Python if you don't want multiple whitespaces.
You may try with BeatifulSoup, as explained here, with something like
soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

Categories

Resources