Parse a binary file with Regular Expressions? - python

I've got two bytes-type variable that I've concatenated (separated by a space) so I can send it as one variable to a server (socket programming). What I'm trying to figure out is how to then separate them and assign them to their original variables using regular expressions. I've consulted regular expressions parsing a binary file but it wouldn't work for me. Here is my output after trying the expression as so just to get the cipher variable
ciphertext = re.match(b'\S', ciphertext)
It generally only matches the first couple characters and returns an object, which isn't what I'm wanting. What am I doing wrong?
edit: I'm probably doing it the hard way. Honestly, any recommendation on how to send 2 bytes objects over a socket using UDP. Its proving really difficult

Ended up using str.rpartition to solve my problems. Wasn't the most obvious answer, but it worked.

Why are you using regex to do this?. You should take a look at the struct module:
In [1]: import struct
In [2]: magic = b'\xcf\xfa\xed\xfe'
In [3]: decoded = struct.unpack('<I', magic)[0]
In [4]: hex(decoded)
Out[4]: '0xfeedfacf'
Also, you can use this recipe for decoding binary files

Related

Fastest way to extract part of a long string in Python

I have a large set of strings, and am looking to extract a certain part of each of the strings. Each string contains a sub string like this:
my_token:[
"key_of_interest"
],
This is the only part in each string it says my_token. I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.
Is there a better or more efficient way of doing this? I'll be doing this for string of length ~10,000 and sets of size 100,000.
Edit: The file is a .ion file. From my understanding it can be treated as a flat file - as it is text based and used for describing metadata.
How can this can possibly be done the "dumbest and simplest way"?
find the starting position
look on for the ending position
grab everything indiscriminately between the two
This is indeed what you're doing. Thus any further inprovement can only come from the optimization of each step. Possible ways include:
narrow down the search region (requires additional constraints/assumptions as per comment56995056)
speed up the search operation bits, which include:
extracting raw data from the format
you already did this by disregarding the format altogether - so you have to make sure there'll never be any incorrect parsing (e.g. your search terms embedded in strings elsewhere or matching a part of a token) as per comment56995034
elementary pattern comparison operation
unlikely to attain in pure Python since str.index is implemented in C already and the implementation is probably already as simple as can possibly be
The underlying requirement shows through when you clarify:
I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.
That sounds like you're trying to avoid the correct approach: use a parser for whatever language is in the string.
There is no good reason to build directly on top of string primitives for parsing, unless you are interested in writing yet another parsing framework.
So, use libraries written by people who have dealt with the issues before you.
If it's JSON, use the standard library json module; ditto if it's some other language with a parser already in the Python standard library.
If it's some other widely-implemented standard: get whichever already-existing third-party Python library knows how to parse that properly.
If it's not already implemented: write a custom parser using pyparsing or some other well-known solid library.
So to make a good choice you need to know what is the data format (this is not answered by “what are the file names”; rather, you need to know what is the data format of the content of those files). Then you'll be able to search for a parser library that knows about that data format.
Well, as already mentioned - a parser seems the best option.
But to answer your question without all this extra advice ... if you're just looking at speed, a parser isn't really the best method of doing this. The faster method is you already have a string like this would be to use regex.
matches = re.match(r"my_token:\[\s*"(.*)"\s*\]\.",str)
key_of_interest = matches.groups()[0]
There are other issues that come up. For example what if your key has a " inside it ? strinified JSON will automatically use an escape character there and that will be captures by the regex too. And therefore this gets a bit too complicated.
And JSON is not regex parsable in itself (is-json-a-regular-language). So, use at your own risk. But with the appropriate restrictions and assumptions regex would be faster than a json parser.

Appending '0x' before the hex numbers in a string

I'm parsing a xml file in which I get basic expressions (like id*10+2). What I am trying to do is to evaluate the expression to actually get the value. To do so, I use the eval() method which works very well.
The only thing is the numbers are in fact hexadecimal numbers. The eval() method could work well if every hex number was prefixed with '0x', but I could not find a way to do it, neither could I find a similar question here. How would it be done in a clean way ?
Use the re module.
>>> import re
>>> re.sub(r'([\dA-F]+)', r'0x\1', 'id*A+2')
'id*0xA+0x2'
>>> eval(re.sub(r'([\dA-F]+)', r'0x\1', 'CAFE+BABE'))
99772
Be warned though, with an invalid input to eval, it won't work. There are also many risks of using eval.
If your hex numbers have lowercase letters, then you could use this:
>>> re.sub(r'(?<!i)([\da-fA-F]+)', r'0x\1', 'id*a+b')
'id*0xa+0xb'
This uses a negative lookbehind assertion to assure that the letter i is not before the section it is trying to convert (preventing 'id' from turning into 'i0xd'. Replace i with I if the variable is Id.
If you can parse expresion into individual numbers then I would suggest to use int function:
>>> int("CAFE", 16)
51966
Be careful with eval! Do not ever use it in untrusted inputs.
If it's just simple arithmetic, I'd use a custom parser (there are tons of examples out in the wild)... And using parser generators (flex/bison, antlr, etc.) is a skill that is useful and easily forgotten, so it could be a good chance to refresh or learn it.
One option is to use the parser module:
import parser, token, re
def hexify(ast):
if not isinstance(ast, list):
return ast
if ast[0] in (token.NAME, token.NUMBER) and re.match('[0-9a-fA-F]+$', ast[1]):
return [token.NUMBER, '0x' + ast[1]]
return map(hexify, ast)
def hexified_eval(expr, *args):
ast = parser.sequence2st(hexify(parser.expr(expr).tolist()))
return eval(ast.compile(), *args)
>>> hexified_eval('id*10 + BABE', {'id':0xcafe})
567466
This is somewhat cleaner than a regex solution in that it only attempts to replace tokens that have been positively identified as either names or numbers (and look like hex numbers). It also correctly handles more general python expressions such as id*10 + len('BABE') (it won't replace 'BABE' with '0xBABE').
OTOH, the regex solution is simpler and might cover all the cases you need to deal with anyway.

Split shell-like syntax in Haskell?

How can I split a string in shell-style syntax in Haskell? The equivalent in Python is shlex.split.
>>> shlex.split('''/nosuchconf "/this doesn't exist either" "yep"''')
['/nosuchconf', "/this doesn't exist either", 'yep']
I'm not sure what exactly you mean: are you wanting to get get all quoted sub-strings from a String? Note that unlike Python, etc. Haskell only has one set of quotes that indicate something is a String, namely "...".
Possibilities to consider:
The words and lines functions
The split package
Write a custom parser using polyparse, uu-parsinglib, parsec, etc.
It may be useful if you specified why you wanted such functionality: are you trying to parse existing shell scripts? Then language-sh might be of use. But you shouldn't be using such Strings internally in Haskell, and instead using [String] or something.

How would one convert a Python string representation of a byte-string to an actual byte-string? [duplicate]

This question already has an answer here:
Converting python string into bytes directly without eval()
(1 answer)
Closed 4 years ago.
I'm trying to figure out how one might convert a string representation of a byte-string into an actual byte-string type. I'm not very used to Python (just hacking on it to help a friend), so I'm not sure if there's some easy "casting" method (like my beloved Java has ;) ). Basically I have a text file, which has as it's contents a byte-string:
b'\x03\xacgB\x16\xf3\xe1\\v\x1e\xe1\xa5\xe2U\xf0g\x956#\xc8\xb3\x88\xb4E\x9e\x13\xf9x\xd7\xc8F\xf4'
I currently read in this file as follows:
aFile = open('test.txt')
x = aFile.read()
print(x) # prints b'\x03\xacgB\x16\xf3\xe1\\v\x1e\xe1\xa5\xe2U\xf0g\x956#\xc8\xb3\x88\xb4E\x9e\x13\xf9x\xd7\xc8F\xf4'
print(type(x)) # prints <class 'str'>
How do I make x be of type <class 'bytes'>? Thanks for any help.
Edit: Having read one of the replies below, I think I'm maybe constraining the question too much. My apologies for that. The input string doens't have to be in python byte-string format (i.e. with the b and the quotation marks), it could just be the plain byte-string:
\x03\xacgB\x16\xf3\xe1\\v\x1e\xe1\xa5\xe2U\xf0g\x956#\xc8\xb3\x88\xb4E\x9e\x13\xf9x\xd7\xc8F\xf4
If this makes it easier or is better practice, I can use this.
>>> r'\x03\xacgB\x16\xf3\xe1\\v\x1e\xe1\xa5\xe2U\xf0g\x956#\xc8\xb3\x88\xb4E\x9e\x13\xf9x\xd7\xc8F\xf4'.decode('string-escape')
'\x03\xacgB\x16\xf3\xe1\\v\x1e\xe1\xa5\xe2U\xf0g\x956#\xc8\xb3\x88\xb4E\x9e\x13\xf9x\xd7\xc8F\xf4'
This will work for strings that don't have b'...' around it. Otherwise you are encouraged to use ast.literal_eval().
Since your input is in Python's syntax, for some reason (*), the thing to do here is just call eval:
>>> r"b'\x12\x12'"
"b'\\x12\\x12'"
>>> eval(r"b'\x12\x12'")
'\x12\x12'
Be careful, though, as this may be a security problem. eval will run any code, so you may need to sanitize the input. In your case its simple - just check that the thing you're eval-ing is indeed a string in the format you expect. If security isn't an issue here, just don't bother.
Redarding your EDIT: Still, eval is the simplest approach here (after adding the b'' if it's not there). You could also, of course, do this manually by converting each \xXX to its real value.
(*) Why, really? This seems like a strange choice for a data representation format

Python regex parse stream

Is there any way to use regex match on a stream in python?
like
reg = re.compile(r'\w+')
reg.match(StringIO.StringIO('aa aaa aa'))
And I don't want to do this by getting the value of the whole string. I want to know if there's any way to match regex on a srtream(on-the-fly).
I had the same problem. The first thought was to implement a LazyString class, which acts like a string but only reading as much data from the stream as currently needed (I did this by reimplementing __getitem__ and __iter__ to fetch and buffer characters up to the highest position accessed...).
This didn't work out (I got a "TypeError: expected string or buffer" from re.match), so I looked a bit into the implementation of the re module in the standard library.
Unfortunately using regexes on a stream seems not possible. The core of the module is implemented in C and this implementation expects the whole input to be in memory at once (I guess mainly because of performance reasons). There seems to be no easy way to fix this.
I also had a look at PYL (Python LEX/YACC), but their lexer uses re internally, so this wouldnt solve the issue.
A possibility could be to use ANTLR which supports a Python backend. It constructs the lexer using pure python code and seems to be able to operate on input streams. Since for me the problem is not that important (I do not expect my input to be extensively large...), I will probably not investigate that further, but it might be worth a look.
In the specific case of a file, if you can memory-map the file with mmap and if you're working with bytestrings instead of Unicode, you can feed a memory-mapped file to re as if it were a bytestring and it'll just work. This is limited by your address space, not your RAM, so a 64-bit machine with 8 GB of RAM can memory-map a 32 GB file just fine.
If you can do this, it's a really nice option. If you can't, you have to turn to messier options.
The 3rd-party regex module (not re) offers partial match support, which can be used to build streaming support... but it's messy and has plenty of caveats. Things like lookbehinds and ^ won't work, zero-width matches would be tricky to get right, and I don't know if it'd interact correctly with other advanced features regex offers and re doesn't. Still, it seems to be the closest thing to a complete solution available.
If you pass partial=True to regex.match, regex.fullmatch, regex.search, or regex.finditer, then in addition to reporting complete matches, regex will also report things that could be a match if the data was extended:
In [10]: regex.search(r'1234', '12', partial=True)
Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True>
It'll report a partial match instead of a complete match if more data could change the match result, so for example, regex.search(r'[\s\S]*', anything, partial=True) will always be a partial match.
With this, you can keep a sliding window of data to match, extending it when you hit the end of the window and discarding consumed data from the beginning. Unfortunately, anything that would get confused by data disappearing from the start of the string won't work, so lookbehinds, ^, \b, and \B are out. Zero-width matches would also need careful handling. Here's a proof of concept that uses a sliding window over a file or file-like object:
import regex
def findall_over_file_with_caveats(pattern, file):
# Caveats:
# - doesn't support ^ or backreferences, and might not play well with
# advanced features I'm not aware of that regex provides and re doesn't.
# - Doesn't do the careful handling that zero-width matches would need,
# so consider behavior undefined in case of zero-width matches.
# - I have not bothered to implement findall's behavior of returning groups
# when the pattern has groups.
# Unlike findall, produces an iterator instead of a list.
# bytes window for bytes pattern, unicode window for unicode pattern
# We assume the file provides data of the same type.
window = pattern[:0]
chunksize = 8192
sentinel = object()
last_chunk = False
while not last_chunk:
chunk = file.read(chunksize)
if not chunk:
last_chunk = True
window += chunk
match = sentinel
for match in regex.finditer(pattern, window, partial=not last_chunk):
if not match.partial:
yield match.group()
if match is sentinel or not match.partial:
# No partial match at the end (maybe even no matches at all).
# Discard the window. We don't need that data.
# The only cases I can find where we do this are if the pattern
# uses unsupported features or if we're on the last chunk, but
# there might be some important case I haven't thought of.
window = window[:0]
else:
# Partial match at the end.
# Discard all data not involved in the match.
window = window[match.start():]
if match.start() == 0:
# Our chunks are too small. Make them bigger.
chunksize *= 2
This seems to be an old problem. As I have posted to a a similar question, you may want to subclass the Matcher class of my solution streamsearch-py and perform regex matching in the buffer. Check out the kmp_example.py for a template. If it turns out classic Knuth-Morris-Pratt matching is all you need, then your problem would be solved right now with this little open source library :-)
The answers here are now outdated. Modern Python re package now supports bytes-like objects, which have an api you can implement yourself and get streaming behaviour.
Yes - using the getvalue method:
import cStringIO
import re
data = cStringIO.StringIO("some text")
regex = re.compile(r"\w+")
regex.match(data.getvalue())

Categories

Resources