Python unpack string in to array - python

I am working with Ruby every day, but i have an issue in Python. I was found this languages very similar... But I have some problems with migration from Ruby :)
Please, help me to convert this action in python:
string = "qwerty2012"
(var, some_var, another_var) = string.unpack("a1a4a*")
this should return three variables with unpacked values from string:
var = "q" # a1
some_var = "wert" # a4
another_var = "y2012" # a*
Help me to represent it in Python
Thank you!

s = "qwerty2012"
(a, b, c) = s[:1], s[1:5], s[5:]

Python does have a similar module named struct. It lacks the ability to grab the rest of the string in the same way that Ruby and PHP lifted from Perl. You can almost get there though:
>>> import struct
>>> s = 'qwerty2012'
>>> struct.unpack_from('1s4s', s)
('q', 'wert')
>>> def my_unpack(format, packed_string):
... result = []
... result.extend(struct.unpack_from(format, packed_string))
... chars_gobbled = struct.calcsize(format)
... rest = packed_string[chars_gobbled:]
... if rest:
... result.append(rest)
... return result
...
>>> my_unpack('1s4s', 'qwerty2012')
['q', 'wert', 'y2012']
>>> my_unpack('1s4s', 'qwert')
['q', 'wert']
>>> [hex(x) for x in my_unpack('<I', '\xDE\xAD\xBE\xEF')]
['0xefbeadde']
I wish that the struct module implemented the rest of Perl's unpack and pack since they were incredibly useful functions for ripping apart binary packets but alas.

s = "qwerty2012"
var, some_var, another_var = s[:1], s[1:5], s[5:]
will do the assignment and yield respectively:
q
wert
y2012
The above assignment makes use of slice notation as described by the Python Docs. This SO post Good Primer for Python Slice Notation gives a good explanation too.

Here's a preliminary recreation of unpack:
import re
import StringIO
def unpack(s, fmt):
fs = StringIO.StringIO(s)
res = []
for do,num in unpack.pattern.findall(fmt):
if num == '*':
num = len(s)
elif num == '':
num = 1
else:
num = int(num)
this = unpack.types[do](num, fs)
if this is not None:
res.append(this)
return res
unpack.types = {
'#': lambda n,s: s.seek(n), # skip to offset
'a': lambda n,s: s.read(n), # string
'A': lambda n,s: s.read(n).rstrip(), # string, right-trimmed
'b': lambda n,s: bin(reduce(lambda x,y:256*x+ord(y), s.read(n), 0))[2:].zfill(8*n)[::-1], # binary, LSB first
'B': lambda n,s: bin(reduce(lambda x,y:256*x+ord(y), s.read(n), 0))[2:].zfill(8*n) # binary, MSB first
}
unpack.pattern = re.compile(r'([a-zA-Z#](?:_|!|<|>|!<|!>|0|))(\d+|\*|)')
It works for your given example,
unpack("qwerty2012", "a1a4a*") # -> ['q', 'wert', 'y2012']
but has a long list of datatypes not yet implemented (see the documentation).

This might ease your migration from Ruby:
import re
import struct
def unpack(format, a_string):
pattern = r'''a(\*|\d+)\s*'''
widths = [int(w) if w is not '*' else 0 for w in re.findall(pattern, format)]
if not widths[-1]: widths[-1] = len(a_string) - sum(widths)
fmt = ''.join('%ds' % f for f in widths)
return struct.unpack_from(fmt, a_string)
(var, some_var, another_var) = unpack('a1a4a*', 'qwerty2012') # also 'a1 a4 a*' OK
print (var, some_var, another_var)
Output:
('q', 'wert', 'y2012')

Related

Parsing text files with "magic" values

Background
I have some large text files used in an automation script for audio tuning. Each line in the text file looks roughly like:
A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]] BANANA # BANANA
The text gets fed to an old command-line program which searches for keywords, and swaps them out. Sample output would be:
A[0] + B[100] - C[0x1000] [[0]] 0 # 0
A[2] + B[200] - C[0x100A] [[2]] 0 # 0
Problem
Sometimes, text files have keywords that are meant to be left untouched (i.e. cases where we don't want "BANANA" substituted). I'd like to modify the text files to use some kind of keyword/delimiter that is unlikely to pop up in normal circumstances, i.e:
A[#1] + B[#2] - C[#3] [[#1]] #1 # #1
Question
Does python's text file parser have any special indexing/escape sequences I could use instead of simple keywords?
use a regular expression replacement function with a dictionary.
Match everything between brackets (non-greedy, avoiding the brackets themselves) and replace by the value of the dict, put original value if not found:
import re
d = {"BANANA":"12", "PINEAPPLE":"20","CHERRY":"100","BANANA":"400"}
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
print(re.sub("\[([^\[\]]*)\]",lambda m : "[{}]".format(d.get(m.group(1),m.group(1))),s))
prints:
A[400] + B[20] - C[100] [[400]]
You can use re.sub to perform the substitution. This answer creates a list of randomized values to demonstrate, however, the list can be replaces with the data you are using:
import re
import random
s = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
new_s = re.sub('(?<=\[)[a-zA-Z0-9]+(?=\])', '{}', s)
random_data = [[random.randint(1, 2000) for i in range(4)] for _ in range(10)]
final_results = [new_s.format(*i) for i in random_data]
for command in final_results:
print(command)
Output:
A[51] + B[134] - C[864] [[1344]]
A[468] + B[1761] - C[1132] [[1927]]
A[1236] + B[34] - C[494] [[1009]]
A[1330] + B[1002] - C[1751] [[1813]]
A[936] + B[567] - C[393] [[560]]
A[1926] + B[936] - C[906] [[1596]]
A[1532] + B[1881] - C[871] [[1766]]
A[506] + B[1505] - C[1096] [[491]]
A[290] + B[1841] - C[664] [[38]]
A[1552] + B[501] - C[500] [[373]]
Just use
\[([^][]+)\]
And replace this with the desired result, e.g. 123.
Broken down, this says
\[ # opening bracket
([^][]+) # capture anything not brackets, 1+ times
\] # closing bracket
See a demo on regex101.com.
For your changed requirements, you could use an OrderedDict:
import re
from collections import OrderedDict
rx = re.compile(r'\[([^][]+)\]')
d = OrderedDict()
def replacer(match):
item = match.group(1)
d[item] = 1
return '[#{}]'.format(list(d.keys()).index(item) + 1)
string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
string = rx.sub(replacer, string)
print(string)
Which yields
A[#1] + B[#2] - C[#3] [[#1]]
The idea here is to put every (potentially) new item in the dict, then search for the index. OrderedDicts remember the order entry.
For the sake of academic completeness, you could do it all on your own as well:
import re
class Replacer:
rx = re.compile(r'\[([^][]+)\]')
keywords = []
def do_replace(self, match):
idx = self.lookup(match.group(1))
return '[#{}]'.format(idx + 1)
def replace(self, string):
return self.rx.sub(self.do_replace, string)
def lookup(self, item):
for idx, key in enumerate(self.keywords):
if key == item:
return idx
self.keywords.append(item)
return len(self.keywords)-1
string = "A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]"
rpl = Replacer()
string = rpl.replace(string)
print(string)
Can also be done using pyparsing.
This parser essentially defines noun to be the uppercase things within square brackets, then defines a sequence of them to be one line of input, as complete.
To replace items identified with other things define a class derived from dict in a suitable way, so that anything not in the class is left unchanged.
>>> import pyparsing as pp
>>> noun = pp.Word(pp.alphas.upper())
>>> between = pp.CharsNotIn('[]')
>>> leftbrackets = pp.OneOrMore('[')
>>> rightbrackets = pp.OneOrMore(']')
>>> stmt = 'A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]'
>>> one = between + leftbrackets + noun + rightbrackets
>>> complete = pp.OneOrMore(one)
>>> complete.parseString(stmt)
(['A', '[', 'BANANA', ']', ' + B', '[', 'PINEAPPLE', ']', ' - C', '[', 'CHERRY', ']', ' ', '[', '[', 'BANANA', ']', ']'], {})
>>> class Replace(dict):
... def __missing__(self, key):
... return key
...
>>> replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})
>>> new = []
>>> for item in complete.parseString(stmt).asList():
... new.append(replace[item])
...
>>> ''.join(new)
'A[1] + B[2] - C[CHERRY] [[1]]'
I think it's easier — and clearer — using plex. The snag is that it appears to be available only for Py2. It took me an hour or two to make sufficient conversion work to Py3 to get this.
Just three types of tokens to watch for, then a similar number of branches within a while statement.
from plex import *
from io import StringIO
stmt = StringIO('A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]')
lexicon = Lexicon([
(Rep1(AnyBut('[]')), 'not_brackets'),
(Str('['), 'left_bracket'),
(Str(']'), 'right_bracket'),
])
class Replace(dict):
def __missing__(self, key):
return key
replace = Replace({'BANANA': '1', 'PINEAPPLE': '2'})
scanner = Scanner(lexicon, stmt)
new_statement = []
while True:
token = scanner.read()
if token[0] is None:
break
elif token[0]=='no_brackets':
new_statement.append(replace[token[1]])
else:
new_statement.append(token[1])
print (''.join(new_statement))
Result:
A[BANANA] + B[PINEAPPLE] - C[CHERRY] [[BANANA]]

Parsing a symbolic equation in Python

I'm new to Python and find myself in the following situation. I work with equations stored as strings, such as:
>>> my_eqn = "A + 3.1B - 4.7D"
I'm looking to parse the string and store the numeric and alphabetic parts separately in two lists, or some other container. A (very) rough sketch of what I'm trying to put together would look like:
>>> foo = parse_and_separate(my_eqn);
>>> foo.numbers
[1.0, 3.1, -4.7]
>>> foo.letters
['A', 'B', 'D']
Any resources/references/pointers would be much appreciated.
Thanks!
Update
Here's one solution I came up with that's probably overly-complicated but seems to work. Thanks again to all those responded!
import re my_eqn = "A + 3.1B - 4.7D"
# add a "1.0" in front of single letters
my_eqn = re.sub(r"(\b[A-Z]\b)","1"+ r"\1", my_eqn, re.I)
# store the coefficients and variable names separately via regex
variables = re.findall("[a-z]", my_eqn, re.I)
coeffs = re.findall("[-+]?\s?\d*\.\d+|\d+", my_eqn)
# strip out '+' characters and white space
coeffs = [s.strip('+') for s in coeffs]
coeffs = [s.replace(' ', '') for s in coeffs]
# coefficients should be floats
coeffs = list(map(float, coeffs))
# confirm answers
print(variables)
print(coeffs)
This works for your simple scenario if you don't want to include any non standard python libraries.
class Foo:
def __init__(self):
self.numbers = []
self.letters = []
def split_symbol(symbol, operator):
name = ''
multiplier = ''
for letter in symbol:
if letter.isalpha():
name += letter
else:
multiplier += letter
if not multiplier:
multiplier = '1.0'
if operator == '-':
multiplier = operator + multiplier
return name, float(multiplier)
def parse_and_separate(my_eqn):
foo = Foo()
equation = my_eqn.split()
operator = ''
for symbol in equation:
if symbol in ['-', '+']:
operator = symbol
else:
letter, number = split_symbol(symbol, operator)
foo.numbers.append(number)
foo.letters.append(letter)
return foo
foo = parse_and_separate("A + 3.1B - 4.7D + 45alpha")
print(foo.numbers)
print(foo.letters)

Encoding a numeric string into a shortened alphanumeric string, and back again

Quick question. I'm trying to find or write an encoder in Python to shorten a string of numbers by using upper and lower case letters. The numeric strings look something like this:
20120425161608678259146181504021022591461815040210220120425161608667
The length is always the same.
My initial thought was to write some simple encoder to utilize upper and lower case letters and numbers to shorten this string into something that looks more like this:
a26Dkd38JK
That was completely arbitrary, just trying to be as clear as possible.
I'm certain that there is a really slick way to do this, probably already built in. Maybe this is an embarrassing question to even be asking.
Also, I need to be able to take the shortened string and convert it back to the longer numeric value.
Should I write something and post the code, or is this a one line built in function of Python that I should already know about?
Thanks!
This is a pretty good compression:
import base64
def num_to_alpha(num):
num = hex(num)[2:].rstrip("L")
if len(num) % 2:
num = "0" + num
return base64.b64encode(num.decode('hex'))
It first turns the integer into a bytestring and then base64 encodes it. Here's the decoder:
def alpha_to_num(alpha):
num_bytes = base64.b64decode(alpha)
return int(num_bytes.encode('hex'), 16)
Example:
>>> num_to_alpha(20120425161608678259146181504021022591461815040210220120425161608667)
'vw4LUVm4Ea3fMnoTkHzNOlP6Z7eUAkHNdZjN2w=='
>>> alpha_to_num('vw4LUVm4Ea3fMnoTkHzNOlP6Z7eUAkHNdZjN2w==')
20120425161608678259146181504021022591461815040210220120425161608667
There are two functions that are custom (not based on base64), but produce shorter output:
chrs = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = len(chrs)
def int_to_cust(i):
result = ''
while i:
result = chrs[i % l] + result
i = i // l
if not result:
result = chrs[0]
return result
def cust_to_int(s):
result = 0
for char in s:
result = result * l + chrs.find(char)
return result
And the results are:
>>> int_to_cust(20120425161608678259146181504021022591461815040210220120425161608667)
'9F9mFGkji7k6QFRACqLwuonnoj9SqPrs3G3fRx'
>>> cust_to_int('9F9mFGkji7k6QFRACqLwuonnoj9SqPrs3G3fRx')
20120425161608678259146181504021022591461815040210220120425161608667L
You can also shorten the generated string, if you add other characters to the chrs variable.
Do it with 'class':
VALID_CHRS = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
BASE = len(VALID_CHRS)
MAP_CHRS = {k: v
for k, v in zip(VALID_CHRS, range(BASE + 1))}
class TinyNum:
"""Compact number representation in alphanumeric characters."""
def __init__(self, n):
result = ''
while n:
result = VALID_CHRS[n % BASE] + result
n //= BASE
if not result:
result = VALID_CHRS[0]
self.num = result
def to_int(self):
"""Return the number as an int."""
result = 0
for char in self.num:
result = result * BASE + MAP_CHRS[char]
return result
Sample usage:
>> n = 4590823745
>> tn = TinyNum(a)
>> print(n)
4590823745
>> print(tn.num)
50GCYh
print(tn.to_int())
4590823745
(Based on Tadeck's answer.)
>>> s="20120425161608678259146181504021022591461815040210220120425161608667"
>>> import base64, zlib
>>> base64.b64encode(zlib.compress(s))
'eJxly8ENACAMA7GVclGblv0X4434WrKFVW5CtJl1HyosrZKRf3hL5gLVZA2b'
>>> zlib.decompress(base64.b64decode(_))
'20120425161608678259146181504021022591461815040210220120425161608667'
so zlib isn't real smart at compressing strings of digits :(

Split string by count of characters

I can't figure out how to do this with string methods:
In my file I have something like 1.012345e0070.123414e-004-0.1234567891.21423... which means there is no delimiter between the numbers.
Now if I read a line from this file I get a string like above which I want to split after e.g. 12 characters.
There is no way to do this with something like str.split() or any other string method as far as I've seen but maybe I'm overlooking something?
Thx
Since you want to iterate in an unusual way, a generator is a good way to abstract that:
def chunks(s, n):
"""Produce `n`-character chunks from `s`."""
for start in range(0, len(s), n):
yield s[start:start+n]
nums = "1.012345e0070.123414e-004-0.1234567891.21423"
for chunk in chunks(nums, 12):
print chunk
produces:
1.012345e007
0.123414e-00
4-0.12345678
91.21423
(which doesn't look right, but those are the 12-char chunks)
You're looking for string slicing.
>>> x = "1.012345e0070.123414e-004-0.1234567891.21423"
>>> x[2:10]
'012345e0'
line = "1.012345e0070.123414e-004-0.1234567891.21423"
firstNumber = line[:12]
restOfLine = line[12:]
print firstNumber
print restOfLine
Output:
1.012345e007
0.123414e-004-0.1234567891.21423
you can do it like this:
step = 12
for i in range(0, len(string), 12):
slice = string[i:step]
step += 12
in this way on each iteration you will get one slice of 14 characters.
from itertools import izip_longest
def grouper(n, iterable, padvalue=None):
return izip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
I stumbled on this while looking for a solution for a similar problem - but in my case I wanted to split string into chunks of differing lengths. Eventually I solved it with RE
In [13]: import re
In [14]: random_val = '07eb8010e539e2621cb100e4f33a2ff9'
In [15]: dashmap=(8, 4, 4, 4, 12)
In [16]: re.findall(''.join('(\S{{{}}})'.format(l) for l in dashmap), random_val)
Out[16]: [('07eb8010', 'e539', 'e262', '1cb1', '00e4f33a2ff9')]
Bonus
For those who may find it interesting - I tried to create pseudo-random ID by specific rules, so this code is actually part of the following function
import re, time, random
def random_id_from_time_hash(dashmap=(8, 4, 4, 4, 12)):
random_val = ''
while len(random_val) < sum(dashmap):
random_val += '{:016x}'.format(hash(time.time() * random.randint(1, 1000)))
return '-'.join(re.findall(''.join('(\S{{{}}})'.format(l) for l in dashmap), random_val)[0])
I always thought, since string addition operation is possible by a simple logic, may be division should be like this. When divided by a number, it should split by that length. So may be this is what you are looking for.
class MyString:
def __init__(self, string):
self.string = string
def __div__(self, div):
l = []
for i in range(0, len(self.string), div):
l.append(self.string[i:i+div])
return l
>>> m = MyString(s)
>>> m/3
['abc', 'bdb', 'fbf', 'bfb']
>>> m = MyString('abcd')
>>> m/3
['abc', 'd']
If you don't want to create an entirely new class, simply use this function that re-wraps the core of the above code,
>>> def string_divide(string, div):
l = []
for i in range(0, len(string), div):
l.append(string[i:i+div])
return l
>>> string_divide('abcdefghijklmnopqrstuvwxyz', 15)
['abcdefghijklmno', 'pqrstuvwxyz']
Try this function:
x = "1.012345e0070.123414e-004-0.1234567891.21423"
while len(x)>0:
v = x[:12]
print v
x = x[12:]

Python Unknown pattern finding

Okay, basically what I want is to compress a file by reusing code and then at runtime replace missing code. What I've come up with is really ugly and slow, at least it works. The problem is that the file has no specific structure, for example 'aGVsbG8=\n', as you can see it's base64 encoding. My function is really slow because the length of the file is 1700+ and it checks for patterns 1 character at the time. Please help me with new better code or at least help me with optimizing what I got :). Anything that helps is welcome! BTW i have already tried compression libraries but they didn't compress as good as my ugly function.
def c_long(inp, cap=False, b=5):
import re,string
if cap is False: cap = len(inp)
es = re.escape; le=len; ref = re.findall; ran = range; fi = string.find
c = b;inpc = inp;pattern = inpc[:b]; l=[]
rep = string.replace; ins = list.insert
while True:
if c == le(inpc) and le(inpc) > b+1: c = b; inpc = inpc[1:]; pattern = inpc[:b]
elif le(inpc) <= b: break
if c == cap: c = b; inpc = inpc[1:]; pattern = inpc[:b]
p = ref(es(pattern),inp)
pattern += inpc[c]
if le(p) > 1 and le(pattern) >= b+1:
if l == []: l = [[pattern,le(p)+le(pattern)]]
elif le(ref(es(inpc[:c+2]),inp))+le(inpc[:c+2]) < le(p)+le(pattern):
x = [pattern,le(p)+le(inpc[:c+1])]
for i in ran(le(l)):
if x[1] >= l[i][1] and x[0][:-1] not in l[i][0]: ins(l,i,x); break
elif x[1] >= l[i][1] and x[0][:-1] in l[i][0]: l[i] = x; break
inpc = inpc[:fi(inpc,x[0])] + inpc[le(x[0]):]
pattern = inpc[:b]
c = b-1
c += 1
d = {}; c = 0
s = ran(le(l))
for x in l: inp = rep(inp,x[0],'{%d}' % s[c]); d[str(s[c])] = x[0]; c += 1
return [inp,d]
def decompress(inp,l): return apply(inp.format, [l[str(x)] for x in sorted([int(x) for x in l.keys()])])
The easiest way to compress base64-encoded data is to first convert it to binary data -- this will already save 25 percent of the storage space:
>>> s = "YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXo=\n"
>>> t = s.decode("base64")
>>> len(s)
37
>>> len(t)
26
In most cases, you can compress the string even further using some compression algorithm, like t.encode("bz2") or t.encode("zlib").
A few remarks on your code: There are lots of factors that make the code hard to read: inconsistent spacing, overly long lines, meaningless variable names, unidiomatic code, etc. An example: Your decompress() function could be equivalently written as
def decompress(compressed_string, substitutions):
subst_list = [substitutions[k] for k in sorted(substitutions, key=int)]
return compressed_string.format(*subst_list)
Now it's already much more obvious what it does. You could go one step further: Why is substitutions a dictionary with the string keys "0", "1" etc.? Not only is it strange to use strings instead of integers -- you don't need the keys at all! A simple list will do, and decompress() will simplify to
def decompress(compressed_string, substitutions):
return compressed_string.format(*substitutions)
You might think all this is secondary, but if you make the rest of your code equally readable, you will find the bugs in your code yourself. (There are bugs -- it crashes for "abcdefgabcdefg" and many other strings.)
Typically one would pump the program through a compression algorithm optimized for text, then run that through exec, e.g.
code="""..."""
exec(somelib.decompress(code), globals=???, locals=???)
It may be the case that .pyc/.pyo files are compressed already, and one could check by creating one with x="""aaaaaaaa""", then increasing the length to x="""aaaaaaaaaaaaaaaaaaaaaaa...aaaa""" and seeing if the size changes appreciably.

Categories

Resources