python bytes to bit string - python

I have value of the type bytes that need to be converted to BIT STRING
bytes_val = (b'\x80\x00', 14)
the bytes in index zero need to be converted to bit string of length as indicated by the second element (14 in this case) and formatted as groups of 8 bits like below.
expected output => '10000000 000000'B
Another example
bytes_val2 = (b'\xff\xff\xff\xff\xf0\x00', 45) #=> '11111111 11111111 11111111 11111111 11110000 00000'B

What about some combination of formatting (below with f-string but can be done otherwise), and slicing:
def bytes2binstr(b, n=None):
s = ' '.join(f'{x:08b}' for x in b)
return s if n is None else s[:n + n // 8 + (0 if n % 8 else -1)]
If I understood correctly (I am not sure what the B at the end is supposed to mean), it passes your tests and a couple more:
func = bytes2binstr
args = (
(b'\x80\x00', None),
(b'\x80\x00', 14),
(b'\x0f\x00', 14),
(b'\xff\xff\xff\xff\xf0\x00', 16),
(b'\xff\xff\xff\xff\xf0\x00', 22),
(b'\x0f\xff\xff\xff\xf0\x00', 45),
(b'\xff\xff\xff\xff\xf0\x00', 45),
)
for arg in args:
print(arg)
print(repr(func(*arg)))
# (b'\x80\x00', None)
# '10000000 00000000'
# (b'\x80\x00', 14)
# '10000000 000000'
# (b'\x0f\x00', 14)
# '00001111 000000'
# (b'\xff\xff\xff\xff\xf0\x00', 16)
# '11111111 11111111'
# (b'\xff\xff\xff\xff\xf0\x00', 22)
# '11111111 11111111 111111'
# (b'\x0f\xff\xff\xff\xf0\x00', 45)
# '00001111 11111111 11111111 11111111 11110000 00000'
# (b'\xff\xff\xff\xff\xf0\x00', 45)
# '11111111 11111111 11111111 11111111 11110000 00000'
Explanation
we start from a bytes object
iterating through it gives us a single byte as a number
each byte is 8 bit, so decoding that will already give us the correct separation
each byte is formatted using the b binary specifier, with some additional formatting: 0 zero fill, 8 minimum length
we join (concatenate) the result of the formatting using ' ' as "separator"
finally the result is returned as is if a maximum number of bits n was not specified (set to None), otherwise the result is cropped to n + the number of spaces that were added in-between the 8-character groups.
In the solution above 8 is somewhat hard-coded.
If you want it to be a parameter, you may want to look into (possibly a variation of) #kederrac first answer using int.from_bytes().
This could look something like:
def bytes2binstr_frombytes(b, n=None, k=8):
s = '{x:0{m}b}'.format(m=len(b) * 8, x=int.from_bytes(b, byteorder='big'))[:n]
return ' '.join([s[i:i + k] for i in range(0, len(s), k)])
which gives the same output as above.
Speedwise, the int.from_bytes()-based solution is also faster:
for i in range(2, 7):
n = 10 ** i
print(n)
b = b''.join([random.randint(0, 2 ** 8 - 1).to_bytes(1, 'big') for _ in range(n)])
for func in funcs:
print(func.__name__, funcs[0](b, n * 7) == func(b, n * 7))
%timeit func(b, n * 7)
print()
# 100
# bytes2binstr True
# 10000 loops, best of 3: 33.9 µs per loop
# bytes2binstr_frombytes True
# 100000 loops, best of 3: 15.1 µs per loop
# 1000
# bytes2binstr True
# 1000 loops, best of 3: 332 µs per loop
# bytes2binstr_frombytes True
# 10000 loops, best of 3: 134 µs per loop
# 10000
# bytes2binstr True
# 100 loops, best of 3: 3.29 ms per loop
# bytes2binstr_frombytes True
# 1000 loops, best of 3: 1.33 ms per loop
# 100000
# bytes2binstr True
# 10 loops, best of 3: 37.7 ms per loop
# bytes2binstr_frombytes True
# 100 loops, best of 3: 16.7 ms per loop
# 1000000
# bytes2binstr True
# 1 loop, best of 3: 400 ms per loop
# bytes2binstr_frombytes True
# 10 loops, best of 3: 190 ms per loop

you can use:
def bytest_to_bit(by, n):
bi = "{:0{l}b}".format(int.from_bytes(by, byteorder='big'), l=len(by) * 8)[:n]
return ' '.join([bi[i:i + 8] for i in range(0, len(bi), 8)])
bytest_to_bit(b'\xff\xff\xff\xff\xf0\x00', 45)
output:
'11111111 11111111 11111111 11111111 11110000 00000'
steps:
transform your bytes to an integer using int.from_bytes
str.format method can take a binary format spec.
also, you can use a more compact form where each byte is formatted:
def bytest_to_bit(by, n):
bi = ' '.join(map('{:08b}'.format, by))
return bi[:n + len(by) - 1].rstrip()
bytest_to_bit(b'\xff\xff\xff\xff\xf0\x00', 45)

test_data = [
(b'\x80\x00', 14),
(b'\xff\xff\xff\xff\xf0\x00', 45),
]
def get_bit_string(bytes_, length) -> str:
output_chars = []
for byte in bytes_:
for _ in range(8):
if length <= 0:
return ''.join(output_chars)
output_chars.append(str(byte >> 7 & 1))
byte <<= 1
length -= 1
output_chars.append(' ')
return ''.join(output_chars)
for data in test_data:
print(get_bit_string(*data))
output:
10000000 000000
11111111 11111111 11111111 11111111 11110000 00000
explanation:
length: Start from target legnth, and decreasing to 0.
if length <= 0: return ...: If we reached target length, stop and return.
''.join(output_chars): Make string from list.
str(byte >> 7 & 1)
byte >> 7: Shift 7 bits to right(only remains MSB since byte has 8 bits.)
MSB means Most Significant Bit
(...) & 1: Bit-wise and operation. It extracts LSB.
byte <<= 1: Shift 1 bit to left for byte.
length -= 1: Decreasing length.

This is lazy version.
It neither loads nor processes the entire bytes.
This one does halt regardless of input size.
The other solutions may not!
I use collections.deque to build bit string.
from collections import deque
from itertools import chain, repeat, starmap
import os
def bit_lenght_list(n):
eights, rem = divmod(n, 8)
return chain(repeat(8, eights), (rem,))
def build_bitstring(byte, bit_length):
d = deque("0" * 8, 8)
d.extend(bin(byte)[2:])
return "".join(d)[:bit_length]
def bytes_to_bits(byte_string, bits):
return "{!r}B".format(
" ".join(starmap(build_bitstring, zip(byte_string, bit_lenght_list(bits))))
)
Test;
In [1]: bytes_ = os.urandom(int(1e9))
In [2]: timeit bytes_to_bits(bytes_, 0)
4.21 µs ± 27.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: timeit bytes_to_bits(os.urandom(1), int(1e9))
6.8 µs ± 51 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: bytes_ = os.urandom(6)
In [5]: bytes_
Out[5]: b'\xbf\xd5\x08\xbe$\x01'
In [6]: timeit bytes_to_bits(bytes_, 45) #'10111111 11010101 00001000 10111110 00100100 00000'B
12.3 µs ± 85 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: bytes_to_bits(bytes_, 14)
Out[7]: "'10111111 110101'B"

when you say BIT you mean binary?
I would try
bytes_val = b'\\x80\\x00'
for byte in bytes_val:
value_in_binary = bin(byte)

This gives the answer without python's binary representation pre-fixed 0b:
bit_str = ' '.join(bin(i).replace('0b', '') for i in bytes_val)

This works in Python 3.x:
def to_bin(l):
val, length = l
bit_str = ''.join(bin(i).replace('0b', '') for i in val)
if len(bit_str) < length:
# pad with zeros
return '0'*(length-len(bit_str)) + bit_str
else:
# cut to size
return bit_str[:length]
bytes_val = [b'\x80\x00',14]
print(to_bin(bytes_val))
and this works in 2.x:
def to_bin(l):
val, length = l
bit_str = ''.join(bin(ord(i)).replace('0b', '') for i in val)
if len(bit_str) < length:
# pad with zeros
return '0'*(length-len(bit_str)) + bit_str
else:
# cut to size
return bit_str[:length]
bytes_val = [b'\x80\x00',14]
print(to_bin(bytes_val))
Both produce result 00000100000000

Related

Optimization of numpy array iteration

I'm trying to optimize the performance of my python program and I think I have identified this piece of code as bottleneck:
for i in range(len(green_list)):
rgb_list = []
for j in range(len(green_list[i])):
rgb_list.append('%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]))
write_file(str(i), rgb_list)
Where red_list, green_list and blue_list are numpy arrays with values like this:
red_list = [[1, 2, 3, 4, 5], [51, 52, 53, 54, 55]]
green_list = [[6, 7, 8, 9, 10], [56, 57, 58, 59, 60]]
blue_list = [[11, 12, 13, 14, 15], [61, 62, 63, 64, 65]]
At the end of each execution of the inner-for rgb_list is containing the hex values:
rgb_list = ['01060b', '02070c', '03080d', '04090e', '050a01']
Now, it is not clear to me how to exploit the potential of numpy arrays but I think there is a way to optimize those two nested loops. Any suggestions?
I assume the essential traits of your code could be summarized in the following generator:
import numpy as np
def as_str_OP(r_arr, g_arr, b_arr):
n, m = r_arr.shape
rgbs = []
for i in range(n):
rgb = []
for j in range(m):
rgb.append('%02x%02x%02x' % (r_arr[i, j], g_arr[i, j], b_arr[i, j]))
yield rgb
which can be consumed with a for loop, for example to write to disk:
for x in as_str_OP(r_arr, g_arr, b_arr):
write_to_disk(x)
The generator itself can be written either with the core computation vectorized in Python or in a Numba-friendly way.
The key is to replace the relatively slow string interpolation with a int-to-hex custom-made computation.
This results in substantial speed-up, especially as the size of the input grows (and particularly the second dimension).
Below is the NumPy-vectorized version:
def as_str_np(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
rgbs = []
for i in range(n):
rgb = np.empty((m, 2 * l), dtype=np.uint32)
r0, r1 = divmod(r_arr[i, :], 16)
g0, g1 = divmod(g_arr[i, :], 16)
b0, b1 = divmod(b_arr[i, :], 16)
rgb[:, 0] = hex_to_ascii(r0)
rgb[:, 1] = hex_to_ascii(r1)
rgb[:, 2] = hex_to_ascii(g0)
rgb[:, 3] = hex_to_ascii(g1)
rgb[:, 4] = hex_to_ascii(b0)
rgb[:, 5] = hex_to_ascii(b1)
yield rgb.view(f'<U{2 * l}').reshape(m).tolist()
and the Numba-accelerated version:
import numba as nb
#nb.njit
def hex_to_ascii(x):
ascii_num_offset = 48 # ord(b'0') == 48
ascii_alp_offset = 87 # ord(b'a') == 97, (num of non-alpha digits) == 10
return x + (ascii_num_offset if x < 10 else ascii_alp_offset)
#nb.njit
def _to_hex_2d(x):
a, b = divmod(x, 16)
return hex_to_ascii(a), hex_to_ascii(b)
#nb.njit
def _as_str_nb(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
for i in range(n):
rgb = np.empty((m, 2 * l), dtype=np.uint32)
for j in range(m):
rgb[j, 0:2] = _to_hex_2d(r_arr[i, j])
rgb[j, 2:4] = _to_hex_2d(g_arr[i, j])
rgb[j, 4:6] = _to_hex_2d(b_arr[i, j])
yield rgb
def as_str_nb(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
for x in _as_str_nb(r_arr, g_arr, b_arr):
yield x.view(f'<U{2 * l}').reshape(m).tolist()
This essentially involves manually writing the number, correctly converted to hexadecimal ASCII chars, into a properly typed array, which can then be converted to give the desired output.
Note that the final numpy.ndarray.tolist() could be avoided if whatever will consume the generator is capable of dealing with the NumPy array itself, thus saving some potentially large and definitely appreciable time, e.g.:
def as_str_nba(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
for x in _as_str_nb(r_arr, g_arr, b_arr):
yield x.view(f'<U{2 * l}').reshape(m)
Overcoming IO-bound bottleneck
However, if you are IO-bounded you should modify your code to write in blocks, e.g using the grouper recipe from itertools recipes:
from itertools import zip_longest
def grouper(iterable, n, *, incomplete='fill', fillvalue=None):
"Collect data into non-overlapping fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, fillvalue='x') --> ABC DEF Gxx
# grouper('ABCDEFG', 3, incomplete='strict') --> ABC DEF ValueError
# grouper('ABCDEFG', 3, incomplete='ignore') --> ABC DEF
args = [iter(iterable)] * n
if incomplete == 'fill':
return zip_longest(*args, fillvalue=fillvalue)
if incomplete == 'strict':
return zip(*args, strict=True)
if incomplete == 'ignore':
return zip(*args)
else:
raise ValueError('Expected fill, strict, or ignore')
to be used like:
group_size = 3
for x in grouper(as_str_OP(r_arr, g_arr, b_arr), group_size):
write_many_to_disk(x)
Testing out the output
Some dummy input can be produced easily (r_arr is essentially red_list, etc.):
def gen_color(n, m):
return np.random.randint(0, 2 ** 8, (n, m))
N, M = 10, 3
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)
and tested by consuming the generator to produce a list:
res_OP = list(as_str_OP(r_arr, g_arr, b_arr))
res_np = list(as_str_np(r_arr, g_arr, b_arr))
res_nb = list(as_str_nb(r_arr, g_arr, b_arr))
res_nba = list(as_str_nba(r_arr, g_arr, b_arr))
print(np.array(res_OP))
# [['1f6984' '916d98' 'f9d779']
# ['65f895' 'ded23e' '332fdc']
# ['b9e059' 'ce8676' 'cb75e9']
# ['bca0fc' '3289a9' 'cc3d3a']
# ['6bb0be' '07134a' 'c3cf05']
# ['152d5c' 'bac081' 'c59a08']
# ['97efcc' '4c31c0' '957693']
# ['15247e' 'af8f0a' 'ffb89a']
# ['161333' '8f41ce' '187b01']
# ['d811ae' '730b17' 'd2e269']]
print(res_OP == res_np)
# True
print(res_OP == res_nb)
# True
print(res_OP == [x.tolist() for x in res_nba])
# True
eventually passing through some grouping:
k = 3
res_OP = list(grouper(as_str_OP(r_arr, g_arr, b_arr), k))
res_np = list(grouper(as_str_np(r_arr, g_arr, b_arr), k))
res_nb = list(grouper(as_str_nb(r_arr, g_arr, b_arr), k))
res_nba = list(grouper(as_str_nba(r_arr, g_arr, b_arr), k))
print(np.array(res_OP))
# [[list(['1f6984', '916d98', 'f9d779'])
# list(['65f895', 'ded23e', '332fdc'])
# list(['b9e059', 'ce8676', 'cb75e9'])]
# [list(['bca0fc', '3289a9', 'cc3d3a'])
# list(['6bb0be', '07134a', 'c3cf05'])
# list(['152d5c', 'bac081', 'c59a08'])]
# [list(['97efcc', '4c31c0', '957693'])
# list(['15247e', 'af8f0a', 'ffb89a'])
# list(['161333', '8f41ce', '187b01'])]
# [list(['d811ae', '730b17', 'd2e269']) None None]]
print(res_OP == res_np)
# True
print(res_OP == res_nb)
# True
print(res_OP == [tuple(y.tolist() if y is not None else y for y in x) for x in res_nba])
# True
Benchmarks
To give you some ideas of the numbers we could be talking, let us use %timeit on much larger inputs:
N, M = 1000, 1000
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)
%timeit -n 1 -r 1 list(as_str_OP(r_arr, g_arr, b_arr))
# 1 loop, best of 1: 1.1 s per loop
%timeit -n 4 -r 4 list(as_str_np(r_arr, g_arr, b_arr))
# 4 loops, best of 4: 279 ms per loop
%timeit -n 4 -r 4 list(as_str_nb(r_arr, g_arr, b_arr))
# 1 loop, best of 1: 96.5 ms per loop
%timeit -n 4 -r 4 list(as_str_nba(r_arr, g_arr, b_arr))
# 4 loops, best of 4: 10.4 ms per loop
To simulate disk writing we could use the following consumer:
import time
import math
def consumer(gen, timeout_sec=0.001, weight=1):
result = []
for x in gen:
result.append(x)
time.sleep(timeout_sec * weight)
return result
where disk writing is simulated with a time.sleep() call with a timeout depending on the logarithm of the object size:
N, M = 1000, 1000
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)
%timeit -n 1 -r 1 consumer(as_str_OP(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 2.37 s per loop
%timeit -n 1 -r 1 consumer(as_str_np(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.48 s per loop
%timeit -n 1 -r 1 consumer(as_str_nb(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.27 s per loop
%timeit -n 1 -r 1 consumer(as_str_nba(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.13 s per loop
k = 100
%timeit -n 1 -r 1 consumer(grouper(as_str_OP(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 1.17 s per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_np(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 368 ms per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_nb(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 173 ms per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_nba(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 87.4 ms per loop
Ignoring the disk-writing simulation, the NumPy-vectorized approach is ~4x faster with the test input sizes, while Numba-accelerated approach gets ~10x to ~100x faster depending on whether the potentially useless conversion to list() with numpy.ndarray.tolist() is present or not.
When it comes to the simulated disk-writing, the faster versions are all more or less equivalent, and noticeably less effective without grouping, resulting in ~2x speed-up.
With grouping alone the speed-up gets to be ~2x, but when combining it with the faster approaches, the speed-ups fare between ~3x of the NumPy-vectorized version and the ~7x or ~13x of the Numba-accelerated approaches (with or without numpy.ndarray.tolist()).
Again, this is with the given input, and under the test conditions.
The actual mileage may vary.
you could use reduce for the inner loop, making it possible for your computer to divide the computations between different threads behind the scenes
for i in range(len(green_list)):
rgb_list = reduce(lambda ls, j: ls + ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j])],range(len(green_list[i])),list())
print(rgb_list)
or you could try to achive the same goal with a one-liner,
for i in range(len(green_list)):
rgb_list = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]
print(rgb_list)
hope it will do the trick for you
In the code you show, the slow bit is the string formatting. That we can improve somewhat.
A hex colour consists of eight bits for the red field, eight for the green, and eight for the blue (since your data does not seem to have an alpha channel, I am going to ignore that option). So we need at least twenty four bits to store the rgb colours.
You can create hex values using numpy's bitwise operators. The advantage is that this is completely vectorised. You then only have one value to format into a hex string for each (i, j), instead of three:
for i in range(len(green_list)):
hx = red_list[i] << 16 | green_list[i] << 8 | blue_list[i]
hex_list = ['%06x' % val for val in hx]
When the numpy arrays have dimensions (10, 1_000_000), this is about 5.5x faster than your original method (on my machine).
1. for-loop
Code modifications for rgb_list.append() does not affect much to the performance.
import timeit
n = 1000000
red_list = [list(range(1, n+0)), list(range(1, n+2))]
green_list = [list(range(2, n+1)), list(range(2, n+3))]
blue_list = [list(range(3, n+2)), list(range(3, n+4))]
def test_1():
for i in range(len(green_list)):
rgb_list = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]
def test_2():
for i in range(len(green_list)):
rgb_list = [None] * len(green_list[i])
for j in range(len(green_list[i])):
rgb_list[j] = '%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j])
def test_3():
for i in range(len(green_list)):
rgb_list = []
for j in range(len(green_list[i])):
rgb_list.append('%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]))
%timeit -n 1 -r 7 test_1(): 1.31 s ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit -n 1 -r 7 test_2(): 1.33 s ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit -n 1 -r 7 test_3(): 1.39 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2. disk IO
Code modifications for disk IO also does not affect much to the performance.
n = 20000000
def test_write_each():
for i in range(len(green_list)):
rgb_list = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]
with open("test_%d" % i, "wb") as f:
pickle.dump(rgb_list, f)
def test_write_once():
rgb_list_list = [None] * len(green_list)
for i in range(len(green_list)):
rgb_list_list[i] = ['%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]) for j in range(len(green_list[i]))]
with open("test_all", "wb") as f:
pickle.dump(rgb_list_list, f)
%timeit -n 1 -r 3 test_write_each(): 35.2 s ± 74.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
%timeit -n 1 -r 3 test_write_once(): 35.4 s ± 54.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
Conclusion
From the benchmark result, there seems like no bottleneck to be avoided in the question code.
If the disk IO itself is the problem, I would like to suggest to run the disk-IO code only once after every other job (including the ones that are not mentioned in this question) is finished.

convert time string XhYmZs to seconds in python

I have a string which comes in three forms:
XhYmZs or YmZs or Zs
where, h,m,s are for hours, mins, secs and X,Y,Z are the corresponding values.
How do I efficiently convert these strings to seconds in python2.7?
I guess I can do something like:
s="XhYmZs"
if "h" in s:
hours=s.split("h")
elif "m" in s:
mins=s.split("m")[0][-1]
... but this does not seem very efficient to me :(
Split on the delimiters you're interested in, then parse each resulting element into an integer and multiply as needed:
import re
def hms(s):
l = list(map(int, re.split('[hms]', s)[:-1]))
if len(l) == 3:
return l[0]*3600 + l[1]*60 + l[2]
elif len(l) == 2:
return l[0]*60 + l[1]
else:
return l[0]
This produces a duration normalized to seconds.
>>> hms('3h4m5s')
11045
>>> 3*3600+4*60+5
11045
>>> hms('70m5s')
4205
>>> 70*60+5
4205
>>> hms('300s')
300
You can also make this one line by turning the re.split() result around and multiplying by 60 raised to an incrementing power based on the element's position in the list:
def hms2(s):
return sum(int(x)*60**i for i,x in enumerate(re.split('[hms]', s)[-2::-1]))
>>> import datetime
>>> datetime.datetime.strptime('3h4m5s', '%Hh%Mm%Ss').time()
datetime.time(3, 4, 5)
Since it varies which fields are in your strings, you may have to build a matching format string.
>>> def parse(s):
... fmt=''.join('%'+c.upper()+c for c in 'hms' if c in s)
... return datetime.datetime.strptime(s, fmt).time()
The datetime module is the standard library way to handle times.
Asking to do this "efficiently" is a bit of a fool's errand. String parsing in an interpreted language isn't fast; aim for clarity. In addition, seeming efficient isn't very meaningful; either analyze the algorithm or benchmark, otherwise it's speculation.
Do not know how efficient this is, but this is how I would do it:
import re
test_data = [
'1h2m3s',
'1m2s',
'1s',
'3s1h2m',
]
HMS_REGEX = re.compile('^(\d+)h(\d+)m(\d+)s$')
MS_REGEX = re.compile('^(\d+)m(\d+)s$')
S_REGEX = re.compile('^(\d+)s$')
def total_seconds(hms_string):
found = HMS_REGEX.match(hms_string)
if found:
x = found.group(1)
return 3600 * int(found.group(1)) + \
60 * int(found.group(2)) + \
int(found.group(3))
found = MS_REGEX.match(hms_string)
if found:
return 60 * int(found.group(1)) + int(found.group(2))
found = S_REGEX.match(hms_string)
if found:
return int(found.group(1))
raise ValueError('Could not convert ' + hms_string)
for datum in test_data:
try:
print(total_seconds(datum))
except ValueError as exc:
print(exc)
or going to a single match and riffing on TigerhawkT3's one liner, but retaining the error checking of non-matching strings:
HMS_REGEX = re.compile('^(\d+)h(\d+)m(\d+)s$|^(\d+)m(\d+)s$|^(\d+)s$')
def total_seconds(hms_string):
found = HMS_REGEX.match(hms_string)
if found:
return sum(
int(x or 0) * 60 ** i for i, x in enumerate(
(y for y in reversed(found.groups()) if y is not None))
raise ValueError('Could not convert ' + hms_string)
My fellow pythonistas, please stop using regular expression for everything. Regular Expression is not needed for such simple tasks. Python is considered a slow language not because the GIL or the interpreter, because such mis-usage.
In [1]: import re
...: def hms(s):
...: l = list(map(int, re.split('[hms]', s)[:-1]))
...: if len(l) == 3:
...: return l[0]*3600 + l[1]*60 + l[2]
...: elif len(l) == 2:
...: return l[0]*60 + l[1]
...: else:
...: return l[0]
In [2]: %timeit hms("6h7m8s")
5.62 µs ± 722 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: def ehms(s):
...: bases=dict(h=3600, m=60, s=1)
...: secs = 0
...: num = 0
...: for c in s:
...: if c.isdigit():
...: num = num * 10 + int(c)
...: else:
...: secs += bases[c] * num
...: num = 0
...: return secs
In [7]: %timeit ehms("6h7m8s")
2.07 µs ± 70.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit hms("8s")
2.35 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [9]: %timeit ehms("8s")
1.06 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [10]: bases=dict(h=3600, m=60, s=1)
In [15]: a = ord('a')
In [16]: def eehms(s):
...: secs = 0
...: num = 0
...: for c in s:
...: if c.isdigit():
...: num = num * 10 + ord(c) - a
...: else:
...: secs += bases[c] * num
...: num = 0
...: return secs
In [17]: %timeit eehms("6h7m8s")
1.45 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
see, almost 4 times as fast.
There's a library python-dateutil - pip install python-dateutil, it takes a string and returns a datetime.datetime.
It can parse values as 5h 30m, 0.5h 30m, 0.5h - with spaces or without.
from datetime import datetime
from dateutil import parser
time = '5h15m50s'
midnight_plus_time = parser.parse(time)
midnight: datetime = datetime.combine(datetime.today(), datetime.min.time())
timedelta = midnight_plus_time - midnight
print(timedelta.seconds) # 18950
It can't parse more than 24h at once though.

How to turn a 1D radial profile into a 2D array in python

I have a list that models a phenomenon that is a function of radius. I want to convert this to a 2D array. I wrote some code that does exactly what I want, but since it uses nested for loops, it is quite slow.
l = len(profile1D)/2
critDim = int((l**2 /2.)**(1/2.))
profile2D = np.empty([critDim, critDim])
for x in xrange(0, critDim):
for y in xrange(0,critDim):
r = ((x**2 + y**2)**(1/2.))
profile2D[x,y] = profile1D[int(l+r)]
Is there a more efficient way to do the same thing by avoiding these loops?
Here's a vectorized approach using broadcasting -
a = np.arange(critDim)**2
r2D = np.sqrt(a[:,None] + a)
out = profile1D[(l+r2D).astype(int)]
If there are many repeated indices generated by l+r2D, we can use np.take for some further performance boost, like so -
out = np.take(profile1D,(l+r2D).astype(int))
Runtime test
Function definitions -
def org_app(profile1D,l,critDim):
profile2D = np.empty([critDim, critDim])
for x in xrange(0, critDim):
for y in xrange(0,critDim):
r = ((x**2 + y**2)**(1/2.))
profile2D[x,y] = profile1D[int(l+r)]
return profile2D
def vect_app1(profile1D,l,critDim):
a = np.arange(critDim)**2
r2D = np.sqrt(a[:,None] + a)
out = profile1D[(l+r2D).astype(int)]
return out
def vect_app2(profile1D,l,critDim):
a = np.arange(critDim)**2
r2D = np.sqrt(a[:,None] + a)
out = np.take(profile1D,(l+r2D).astype(int))
return out
Timings and verification -
In [25]: # Setup input array and params
...: profile1D = np.random.randint(0,9,(1000))
...: l = len(profile1D)/2
...: critDim = int((l**2 /2.)**(1/2.))
...:
In [26]: np.allclose(org_app(profile1D,l,critDim),vect_app1(profile1D,l,critDim))
Out[26]: True
In [27]: np.allclose(org_app(profile1D,l,critDim),vect_app2(profile1D,l,critDim))
Out[27]: True
In [28]: %timeit org_app(profile1D,l,critDim)
10 loops, best of 3: 154 ms per loop
In [29]: %timeit vect_app1(profile1D,l,critDim)
1000 loops, best of 3: 1.69 ms per loop
In [30]: %timeit vect_app2(profile1D,l,critDim)
1000 loops, best of 3: 1.68 ms per loop
In [31]: # Setup input array and params
...: profile1D = np.random.randint(0,9,(5000))
...: l = len(profile1D)/2
...: critDim = int((l**2 /2.)**(1/2.))
...:
In [32]: %timeit org_app(profile1D,l,critDim)
1 loops, best of 3: 3.76 s per loop
In [33]: %timeit vect_app1(profile1D,l,critDim)
10 loops, best of 3: 59.8 ms per loop
In [34]: %timeit vect_app2(profile1D,l,critDim)
10 loops, best of 3: 59.5 ms per loop

In Python how do I parse the 11th and 12th bit of 3 bytes?

If I have 3 bytes b'\x00\x0c\x00', which can be represented with the bits 00000000 00001100 00000000, how do I then parse the 11th and 12th bit 11 most efficiently?
Here positions:
**
00000000 11111110 22222111 tens
87654321 65432109 43210987 ones
|||||||| |||||||| ||||||||
00000000 00001100 00000000
**
I have the following code:
bytes_input = b'\x00\x0c\x00'
for byte in bytes_input:
print(byte, '{:08b}'.format(byte), bin(byte))
bit_position = 11-1
bits_per_byte = 8
floor = bit_position//bits_per_byte
print('floor', floor)
byte = bytes_input[floor]
print('byte', byte, type(byte))
modulo = bit_position%bits_per_byte
print('modulo', modulo)
bits = bin(byte >> modulo & 3)
print('bits', bits, type(bits))
Which returns:
0 00000000 0b0
12 00001100 0b1100
0 00000000 0b0
floor 1
byte 12 <class 'int'>
modulo 2
bits 0b11 <class 'str'>
Is there a computationally faster way for me to get the information that doesn't require me to calculate floor and modulo?
To put things into context I am parsing this file format:
http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml
Update 01feb2015:
Thanks to #Dunes I read the documentation on from_bytes and found out that I can avoid doing divmod by just doing int.from_bytes with byteorder=small. The final function I adapted into my code is fsmall. I can't get timeit to work, so I'm not sure about the relative speeds of functions.
bytes_input = b'\x00\x0c\x00'
bit_position = 11-1
bpb = bits_per_byte = 8
def foriginal(bytes_input, bit_position):
floor = bit_position//bpb
byte = bytes_input[floor]
modulo = bit_position%bpb
return byte >> modulo & 0b11
def fdivmod(bytes_input, bit_position):
div, mod = divmod(bit_position, bpb)
return bytes_input[div] >> mod & 0b11
def fsmall(bytes_input, bit_position):
int_bytes = int.from_bytes(bytes_input, byteorder='little')
shift = bit_position
bits = int_bytes >> shift & 0b11
return bits
You could try:
(int.from_bytes(bytes_input, 'big') >> bit_position) & 0b11
It doesn't appear to be any quicker though, just terser.
However, int.from_bytes(bytes_input, 'big') is the most time consuming part of that code snippet by a factor 2 to 1. If you can convert your data from bytes to int once, at the beginning of the program, then you will see quicker bit masking operations.
In [52]: %timeit n = int.from_bytes(bytes_input, 'big')
1000000 loops, best of 3: 237 ns per loop
In [53]: %timeit n >> bit_position & 0b11
10000000 loops, best of 3: 107 ns per loop
Is there a computationally faster way for me to get the information that doesn't require me to calculate floor and modulo?
Not really. But there is divmod().
>>> divmod(10, 8)
(1, 2)
The fastest way is to convert your byte string into a numeric type that can be checked using a bitmask:
def check(b, checkbits):
# python2 use ord(bb)
bits = sum([bb << (8 * (len(b) - i)) for i, bb in enumerate(b,1)])
mask = sum([2 ** (b-1) for b in checkbits])
return bits, bits & mask == mask
bytes_input = b'\x00\x0c\x00'
checkbits = (11, 12)
bits, is_set = check(bytes_input, checkbits)
print bits, bin(bits), is_set
3072 0b110000000000 True
%timeit check(bytes_input, checkbits)
100000 loops, best of 3: 3.24 µs per loop
I'm not sure about the timing of your code, because I couldn't get it to work.
Update: Turns out there is a faster check() implementation:
def check2(b, mask):
bits = 0
i = 0
for bb in b[::-1]:
# python2 use ord(bb)
bits |= bb << i
i += 8
return bits, bits & mask == mask
# we now build the mask directly
# note this is the same as 2**10 | 2**11
mask = (2**11 | 2**12) >> 1
%timeit check2(bytes_input, mask)
1000000 loops, best of 3: 1.82 µs per loop
Update 2: adopting Dunes' nicely terse solution the whole thing becomes a two-liner (note my test runs in Python 2, apparently much slower than Dune's Python3):
#python2 from_bytes = lambda str: int(str.encode('hex'), 16)
mask = (2**11 | 2**12) >> 1
check = lambda b, mask: int.from_bytes(b) & mask
%timeit check(bytes_input, mask)
100000 loops, best of 3: 2.1 µs per loop
You can do binary operations in Python:
Convert the byte[s] you care about to an integer - in this case you only care about the middle byte, so we will just parse that:
>>> bytes_input = b'\x00\x0c\x00'
>>> middle_byte = bytes_input[1]
>>> middle_byte
12
Now you can do binary AND with the & operator:
>>> middle_int & 0x0C
12
To expand on this more generally, you can convert any arbitrary binary [string] to it's integer value with something like:
>>> int.from_bytes(b'\x00\x0c\x00')
3072
And now you can apply the bitmask again:
>>> string_to_int(b'\x00\x0c\x00') & 0x000C00
3072

How to convert 'false' to 0 and 'true' to 1?

Is there a way to convert true of type unicode to 1 and false of type unicode to 0 (in Python)?
For example: x == 'true' and type(x) == unicode
I want x = 1
PS: I don’t want to use if-else.
Use int() on a boolean test:
x = int(x == 'true')
int() turns the boolean into 1 or 0. Note that any value not equal to 'true' will result in 0 being returned.
If B is a Boolean array, write
B = B*1
(A bit code golfy.)
You can use x.astype('uint8') where x is your Boolean array.
Here's a yet another solution to your problem:
def to_bool(s):
return 1 - sum(map(ord, s)) % 2
# return 1 - sum(s.encode('ascii')) % 2 # Alternative for Python 3
It works because the sum of the ASCII codes of 'true' is 448, which is even, while the sum of the ASCII codes of 'false' is 523 which is odd.
The funny thing about this solution is that its result is pretty random if the input is not one of 'true' or 'false'. Half of the time it will return 0, and the other half 1. The variant using encode will raise an encoding error if the input is not ASCII (thus increasing the undefined-ness of the behaviour).
Seriously, I believe the most readable, and faster, solution is to use an if:
def to_bool(s):
return 1 if s == 'true' else 0
See some microbenchmarks:
In [14]: def most_readable(s):
...: return 1 if s == 'true' else 0
In [15]: def int_cast(s):
...: return int(s == 'true')
In [16]: def str2bool(s):
...: try:
...: return ['false', 'true'].index(s)
...: except (ValueError, AttributeError):
...: raise ValueError()
In [17]: def str2bool2(s):
...: try:
...: return ('false', 'true').index(s)
...: except (ValueError, AttributeError):
...: raise ValueError()
In [18]: def to_bool(s):
...: return 1 - sum(s.encode('ascii')) % 2
In [19]: %timeit most_readable('true')
10000000 loops, best of 3: 112 ns per loop
In [20]: %timeit most_readable('false')
10000000 loops, best of 3: 109 ns per loop
In [21]: %timeit int_cast('true')
1000000 loops, best of 3: 259 ns per loop
In [22]: %timeit int_cast('false')
1000000 loops, best of 3: 262 ns per loop
In [23]: %timeit str2bool('true')
1000000 loops, best of 3: 343 ns per loop
In [24]: %timeit str2bool('false')
1000000 loops, best of 3: 325 ns per loop
In [25]: %timeit str2bool2('true')
1000000 loops, best of 3: 295 ns per loop
In [26]: %timeit str2bool2('false')
1000000 loops, best of 3: 277 ns per loop
In [27]: %timeit to_bool('true')
1000000 loops, best of 3: 607 ns per loop
In [28]: %timeit to_bool('false')
1000000 loops, best of 3: 612 ns per loop
Notice how the if solution is at least 2.5x times faster than all the other solutions. It does not make sense to put as a requirement to avoid using ifs except if this is some kind of homework (in which case you shouldn't have asked this in the first place).
If you need a general purpose conversion from a string which per se is not a bool, you should better write a routine similar to the one depicted below. In keeping with the spirit of duck typing, I have not silently passed the error but converted it as appropriate for the current scenario.
>>> def str2bool(st):
try:
return ['false', 'true'].index(st.lower())
except (ValueError, AttributeError):
raise ValueError('no Valid Conversion Possible')
>>> str2bool('garbaze')
Traceback (most recent call last):
File "<pyshell#106>", line 1, in <module>
str2bool('garbaze')
File "<pyshell#105>", line 5, in str2bool
raise TypeError('no Valid COnversion Possible')
TypeError: no Valid Conversion Possible
>>> str2bool('false')
0
>>> str2bool('True')
1
+(False) converts to 0 and
+(True) converts to 1
Any of the following will work:
s = "true"
(s == 'true').real
1
(s == 'false').real
0
(s == 'true').conjugate()
1
(s == '').conjugate()
0
(s == 'true').__int__()
1
(s == 'opal').__int__()
0
def as_int(s):
return (s == 'true').__int__()
>>>> as_int('false')
0
>>>> as_int('true')
1
bool to int:
x = (x == 'true') + 0
Now the x contains 1 if x == 'true' else 0.
Note: x == 'true' will return bool which then will be typecasted to int having value (1 if bool value is True else 0) when added with 0.
only with this:
const a = true;
const b = false;
console.log(+a);//1
console.log(+b);//0

Categories

Resources