Best way to replace multiple characters in a string? - python

I need to replace some characters as follows: & ➔ \&, # ➔ \#, ...
I coded as follows, but I guess there should be some better way. Any hints?
strs = strs.replace('&', '\&')
strs = strs.replace('#', '\#')
...

Replacing two characters
I timed all the methods in the current answers along with one extra.
With an input string of abc&def#ghi and replacing & -> \& and # -> \#, the fastest way was to chain together the replacements like this: text.replace('&', '\&').replace('#', '\#').
Timings for each function:
a) 1000000 loops, best of 3: 1.47 μs per loop
b) 1000000 loops, best of 3: 1.51 μs per loop
c) 100000 loops, best of 3: 12.3 μs per loop
d) 100000 loops, best of 3: 12 μs per loop
e) 100000 loops, best of 3: 3.27 μs per loop
f) 1000000 loops, best of 3: 0.817 μs per loop
g) 100000 loops, best of 3: 3.64 μs per loop
h) 1000000 loops, best of 3: 0.927 μs per loop
i) 1000000 loops, best of 3: 0.814 μs per loop
Here are the functions:
def a(text):
chars = "&#"
for c in chars:
text = text.replace(c, "\\" + c)
def b(text):
for ch in ['&','#']:
if ch in text:
text = text.replace(ch,"\\"+ch)
import re
def c(text):
rx = re.compile('([&#])')
text = rx.sub(r'\\\1', text)
RX = re.compile('([&#])')
def d(text):
text = RX.sub(r'\\\1', text)
def mk_esc(esc_chars):
return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
esc = mk_esc('&#')
def e(text):
esc(text)
def f(text):
text = text.replace('&', '\&').replace('#', '\#')
def g(text):
replacements = {"&": "\&", "#": "\#"}
text = "".join([replacements.get(c, c) for c in text])
def h(text):
text = text.replace('&', r'\&')
text = text.replace('#', r'\#')
def i(text):
text = text.replace('&', r'\&').replace('#', r'\#')
Timed like this:
python -mtimeit -s"import time_functions" "time_functions.a('abc&def#ghi')"
python -mtimeit -s"import time_functions" "time_functions.b('abc&def#ghi')"
python -mtimeit -s"import time_functions" "time_functions.c('abc&def#ghi')"
python -mtimeit -s"import time_functions" "time_functions.d('abc&def#ghi')"
python -mtimeit -s"import time_functions" "time_functions.e('abc&def#ghi')"
python -mtimeit -s"import time_functions" "time_functions.f('abc&def#ghi')"
python -mtimeit -s"import time_functions" "time_functions.g('abc&def#ghi')"
python -mtimeit -s"import time_functions" "time_functions.h('abc&def#ghi')"
python -mtimeit -s"import time_functions" "time_functions.i('abc&def#ghi')"
Replacing 17 characters
Here's similar code to do the same but with more characters to escape (\`*_{}>#+-.!$):
def a(text):
chars = "\\`*_{}[]()>#+-.!$"
for c in chars:
text = text.replace(c, "\\" + c)
def b(text):
for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
if ch in text:
text = text.replace(ch,"\\"+ch)
import re
def c(text):
rx = re.compile('([&#])')
text = rx.sub(r'\\\1', text)
RX = re.compile('([\\`*_{}[]()>#+-.!$])')
def d(text):
text = RX.sub(r'\\\1', text)
def mk_esc(esc_chars):
return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
esc = mk_esc('\\`*_{}[]()>#+-.!$')
def e(text):
esc(text)
def f(text):
text = text.replace('\\', '\\\\').replace('`', '\`').replace('*', '\*').replace('_', '\_').replace('{', '\{').replace('}', '\}').replace('[', '\[').replace(']', '\]').replace('(', '\(').replace(')', '\)').replace('>', '\>').replace('#', '\#').replace('+', '\+').replace('-', '\-').replace('.', '\.').replace('!', '\!').replace('$', '\$')
def g(text):
replacements = {
"\\": "\\\\",
"`": "\`",
"*": "\*",
"_": "\_",
"{": "\{",
"}": "\}",
"[": "\[",
"]": "\]",
"(": "\(",
")": "\)",
">": "\>",
"#": "\#",
"+": "\+",
"-": "\-",
".": "\.",
"!": "\!",
"$": "\$",
}
text = "".join([replacements.get(c, c) for c in text])
def h(text):
text = text.replace('\\', r'\\')
text = text.replace('`', r'\`')
text = text.replace('*', r'\*')
text = text.replace('_', r'\_')
text = text.replace('{', r'\{')
text = text.replace('}', r'\}')
text = text.replace('[', r'\[')
text = text.replace(']', r'\]')
text = text.replace('(', r'\(')
text = text.replace(')', r'\)')
text = text.replace('>', r'\>')
text = text.replace('#', r'\#')
text = text.replace('+', r'\+')
text = text.replace('-', r'\-')
text = text.replace('.', r'\.')
text = text.replace('!', r'\!')
text = text.replace('$', r'\$')
def i(text):
text = text.replace('\\', r'\\').replace('`', r'\`').replace('*', r'\*').replace('_', r'\_').replace('{', r'\{').replace('}', r'\}').replace('[', r'\[').replace(']', r'\]').replace('(', r'\(').replace(')', r'\)').replace('>', r'\>').replace('#', r'\#').replace('+', r'\+').replace('-', r'\-').replace('.', r'\.').replace('!', r'\!').replace('$', r'\$')
Here's the results for the same input string abc&def#ghi:
a) 100000 loops, best of 3: 6.72 μs per loop
b) 100000 loops, best of 3: 2.64 μs per loop
c) 100000 loops, best of 3: 11.9 μs per loop
d) 100000 loops, best of 3: 4.92 μs per loop
e) 100000 loops, best of 3: 2.96 μs per loop
f) 100000 loops, best of 3: 4.29 μs per loop
g) 100000 loops, best of 3: 4.68 μs per loop
h) 100000 loops, best of 3: 4.73 μs per loop
i) 100000 loops, best of 3: 4.24 μs per loop
And with a longer input string (## *Something* and [another] thing in a longer sentence with {more} things to replace$):
a) 100000 loops, best of 3: 7.59 μs per loop
b) 100000 loops, best of 3: 6.54 μs per loop
c) 100000 loops, best of 3: 16.9 μs per loop
d) 100000 loops, best of 3: 7.29 μs per loop
e) 100000 loops, best of 3: 12.2 μs per loop
f) 100000 loops, best of 3: 5.38 μs per loop
g) 10000 loops, best of 3: 21.7 μs per loop
h) 100000 loops, best of 3: 5.7 μs per loop
i) 100000 loops, best of 3: 5.13 μs per loop
Adding a couple of variants:
def ab(text):
for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
text = text.replace(ch,"\\"+ch)
def ba(text):
chars = "\\`*_{}[]()>#+-.!$"
for c in chars:
if c in text:
text = text.replace(c, "\\" + c)
With the shorter input:
ab) 100000 loops, best of 3: 7.05 μs per loop
ba) 100000 loops, best of 3: 2.4 μs per loop
With the longer input:
ab) 100000 loops, best of 3: 7.71 μs per loop
ba) 100000 loops, best of 3: 6.08 μs per loop
So I'm going to use ba for readability and speed.
Addendum
Prompted by haccks in the comments, one difference between ab and ba is the if c in text: check. Let's test them against two more variants:
def ab_with_check(text):
for ch in ['\\','`','*','_','{','}','[',']','(',')','>','#','+','-','.','!','$','\'']:
if ch in text:
text = text.replace(ch,"\\"+ch)
def ba_without_check(text):
chars = "\\`*_{}[]()>#+-.!$"
for c in chars:
text = text.replace(c, "\\" + c)
Times in μs per loop on Python 2.7.14 and 3.6.3, and on a different machine from the earlier set, so cannot be compared directly.
╭────────────╥──────┬───────────────┬──────┬──────────────────╮
│ Py, input ║ ab │ ab_with_check │ ba │ ba_without_check │
╞════════════╬══════╪═══════════════╪══════╪══════════════════╡
│ Py2, short ║ 8.81 │ 4.22 │ 3.45 │ 8.01 │
│ Py3, short ║ 5.54 │ 1.34 │ 1.46 │ 5.34 │
├────────────╫──────┼───────────────┼──────┼──────────────────┤
│ Py2, long ║ 9.3 │ 7.15 │ 6.85 │ 8.55 │
│ Py3, long ║ 7.43 │ 4.38 │ 4.41 │ 7.02 │
└────────────╨──────┴───────────────┴──────┴──────────────────┘
We can conclude that:
Those with the check are up to 4x faster than those without the check
ab_with_check is slightly in the lead on Python 3, but ba (with check) has a greater lead on Python 2
However, the biggest lesson here is Python 3 is up to 3x faster than Python 2! There's not a huge difference between the slowest on Python 3 and fastest on Python 2!

>>> string="abc&def#ghi"
>>> for ch in ['&','#']:
... if ch in string:
... string=string.replace(ch,"\\"+ch)
...
>>> print string
abc\&def\#ghi

Here is a python3 method using str.translate and str.maketrans:
s = "abc&def#ghi"
print(s.translate(str.maketrans({'&': '\&', '#': '\#'})))
The printed string is abc\&def\#ghi.

Simply chain the replace functions like this
strs = "abc&def#ghi"
print strs.replace('&', '\&').replace('#', '\#')
# abc\&def\#ghi
If the replacements are going to be more in number, you can do this in this generic way
strs, replacements = "abc&def#ghi", {"&": "\&", "#": "\#"}
print "".join([replacements.get(c, c) for c in strs])
# abc\&def\#ghi

Late to the party, but I lost a lot of time with this issue until I found my answer.
Short and sweet, translate is superior to replace. If you're more interested in funcionality over time optimization, do not use replace.
Also use translate if you don't know if the set of characters to be replaced overlaps the set of characters used to replace.
Case in point:
Using replace you would naively expect the snippet "1234".replace("1", "2").replace("2", "3").replace("3", "4") to return "2344", but it will return in fact "4444".
Translation seems to perform what OP originally desired.

Are you always going to prepend a backslash? If so, try
import re
rx = re.compile('([&#])')
# ^^ fill in the characters here.
strs = rx.sub('\\\\\\1', strs)
It may not be the most efficient method but I think it is the easiest.

You may consider writing a generic escape function:
def mk_esc(esc_chars):
return lambda s: ''.join(['\\' + c if c in esc_chars else c for c in s])
>>> esc = mk_esc('&#')
>>> print esc('Learn & be #1')
Learn \& be \#1
This way you can make your function configurable with a list of character that should be escaped.

For Python 3.8 and above, one can use assignment expressions
[text := text.replace(s, f"\\{s}") for s in "&#" if s in text];
Although, I am quite unsure if this would be considered "appropriate use" of assignment expressions as described in PEP 572, but looks clean and reads quite well (to my eyes). The semicolon at the end suppresses output if you run this in a REPL.
This would be "appropriate" if you wanted all intermediate strings as well. For example, (removing all lowercase vowels):
text = "Lorem ipsum dolor sit amet"
intermediates = [text := text.replace(i, "") for i in "aeiou" if i in text]
['Lorem ipsum dolor sit met',
'Lorm ipsum dolor sit mt',
'Lorm psum dolor st mt',
'Lrm psum dlr st mt',
'Lrm psm dlr st mt']
On the plus side, it does seem (unexpectedly?) faster than some of the faster methods in the accepted answer, and seems to perform nicely with both increasing strings length and an increasing number of substitutions.
The code for the above comparison is below. I am using random strings to make my life a bit simpler, and the characters to replace are chosen randomly from the string itself. (Note: I am using ipython's %timeit magic here, so run this in ipython/jupyter).
import random, string
def make_txt(length):
"makes a random string of a given length"
return "".join(random.choices(string.printable, k=length))
def get_substring(s, num):
"gets a substring"
return "".join(random.choices(s, k=num))
def a(text, replace): # one of the better performing approaches from the accepted answer
for i in replace:
if i in text:
text = text.replace(i, "")
def b(text, replace):
_ = (text := text.replace(i, "") for i in replace if i in text)
def compare(strlen, replace_length):
"use ipython / jupyter for the %timeit functionality"
times_a, times_b = [], []
for i in range(*strlen):
el = make_txt(i)
et = get_substring(el, replace_length)
res_a = %timeit -n 1000 -o a(el, et) # ipython magic
el = make_txt(i)
et = get_substring(el, replace_length)
res_b = %timeit -n 1000 -o b(el, et) # ipython magic
times_a.append(res_a.average * 1e6)
times_b.append(res_b.average * 1e6)
return times_a, times_b
#----run
t2 = compare((2*2, 1000, 50), 2)
t10 = compare((2*10, 1000, 50), 10)

FYI, this is of little or no use to the OP but it may be of use to other readers (please do not downvote, I'm aware of this).
As a somewhat ridiculous but interesting exercise, wanted to see if I could use python functional programming to replace multiple chars. I'm pretty sure this does NOT beat just calling replace() twice. And if performance was an issue, you could easily beat this in rust, C, julia, perl, java, javascript and maybe even awk. It uses an external 'helpers' package called pytoolz, accelerated via cython (cytoolz, it's a pypi package).
from cytoolz.functoolz import compose
from cytoolz.itertoolz import chain,sliding_window
from itertools import starmap,imap,ifilter
from operator import itemgetter,contains
text='&hello#hi&yo&'
char_index_iter=compose(partial(imap, itemgetter(0)), partial(ifilter, compose(partial(contains, '#&'), itemgetter(1))), enumerate)
print '\\'.join(imap(text.__getitem__, starmap(slice, sliding_window(2, chain((0,), char_index_iter(text), (len(text),))))))
I'm not even going to explain this because no one would bother using this to accomplish multiple replace. Nevertheless, I felt somewhat accomplished in doing this and thought it might inspire other readers or win a code obfuscation contest.

How about this?
def replace_all(dict, str):
for key in dict:
str = str.replace(key, dict[key])
return str
then
print(replace_all({"&":"\&", "#":"\#"}, "&#"))
output
\&\#
similar to answer

Using reduce which is available in python2.7 and python3.* you can easily replace mutiple substrings in a clean and pythonic way.
# Lets define a helper method to make it easy to use
def replacer(text, replacements):
return reduce(
lambda text, ptuple: text.replace(ptuple[0], ptuple[1]),
replacements, text
)
if __name__ == '__main__':
uncleaned_str = "abc&def#ghi"
cleaned_str = replacer(uncleaned_str, [("&","\&"),("#","\#")])
print(cleaned_str) # "abc\&def\#ghi"
In python2.7 you don't have to import reduce but in python3.* you have to import it from the functools module.

advanced way using regex
import re
text = "hello ,world!"
replaces = {"hello": "hi", "world":" 2020", "!":"."}
regex = re.sub("|".join(replaces.keys()), lambda match: replaces[match.string[match.start():match.end()]], text)
print(regex)

>>> a = '&#'
>>> print a.replace('&', r'\&')
\&#
>>> print a.replace('#', r'\#')
&\#
>>>
You want to use a 'raw' string (denoted by the 'r' prefixing the replacement string), since raw strings to not treat the backslash specially.

Maybe a simple loop for chars to replace:
a = '&#'
to_replace = ['&', '#']
for char in to_replace:
a = a.replace(char, "\\"+char)
print(a)
>>> \&\#

This will help someone looking for a simple solution.
def replacemany(our_str, to_be_replaced:tuple, replace_with:str):
for nextchar in to_be_replaced:
our_str = our_str.replace(nextchar, replace_with)
return our_str
os = 'the rain in spain falls mainly on the plain ttttttttt sssssssssss nnnnnnnnnn'
tbr = ('a','t','s','n')
rw = ''
print(replacemany(os,tbr,rw))
Output:
he ri i pi fll mily o he pli

Example is given below for the or condition, it will delete all ' and , from the given string. pass as many characters as you want separated by |
import re
test = re.sub("('|,)","",str(jsonAtrList))
Before:
After:

Related

python bytes to bit string

I have value of the type bytes that need to be converted to BIT STRING
bytes_val = (b'\x80\x00', 14)
the bytes in index zero need to be converted to bit string of length as indicated by the second element (14 in this case) and formatted as groups of 8 bits like below.
expected output => '10000000 000000'B
Another example
bytes_val2 = (b'\xff\xff\xff\xff\xf0\x00', 45) #=> '11111111 11111111 11111111 11111111 11110000 00000'B
What about some combination of formatting (below with f-string but can be done otherwise), and slicing:
def bytes2binstr(b, n=None):
s = ' '.join(f'{x:08b}' for x in b)
return s if n is None else s[:n + n // 8 + (0 if n % 8 else -1)]
If I understood correctly (I am not sure what the B at the end is supposed to mean), it passes your tests and a couple more:
func = bytes2binstr
args = (
(b'\x80\x00', None),
(b'\x80\x00', 14),
(b'\x0f\x00', 14),
(b'\xff\xff\xff\xff\xf0\x00', 16),
(b'\xff\xff\xff\xff\xf0\x00', 22),
(b'\x0f\xff\xff\xff\xf0\x00', 45),
(b'\xff\xff\xff\xff\xf0\x00', 45),
)
for arg in args:
print(arg)
print(repr(func(*arg)))
# (b'\x80\x00', None)
# '10000000 00000000'
# (b'\x80\x00', 14)
# '10000000 000000'
# (b'\x0f\x00', 14)
# '00001111 000000'
# (b'\xff\xff\xff\xff\xf0\x00', 16)
# '11111111 11111111'
# (b'\xff\xff\xff\xff\xf0\x00', 22)
# '11111111 11111111 111111'
# (b'\x0f\xff\xff\xff\xf0\x00', 45)
# '00001111 11111111 11111111 11111111 11110000 00000'
# (b'\xff\xff\xff\xff\xf0\x00', 45)
# '11111111 11111111 11111111 11111111 11110000 00000'
Explanation
we start from a bytes object
iterating through it gives us a single byte as a number
each byte is 8 bit, so decoding that will already give us the correct separation
each byte is formatted using the b binary specifier, with some additional formatting: 0 zero fill, 8 minimum length
we join (concatenate) the result of the formatting using ' ' as "separator"
finally the result is returned as is if a maximum number of bits n was not specified (set to None), otherwise the result is cropped to n + the number of spaces that were added in-between the 8-character groups.
In the solution above 8 is somewhat hard-coded.
If you want it to be a parameter, you may want to look into (possibly a variation of) #kederrac first answer using int.from_bytes().
This could look something like:
def bytes2binstr_frombytes(b, n=None, k=8):
s = '{x:0{m}b}'.format(m=len(b) * 8, x=int.from_bytes(b, byteorder='big'))[:n]
return ' '.join([s[i:i + k] for i in range(0, len(s), k)])
which gives the same output as above.
Speedwise, the int.from_bytes()-based solution is also faster:
for i in range(2, 7):
n = 10 ** i
print(n)
b = b''.join([random.randint(0, 2 ** 8 - 1).to_bytes(1, 'big') for _ in range(n)])
for func in funcs:
print(func.__name__, funcs[0](b, n * 7) == func(b, n * 7))
%timeit func(b, n * 7)
print()
# 100
# bytes2binstr True
# 10000 loops, best of 3: 33.9 µs per loop
# bytes2binstr_frombytes True
# 100000 loops, best of 3: 15.1 µs per loop
# 1000
# bytes2binstr True
# 1000 loops, best of 3: 332 µs per loop
# bytes2binstr_frombytes True
# 10000 loops, best of 3: 134 µs per loop
# 10000
# bytes2binstr True
# 100 loops, best of 3: 3.29 ms per loop
# bytes2binstr_frombytes True
# 1000 loops, best of 3: 1.33 ms per loop
# 100000
# bytes2binstr True
# 10 loops, best of 3: 37.7 ms per loop
# bytes2binstr_frombytes True
# 100 loops, best of 3: 16.7 ms per loop
# 1000000
# bytes2binstr True
# 1 loop, best of 3: 400 ms per loop
# bytes2binstr_frombytes True
# 10 loops, best of 3: 190 ms per loop
you can use:
def bytest_to_bit(by, n):
bi = "{:0{l}b}".format(int.from_bytes(by, byteorder='big'), l=len(by) * 8)[:n]
return ' '.join([bi[i:i + 8] for i in range(0, len(bi), 8)])
bytest_to_bit(b'\xff\xff\xff\xff\xf0\x00', 45)
output:
'11111111 11111111 11111111 11111111 11110000 00000'
steps:
transform your bytes to an integer using int.from_bytes
str.format method can take a binary format spec.
also, you can use a more compact form where each byte is formatted:
def bytest_to_bit(by, n):
bi = ' '.join(map('{:08b}'.format, by))
return bi[:n + len(by) - 1].rstrip()
bytest_to_bit(b'\xff\xff\xff\xff\xf0\x00', 45)
test_data = [
(b'\x80\x00', 14),
(b'\xff\xff\xff\xff\xf0\x00', 45),
]
def get_bit_string(bytes_, length) -> str:
output_chars = []
for byte in bytes_:
for _ in range(8):
if length <= 0:
return ''.join(output_chars)
output_chars.append(str(byte >> 7 & 1))
byte <<= 1
length -= 1
output_chars.append(' ')
return ''.join(output_chars)
for data in test_data:
print(get_bit_string(*data))
output:
10000000 000000
11111111 11111111 11111111 11111111 11110000 00000
explanation:
length: Start from target legnth, and decreasing to 0.
if length <= 0: return ...: If we reached target length, stop and return.
''.join(output_chars): Make string from list.
str(byte >> 7 & 1)
byte >> 7: Shift 7 bits to right(only remains MSB since byte has 8 bits.)
MSB means Most Significant Bit
(...) & 1: Bit-wise and operation. It extracts LSB.
byte <<= 1: Shift 1 bit to left for byte.
length -= 1: Decreasing length.
This is lazy version.
It neither loads nor processes the entire bytes.
This one does halt regardless of input size.
The other solutions may not!
I use collections.deque to build bit string.
from collections import deque
from itertools import chain, repeat, starmap
import os
def bit_lenght_list(n):
eights, rem = divmod(n, 8)
return chain(repeat(8, eights), (rem,))
def build_bitstring(byte, bit_length):
d = deque("0" * 8, 8)
d.extend(bin(byte)[2:])
return "".join(d)[:bit_length]
def bytes_to_bits(byte_string, bits):
return "{!r}B".format(
" ".join(starmap(build_bitstring, zip(byte_string, bit_lenght_list(bits))))
)
Test;
In [1]: bytes_ = os.urandom(int(1e9))
In [2]: timeit bytes_to_bits(bytes_, 0)
4.21 µs ± 27.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: timeit bytes_to_bits(os.urandom(1), int(1e9))
6.8 µs ± 51 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: bytes_ = os.urandom(6)
In [5]: bytes_
Out[5]: b'\xbf\xd5\x08\xbe$\x01'
In [6]: timeit bytes_to_bits(bytes_, 45) #'10111111 11010101 00001000 10111110 00100100 00000'B
12.3 µs ± 85 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [7]: bytes_to_bits(bytes_, 14)
Out[7]: "'10111111 110101'B"
when you say BIT you mean binary?
I would try
bytes_val = b'\\x80\\x00'
for byte in bytes_val:
value_in_binary = bin(byte)
This gives the answer without python's binary representation pre-fixed 0b:
bit_str = ' '.join(bin(i).replace('0b', '') for i in bytes_val)
This works in Python 3.x:
def to_bin(l):
val, length = l
bit_str = ''.join(bin(i).replace('0b', '') for i in val)
if len(bit_str) < length:
# pad with zeros
return '0'*(length-len(bit_str)) + bit_str
else:
# cut to size
return bit_str[:length]
bytes_val = [b'\x80\x00',14]
print(to_bin(bytes_val))
and this works in 2.x:
def to_bin(l):
val, length = l
bit_str = ''.join(bin(ord(i)).replace('0b', '') for i in val)
if len(bit_str) < length:
# pad with zeros
return '0'*(length-len(bit_str)) + bit_str
else:
# cut to size
return bit_str[:length]
bytes_val = [b'\x80\x00',14]
print(to_bin(bytes_val))
Both produce result 00000100000000

How to parse and evaluate a math expression with Pandas Dataframe columns?

What I would like to do is to parse an expression such this one:
result = A + B + sqrt(B + 4)
Where A and B are columns of a dataframe. So I would have to parse the expresion like this in order to get the result:
new_col = df.B + 4
result = df.A + df.B + new_col.apply(sqrt)
Where df is the dataframe.
I have tried with re.sub but it would be good only to replace the column variables (not the functions) like this:
import re
def repl(match):
inner_word = match.group(1)
new_var = "df['{}']".format(inner_word)
return new_var
eq = 'A + 3 / B'
new_eq = re.sub('([a-zA-Z_]+)', repl, eq)
result = eval(new_eq)
So, my questions are:
Is there a python library to do this? If not, how can I achieve this in a simple way?
Creating a recursive function could be the solution?
If I use the "reverse polish notation" could simplify the parsing?
Would I have to use the ast module?
Pandas DataFrames do have an eval function. Using your example equation:
import pandas as pd
# create an example DataFrame to work with
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
# define equation
eq = 'A + 3 / B'
# actual computation
df.eval(eq)
# more complicated equation
eq = "A + B + sqrt(B + 4)"
df.eval(eq)
Warning
Keep in mind that eval allows to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.
Following the example provided by #uuazed, a faster way would be using numexpr
import pandas as pd
import numpy as np
import numexpr as ne
df = pd.DataFrame(np.random.randn(int(1e6), 2), columns=['A', 'B'])
eq = "A + B + sqrt(B + 4)"
timeit df.eval(eq)
# 15.9 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
timeit A=df.A; B=df.B; ne.evaluate(eq)
# 6.24 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
numexpr may also have more supported operations

How to turn a 1D radial profile into a 2D array in python

I have a list that models a phenomenon that is a function of radius. I want to convert this to a 2D array. I wrote some code that does exactly what I want, but since it uses nested for loops, it is quite slow.
l = len(profile1D)/2
critDim = int((l**2 /2.)**(1/2.))
profile2D = np.empty([critDim, critDim])
for x in xrange(0, critDim):
for y in xrange(0,critDim):
r = ((x**2 + y**2)**(1/2.))
profile2D[x,y] = profile1D[int(l+r)]
Is there a more efficient way to do the same thing by avoiding these loops?
Here's a vectorized approach using broadcasting -
a = np.arange(critDim)**2
r2D = np.sqrt(a[:,None] + a)
out = profile1D[(l+r2D).astype(int)]
If there are many repeated indices generated by l+r2D, we can use np.take for some further performance boost, like so -
out = np.take(profile1D,(l+r2D).astype(int))
Runtime test
Function definitions -
def org_app(profile1D,l,critDim):
profile2D = np.empty([critDim, critDim])
for x in xrange(0, critDim):
for y in xrange(0,critDim):
r = ((x**2 + y**2)**(1/2.))
profile2D[x,y] = profile1D[int(l+r)]
return profile2D
def vect_app1(profile1D,l,critDim):
a = np.arange(critDim)**2
r2D = np.sqrt(a[:,None] + a)
out = profile1D[(l+r2D).astype(int)]
return out
def vect_app2(profile1D,l,critDim):
a = np.arange(critDim)**2
r2D = np.sqrt(a[:,None] + a)
out = np.take(profile1D,(l+r2D).astype(int))
return out
Timings and verification -
In [25]: # Setup input array and params
...: profile1D = np.random.randint(0,9,(1000))
...: l = len(profile1D)/2
...: critDim = int((l**2 /2.)**(1/2.))
...:
In [26]: np.allclose(org_app(profile1D,l,critDim),vect_app1(profile1D,l,critDim))
Out[26]: True
In [27]: np.allclose(org_app(profile1D,l,critDim),vect_app2(profile1D,l,critDim))
Out[27]: True
In [28]: %timeit org_app(profile1D,l,critDim)
10 loops, best of 3: 154 ms per loop
In [29]: %timeit vect_app1(profile1D,l,critDim)
1000 loops, best of 3: 1.69 ms per loop
In [30]: %timeit vect_app2(profile1D,l,critDim)
1000 loops, best of 3: 1.68 ms per loop
In [31]: # Setup input array and params
...: profile1D = np.random.randint(0,9,(5000))
...: l = len(profile1D)/2
...: critDim = int((l**2 /2.)**(1/2.))
...:
In [32]: %timeit org_app(profile1D,l,critDim)
1 loops, best of 3: 3.76 s per loop
In [33]: %timeit vect_app1(profile1D,l,critDim)
10 loops, best of 3: 59.8 ms per loop
In [34]: %timeit vect_app2(profile1D,l,critDim)
10 loops, best of 3: 59.5 ms per loop

Performance considerations when populating lists vs dictionaries

Say I need to collect millions of strings in an iterable that I can later randomly index by position.
I need to populate the iterable one item at a time, sequentially, for millions of entries.
Given the above, which method could in principle be more efficient:
Populating a list:
while <condition>:
if <condition>:
my_list[count] = value
count += 1
Populating a dictionary:
while <condition>:
if <condition>:
my_dict[count] = value
count += 1
(the above is pesudocode, everything would be initialized before running the snippets).
I am specifically interested in the CPython implementation for Python 3.4.
Lists are definitely faster, if you use them in the right way.
In [19]: %%timeit l = []
....: for i in range(1000000): l.append(str(i))
....:
1 loops, best of 3: 182 ms per loop
In [20]: %%timeit d = {}
....: for i in range(1000000): d[i] = str(i)
....:
1 loops, best of 3: 207 ms per loop
In [21]: %timeit [str(i) for i in range(1000000)]
10 loops, best of 3: 158 ms per loop
Pushing the Python loop down to the C level with a comprehension buys you quite a bit of time. It also makes more sense to prefer a list for keys that are a prefix of the integers. Pre-allocating saves even more time:
>>> %%timeit
... l = [None] * 1000000
... for i in xrange(1000000): my_list[i] = str(i)
...
10 loops, best of 3: 147 ms per loop
For completeness, a dict comprehension does not speed things up:
In [22]: %timeit {i: str(i) for i in range(1000000)}
1 loops, best of 3: 213 ms per loop
With larger strings, I see very similar differences in performance (try str(i) * 10). This is CPython 2.7.6 on an x86-64.
I don't understand why you want to create an empty list or dict and then populate it. Why not create a new list or dictionary directly from the generation process?
results = list(a_generator)
# Or if you really want to use a dict for some reason:
results = dict(enumerate(a_generator))
You can get even better times by using the map function:
>>> def test1():
l = []
for i in range(10 ** 6):
l.append(str(i))
>>> def test2():
d = {}
for i in range(10 ** 6):
d[i] = str(i)
>>> def test3():
[str(i) for i in range(10 ** 6)]
>>> def test4():
{i: str(i) for i in range(10 ** 6)}
>>> def test5():
list(map(str, range(10 ** 6)))
>>> def test6():
r = range(10 ** 6)
dict(zip(r, map(str, r)))
>>> timeit.Timer('test1()', 'from __main__ import test1').timeit(100)
30.628035710889932
>>> timeit.Timer('test2()', 'from __main__ import test2').timeit(100)
31.093550469839613
>>> timeit.Timer('test3()', 'from __main__ import test3').timeit(100)
25.778271498509355
>>> timeit.Timer('test4()', 'from __main__ import test4').timeit(100)
30.10892986559668
>>> timeit.Timer('test5()', 'from __main__ import test5').timeit(100)
20.633583353028826
>>> timeit.Timer('test6()', 'from __main__ import test6').timeit(100)
28.660790917067914

How to convert 'false' to 0 and 'true' to 1?

Is there a way to convert true of type unicode to 1 and false of type unicode to 0 (in Python)?
For example: x == 'true' and type(x) == unicode
I want x = 1
PS: I don’t want to use if-else.
Use int() on a boolean test:
x = int(x == 'true')
int() turns the boolean into 1 or 0. Note that any value not equal to 'true' will result in 0 being returned.
If B is a Boolean array, write
B = B*1
(A bit code golfy.)
You can use x.astype('uint8') where x is your Boolean array.
Here's a yet another solution to your problem:
def to_bool(s):
return 1 - sum(map(ord, s)) % 2
# return 1 - sum(s.encode('ascii')) % 2 # Alternative for Python 3
It works because the sum of the ASCII codes of 'true' is 448, which is even, while the sum of the ASCII codes of 'false' is 523 which is odd.
The funny thing about this solution is that its result is pretty random if the input is not one of 'true' or 'false'. Half of the time it will return 0, and the other half 1. The variant using encode will raise an encoding error if the input is not ASCII (thus increasing the undefined-ness of the behaviour).
Seriously, I believe the most readable, and faster, solution is to use an if:
def to_bool(s):
return 1 if s == 'true' else 0
See some microbenchmarks:
In [14]: def most_readable(s):
...: return 1 if s == 'true' else 0
In [15]: def int_cast(s):
...: return int(s == 'true')
In [16]: def str2bool(s):
...: try:
...: return ['false', 'true'].index(s)
...: except (ValueError, AttributeError):
...: raise ValueError()
In [17]: def str2bool2(s):
...: try:
...: return ('false', 'true').index(s)
...: except (ValueError, AttributeError):
...: raise ValueError()
In [18]: def to_bool(s):
...: return 1 - sum(s.encode('ascii')) % 2
In [19]: %timeit most_readable('true')
10000000 loops, best of 3: 112 ns per loop
In [20]: %timeit most_readable('false')
10000000 loops, best of 3: 109 ns per loop
In [21]: %timeit int_cast('true')
1000000 loops, best of 3: 259 ns per loop
In [22]: %timeit int_cast('false')
1000000 loops, best of 3: 262 ns per loop
In [23]: %timeit str2bool('true')
1000000 loops, best of 3: 343 ns per loop
In [24]: %timeit str2bool('false')
1000000 loops, best of 3: 325 ns per loop
In [25]: %timeit str2bool2('true')
1000000 loops, best of 3: 295 ns per loop
In [26]: %timeit str2bool2('false')
1000000 loops, best of 3: 277 ns per loop
In [27]: %timeit to_bool('true')
1000000 loops, best of 3: 607 ns per loop
In [28]: %timeit to_bool('false')
1000000 loops, best of 3: 612 ns per loop
Notice how the if solution is at least 2.5x times faster than all the other solutions. It does not make sense to put as a requirement to avoid using ifs except if this is some kind of homework (in which case you shouldn't have asked this in the first place).
If you need a general purpose conversion from a string which per se is not a bool, you should better write a routine similar to the one depicted below. In keeping with the spirit of duck typing, I have not silently passed the error but converted it as appropriate for the current scenario.
>>> def str2bool(st):
try:
return ['false', 'true'].index(st.lower())
except (ValueError, AttributeError):
raise ValueError('no Valid Conversion Possible')
>>> str2bool('garbaze')
Traceback (most recent call last):
File "<pyshell#106>", line 1, in <module>
str2bool('garbaze')
File "<pyshell#105>", line 5, in str2bool
raise TypeError('no Valid COnversion Possible')
TypeError: no Valid Conversion Possible
>>> str2bool('false')
0
>>> str2bool('True')
1
+(False) converts to 0 and
+(True) converts to 1
Any of the following will work:
s = "true"
(s == 'true').real
1
(s == 'false').real
0
(s == 'true').conjugate()
1
(s == '').conjugate()
0
(s == 'true').__int__()
1
(s == 'opal').__int__()
0
def as_int(s):
return (s == 'true').__int__()
>>>> as_int('false')
0
>>>> as_int('true')
1
bool to int:
x = (x == 'true') + 0
Now the x contains 1 if x == 'true' else 0.
Note: x == 'true' will return bool which then will be typecasted to int having value (1 if bool value is True else 0) when added with 0.
only with this:
const a = true;
const b = false;
console.log(+a);//1
console.log(+b);//0

Categories

Resources