I'm having trouble getting a replace() to work
I've tried my_string.replace('\\', '') and re.sub('\\', '', my_string), but neither one works.
I thought \ was the escape code for backslash, am I wrong?
The string in question looks like
'<2011315123.04C6DACE618A7C2763810#\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'
or print my_string
<2011315123.04C6DACE618A7C2763810#???ꂩ?猩???邾?낤>
Yes, it's supposed to look like garbage, but I'd rather get
'<2011315123.04C6DACE618A7C2763810#82b182ea82a982e78ca982a682e982be82eb82a4>'
You don't have any backslashes in your string. What you don't have, you can't remove.
Consider what you are showing as '\x82' ... this is a one-byte string.
>>> s = '\x82'
>>> len(s)
1
>>> ord(s)
130
>>> hex(ord(s))
'0x82'
>>> print s
é # my sys.stdout.encoding is 'cp850'
>>> print repr(s)
'\x82'
>>>
What you'd "rather get" ('x82') is meaningless.
Update The "non-ascii" part of the string (bounded by # and >) is actually Japanese text written mostly in Hiragana and encoded using shift_jis. Transcript of IDLE session:
>>> y = '\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4'
>>> print y.decode('shift_jis')
これから見えるだろう
Google Translate produces "Can not you see the future" as the English translation.
In a comment on another answer, you say:
I just need ascii
and
What I'm doing with it is seeing how
far apart the two strings are using
nltk.edit_distance(), so this will
give me a multiple of the true
distance. Which is good enough for me.
Why do you think you need ASCII? Edit distance is defined quite independently of any alphabet.
For a start, doing nonsensical transformations of your strings won't give you a consistent or predicable multiple of the true distance. Secondly, out of the following:
x
repr(x)
repr(x).replace('\\', '')
repr(x).replace('\\x', '') # if \ is noise, so is x
x.decode(whatever_the_encoding_is)
why do you choose the third?
Update 2 in response to comments:
(1) You still haven't said why you think you need "ascii". nltk.edit_distance doesn't require "ascii" -- the args are said to be "strings" (whatever that means) but the code will work with any 2 sequences of objects for which != works. In other words, why not just use the first of the above 5 options?
(2) Accepting up to 100% inflation of the edit distance is somwhat astonishing. Note that your currently chosen method will use 4 symbols (hex digits) per Japanese character. repr(x) uses 8 symbols per character. x (the first option) uses 2.
(3) You can mitigate the inflation effect by normalising your edit distance. Instead of comparing distance(s1, s2) with a number_of_symbols threshold, compare distance(s1, s2) / float(max(len(s1), len(s2))) with a fraction threshold. Note normalisation is usually used anyway ... the rationale being that the dissimilarity between 20-symbol strings with an edit distance of 4 is about the same as that between 10-symbol strings with an edit distance of 2, not twice as much.
(4) nltk.edit_distance is the most shockingly inefficient pure-Python implementation of edit_distance that I've ever seen. This implementation by Magnus Lie Hetland is much better, but still capable of improvement.
This works i think if you really want to just strip the "\"
>>> a = '<2011315123.04C6DACE618A7C2763810#\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'
>>> repr(a).replace("\\","")[1:-1]
'<2011315123.04C6DACE618A7C2763810#x82xb1x82xeax82xa9x82xe7x8cxa9x82xa6x82xe9x82xbex82xebx82xa4>'
>>>
But like the answer above, what you get is pretty much meaningless.
Related
I've tried various search terms in Google and I have tried to find an answer from the official docs but I cannot find a reasonable (read: understandable) explanation why this is the case in Python.
I understand that it's something to do with lexicographical order.
The expression 'a' > 'A' is regarded as True
But why is this the case? Or rather, why does this need to be the case?
Sorry if this is phrased badly, this is my first ever question on this site.
This should make the ordering clearer:
>>> ord('a')
97
>>> ord('A')
65
>>> ord('a') > ord('A')
True
The character code for 'a' is greater than the one for 'A'.
ord('a') returns an integer representing the Unicode code. The ord() is the inverse of chr().
The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:
0061 'a'; LATIN SMALL LETTER A
0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
Read more here: https://docs.python.org/3.3/howto/unicode.html
I have searched stackoverflow and I can't find the answer that I am looking for. Apologies if this sounds like a dumb question since I am a newbie learning Python. Spent 1 hour trying to understand this and I can't grasp the concept.
Can somebody explain to me the following:
hilarious = False
joke_evaluation = "Isn't that joke so funny?! %r"
print joke_evaluation % hilarious
hilarious = "False"
joke_evaluation = "Isn't that joke so funny?!"
print hilarious + joke_evaluation
Why is it that you can't combine the first with + but %.
Is it because in the second one, they are both defined strings with quotations but in the first, hilarious = False is not in quotations?
The % operator on strings isn't exactly a concatenation like the + operator is.
With % you're actually substituting placeholders in the string on the left side of % with values from the right side.
So you could have something like this:
"This is my %s string" % "fantastic"
would yield:
This is my fantastic string
See how you're not concatenating the strings but "inserting" into the string on the right side.
See the documentation for more details.
Update:
As pointed out in the comments below, there are two "issues" with this. As of Python 2.5, this is actually the "old" way of doing string substitution. These days the following format is preferred (kudos to asmacdo)
"This is my {adjective} string".format(adjective='fantastic')
As well in the comments (thanks ErlVolton) I should explain that the "%s" refers to a string substitution. That is, the value that gets put in there should be a string. Similarly you can have integer substitution ("%d"), decimal floating point substitution ("%f") and, as in the case of the original question, you can substitute boolean values with "%r". You can also do a lot more formatting (vary the number of decimal places for a floating point number, pad numbers with leading zeros etc.) which is explained much better in the docs.
Finally, it's worth mentioning that you can substitute multiple values into a string but that changes the syntax a tiny bit. Instead of having a single value after the % operator you need to use a tuple. Example:
"this %s substitutes strings, booleans (%r) and an integer (%d)" % ('string', True, 42)
would yield:
this string substitutes strings, booleans (True) and an integer (42)
In this case, the percent sign marks the start of a printf-style specifier. When the first argument is a string, it formats it using the second argument (a boolean value in this case).
Refer to the documentation, or check out this question. They should shed some light on the situation.
As for the plus sign, you simply can't add (or concat) a boolean value to a string.
I am looking for a slick function that reverses the digits of the binary representation of a number.
If f were such a function I would have
int(reversed(s),2) == f(int(s,2)) whenever s is a string of zeros and ones starting with 1.
Right now I am using lambda x: int(''.join(reversed(bin(x)[2:])),2)
which is ok as far as conciseness is concerned, but it seems like a pretty roundabout way of doing this.
I was wondering if there was a nicer (perhaps faster) way with bitwise operators and what not.
How about
int('{0:b}'.format(n)[::-1], 2)
or
int(bin(n)[:1:-1], 2)
The second method seems to be the faster of the two, however both are much faster than your current method:
import timeit
print timeit.timeit("int('{0:b}'.format(n)[::-1], 2)", 'n = 123456')
print timeit.timeit("int(bin(n)[:1:-1], 2)", 'n = 123456')
print timeit.timeit("int(''.join(reversed(bin(n)[2:])),2)", 'n = 123456')
1.13251614571
0.710681915283
2.23476600647
You could do it with shift operators like this:
def revbits(x):
rev = 0
while x:
rev <<= 1
rev += x & 1
x >>= 1
return rev
It doesn't seem any faster than your method, though (in fact, slightly slower for me).
Here is my suggestion:
In [83]: int(''.join(bin(x)[:1:-1]), 2)
Out[83]: 9987
Same method, slightly simplified.
I would argue your current method is perfectly fine, but you can lose the list() call, as str.join() will accept any iterable:
def binary_reverse(num):
return int(''.join(reversed(bin(num)[2:])), 2)
It would also advise against using lambda for anything but the simplest of functions, where it will only be used once, and makes surrounding code clearer by being inlined.
The reason I feel this is fine as it describes what you want to do - take the binary representation of a number, reverse it, then get a number again. That makes this code very readable, and that should be a priority.
There is an entire half chapter of Hacker's Delight devoted to this issue (Section 7-1: Reversing Bits and Bytes) using binary operations, bit shifts, and other goodies. Seems like these are all possible in Python and it should be much quicker than the binary-to-string-and-reverse methods.
The book isn't available publicly but I found this blog post that discusses some of it. The method shown in the blog post follows the following quote from the book:
Bit reversal can be done quite efficiently by interchanging adjacent
single bits, then interchanging adjacent 2-bit fields, and so on, as
shown below. These five assignment statements can be executed in any
order.
http://blog.sacaluta.com/2011/02/hackers-delight-reversing-bits.html
>>> def bit_rev(n):
... return int(bin(n)[:1:-1], 2)
...
>>> bit_rev(2)
1
>>>bit_rev(10)
5
What if you wanted to reverse the binary value based on a specific amount of bits, i.e. 1 = 2b'00000001? In this case the reverse value would be 2b'10000000 or 128 (dec) respectively 0x80 (hex).
def binary_reverse(num, bit_length):
# Convert to binary and pad with 0s on the left
bin_val = bin(num)[2:].zfill(bit_length)
return int(''.join(reversed(bin_val)), 2)
# Or, alternatively:
# return int(bin_val[::-1], 2)
My Python code was doing something strange to me (or my numbers, rather):
a)
float(poverb.tangibles[1])*1000
1038277000.0
b)
float(poverb.tangibles[1]*1000)
inf
Which led to discovering that:
long(poverb.tangibles[1]*1000)
produces the largest number I've ever seen.
Uhhh, I didn't read the whole Python tutorial or it's doc. Did I miss something critical about how float works?
EDIT:
>>> poverb.tangibles[1]
u'1038277'
What you probably missed is docs on how multiplication works on strings. Your tangibles list contains strings. tangibles[1] is a string. tangibles[1]*1000 is that string repeated 1000 times. Calling float or long on that string interprets it as a number, creating a huge number. If you instead do float(tangibles[1]), you only get the actual number, not the number repeated 1000 times.
What you are seeing is just the same as what goes on in this example:
>>> x = '1'
>>> x
'1'
>>> x*10
'1111111111'
>>> float(x)
1.0
>>> float(x*10)
1111111111.0
This should be easy.
Here's my array (rather, a method of generating representative test arrays):
>>> ri = numpy.random.randint
>>> ri2 = lambda x: ''.join(ri(0,9,x).astype('S'))
>>> a = array([float(ri2(x)+ '.' + ri2(y)) for x,y in ri(1,10,(10,2))])
>>> a
array([ 7.99914000e+01, 2.08000000e+01, 3.94000000e+02,
4.66100000e+03, 5.00000000e+00, 1.72575100e+03,
3.91500000e+02, 1.90610000e+04, 1.16247000e+04,
3.53920000e+02])
I want a list of strings where '\n'.join(list_o_strings) would print:
79.9914
20.8
394.0
4661.0
5.0
1725.751
391.5
19061.0
11624.7
353.92
I want to space pad to the left and the right (but no more than necessary).
I want a zero after the decimal if that is all that is after the decimal.
I do not want scientific notation.
..and I do not want to lose any significant digits. (in 353.98000000000002 the 2 is not significant)
Yeah, it's nice to want..
Python 2.5's %g, %fx.x, etc. are either befuddling me, or can't do it.
I have not tried import decimal yet. I can't see that NumPy does it either (although, the array.__str__ and array.__repr__ are decimal aligned (but sometimes return scientific).
Oh, and speed counts. I'm dealing with big arrays here.
My current solution approaches are:
to str(a) and parse off NumPy's brackets
to str(e) each element in the array and split('.') then pad and reconstruct
to a.astype('S'+str(i)) where i is the max(len(str(a))), then pad
It seems like there should be some off-the-shelf solution out there... (but not required)
Top suggestion fails with when dtype is float64:
>>> a
array([ 5.50056103e+02, 6.77383566e+03, 6.01001513e+05,
3.55425142e+08, 7.07254875e+05, 8.83174744e+02,
8.22320510e+01, 4.25076609e+08, 6.28662635e+07,
1.56503068e+02])
>>> ut0 = re.compile(r'(\d)0+$')
>>> thelist = [ut0.sub(r'\1', "%12f" % x) for x in a]
>>> print '\n'.join(thelist)
550.056103
6773.835663
601001.513
355425141.8471
707254.875038
883.174744
82.232051
425076608.7676
62866263.55
156.503068
Sorry, but after thorough investigation I can't find any way to perform the task you require without a minimum of post-processing (to strip off the trailing zeros you don't want to see); something like:
import re
ut0 = re.compile(r'(\d)0+$')
thelist = [ut0.sub(r'\1', "%12f" % x) for x in a]
print '\n'.join(thelist)
is speedy and concise, but breaks your constraint of being "off-the-shelf" -- it is, instead, a modular combination of general formatting (which almost does what you want but leaves trailing zero you want to hide) and a RE to remove undesired trailing zeros. Practically, I think it does exactly what you require, but your conditions as stated are, I believe, over-constrained.
Edit: original question was edited to specify more significant digits, require no extra leading space beyond what's required for the largest number, and provide a new example (where my previous suggestion, above, doesn't match the desired output). The work of removing leading whitespace that's common to a bunch of strings is best performed with textwrap.dedent -- but that works on a single string (with newlines) while the required output is a list of strings. No problem, we'll just put the lines together, dedent them, and split them up again:
import re
import textwrap
a = [ 5.50056103e+02, 6.77383566e+03, 6.01001513e+05,
3.55425142e+08, 7.07254875e+05, 8.83174744e+02,
8.22320510e+01, 4.25076609e+08, 6.28662635e+07,
1.56503068e+02]
thelist = textwrap.dedent(
'\n'.join(ut0.sub(r'\1', "%20f" % x) for x in a)).splitlines()
print '\n'.join(thelist)
emits:
550.056103
6773.83566
601001.513
355425142.0
707254.875
883.174744
82.232051
425076609.0
62866263.5
156.503068
Pythons string formatting can both print out only the necessary decimals (with %g) or use a fixed set of decimals (with %f). However, you want to print out only the necessary decimals, except if the number is a whole number, then you want one decimal, and that makes it complex.
This means you would end up with something like:
def printarr(arr):
for x in array:
if math.floor(x) == x:
res = '%.1f' % x
else:
res = '%.10g' % x
print "%*s" % (15-res.find('.')+len(res), res)
This will first create a string either with 1 decimal, if the value is a whole number, or it will print with automatic decimals (but only up to 10 numbers) if it is not a fractional number. Lastly it will print it, adjusted so that the decimal point will be aligned.
Probably, though, numpy actually does what you want, because you typically do want it to be in exponential mode if it's too long.