Reversing C-style format strings in Python (`%`) - python

Introduction and setup
Suppose I have a 'template'* string of the form,
>>> template = """My %(pet)s ate my %(object)s.
... This is a float: %(number)0.2f.
... %(integer)10d is an integer on a newline."""
With this template I can generate a new string with,
>>> d = dict(pet='dog', object='homework', number=7.7375487, integer=743898,)
>>> out_string = template % d
>>> print(out_string)
My dog ate my homework.
This is a float: 7.74.
743898 is an integer on a newline.
How nice!
Question
I'd like to apply template to out_string to produce a new dict. Something like,
>>> d_approx_copy = reverse_cstyle_template(out_string, template)
>>> print(d_approx_copy)
{pet='dog', object='homework', number=7.74, integer=743898,}
Is there a Pythonic way to do this? Does an implementation already exist?**
Notes
*: I'm not using Template because, AFAIK, they don't currently support reversing.
**: I am aware of the risks associated with the loss of precision in number (from 7.7375487 to 7.74). I can deal with that. I'm just looking for a simple way to do this.

As I was developing this question, I could not find an existing tool to reverse C-style strings this way. That is, I think the answer to this question is: the reverse_cstyle_template function I was looking for does not currently exist.
In the process of researching this topic, I found many questions/answers similar to this one that use regular expressions (e.g. 1, 2, 3). However, I wanted something simpler and I did not want to have to use a different template string for formatting vs. parsing.
This eventually led me to format string syntax, and Richard Jones' parse package. For example the template above is written in format string syntax as,
>>> template = """My {pet} ate my {object}.
... This is a float: {number:0.2f}.
... {integer:10d} is an integer on a newline."""
With this template, one can use the built-in str.format to create a new string based on d,
template.format(**d)
Then use the parse package to get d_approx_copy,
>>> from parse import parse
>>> d_approx_copy = parse(template, out_string).named
Note here that I've accessed the .named attribute. This is because parse returns a Result object (defined in parse) that captures both named and fixed format specifiers. For example if one uses,
>>> template = """My {pet} {}ate my {object}.
... This is a float: {number:0.2f}.
... {integer:10d} is an integer on a newline.
... Here is another 'fixed' input: {}"""
>>> out_string = template.format('spot ', 7, **d)
>>> print(out_string)
My dog spot ate my homework.
This is a float: 7.74.
743898 is an integer on a newline.
Here is another 'fixed' input: 7
Then we can get the fixed and named data back by,
>>> data = parse.parse(template, out_string)
>>> print(data.named)
{'pet': 'dog', 'integer': 743898, 'object': 'homework', 'number': 7.74}
>>> print(data.fixed)
('spot ', '7')
Cool, right?!
Hopefully someday this functionality will be included as a built-in either in str, or in Template. For now though parse works well for my purposes.
Lastly, I think it's important to re-emphasize the loss of precision that occurs through these steps when specifying precision in the format specifier (i.e. 7.7375487 becomes 7.74)! In general using the precision specifier is probably a bad idea except when creating 'readable' strings (e.g. for 'summary' file output) that are not meant for further processing (i.e. will never be parsed). This, of course, negates the point of this Q/A but needs to be mentioned here.

Related

How to search for a string like \x60\xe2\x4b(indicating a emoticon) using regular expression in python

import re
string="b'#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 '"
print(re.findall(r"\x[0-9a-z]{2}",string))
The the list returned by the findall() function is empty :(
The problem here is that your string is the Python representation of a Python bytes object, which is pretty much useless.
Most likely, you had a bytes object, like this:
b = b'#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 '
… and you converted it to a string, like this:
s = str(b)
Don't do that. Instead, decode it:
s = b.decode('utf-8')
That will get you the actual characters, which you can then match easily, instead of trying to match the characters in the string representation of the bytes representation and then reconstructing the actual characters laboriously from the results.
However, it's worth noting that \xe2\x80\xa6 is not an emoji, it's a horizontal ellipsis character, …. If that isn't what you wanted, you already corrupted the data before this point.
Not a regexp per se, but might help you out any way.
def emojis(s):
return [c for c in s if ord(c) in range(0x1F600, 0x1F64F)]
print(emojis("hello world 😊")) # sample usage
You need to re.compile(ur'A\xe2\x80\xa6',re.UNICODE)
Compile a Unicode regex and use that pattern matching for your find,find all’s,subs,etc.
Try this. I joined the string in your question with that in your title to make the final search string
import re
k = r"#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 for a string like \x60\xe2\x4b(indicating a emoticon) using regular expression in python"
print(k)
print()
p = re.findall(r"((\\x[a-z0-9]{1,}){1,})", k)
for each in p:
print(each[0])
Output
#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 for a string like \x60\xe2\x4b(indicating a emoticon) using regular expression in python
\xe2\x80\xa6
\x60\xe2\x4b

Convert a big number string 2,345,678 into its value in int or float or anything that can be manipulated later in python [duplicate]

I have a string that represents a number which uses commas to separate thousands. How can I convert this to a number in python?
>>> int("1,000,000")
Generates a ValueError.
I could replace the commas with empty strings before I try to convert it, but that feels wrong somehow. Is there a better way?
For float values, see How can I convert a string with dot and comma into a float in Python, although the techniques are essentially the same.
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
locale.atoi('1,000,000')
# 1000000
locale.atof('1,000,000.53')
# 1000000.53
There are several ways to parse numbers with thousands separators. And I doubt that the way described by #unutbu is the best in all cases. That's why I list other ways too.
The proper place to call setlocale() is in __main__ module. It's global setting and will affect the whole program and even C extensions (although note that LC_NUMERIC setting is not set at system level, but is emulated by Python). Read caveats in documentation and think twice before going this way. It's probably OK in single application, but never use it in libraries for wide audience. Probably you shoud avoid requesting locale with some particular charset encoding, since it might not be available on some systems.
Use one of third party libraries for internationalization. For example PyICU allows using any available locale wihtout affecting the whole process (and even parsing numbers with particular thousands separators without using locales):
NumberFormat.createInstance(Locale('en_US')).parse("1,000,000").getLong()
Write your own parsing function, if you don't what to install third party libraries to do it "right way". It can be as simple as int(data.replace(',', '')) when strict validation is not needed.
Replace the commas with empty strings, and turn the resulting string into an int or a float.
>>> a = '1,000,000'
>>> int(a.replace(',' , ''))
1000000
>>> float(a.replace(',' , ''))
1000000.0
I got locale error from accepted answer, but the following change works here in Finland (Windows XP):
import locale
locale.setlocale( locale.LC_ALL, 'english_USA' )
print locale.atoi('1,000,000')
# 1000000
print locale.atof('1,000,000.53')
# 1000000.53
This works:
(A dirty but quick way)
>>> a='-1,234,567,89.0123'
>>> "".join(a.split(","))
'-123456789.0123'
I tried this. It goes a bit beyond the question:
You get an input. It will be converted to string first (if it is a list, for example from Beautiful soup);
then to int,
then to float.
It goes as far as it can get. In worst case, it returns everything unconverted as string.
def to_normal(soupCell):
''' converts a html cell from beautiful soup to text, then to int, then to float: as far as it gets.
US thousands separators are taken into account.
needs import locale'''
locale.setlocale( locale.LC_ALL, 'english_USA' )
output = unicode(soupCell.findAll(text=True)[0].string)
try:
return locale.atoi(output)
except ValueError:
try: return locale.atof(output)
except ValueError:
return output
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'en_US.UTF-8'
>>> print locale.atoi('1,000,000')
1000000
>>> print locale.atof('1,000,000.53')
1000000.53
this is done on Linux in US.
A little late, but the babel library has parse_decimal and parse_number which do exactly what you want:
from babel.numbers import parse_decimal, parse_number
parse_decimal('10,3453', locale='es_ES')
>>> Decimal('10.3453')
parse_number('20.457', locale='es_ES')
>>> 20457
parse_decimal('10,3453', locale='es_MX')
>>> Decimal('103453')
You can also pass a Locale class instead of a string:
from babel import Locale
parse_decimal('10,3453', locale=Locale('es_MX'))
>>> Decimal('103453')
If you're using pandas and you're trying to parse a CSV that includes numbers with a comma for thousands separators, you can just pass the keyword argument thousands=',' like so:
df = pd.read_csv('your_file.csv', thousands=',')
Try this:
def changenum(data):
foo = ""
for i in list(data):
if i == ",":
continue
else:
foo += i
return float(int(foo))

Are there other ways to format strings other then comma, percent, plus sign?

I've been looking around and I've been unable to find a definitive answer to this question: what's the recommended way to print variables in Python?
So far, I've seen three ways: using commas, using percent signs, or using plus signs:
>>> a = "hello"
>>> b = "world"
>>> print a, "to the", b
hello to the world
>>> print "%s to the %s" % (a, b)
hello to the world
>>> print a + " to the " + b
hello to the world
Each method seems to have its pros and cons.
Commas allow to write the variable directly and add spaces, as well as automatically perform a string conversion if needed. But I seem to remember that good coding practices say that it's best to separate your variables from your text.
Percent signs allow that, though they require to use a list when there's more than one variable, and you have to write the type of the variable (though it seems able to convert even if the variable type isn't the same, like trying to print a number with %s).
Plus signs seem to be the "worst" as they mix variables and text, and don't convert on the fly; though maybe it is necessary to have more control on your variable from time to time.
I've looked around and it seems some of those methods may be obsolete nowadays. Since they all seem to work and each have their pros and cons, I'm wondering: is there a recommended method, or do they all depend on the context?
Including the values from identifiers inside a string is called string formatting. You can handle string formatting in different ways with various pros and cons.
Using string concatenation (+)
Con: You must manually convert objects to strings
Pro: The objects appear where you want to place the into the string
Con: The final layout may not be clear due to breaking the string literal
Using template strings (i.e. $bash-style substitution):
Pro: You may be familiar with shell variable expansion
Pro: Conversion to string is done automatically
Pro: Final layout is clear.
Con: You cannot specify how to perform the conversion
Using %-style formatting:
Pro: similar to formatting with C's printf.
Pro: conversions are done for you
Pro: you can specify different type of conversions, with some options (e.g. precision for floats)
Pro: The final layout is clear
Pro: You can also specify the name of the elements to substitute as in: %(name)s.
Con: You cannot customize handling of format specifiers.
Con: There are some corner cases that can puzzle you. To avoid them you should always use either tuple or dict as argument.
Using str.format:
All the pros of %-style formatting (except that it is not similar to printf)
Similar to .NET String.Format
Pro: You can manually specify numbered fields which allows you to use a positional argument multiple times
Pro: More options in the format specifiers
Pro: You can customize the formatting specifiers in custom types
The commas do not do string-formatting. They are part of the print statement statement syntax.
They have a softspace "feature" which is gone in python3 since print is a function now:
>>> print 'something\t', 'other'
something other
>>> print 'something\tother'
something other
Note how the above outputs are exactly equivalent even though the first one used comma.
This is because the comma doesn't introduce whitespace in certain situations (e.g. right after a tab or a newline).
In python3 this doesn't happen:
>>> print('something\t', 'other')
something other
>>> print('something\tother') # note the difference in spacing.
something other
Since python2.6 the preferred way of doing string formatting is using the str.format method. It was meant to replace the %-style formatting, even though currently there are no plans (and I don't there will ever be) to remove %-style formatting.
string.format() basics
Here are a couple of example of basic string substitution, the {} is the placeholder for the substituted variables. If no format is specified, it will insert and format as a string.
s1 = "so much depends upon {}".format("a red wheel barrow")
s2 = "glazed with {} water beside the {} chickens".format("rain", "white")
You can also use the numeric position of the variables and change them in the strings, this gives some flexibility when doing the formatting, if you made a mistake in the order you can easily correct without shuffling all variables around.
s1 = " {0} is better than {1} ".format("emacs", "vim")
s2 = " {1} is better than {0} ".format("emacs", "vim")
The format() function offers a fair amount of additional features and capabilities, here are a few useful tips and tricks using .format()
Named Arguments
You can use the new string format as a templating engine and use named arguments, instead of requiring a strict order.
madlib = " I {verb} the {object} off the {place} ".format(verb="took", object="cheese", place="table")
>>> I took the cheese off the table
Reuse Same Variable Multiple Times
Using the % formatter, requires a strict ordering of variables, the .format() method allows you to put them in any order as we saw above in the basics, but also allows for reuse.
str = "Oh {0}, {0}! wherefore art thou {0}?".format("Romeo")
>>> Oh Romeo, Romeo! wherefore art thou Romeo?
Use Format as a Function
You can use .format as a function which allows for some separation of text and formatting from code. For example at the beginning of your program you could include all your formats and then use later. This also could be a nice way to handle internationalization which not only requires different text but often requires different formats for numbers.
email_f = "Your email address was {email}".format
print(email_f(email="bob#example.com"))
Escaping Braces
If you need to use braces when using str.format(), just double up
print(" The {} set is often represented as {{0}} ".format("empty"))
>>> The empty set is often represented as {0}
the question is, wether you want print variables (case 1) or want to output formatted text (case 2). Case one is good and easy to use, mostly for debug output.
If you like to say something in a defined way, formatting is the better choice. '+' is not the pythonic way of string maipulation.
An alternative to % is "{0} to the {1}".format(a,b) and is the preferred way of formatting since Python 3.
Depends a bit on which version.
Python 2 will be simply:
print 'string'
print 345
print 'string'+(str(345))
print ''
Python 3 requires parentheses (wish it didn't personally)
print ('string')
print (345)
print ('string'+(str(345))
Also the most foolproof method to do it is to convert everything into a variable:
a = 'string'
b = 345
c = str(345)
d = a + c

Appending '0x' before the hex numbers in a string

I'm parsing a xml file in which I get basic expressions (like id*10+2). What I am trying to do is to evaluate the expression to actually get the value. To do so, I use the eval() method which works very well.
The only thing is the numbers are in fact hexadecimal numbers. The eval() method could work well if every hex number was prefixed with '0x', but I could not find a way to do it, neither could I find a similar question here. How would it be done in a clean way ?
Use the re module.
>>> import re
>>> re.sub(r'([\dA-F]+)', r'0x\1', 'id*A+2')
'id*0xA+0x2'
>>> eval(re.sub(r'([\dA-F]+)', r'0x\1', 'CAFE+BABE'))
99772
Be warned though, with an invalid input to eval, it won't work. There are also many risks of using eval.
If your hex numbers have lowercase letters, then you could use this:
>>> re.sub(r'(?<!i)([\da-fA-F]+)', r'0x\1', 'id*a+b')
'id*0xa+0xb'
This uses a negative lookbehind assertion to assure that the letter i is not before the section it is trying to convert (preventing 'id' from turning into 'i0xd'. Replace i with I if the variable is Id.
If you can parse expresion into individual numbers then I would suggest to use int function:
>>> int("CAFE", 16)
51966
Be careful with eval! Do not ever use it in untrusted inputs.
If it's just simple arithmetic, I'd use a custom parser (there are tons of examples out in the wild)... And using parser generators (flex/bison, antlr, etc.) is a skill that is useful and easily forgotten, so it could be a good chance to refresh or learn it.
One option is to use the parser module:
import parser, token, re
def hexify(ast):
if not isinstance(ast, list):
return ast
if ast[0] in (token.NAME, token.NUMBER) and re.match('[0-9a-fA-F]+$', ast[1]):
return [token.NUMBER, '0x' + ast[1]]
return map(hexify, ast)
def hexified_eval(expr, *args):
ast = parser.sequence2st(hexify(parser.expr(expr).tolist()))
return eval(ast.compile(), *args)
>>> hexified_eval('id*10 + BABE', {'id':0xcafe})
567466
This is somewhat cleaner than a regex solution in that it only attempts to replace tokens that have been positively identified as either names or numbers (and look like hex numbers). It also correctly handles more general python expressions such as id*10 + len('BABE') (it won't replace 'BABE' with '0xBABE').
OTOH, the regex solution is simpler and might cover all the cases you need to deal with anyway.

Is there an easy way to convert a string containing a string literal into the string it represents?

I'm trying to (slightly) improve a script that does a quick-and-hacky parse of some config files.
Upon recognising "an item" read from the file, I need to try to convert it into a simple python value. The value could be a number or a string.
To convert strings read from the file into Python numbers I can just use int or float and catch the ValueError if it wasn't actually a number. Is there something similar for Python strings? i.e.
s1 = 'Goodbye World. :('
s2 = repr(s1)
s3 = ' "not a string literal" '
s4 = s3.strip()
v1 = parse_string_literal(s1) # throws ValueError
v2 = parse_string_literal(s2) # returns 'Goodby World. :('
v3 = parse_string_literal(s3) # throws ValueError
v4 = parse_string_literal(s4) # returns 'not a string literal'
In the file, string values are represented very similarly to Python string literals; they can be quoted with either ' or ", and could contain backslash escapes, etc. I could roll my own parser with regexes, but if there's something already existing I'd rather not re-invent the wheel.
I could use eval of course, but that's always somewhat dangerous.
... And sure enough, I just found the answer after I posted.
Even better than what I was looking for is ast.literal_eval: ast — Abstract Syntax Trees
It can evaluate any Python expression consisting solely of literals, which makes it safe. It also means I can recognise items from the config file that are potentially numbers or strings without having attempt multiple conversions, falling back to the next conversion on a ValueError exception. I don't even have to figure out what type the item is.
It's even way more flexible than I need, which could be a problem if I cared about making sure the item was only a number or a string, but I don't:
>>> ast.literal_eval('{"foo": [23.8, 170, (1, 2, 3)]}')
{'foo': [23.8, 170, (1, 2, 3)]}
ast.literal_eval() handles all simple Python literals, and most compound literals.

Categories

Resources