Python Regex: Formatting use of commas, periods internationally

Python Regex: Formatting use of commas, periods internationally - python

I'm storing currencies in a Decimal. From the client, I could be receiving strings in the following formats:
US$1,000.00
€1.000,00
So far, I've written:
re.sub(r'[^\d\.]', '', 'US$1,000.00')
which will return 1000.00 (formatted the way I'd like) for the first example and 1.000 for the second (which I don't).
What would be the best way to catch both decimals correctly?

You could try splitting and then glueing things back together
import re;
z = re.split("[,.]", re.sub([^\d\.\,], '', "$1,000.00"))
''.join(z[0:-2]) + ".".join(z[-2:]) # '1000.00'

You need to have a different expression for each currency. There are a lot of different currency rules and you will be in a world of hurt if you try to handle them all through a single regex. Maybe regex is the right solution here, maybe not.
Anyway, something like this would be OK:
money = "US$1,000.00"
decimal_rep = Decimal(0)
if money.startswith("US$"):
decimal_rep = Decimal(re.sub(r'[^\d\.]', '', money))
elif money.startswith("€"):
...

# -*- coding: cp1252 -*-
import re
text = '''US$1,000.00
US$3,000,000
€1.000,00
€4.000'''
print '%s\n-------------------' % text
pat = '([$€])[ \t]*[\d,.]+'
def ripl(mat, d = dict(('$,','€.'))):
return mat.group().replace(d[mat.group(1)],'')
print re.sub(pat,ripl,text)

I agree with Jordan: if there are more possible currency formats, then this is not the way to go.
However, if you know, that you'll only ever have these two formats, you can remove all non-digit characters except for periods and commas that are followed by nothing but digits:
output = re.sub(r'(?![.,]\d+$)\D', '', input)

I found a module that takes care of alot of the complexities in currency formatting (in particular with respect to periods, commas and a bunch more things). The package is called Babel, here is a link to the particular method(s) that could help: http://babel.edgewall.org/wiki/ApiDocs/babel.numbers#babel.numbers:parse_decimal
Docs:
http://babel.edgewall.org/wiki/ApiDocs/babel.numbers
Lots of other helpful internationalization utils in there.

Related

Convert a big number string 2,345,678 into its value in int or float or anything that can be manipulated later in python [duplicate]

I have a string that represents a number which uses commas to separate thousands. How can I convert this to a number in python?
>>> int("1,000,000")
Generates a ValueError.
I could replace the commas with empty strings before I try to convert it, but that feels wrong somehow. Is there a better way?
For float values, see How can I convert a string with dot and comma into a float in Python, although the techniques are essentially the same.

import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
locale.atoi('1,000,000')
# 1000000
locale.atof('1,000,000.53')
# 1000000.53

There are several ways to parse numbers with thousands separators. And I doubt that the way described by #unutbu is the best in all cases. That's why I list other ways too.
The proper place to call setlocale() is in __main__ module. It's global setting and will affect the whole program and even C extensions (although note that LC_NUMERIC setting is not set at system level, but is emulated by Python). Read caveats in documentation and think twice before going this way. It's probably OK in single application, but never use it in libraries for wide audience. Probably you shoud avoid requesting locale with some particular charset encoding, since it might not be available on some systems.
Use one of third party libraries for internationalization. For example PyICU allows using any available locale wihtout affecting the whole process (and even parsing numbers with particular thousands separators without using locales):
NumberFormat.createInstance(Locale('en_US')).parse("1,000,000").getLong()
Write your own parsing function, if you don't what to install third party libraries to do it "right way". It can be as simple as int(data.replace(',', '')) when strict validation is not needed.

Replace the commas with empty strings, and turn the resulting string into an int or a float.
>>> a = '1,000,000'
>>> int(a.replace(',' , ''))
1000000
>>> float(a.replace(',' , ''))
1000000.0

I got locale error from accepted answer, but the following change works here in Finland (Windows XP):
import locale
locale.setlocale( locale.LC_ALL, 'english_USA' )
print locale.atoi('1,000,000')
# 1000000
print locale.atof('1,000,000.53')
# 1000000.53

This works:
(A dirty but quick way)
>>> a='-1,234,567,89.0123'
>>> "".join(a.split(","))
'-123456789.0123'

I tried this. It goes a bit beyond the question:
You get an input. It will be converted to string first (if it is a list, for example from Beautiful soup);
then to int,
then to float.
It goes as far as it can get. In worst case, it returns everything unconverted as string.
def to_normal(soupCell):
''' converts a html cell from beautiful soup to text, then to int, then to float: as far as it gets.
US thousands separators are taken into account.
needs import locale'''
locale.setlocale( locale.LC_ALL, 'english_USA' )
output = unicode(soupCell.findAll(text=True)[0].string)
try:
return locale.atoi(output)
except ValueError:
try: return locale.atof(output)
except ValueError:
return output

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'en_US.UTF-8'
>>> print locale.atoi('1,000,000')
1000000
>>> print locale.atof('1,000,000.53')
1000000.53
this is done on Linux in US.

A little late, but the babel library has parse_decimal and parse_number which do exactly what you want:
from babel.numbers import parse_decimal, parse_number
parse_decimal('10,3453', locale='es_ES')
>>> Decimal('10.3453')
parse_number('20.457', locale='es_ES')
>>> 20457
parse_decimal('10,3453', locale='es_MX')
>>> Decimal('103453')
You can also pass a Locale class instead of a string:
from babel import Locale
parse_decimal('10,3453', locale=Locale('es_MX'))
>>> Decimal('103453')

If you're using pandas and you're trying to parse a CSV that includes numbers with a comma for thousands separators, you can just pass the keyword argument thousands=',' like so:
df = pd.read_csv('your_file.csv', thousands=',')

Try this:
def changenum(data):
foo = ""
for i in list(data):
if i == ",":
continue
else:
foo += i
return float(int(foo))

String slugification in Python

I am in search of the best way to "slugify" string what "slug" is, and my current solution is based on this recipe
I have changed it a little bit to:
s = 'String to slugify'
slug = unicodedata.normalize('NFKD', s)
slug = slug.encode('ascii', 'ignore').lower()
slug = re.sub(r'[^a-z0-9]+', '-', slug).strip('-')
slug = re.sub(r'[-]+', '-', slug)
Anyone see any problems with this code? It is working fine, but maybe I am missing something or you know a better way?

There is a python package named python-slugify, which does a pretty good job of slugifying:
pip install python-slugify
Works like this:
from slugify import slugify
txt = "This is a test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")
txt = "This -- is a ## test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")
txt = 'C\'est déjà l\'été.'
r = slugify(txt)
self.assertEquals(r, "cest-deja-lete")
txt = 'Nín hǎo. Wǒ shì zhōng guó rén'
r = slugify(txt)
self.assertEquals(r, "nin-hao-wo-shi-zhong-guo-ren")
txt = 'Компьютер'
r = slugify(txt)
self.assertEquals(r, "kompiuter")
txt = 'jaja---lol-méméméoo--a'
r = slugify(txt)
self.assertEquals(r, "jaja-lol-mememeoo-a")
See More examples
This package does a bit more than what you posted (take a look at the source, it's just one file). The project is still active (got updated 2 days before I originally answered, over nine years later (last checked 2022-03-30), it still gets updated).
careful: There is a second package around, named slugify. If you have both of them, you might get a problem, as they have the same name for import. The one just named slugify didn't do all I quick-checked: "Ich heiße" became "ich-heie" (should be "ich-heisse"), so be sure to pick the right one, when using pip or easy_install.

Install unidecode form from here for unicode support
pip install unidecode
# -*- coding: utf-8 -*-
import re
import unidecode
def slugify(text):
text = unidecode.unidecode(text).lower()
return re.sub(r'[\W_]+', '-', text)
text = u"My custom хелло ворлд"
print slugify(text)
>>> my-custom-khello-vorld

There is python package named awesome-slugify:
pip install awesome-slugify
Works like this:
from slugify import slugify
slugify('one kožušček') # one-kozuscek
awesome-slugify github page

def slugify(value):
"""
Converts to lowercase, removes non-word characters (alphanumerics and
underscores) and converts spaces to hyphens. Also strips leading and
trailing whitespace.
"""
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
value = re.sub('[^\w\s-]', '', value).strip().lower()
return mark_safe(re.sub('[-\s]+', '-', value))
slugify = allow_lazy(slugify, six.text_type)
This is the slugify function present in django.utils.text
This should suffice your requirement.

It works well in Django, so I don't see why it wouldn't be a good general purpose slugify function.
Are you having any problems with it?

The problem is with the ascii normalization line:
slug = unicodedata.normalize('NFKD', s)
It is called unicode normalization which does not decompose lots of characters to ascii. For example, it would strip non-ascii characters from the following strings:
Mørdag -> mrdag
Æther -> ther
A better way to do it is to use the unidecode module that tries to transliterate strings to ascii. So if you replace the above line with:
import unidecode
slug = unidecode.unidecode(s)
You get better results for the above strings and for many Greek and Russian characters too:
Mørdag -> mordag
Æther -> aether

Unidecode is good; however, be careful: unidecode is GPL. If this license doesn't fit then use this one

A couple of options on GitHub:
https://github.com/dimka665/awesome-slugify
https://github.com/un33k/python-slugify
https://github.com/mozilla/unicode-slugify
Each supports slightly different parameters for its API, so you'll need to look through to figure out what you prefer.
In particular, pay attention to the different options they provide for dealing with non-ASCII characters. Pydanny wrote a very helpful blog post illustrating some of the unicode handling differences in these slugify'ing libraries: http://www.pydanny.com/awesome-slugify-human-readable-url-slugs-from-any-string.html This blog post is slightly outdated because Mozilla's unicode-slugify is no longer Django-specific.
Also note that currently awesome-slugify is GPLv3, though there's an open issue where the author says they'd prefer to release as MIT/BSD, just not sure of the legality: https://github.com/dimka665/awesome-slugify/issues/24

You might consider changing the last line to
slug=re.sub(r'--+',r'-',slug)
since the pattern [-]+ is no different than -+, and you don't really care about matching just one hyphen, only two or more.
But, of course, this is quite minor.

Another option is boltons.strutils.slugify. Boltons has quite a few other useful functions as well, and is distributed under a BSD license.

By your example, a fast manner to do that could be:
s = 'String to slugify'
slug = s.replace(" ", "-").lower()

another nice answer for creating it could be this form
import re
re.sub(r'\W+', '-', st).strip('-').lower()

Verify CSV against given format

I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:
"<String>","<String>",<Int>,<Float>
That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).
(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.
(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.
EDIT:
Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.
EDIT #2:
Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.

Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).
from pyparsing import *
integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
integer + COMMA + floatnum + LineEnd()
tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()
for t in tests:
print t
try:
print validLine.parseString(t).asList()
except ParseException, pe:
print pe.markInputline('?')
print pe.msg
print
Prints
"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]
"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]
bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes
"bad","good2",100,3
"bad","good2",100,?3
Expected float
"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","
You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:
dblQuotedString.setParseAction(removeQuotes)
If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:
comment = '#' + restOfline
validLine.ignore(comment)
You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):
validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
integer("qty") + COMMA + floatnum("price") + LineEnd()
And your post-processing code can then do this:
data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price

I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).
Python would be an excellent tool for the job.
I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.
import csv
import math
dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
if len(row) != 4:
print 'Invalid row length.'
return
my_int = int(row[2])
my_float = float(row[3])
if math.isnan(my_int):
print 'Bad int found'
return
if math.isnan(my_float):
print 'Bad float found'
return
print 'All good!'

Here's a small snippet I made:
import csv
f = csv.reader(open("test.csv"))
for value in f:
value[0] = str(value[0])
value[1] = str(value[1])
value[2] = int(value[2])
value[3] = float(value[3])
If you run that with a file that doesn't have the format your specified, you'll get an exception:
$ python valid.py
Traceback (most recent call last):
File "valid.py", line 8, in <module>
i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'
You can then make a try-except ValueError to catch it and let the users know what they did wrong.

There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.
Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.
You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.

Try Cutplace. It verifies that tabluar data conforms to an interface control document.

Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.
Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.
Something like...
/"[a-zA-Z' -]+"/
...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.

Munging non-printable characters to dots using string.translate()

So I've done this before and it's a surprising ugly bit of code for such a seemingly simple task.
The goal is to translate any non-printable character into a . (dot). For my purposes "printable" does exclude the last few characters from string.printable (new-lines, tabs, and so on). This is for printing things like the old MS-DOS debug "hex dump" format ... or anything similar to that (where additional whitespace will mangle the intended dump layout).
I know I can use string.translate() and, to use that, I need a translation table. So I use string.maketrans() for that. Here's the best I could come up with:
filter = string.maketrans(
string.translate(string.maketrans('',''),
string.maketrans('',''),string.printable[:-5]),
'.'*len(string.translate(string.maketrans('',''),
string.maketrans('',''),string.printable[:-5])))
... which is an unreadable mess (though it does work).
From there you can call use something like:
for each_line in sometext:
print string.translate(each_line, filter)
... and be happy. (So long as you don't look under the hood).
Now it is more readable if I break that horrid expression into separate statements:
ascii = string.maketrans('','') # The whole ASCII character set
nonprintable = string.translate(ascii, ascii, string.printable[:-5]) # Optional delchars argument
filter = string.maketrans(nonprintable, '.' * len(nonprintable))
And it's tempting to do that just for legibility.
However, I keep thinking there has to be a more elegant way to express this!

Here's another approach using a list comprehension:
filter = ''.join([['.', chr(x)][chr(x) in string.printable[:-5]] for x in xrange(256)])

Broadest use of "ascii" here, but you get the idea
>>> import string
>>> ascii="".join(map(chr,range(256)))
>>> filter="".join(('.',x)[x in string.printable[:-5]] for x in ascii)
>>> ascii.translate(filter)
'................................ !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~.................................................................................................................................'
If I were golfing, probably use something like this:
filter='.'*32+"".join(map(chr,range(32,127)))+'.'*129

for actual code-golf, I imagine you'd avoid string.maketrans entirely
s=set(string.printable[:-5])
newstring = ''.join(x for x in oldstring if x in s else '.')
or
newstring=re.sub('[^'+string.printable[:-5]+']','',oldstring)

I don't find this solution ugly. It is certainly more efficient than any regex based solution. Here is a tiny bit shorter solution. But only works in python2.6:
nonprintable = string.maketrans('','').translate(None, string.printable[:-5])
filter = string.maketrans(nonprintable, '.' * len(nonprintable))

Sensible python source line wrapping for printout

I am working on a latex document that will require typesetting significant amounts of python source code. I'm using pygments (the python module, not the online demo) to encapsulate this python in latex, which works well except in the case of long individual lines - which simply continue off the page. I could manually wrap these lines except that this just doesn't seem that elegant a solution to me, and I prefer spending time puzzling about crazy automated solutions than on repetitive tasks.
What I would like is some way of processing the python source code to wrap the lines to a certain maximum character length, while preserving functionality. I've had a play around with some python and the closest I've come is inserting \\\n in the last whitespace before the maximum line length - but of course, if this ends up in strings and comments, things go wrong. Quite frankly, I'm not sure how to approach this problem.
So, is anyone aware of a module or tool that can process source code so that no lines exceed a certain length - or at least a good way to start to go about coding something like that?

You might want to extend your current approach a bit, but using the tokenize module from the standard library to determine where to put your line breaks. That way you can see the actual tokens (COMMENT, STRING, etc.) of your source code rather than just the whitespace-separated words.
Here is a short example of what tokenize can do:
>>> from cStringIO import StringIO
>>> from tokenize import tokenize
>>>
>>> python_code = '''
... def foo(): # This is a comment
... print 'foo'
... '''
>>>
>>> fp = StringIO(python_code)
>>>
>>> tokenize(fp.readline)
1,0-1,1: NL '\n'
2,0-2,3: NAME 'def'
2,4-2,7: NAME 'foo'
2,7-2,8: OP '('
2,8-2,9: OP ')'
2,9-2,10: OP ':'
2,11-2,30: COMMENT '# This is a comment'
2,30-2,31: NEWLINE '\n'
3,0-3,4: INDENT ' '
3,4-3,9: NAME 'print'
3,10-3,15: STRING "'foo'"
3,15-3,16: NEWLINE '\n'
4,0-4,0: DEDENT ''
4,0-4,0: ENDMARKER ''

I use the listings package in LaTeX to insert source code; it does syntax highlight, linebreaks et al.
Put the following in your preamble:
\usepackage{listings}
%\lstloadlanguages{Python} # Load only these languages
\newcommand{\MyHookSign}{\hbox{\ensuremath\hookleftarrow}}
\lstset{
% Language
language=Python,
% Basic setup
%basicstyle=\footnotesize,
basicstyle=\scriptsize,
keywordstyle=\bfseries,
commentstyle=,
% Looks
frame=single,
% Linebreaks
breaklines,
prebreak={\space\MyHookSign},
% Line numbering
tabsize=4,
stepnumber=5,
numbers=left,
firstnumber=1,
%numberstyle=\scriptsize,
numberstyle=\tiny,
% Above and beyond ASCII!
extendedchars=true
}
The package has hook for inline code, including entire files, showing it as figures, ...

I'd check a reformat tool in an editor like NetBeans.
When you reformat java it properly fixes the lengths of lines both inside and outside of comments, if the same algorithm were applied to Python, it would work.
For Java it allows you to set any wrapping width and a bunch of other parameters. I'd be pretty surprised if that didn't exist either native or as a plugin.
Can't tell for sure just from the description, but it's worth a try:
http://www.netbeans.org/features/python/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex: Formatting use of commas, periods internationally - python

You could try splitting and then glueing things back together import re; z = re.split("[,.]", re.sub([^\d\.\,], '', "$1,000.00")) ''.join(z[0:-2]) + ".".join(z[-2:]) # '1000.00'

# -- coding: cp1252 -- import re text = '''US$1,000.00 US$3,000,000 €1.000,00 €4.000''' print '%s\n-------------------' % text pat = '([$€])[ \t]*[\d,.]+' def ripl(mat, d = dict(('$,','€.'))): return mat.group().replace(d[mat.group(1)],'') print re.sub(pat,ripl,text)

Related

Convert a big number string 2,345,678 into its value in int or float or anything that can be manipulated later in python [duplicate]

String slugification in Python

Verify CSV against given format

Munging non-printable characters to dots using string.translate()

Sensible python source line wrapping for printout

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex: Formatting use of commas, periods internationally - python

You could try splitting and then glueing things back together import re; z = re.split("[,.]", re.sub([^\d\.\,], '', "$1,000.00")) ''.join(z[0:-2]) + ".".join(z[-2:]) # '1000.00'

# -*- coding: cp1252 -*- import re text = '''US$1,000.00 US$3,000,000 €1.000,00 €4.000''' print '%s\n-------------------' % text pat = '([$€])[ \t]*[\d,.]+' def ripl(mat, d = dict(('$,','€.'))): return mat.group().replace(d[mat.group(1)],'') print re.sub(pat,ripl,text)

Related

Convert a big number string 2,345,678 into its value in int or float or anything that can be manipulated later in python [duplicate]

String slugification in Python

Verify CSV against given format

Munging non-printable characters to dots using string.translate()

Sensible python source line wrapping for printout

Categories

Resources

# -- coding: cp1252 -- import re text = '''US$1,000.00 US$3,000,000 €1.000,00 €4.000''' print '%s\n-------------------' % text pat = '([$€])[ \t]*[\d,.]+' def ripl(mat, d = dict(('$,','€.'))): return mat.group().replace(d[mat.group(1)],'') print re.sub(pat,ripl,text)