Questions concerning using regex in python - python

I currently am reading in a string that starts with a number up to the next delimeter and testing if the string read in is a float. Now I have a few questions here as I believe my regex works I just think I am not using the proper method once it tries to do it.
my particular float will be in the format of
d+(.d+)?(E(+|-)?d+)?
r'(\d+(\.\d+)?([E][+|-]?\d+)?'
Above is the regular expression I'm using and it is correct for the specifications I have set up, but my issue is that I will be reading in bad values and I want to print an error that either prints the whole string as bad or prints the part that passed followed by an error with the incorrect part printing. When I try I get the error print
print "ERROR: %s" % m.groups()
TypeError: not all arguments converted during string formatting
I feel like I am missing something simple but I cannot figure out what.
So in summary I am trying to use the above regular expression to compare a read in number string to see if it is in the float form. If the whole string conforms I want to print it and if there is a bad part I want to print the whole string as an error or print the good part follow by printing the bad part out with an error message.
p = re.compile(r'(\d+)(\.\d+)?(([E][+-])?\d+)?')
def is_float(str):
m = p.match(str)
if m:
print (m.groups())
return True
I have provided the piece of code I am working with perhaps there is an error there
Some sample inputs are:
3#33 //should print 3 then an error with #33 printed
3.435E-10 // should print the whole thing
0.45654 //should print the whole thing
4E-2 //should print the whole thing

m.groups() is an array. NOT a string, m.groups(0) is the entire match, m.groups(1) is the 1st set of capturing brackets in your regex and so forth.
Try:
print(m.groups())
To see the different values at play
First thing
You're missing a closing bracket. It should be:
(\d+(\.\d+)?([E][+|-]?\d+)?)
Notice the one at the end after the final ?
I then tested it here:
https://regex101.com/r/jF1jX2/1 and it worked.

I have to say, I'd not bother with a regex at all. Given a string that is supposed to represent a float, I'd do
def is_float(str):
try:
f = float(str)
return True
except ValueError:
return False
(BTW if the next step was going to be to convert an acceptable str to float,
just put the try .. except inline, use f if no exception is thrown and do whatever is appropriate when the exception is caught )
Also there's a mistake in your regex, in that it doesn't handle a leading "-" for a negative number (or a "+" for a positive one). Try ... except handles anything that you can throw at Python, using Python's rules.

Related

Why doesn't Python give any error when quotes around a string do not match?

I've started learning Python recently and I don't understand why Python behaves like this:
>>> "OK"
'OK'
>>> """OK"""
'OK'
>>> "not Ok'
File "<stdin>", line 1
"not Ok'
^
SyntaxError: EOL while scanning string literal
>>> "not OK"""
'not OK'
Why doesn't it give an error for the last statement as the number of quotes does not match?
The final """ is not recognized as a triple-quotation, but a single " (to close the current string literal) followed by an empty string ""; the two juxtaposed string literals are concatenated. The same behavior can be more readily recognized by putting a space between the closing and opening ".
>>> "not OK" ""
'not OK'
"not OK"""
Python interprets this as "not OK"+""
If you give "not Ok""ay", you will get the output as 'not Okay'
You would think that there is no difference between " or ', but in reality, Python uses a greedy method to accept input.
Once Python sees a matching quotation, then that ends the statement.
It's why you can write something like "'s" "". Inside the string there is a ' but because you're in a string, python doesn't raise an error. Then after that, there is a " followed by " but that's a different (empty) string.
If you do something like "s' then Python is looking for that next " before if runs your command.
Python uses like a stack implementation to detect quotes opening and closing.
If you know whats a stack is, its a datastructure, in which Last element will be removed first.
Assume your string is A = "''"
What it does is, for every single quote or double quote encountered first time, it will add it the stack, and for every second, it will remove from the stack, unless its ofcourse, """ which will be parsed as single one
In our example, A = "''", iterating over it, for the first 2 elements, they will be added to stack and for the next 2, they will be removed.
So the quotes will be matched, if and only if, the number of elements in the stack in the end must be zero

Storing int or str in the list

I created a text file and opened it in Python using:
for word_in_line in open("test.txt"):
To loop through the words in a line in txt file.
The text file only has one line, which is:
int 111 = 3 ;
When I make a list using .split():
print("Input: {}".format(word_in_line))
line_list = word_in_line.split()
It creates:
['int', '111', '=', '3', ';']
And I was looking for a way to check if line_list[1] ('111') is an integer.
But when I try type(line_list[1]), it says that its str because of ''.
My goal is to read through the txt file and see if it is integer or str or other data type, etc.
What you have in your list is a string. So the type coming is correct and expected.
What you are looking to do is check to see if what you have are all digits in your string. So to do that use the isdigit string method:
line_list[1].isdigit()
Depending on what exactly you are trying to validate here, there are cases where all you want are purely digits, where this solution provides exactly that.
There could be other cases where you want to check whether you have some kind of number. For example, 10.5. This is where isdigit will fail. For cases like that, you can take a look at this answer that provides an approach to check whether you have a float
I don't agree with the above answer.
Any string parsing like #idjaw's answer of line_list[1].isdigit() will fail on an odd edge case. For example, what if the number is a float and like .50 and starts with a dot? The above approach won't work. Technically we only care about ints in this example so this won't matter, but in general it is dangerous.
In general if you are trying to check whether a string is a valid number, it is best to just try to convert the string to a number and then handle the error accordingly.
def isNumber(string):
try:
val = int(string)
return True
except ValueError:
return False

Python Regex Error--returns either Nonetype or wrong part of string

I couldn't quite find a similar question on here (or don't know Python well enough to figure it out from other questions), so here goes.
I'm trying to extract part of a string with re.search().start (I've also tried end()), and that line either seems to find something (but a few spaces off) or it returns None, which is baffling me. For example:
def getlsuscore(line):
print(line)
start=re.search(' - [0-9]', line).start()+2
score=line[start:start+3]
print(score)
score=int(score.strip())
return(score)
The two prints are in there for troubleshooting purposes. The first one prints out:
02:24 LSU 62 - 80 EDDLESTONE,BRANDON SUB IN. SHORTESS,HENRY SUB IN. ROBINSON III,ELBERT SUB OUT. QUARTERMAN,TIM SUB OUT.
Exactly as I expect it to. For the record, I'm trying to extract the 80 in that line and force it to an int. I've tried with various things in the regex match, always including the hyphen, and accordingly different numbers at the end to get me to the correct starting point, and I've tried playing with this in many other ways and still haven't got it to work. As for the print(score), I either get "AttributeError: 'NoneType' object has no attribute 'start'" when I have the start()+whatever correct, or if I change it to something wrong just to try it out, I get something like "ValueError: invalid literal for int() with base 10: '-'" or "ValueError: invalid literal for int() with base 10: '- 8'", with no addition or +1, respectively. So why when I put +2 or +3 at the end of start() does it give me an error? What am I messing up here?
Thanks for the help, I'm a noob at Python so if there's another/better way to do this that isn't regex, that works as well. I've just been using this exact same thing quite a bit on this project and had no problems, so I'm a bit stumped.
Edit: More code/context
def getprevlsuscore(file, time):
realline=''
for line in file:
line=line.rstrip()
if line[0:4]==time:
break
if re.search('SUB IN', line):
if not re.search('LSU', line[:9]):
realline=line
return(getlsuscore(realline))
It only throws the error when called in this block of code, and it's reading from a text file that has the play by play of a basketball game. Several hundred lines long, formatted like the line above, and it only throws an error towards the end of the file (I've tried on a couple different games).
The above function is called by this one:
def plusminus(file, list):
for player in list:
for line in file:
line=line.rstrip()
if not re.search('SUB IN', line):
continue
if not re.search('LSU', line):
continue
if not re.search(player.name, line):
continue
lsuscore=getlsuscore(line)
previouslsuscore=getprevlsuscore(file, line[0:4])
oppscore=getoppscore(line)
previousoppscore=getprevoppscore(file, line[0:4])
print(lsuscore)
print(previouslsuscore)
print(oppscore)
print(previousoppscore)
Obviously not finished, the prints are to check the numbers. The scope of the project is that I'm trying to read a txt file copy/paste of a play by play and create a plus/minus for each player, showing the point differentials for the time they've played (e.g. if player X was in for 5 minutes, and his school scored 15 while the other school scored 5 in that time, he'd be +10).
I think a much easier way of getting the scores extracted from that string, no regex involved, is to use the split() method. That method will split the input string on any whitespace and return an array of the substrings.
def getlsuscore(line):
# Example line: 02:24 LSU 62 - 80 ...
splitResults = line.split()
# Index 0 holds the time,
# Index 1 holds the team name,
# Index 2 holds the first score (as a string still),
# Index 3 holds the separating dash,
# Index 4 holds the second score (as a string still),
# And further indexes hold everything else
firstScore = int(splitResults[2])
secondScore = int(splitResults[4])
print(firstScore, secondScore)
return firstScore
You could try something like this:
m = re.search('(\\d+)\\s*-\\s*(\\d+)', line)
s1 = int(m.group(1))
s2 = int(m.group(2))
print(s1, s2)
This just looks for two numbers separated by a hypen, then decodes them into s1 and s2, after which you can do what you like with them. In practice, you should check m to make sure it isn't None, which would indicate a failed search.
Use a group to extract the number instead of resorting to fiddling with the start index of the match:
>>> import re
>>> line="02:24 LSU 62 - 80 EDDLESTONE,BRANDON SUB IN. blah blah..."
>>> int(re.search(r' - (\d+)', line).group(1))
80
>>>
If you get an error like AttributeError: 'NoneType' object has no attribute 'group' that means that the line you are working on doesn't have the " - (\d+)" sequence in it. For instance, maybe its an empty line. You can catch the problem with a try/except block. Then you have to decide whether this is a bad error or not. If you are absolutely positive that all lines follow your rules then maybe its a fatal error and you should exit warning the user that data is bad. Or if you are more loosey about the data, ignore it and continue.

Import string that looks like a list "[0448521958, +61439800915]" from JSON into Python and make it an actual list?

I am extracting a string out of a JSON document using python that is being sent by an app in development. This question is similar to some other questions, but I'm having trouble just using x = ast.literal_eval('[0448521958, +61439800915]') due to the plus sign.
I'm trying to get each phone number as a string in a python list x, but I'm just not sure how to do it. I'm getting this error:
raise ValueError('malformed string')
ValueError: malformed string
your problem is not just the +
the first number starts with 0 which is an octal number ... it only supports 0-7 ... but the number ends with 8 (and also has other numbers bigger than 8)
but it turns out your problems dont stop there
you can use regex to fix the plus
fixed_string = re.sub('\+(\d+)','\\1','[0445521757, +61439800915]')
ast.literal_eval(fixed_string)
I dont know what you can do about the octal number problem however
I think the problem is that ast.literal_eval is trying to interpret the phone numbers as numbers instead of strings. Try this:
str = '[0448521958, +61439800915]'
str.strip('[]').split(', ')
Result:
['0448521958', '+61439800915']
Technically that string isn't valid JSON. If you want to ignore the +, you could strip it out of the file or string before you evaluate it. If you want to preserve it, you'll have to enclose the value with quotes.

Verify CSV against given format

I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:
"<String>","<String>",<Int>,<Float>
That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).
(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.
(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.
EDIT:
Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.
EDIT #2:
Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.
Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).
from pyparsing import *
integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
integer + COMMA + floatnum + LineEnd()
tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()
for t in tests:
print t
try:
print validLine.parseString(t).asList()
except ParseException, pe:
print pe.markInputline('?')
print pe.msg
print
Prints
"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]
"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]
bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes
"bad","good2",100,3
"bad","good2",100,?3
Expected float
"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","
You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:
dblQuotedString.setParseAction(removeQuotes)
If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:
comment = '#' + restOfline
validLine.ignore(comment)
You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):
validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
integer("qty") + COMMA + floatnum("price") + LineEnd()
And your post-processing code can then do this:
data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price
I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).
Python would be an excellent tool for the job.
I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.
import csv
import math
dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
if len(row) != 4:
print 'Invalid row length.'
return
my_int = int(row[2])
my_float = float(row[3])
if math.isnan(my_int):
print 'Bad int found'
return
if math.isnan(my_float):
print 'Bad float found'
return
print 'All good!'
Here's a small snippet I made:
import csv
f = csv.reader(open("test.csv"))
for value in f:
value[0] = str(value[0])
value[1] = str(value[1])
value[2] = int(value[2])
value[3] = float(value[3])
If you run that with a file that doesn't have the format your specified, you'll get an exception:
$ python valid.py
Traceback (most recent call last):
File "valid.py", line 8, in <module>
i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'
You can then make a try-except ValueError to catch it and let the users know what they did wrong.
There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.
Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.
You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.
Try Cutplace. It verifies that tabluar data conforms to an interface control document.
Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.
Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.
Something like...
/"[a-zA-Z' -]+"/
...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.

Categories

Resources