I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:
"<String>","<String>",<Int>,<Float>
That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).
(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.
(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.
EDIT:
Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.
EDIT #2:
Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.
Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).
from pyparsing import *
integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
integer + COMMA + floatnum + LineEnd()
tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()
for t in tests:
print t
try:
print validLine.parseString(t).asList()
except ParseException, pe:
print pe.markInputline('?')
print pe.msg
print
Prints
"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]
"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]
bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes
"bad","good2",100,3
"bad","good2",100,?3
Expected float
"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","
You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:
dblQuotedString.setParseAction(removeQuotes)
If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:
comment = '#' + restOfline
validLine.ignore(comment)
You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):
validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
integer("qty") + COMMA + floatnum("price") + LineEnd()
And your post-processing code can then do this:
data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price
I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).
Python would be an excellent tool for the job.
I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.
import csv
import math
dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
if len(row) != 4:
print 'Invalid row length.'
return
my_int = int(row[2])
my_float = float(row[3])
if math.isnan(my_int):
print 'Bad int found'
return
if math.isnan(my_float):
print 'Bad float found'
return
print 'All good!'
Here's a small snippet I made:
import csv
f = csv.reader(open("test.csv"))
for value in f:
value[0] = str(value[0])
value[1] = str(value[1])
value[2] = int(value[2])
value[3] = float(value[3])
If you run that with a file that doesn't have the format your specified, you'll get an exception:
$ python valid.py
Traceback (most recent call last):
File "valid.py", line 8, in <module>
i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'
You can then make a try-except ValueError to catch it and let the users know what they did wrong.
There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.
Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.
You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.
Try Cutplace. It verifies that tabluar data conforms to an interface control document.
Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.
Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.
Something like...
/"[a-zA-Z' -]+"/
...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.
Related
I stream data via Server Send Event and get about 500.000 datasets but instead of getting one json I get this (example of 2 of the 500.000 datasets)(this is how it looks like opening it in gedit, all question marks are \" and all new lines are \n):
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
... -
My goal is to get this into a database. I actually thought I put this into a dictionary and afterwards create a pandas dataframe from here on I should be able to get it into a database. But this ends up to be quite cumbersome. I ended up with something like this:
c1 = data_json[1:-1]
c2 = c1.replace('{data:{', '{\"data\":{')
c3 = c2.replace('}data:{', ', ')
c4 = '{' + c3 + '}'
but even here I have some problems since I have to add /n/n for the new lines. But as soon as I change c3 to c2.replace('}\n\ndata:{', ', ') I get Process finished with exit code 137 (interrupted by signal 9: SIGKILL). Coming from .NET I could handle this quite easy with a deserializer and I am wondering if there is a similar way to deserialize the data.
I get the data via sseclient and would be able to store them as bytes instead of string, if this would help, just fyi.
Any suggestions?
Juggling with replaces is of course a convoluted path -
the language does have the parsers for this kind of escaping built in -
the simpler of which would be passing the string that contains JSON through an eval call. But eval is seldom needed and should be avoided in most cases as "not elegant" - if not outright unsafe (but being unsafe actually just applies when you have no control over the input data - and even them, ast.literal_eval instead of plain eval can mitigate that). Anyway, there are other problems with the format that will prevent eval to work outright - the missing quotes of the outmost data:, for example.
Random rants apart, if your file content is actually:
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
It has two problems: "under-quoting' of the outmost data and an
"over-scaping" of the inner-data.
On an interactive Python session, using the "raw string" marker I can input your example line as it will be read from a file:
In [263]: a = r"""data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n"""
In [264]: print(a)
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
So, on to remove one level of backslashes - Python have an "unicode_escape" text encoding, but it only works from bytes-objects. We then resort to the "latin1" encoding, as it provides a byte-for-byte conversion of the unicode literal in "a" to bytes, and then apply an unicode_escape to remove the "\" :
In [266]: b = a.encode("latin1").decode("unicode_escape")
In [267]: print(b, "\n", repr(b))
data:{"data":["Kendrick","Lamar"]}
data:{"data":["David","Bowie"]}
'data:{"data":["Kendrick","Lamar"]}\n\ndata:{"data":["David","Bowie"]}\n\n'
now it is easy to parse:
We split the resulting string at "\n\n" and have one list with one record
(those you are calling "dataset") per element. Then we resort to string
manipulation to get rid of the starting "data:" and finally, json.load can work on the remaining part.
so:
import json
raw_data = open("mystrangefile.pseudo_json").read()
data = data.encode("latin1").decode("unicode_escape")
records = [json.loads(record.split(":", 1)[-1]) for record in data.split("\n\n")]
And "records" now should contain well behaved Python objects dictionaries, you can put in a database. (Unless Pandas can provide automatic mapping of the columns to a databas, it seems to be an uneeded step - a raw connection.executemany(""" INSERT ...""", records) with a proper open DB connection should suffice.
Also, on a sidenote you mentioned that you could handle this easily with a .NET deserializer: that is only if your files are not as broken as you have shown us - no possible standard serializer could know how to handle such an specific data format out of the box. But, if you actually is that more proeficient in another language/technology to do that, you could resort to write just a converter from the broken input to a properly encoded file, and use that as an intermediate step.
I'm not completely sure if I understood the format in which you get the string correctly, so please correct me if I'm wrong here:
data_json = 'data:{\\"data\\":[\\"Kendrick\\",\\"Lamar\\"]}\\n\\ndata:{\\"data\\":[\\"David\\",\\"Bowie\\"]}\\n\\n'
Your first line seems to strip the first and last character, which I don't see. Are there any additional characters you are stripping away here?
The two following substring replacements seem to have no effect as the substrings are not present in the initial string (if I got it correctly in the first place).
And finally in the last line you are wrapping your result with { and } which is not correct for lists in json. It should be [...]
I can't really tell why you would get a SIGKILL here, though. It does not throw any errors for me, it just does not do what you want it to do. Maybe you're running out of memory with all the 500k examples?
However, this would be a working solution (again, given that I got the initial string correctly):
c1 = data_json.replace('\\n\\n', '') # removing escaped newlines
c2 = c1.replace('data:', ',') # replacing the additional 'data:' with json delimiter ','
c3 = c2.replace('\\', '') # removing artificial escapes
c4 = c3[1:-1] # removing leading ',' (introduced in c2) and trailing newline
c5 = '[' + c4 + ']' # wrapping as list
Now you should be able to json.loads(c5) or whatever you need to do with that string.
I'm looking for a way to optimize an algorithm that I have already developed. As the title of my question says, I am dealing with comma delimited strings that will sometimes contain any number of embedded commas. This is all being done in the context of big data so speed is important. What I have here does everything I need it to, however, I have to believe there would be a faster way of doing it. If you have any suggestions I would love to hear them. Thank you in advance.
code:
import os,re
commaProblemA=re.compile('^"[\s\w\-()/*.#!#%^\'&$\{\}|<>:0-9]+$')
commaProblemB=re.compile('^[\s\w\-()/*.#!#%^\'&$\{\}|<>:0-9]*"$')
#example string
#these are read from a file in practice
z=',,"N/A","DWIGHT\'s BEET FARM,INC.","CAMUS,ALBERT",35.00,0.00,"NIETZSCHE,FRIEDRICH","God, I hope this works, fast.",,,35.00,,,"",,,,,,,,,,,"20,4,2,3,2,33","223,2,3,,34 00:00:00:000000",,,,,,,,,,,,0,,,,,,"ERW-400",,,,,,,,,,,,,,,1,,,,,,,"BLA",,"IGE6560",,,,'
testList=z.split(',')
for i in testList:
if re.match(commaProblemA,i):
startingIndex=testList.index(i)
endingIndex=testList.index(i)
count=0
while True:
endingIndex+=1
if re.match(commaProblemB,testList[endingIndex]):
diff=endingIndex-startingIndex
while count<diff:
testList[startingIndex]=(testList[startingIndex]+","+testList[startingIndex+1])
testList.pop(startingIndex+1)
count+=1
break
print(str(lineList))
print(len(lineList))
If you really want to do this yourself instead of using a library, first some tips:
don't use split() on csv data. (also bad for performance)
for performance: don't use regEx.
The regular way to scan the data would be like this (pseudo code, assuming single line csv):
for each line
bool insideQuotes = false;
while not end of line {
if currentChar == '"'
insideQuotes = !insideQuotes; // ( ! meaning 'not')
// this also handles the case of escaped quotes inside the field
// (if escaped with an extra quote)
else if currentChar == ',' and !insideQuotes
// seperator found - handle field
}
For even better performance you could open the file in binary mode and handle the newlines yourself while scanning. This way you don't need to scan for a line, copy it in a buffer (for example with getline() or a similar function) and then scan that buffer again to extract the fields.
This topic is related to the Parsing a CS:GO script file in Python theme, but there is another problem.
I'm working on a content from CS:GO and now i'm trying to make a python tool importing all data from from /scripts/ folder into Python dictionaries.
The next step after parsing data is parsing Language resource file from /resources and making relations between dictionaries and language.
There is an original file for Eng localization:
https://github.com/spec45as/PySteamBot/blob/master/csgo_english.txt
The file format is similar to the previous task, but I have faced with another problems. All language files is in UTF-16-LE encode, i couldn't understand the way of working with encoded files and strings in Python (I'm mostly working with Java)
I have tried to make some solutions, based on open(fileName, encoding='utf-16-le').read(), but i don't know how to work with such encoded strings in pyparsing.
pyparsing.ParseException: Expected quoted string, starting with "
ending with " (at char 0), (line:1, col:1)
Another problem is lines with \"-like expressions, for example:
"musickit_midnightriders_01_desc" "\"HAPPY HOLIDAYS, ****ERS!\"\n -Midnight Riders"
How to parse these symbols if I want to leave these lines as they are?
There are a few new wrinkles to this input file that were not in the original CS:GO example:
embedded \" escaped quotes in some of the value strings
some of the quoted value strings span multiple lines
some of the values end with a trailing environment condition (such as [$WIN32], [$OSX])
embedded comments in the file, marked with '//'
The first two are addressed by modifying the definition of value_qs. Since values are now more fully-featured than keys, I decided to use separate QuotedString definitions for them:
key_qs = QuotedString('"').setName("key_qs")
value_qs = QuotedString('"', escChar='\\', multiline=True).setName("value_qs")
The third requires a bit more refactoring. The use of these qualifying conditions is similar to #IFDEF macros in C - they enable/disable the definition only if the environment matches the condition. Some of these conditions were even boolean expressions:
[!$PS3]
[$WIN32||$X360||$OSX]
[!$X360&&!$PS3]
This could lead to duplicate keys in the definition file, such as in these lines:
"Menu_Dlg_Leaderboards_Lost_Connection" "You must be connected to Xbox LIVE to view Leaderboards. Please check your connection and try again." [$X360]
"Menu_Dlg_Leaderboards_Lost_Connection" "You must be connected to PlayStation®Network and Steam to view Leaderboards. Please check your connection and try again." [$PS3]
"Menu_Dlg_Leaderboards_Lost_Connection" "You must be connected to Steam to view Leaderboards. Please check your connection and try again."
which contain 3 definitions for the key "Menu_Dlg_Leaderboards_Lost_Connection", depending on what environment values were set.
In order to not lose these values when parsing the file, I chose to modify the key at parse time by appending the condition if one was present. This code implements the change:
LBRACK,RBRACK = map(Suppress, "[]")
qualExpr = Word(alphanums+'$!&|')
qualExprCondition = LBRACK + qualExpr + RBRACK
key_value = Group(key_qs + value + Optional(qualExprCondition("qual")))
def addQualifierToKey(tokens):
tt = tokens[0]
if 'qual' in tt:
tt[0] += '/' + tt.pop(-1)
key_value.setParseAction(addQualifierToKey)
So that in the sample above, you would get 3 keys:
Menu_Dlg_Leaderboards_Lost_Connection/$X360
Menu_Dlg_Leaderboards_Lost_Connection/$PS3
Menu_Dlg_Leaderboards_Lost_Connection
Lastly, the handling of comments, probably the easiest. Pyparsing has built-in support for skipping over comments, just like whitespace. You just need to define the expression for the comment, and have the top-level parser ignore it. To support this feature, several common comment forms are pre-defined in pyparsing. In this case, the solution is just to change the final parser defintion to:
parser.ignore(dblSlashComment)
And LASTLY lastly, there is a minor bug in the implementation of QuotedString, in which standard whitespace string literals like \t and \n are not handled, and are just treated as an unnecessarily-escaped 't' or 'n'. So for now, when this line is parsed:
"SFUI_SteamOverlay_Text" "This feature requires Steam Community In-Game to be enabled.\n\nYou might need to restart the game after you enable this feature in Steam:\nSteam -> File -> Settings -> In-Game: Enable Steam Community In-Game\n" [$WIN32]
For the value string you just get:
This feature requires Steam Community In-Game to be enabled.nnYou
might need to restart the game after you enable this feature in
Steam:nSteam -> File -> Settings -> In-Game: Enable Steam Community
In-Gamen
instead of:
This feature requires Steam Community In-Game to be enabled.
You might need to restart the game after you enable this feature in Steam:
Steam -> File -> Settings -> In-Game: Enable Steam Community In-Game
I will have to fix this behavior in the next release of pyparsing.
Here is the final parser code:
from pyparsing import (Suppress, QuotedString, Forward, Group, Dict,
ZeroOrMore, Word, alphanums, Optional, dblSlashComment)
LBRACE,RBRACE = map(Suppress, "{}")
key_qs = QuotedString('"').setName("key_qs")
value_qs = QuotedString('"', escChar='\\', multiline=True).setName("value_qs")
# use this code to convert integer values to ints at parse time
def convert_integers(tokens):
if tokens[0].isdigit():
tokens[0] = int(tokens[0])
value_qs.setParseAction(convert_integers)
LBRACK,RBRACK = map(Suppress, "[]")
qualExpr = Word(alphanums+'$!&|')
qualExprCondition = LBRACK + qualExpr + RBRACK
value = Forward()
key_value = Group(key_qs + value + Optional(qualExprCondition("qual")))
def addQualifierToKey(tokens):
tt = tokens[0]
if 'qual' in tt:
tt[0] += '/' + tt.pop(-1)
key_value.setParseAction(addQualifierToKey)
struct = (LBRACE + Dict(ZeroOrMore(key_value)) + RBRACE).setName("struct")
value <<= (value_qs | struct)
parser = Dict(key_value)
parser.ignore(dblSlashComment)
sample = open('cs_go_sample2.txt').read()
config = parser.parseString(sample)
print (config.keys())
for k in config.lang.keys():
print ('- ' + k)
#~ config.lang.pprint()
print (config.lang.Tokens.StickerKit_comm01_burn_them_all)
print (config.lang.Tokens['SFUI_SteamOverlay_Text/$WIN32'])
While going through LPTHW, I've set on reading the code here:
https://github.com/BrechtDeMan/secretsanta/blob/master/pairing.py
I've been trying to understand why the output CSV has double-quotes. There are several questions here about the problem but I'm not groking.
Where are the quotes getting introduced?
Edit: I wrote the author a couple of weeks back but haven't heard back.
Edit 2: An example of the output...
"Alice,101,alice#mail.org,Wendy,204,wendy#mail.org"
Double quotes are introduced in write_file function.
CSV files look simple on the surface, but sooner or later you will encounter some more complex problems. The first one is: what should happen if character denoting delimiter occurs in field content? Because there is no real standard for CSV format, different people had different ideas of correct answer for this question.
Python csv library tries to abstract this complexity and various approaches and make it easier to read and write CSV files following different rules. This is done by Dialect class objects.
The author of write_file function decided to construct output row manually by joining all fields and delimiter characters together, but then used csv module to actually write data into file:
writer.writerow([givers_list[ind][1] + ',' + givers_list[ind][2]
+ ',' + givers_list[ind][3]
+ ',' + givers_list[rand_vec[ind]][1] + ','
+ givers_list[rand_vec[ind]][2] + ',' + givers_list[rand_vec[ind]][3]])
This inconsistent usage of csv module resulted in entire row of data being treated as single field. Because that field contains characters used as field delimiters, Dialect.quoting decides how it should be handled. Default quoting configuration, csv.QUOTE_MINIMAL says that field should be quoted using Dialect.quotechar - which defaults to double quote character ("). That's why eventually entire field ends up surrounded by double quote characters.
Fast and easy, but not correct, solution would be changing quoting algorithm to csv.QUOTE_NONE. This will tell writer object to never surround fields, but instead to escape special characters by Dialect.escapechar. According to documentation, setting it to None (default) will raise an error. I guess that setting it to empty string could do the job.
The correct solution is feeding writer.writerrow with expected input data - list of fields. This should do (untested):
writer.writerow([givers_list[ind][1], givers_list[ind][2],
givers_list[ind][3],
givers_list[rand_vec[ind]][1],
givers_list[rand_vec[ind]][2], givers_list[rand_vec[ind]][3]])
In general, (double)quotes are needed when there is a seperator-char inside a field - and if there are quotes inside that field, they need to be 'escaped' with another quote.
Do you have an example of the output and the quotes you are talking about?
Edit (after example):
Ok, the whole row is treated as one field here. As Miroslaw Zalewski mentioned, those values should be treated as seperate fields instead of one long string.
I want to know how to allow multiple inputs in Python.
Ex: If a message is "!comment postid customcomment"
I want to be able to take that post ID, put that somewhere, and then the customcomment, and put that somewhere else.
Here's my code:
import fb
token="access_token_here"
facebook=fb.graph.api(token)
#__________ Later on in the code: __________
elif msg.startswith('!comment '):
postid = msg.replace('!comment ','',1)
send('Commenting...')
facebook.publish(cat="comments", id=postid, message="customcomment")
send('Commented!')
I can't seem to figure it out.
Thank you in advanced.
I can't quite tell what you are asking but it seems that this will do what you want.
Assuming that
msg = "!comment postid customcomment"
you can use the built-in string method split to turn the string into a list of strings, using " " as a separator and a maximum number of splits of 2:
msg_list=msg.split(" ",2)
the zeroth index will contain "!comment" so you can ignore it
postid=msg_list[1] or postid=int(msg_list[1]) if you need a numerical input
message = msg_list[2]
If you don't limit split and just use the default behavior (ie msg_list=msg.split()), you would have to rejoin the rest of the strings separated by spaces. To do so you can use the built-in string method join which does just that:
message=" ".join(msg_list[2:])
and finally
facebook.publish(cat="comments", id=postid, message=message)