While going through LPTHW, I've set on reading the code here:
https://github.com/BrechtDeMan/secretsanta/blob/master/pairing.py
I've been trying to understand why the output CSV has double-quotes. There are several questions here about the problem but I'm not groking.
Where are the quotes getting introduced?
Edit: I wrote the author a couple of weeks back but haven't heard back.
Edit 2: An example of the output...
"Alice,101,alice#mail.org,Wendy,204,wendy#mail.org"
Double quotes are introduced in write_file function.
CSV files look simple on the surface, but sooner or later you will encounter some more complex problems. The first one is: what should happen if character denoting delimiter occurs in field content? Because there is no real standard for CSV format, different people had different ideas of correct answer for this question.
Python csv library tries to abstract this complexity and various approaches and make it easier to read and write CSV files following different rules. This is done by Dialect class objects.
The author of write_file function decided to construct output row manually by joining all fields and delimiter characters together, but then used csv module to actually write data into file:
writer.writerow([givers_list[ind][1] + ',' + givers_list[ind][2]
+ ',' + givers_list[ind][3]
+ ',' + givers_list[rand_vec[ind]][1] + ','
+ givers_list[rand_vec[ind]][2] + ',' + givers_list[rand_vec[ind]][3]])
This inconsistent usage of csv module resulted in entire row of data being treated as single field. Because that field contains characters used as field delimiters, Dialect.quoting decides how it should be handled. Default quoting configuration, csv.QUOTE_MINIMAL says that field should be quoted using Dialect.quotechar - which defaults to double quote character ("). That's why eventually entire field ends up surrounded by double quote characters.
Fast and easy, but not correct, solution would be changing quoting algorithm to csv.QUOTE_NONE. This will tell writer object to never surround fields, but instead to escape special characters by Dialect.escapechar. According to documentation, setting it to None (default) will raise an error. I guess that setting it to empty string could do the job.
The correct solution is feeding writer.writerrow with expected input data - list of fields. This should do (untested):
writer.writerow([givers_list[ind][1], givers_list[ind][2],
givers_list[ind][3],
givers_list[rand_vec[ind]][1],
givers_list[rand_vec[ind]][2], givers_list[rand_vec[ind]][3]])
In general, (double)quotes are needed when there is a seperator-char inside a field - and if there are quotes inside that field, they need to be 'escaped' with another quote.
Do you have an example of the output and the quotes you are talking about?
Edit (after example):
Ok, the whole row is treated as one field here. As Miroslaw Zalewski mentioned, those values should be treated as seperate fields instead of one long string.
Related
I stream data via Server Send Event and get about 500.000 datasets but instead of getting one json I get this (example of 2 of the 500.000 datasets)(this is how it looks like opening it in gedit, all question marks are \" and all new lines are \n):
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
... -
My goal is to get this into a database. I actually thought I put this into a dictionary and afterwards create a pandas dataframe from here on I should be able to get it into a database. But this ends up to be quite cumbersome. I ended up with something like this:
c1 = data_json[1:-1]
c2 = c1.replace('{data:{', '{\"data\":{')
c3 = c2.replace('}data:{', ', ')
c4 = '{' + c3 + '}'
but even here I have some problems since I have to add /n/n for the new lines. But as soon as I change c3 to c2.replace('}\n\ndata:{', ', ') I get Process finished with exit code 137 (interrupted by signal 9: SIGKILL). Coming from .NET I could handle this quite easy with a deserializer and I am wondering if there is a similar way to deserialize the data.
I get the data via sseclient and would be able to store them as bytes instead of string, if this would help, just fyi.
Any suggestions?
Juggling with replaces is of course a convoluted path -
the language does have the parsers for this kind of escaping built in -
the simpler of which would be passing the string that contains JSON through an eval call. But eval is seldom needed and should be avoided in most cases as "not elegant" - if not outright unsafe (but being unsafe actually just applies when you have no control over the input data - and even them, ast.literal_eval instead of plain eval can mitigate that). Anyway, there are other problems with the format that will prevent eval to work outright - the missing quotes of the outmost data:, for example.
Random rants apart, if your file content is actually:
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
It has two problems: "under-quoting' of the outmost data and an
"over-scaping" of the inner-data.
On an interactive Python session, using the "raw string" marker I can input your example line as it will be read from a file:
In [263]: a = r"""data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n"""
In [264]: print(a)
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
So, on to remove one level of backslashes - Python have an "unicode_escape" text encoding, but it only works from bytes-objects. We then resort to the "latin1" encoding, as it provides a byte-for-byte conversion of the unicode literal in "a" to bytes, and then apply an unicode_escape to remove the "\" :
In [266]: b = a.encode("latin1").decode("unicode_escape")
In [267]: print(b, "\n", repr(b))
data:{"data":["Kendrick","Lamar"]}
data:{"data":["David","Bowie"]}
'data:{"data":["Kendrick","Lamar"]}\n\ndata:{"data":["David","Bowie"]}\n\n'
now it is easy to parse:
We split the resulting string at "\n\n" and have one list with one record
(those you are calling "dataset") per element. Then we resort to string
manipulation to get rid of the starting "data:" and finally, json.load can work on the remaining part.
so:
import json
raw_data = open("mystrangefile.pseudo_json").read()
data = data.encode("latin1").decode("unicode_escape")
records = [json.loads(record.split(":", 1)[-1]) for record in data.split("\n\n")]
And "records" now should contain well behaved Python objects dictionaries, you can put in a database. (Unless Pandas can provide automatic mapping of the columns to a databas, it seems to be an uneeded step - a raw connection.executemany(""" INSERT ...""", records) with a proper open DB connection should suffice.
Also, on a sidenote you mentioned that you could handle this easily with a .NET deserializer: that is only if your files are not as broken as you have shown us - no possible standard serializer could know how to handle such an specific data format out of the box. But, if you actually is that more proeficient in another language/technology to do that, you could resort to write just a converter from the broken input to a properly encoded file, and use that as an intermediate step.
I'm not completely sure if I understood the format in which you get the string correctly, so please correct me if I'm wrong here:
data_json = 'data:{\\"data\\":[\\"Kendrick\\",\\"Lamar\\"]}\\n\\ndata:{\\"data\\":[\\"David\\",\\"Bowie\\"]}\\n\\n'
Your first line seems to strip the first and last character, which I don't see. Are there any additional characters you are stripping away here?
The two following substring replacements seem to have no effect as the substrings are not present in the initial string (if I got it correctly in the first place).
And finally in the last line you are wrapping your result with { and } which is not correct for lists in json. It should be [...]
I can't really tell why you would get a SIGKILL here, though. It does not throw any errors for me, it just does not do what you want it to do. Maybe you're running out of memory with all the 500k examples?
However, this would be a working solution (again, given that I got the initial string correctly):
c1 = data_json.replace('\\n\\n', '') # removing escaped newlines
c2 = c1.replace('data:', ',') # replacing the additional 'data:' with json delimiter ','
c3 = c2.replace('\\', '') # removing artificial escapes
c4 = c3[1:-1] # removing leading ',' (introduced in c2) and trailing newline
c5 = '[' + c4 + ']' # wrapping as list
Now you should be able to json.loads(c5) or whatever you need to do with that string.
It appears that the pandas to_csv function only allows single character delimiters/separators.
Is there some way to allow for a string of characters to be used like, "::" or "%%" instead?
I tried:
df.to_csv(local_file, sep = '::', header=None, index=False)
and getting:
TypeError: "delimiter" must be a 1-character string
Use numpy-savetxt
Ex:
np.savetxt(file.csv, np.char.decode(chunk_data.values.astype(np.bytes_), 'UTF-8'), delimiter='~|', fmt='%s',encoding=None)
np.savetxt(file.dat, chunk_data.values, delimiter='~|', fmt='%s',encoding='utf-8')
Think about what this line a::b::c‘ means to a standard CSV tool: an a, an empty column, a b, an empty column, and a c. Even in a more complicated case with quoting or escaping:"abc::def"::2 means an abc::def, an empty column, and a 2.
So, all you have to do is add an empty column between every column, and then use : as a delimiter, and the output will be almost what you want.
I say “almost” because Pandas is going to quote or escape single colons. Depending on the dialect options you’re using, and the tool you’re trying to interact with, this may or may not be a problem. Unnecessary quoting usually isn’t a problem (unless you ask for QUOTE_ALL, because then your columns will be separated by :"":, so hopefully you don’t need that dialect option), but unnecessary escapes might be (e.g., you might end up with every single : in a string turned into a \: or something). So you have to be careful with the options. But it’ll work for the basic “quote as needed, with mostly standard other options” settings.
I am faced with the following problem: when I generate .csv files in python using django-import-export even though the field is a string, when I open it in Excel the leading zeros are omitted. E.g. 000123 > 123.
This is a problem, because if I'd like to display a zipcode I need the zeros the way they are. I can cover it in quotes, but that's not desirable since it will grab unnecessary attention and it just looks bad. I'm also aware that you can do it in Excel files manually by changing the data type, but I don't want to explain that to people who are using my software.
Any suggestions?
Thanks in advance.
I've tried this solution. It's the solution suggested by #jquijano but it hasn't worked.
After generating the CSV, I opened it with 'open office' and 'excel' and in both cases I could see the (') character at the beginning of each string. However, if I added a new value to the CSV in the editor, for example '0895, the (') disappeared and the leading 0 wasn't removed.
Luckily, I found a workaround. I just added an empty character at the beginning.
value = chr(24) + unidecode('00123')
An easy fix would be adding an apostrophe (') at the beginning of each number when doing using import-export. This way Excel will recognize those numbers as a text.
I have two programs that I converted to using the csv module in Python 2.7. The first I just used csv.reader:
f=open(sys.argv[1],'r')
line=csv.reader(f)
for values in line:
…do some stuff…
In this case commas within quoted string are removed. Another program was too complex just to read so I copied the file changing the delimiter to semi-colon:
fin=open(sys.argv[1],'rb')
lines=csv.reader(fin)
with open('temp.csv','wb') as fout:
w=csv.writer(fout,delimiter=';')
w.writerows(lines)
In this case the commas inside of quoted string are preserved. I find no posts on bugs.python.org on this, so I conclude it must be me. So, anyone else experience this behavior. In the second case all elements are quoted. In the first case quotes are used only where required.
It's hard to guess without seeing a sample of your input file and the resulting output, but maybe this is related: Read CSV file with comma within fields in Python
Try csv.reader(f, skipinitialspace=True).
I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:
"<String>","<String>",<Int>,<Float>
That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).
(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.
(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.
EDIT:
Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.
EDIT #2:
Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.
Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).
from pyparsing import *
integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
integer + COMMA + floatnum + LineEnd()
tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()
for t in tests:
print t
try:
print validLine.parseString(t).asList()
except ParseException, pe:
print pe.markInputline('?')
print pe.msg
print
Prints
"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]
"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]
bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes
"bad","good2",100,3
"bad","good2",100,?3
Expected float
"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","
You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:
dblQuotedString.setParseAction(removeQuotes)
If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:
comment = '#' + restOfline
validLine.ignore(comment)
You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):
validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
integer("qty") + COMMA + floatnum("price") + LineEnd()
And your post-processing code can then do this:
data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price
I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).
Python would be an excellent tool for the job.
I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.
import csv
import math
dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
if len(row) != 4:
print 'Invalid row length.'
return
my_int = int(row[2])
my_float = float(row[3])
if math.isnan(my_int):
print 'Bad int found'
return
if math.isnan(my_float):
print 'Bad float found'
return
print 'All good!'
Here's a small snippet I made:
import csv
f = csv.reader(open("test.csv"))
for value in f:
value[0] = str(value[0])
value[1] = str(value[1])
value[2] = int(value[2])
value[3] = float(value[3])
If you run that with a file that doesn't have the format your specified, you'll get an exception:
$ python valid.py
Traceback (most recent call last):
File "valid.py", line 8, in <module>
i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'
You can then make a try-except ValueError to catch it and let the users know what they did wrong.
There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.
Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.
You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.
Try Cutplace. It verifies that tabluar data conforms to an interface control document.
Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.
Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.
Something like...
/"[a-zA-Z' -]+"/
...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.