Parsing comma delimited strings with embedded commas using Python

Parsing comma delimited strings with embedded commas using Python - python

I'm looking for a way to optimize an algorithm that I have already developed. As the title of my question says, I am dealing with comma delimited strings that will sometimes contain any number of embedded commas. This is all being done in the context of big data so speed is important. What I have here does everything I need it to, however, I have to believe there would be a faster way of doing it. If you have any suggestions I would love to hear them. Thank you in advance.
code:
import os,re
commaProblemA=re.compile('^"[\s\w\-()/*.#!#%^\'&$\{\}|<>:0-9]+$')
commaProblemB=re.compile('^[\s\w\-()/*.#!#%^\'&$\{\}|<>:0-9]*"$')
#example string
#these are read from a file in practice
z=',,"N/A","DWIGHT\'s BEET FARM,INC.","CAMUS,ALBERT",35.00,0.00,"NIETZSCHE,FRIEDRICH","God, I hope this works, fast.",,,35.00,,,"",,,,,,,,,,,"20,4,2,3,2,33","223,2,3,,34 00:00:00:000000",,,,,,,,,,,,0,,,,,,"ERW-400",,,,,,,,,,,,,,,1,,,,,,,"BLA",,"IGE6560",,,,'
testList=z.split(',')
for i in testList:
if re.match(commaProblemA,i):
startingIndex=testList.index(i)
endingIndex=testList.index(i)
count=0
while True:
endingIndex+=1
if re.match(commaProblemB,testList[endingIndex]):
diff=endingIndex-startingIndex
while count<diff:
testList[startingIndex]=(testList[startingIndex]+","+testList[startingIndex+1])
testList.pop(startingIndex+1)
count+=1
break
print(str(lineList))
print(len(lineList))

If you really want to do this yourself instead of using a library, first some tips:
don't use split() on csv data. (also bad for performance)
for performance: don't use regEx.
The regular way to scan the data would be like this (pseudo code, assuming single line csv):
for each line
bool insideQuotes = false;
while not end of line {
if currentChar == '"'
insideQuotes = !insideQuotes; // ( ! meaning 'not')
// this also handles the case of escaped quotes inside the field
// (if escaped with an extra quote)
else if currentChar == ',' and !insideQuotes
// seperator found - handle field
}
For even better performance you could open the file in binary mode and handle the newlines yourself while scanning. This way you don't need to scan for a line, copy it in a buffer (for example with getline() or a similar function) and then scan that buffer again to extract the fields.

Related

How to input a set of string separated with linefeed characthers using sys.stdin.read()

I want to input a set of string separated with linefeed character and i want to use sys.stdin.read() but i'm confuse how it works.
This is what i've tried:
import re
import sys
string = sys.stdin.read()
for i in string:
thearitmethic = re.findall('[\d\(\)\+\-\*\/\.]', i)
print(thearitmethic)

The read() call consumes all input until EOF into a single string. While you then could split it on newlines, it's probably better to just read one line at a time.
for line in sys.stdin:
print(re.findall(r'[-\d()+*/.]', line))
Notice also the r'...' raw string to pass backslashes through to the regex engine. However, there is no need to backslash-escape dots, curly parenteses etc inside a [...] character class.
The problem with reading everything into a single string is that you are unnecessarily keeping potentially large amounts of data in memory. Your loop body only needs to see one line at a time; don't keep more than that in memory at any one time. (You could still in theory exhaust your RAM by reading a really long single line, but this is considered obscure enough to not worry many people in actual practice.)

How to deserialize splitted json data

I stream data via Server Send Event and get about 500.000 datasets but instead of getting one json I get this (example of 2 of the 500.000 datasets)(this is how it looks like opening it in gedit, all question marks are \" and all new lines are \n):
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
... -
My goal is to get this into a database. I actually thought I put this into a dictionary and afterwards create a pandas dataframe from here on I should be able to get it into a database. But this ends up to be quite cumbersome. I ended up with something like this:
c1 = data_json[1:-1]
c2 = c1.replace('{data:{', '{\"data\":{')
c3 = c2.replace('}data:{', ', ')
c4 = '{' + c3 + '}'
but even here I have some problems since I have to add /n/n for the new lines. But as soon as I change c3 to c2.replace('}\n\ndata:{', ', ') I get Process finished with exit code 137 (interrupted by signal 9: SIGKILL). Coming from .NET I could handle this quite easy with a deserializer and I am wondering if there is a similar way to deserialize the data.
I get the data via sseclient and would be able to store them as bytes instead of string, if this would help, just fyi.
Any suggestions?

Juggling with replaces is of course a convoluted path -
the language does have the parsers for this kind of escaping built in -
the simpler of which would be passing the string that contains JSON through an eval call. But eval is seldom needed and should be avoided in most cases as "not elegant" - if not outright unsafe (but being unsafe actually just applies when you have no control over the input data - and even them, ast.literal_eval instead of plain eval can mitigate that). Anyway, there are other problems with the format that will prevent eval to work outright - the missing quotes of the outmost data:, for example.
Random rants apart, if your file content is actually:
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
It has two problems: "under-quoting' of the outmost data and an
"over-scaping" of the inner-data.
On an interactive Python session, using the "raw string" marker I can input your example line as it will be read from a file:
In [263]: a = r"""data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n"""
In [264]: print(a)
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
So, on to remove one level of backslashes - Python have an "unicode_escape" text encoding, but it only works from bytes-objects. We then resort to the "latin1" encoding, as it provides a byte-for-byte conversion of the unicode literal in "a" to bytes, and then apply an unicode_escape to remove the "\" :
In [266]: b = a.encode("latin1").decode("unicode_escape")
In [267]: print(b, "\n", repr(b))
data:{"data":["Kendrick","Lamar"]}
data:{"data":["David","Bowie"]}
'data:{"data":["Kendrick","Lamar"]}\n\ndata:{"data":["David","Bowie"]}\n\n'
now it is easy to parse:
We split the resulting string at "\n\n" and have one list with one record
(those you are calling "dataset") per element. Then we resort to string
manipulation to get rid of the starting "data:" and finally, json.load can work on the remaining part.
so:
import json
raw_data = open("mystrangefile.pseudo_json").read()
data = data.encode("latin1").decode("unicode_escape")
records = [json.loads(record.split(":", 1)[-1]) for record in data.split("\n\n")]
And "records" now should contain well behaved Python objects dictionaries, you can put in a database. (Unless Pandas can provide automatic mapping of the columns to a databas, it seems to be an uneeded step - a raw connection.executemany(""" INSERT ...""", records) with a proper open DB connection should suffice.
Also, on a sidenote you mentioned that you could handle this easily with a .NET deserializer: that is only if your files are not as broken as you have shown us - no possible standard serializer could know how to handle such an specific data format out of the box. But, if you actually is that more proeficient in another language/technology to do that, you could resort to write just a converter from the broken input to a properly encoded file, and use that as an intermediate step.

I'm not completely sure if I understood the format in which you get the string correctly, so please correct me if I'm wrong here:
data_json = 'data:{\\"data\\":[\\"Kendrick\\",\\"Lamar\\"]}\\n\\ndata:{\\"data\\":[\\"David\\",\\"Bowie\\"]}\\n\\n'
Your first line seems to strip the first and last character, which I don't see. Are there any additional characters you are stripping away here?
The two following substring replacements seem to have no effect as the substrings are not present in the initial string (if I got it correctly in the first place).
And finally in the last line you are wrapping your result with { and } which is not correct for lists in json. It should be [...]
I can't really tell why you would get a SIGKILL here, though. It does not throw any errors for me, it just does not do what you want it to do. Maybe you're running out of memory with all the 500k examples?
However, this would be a working solution (again, given that I got the initial string correctly):
c1 = data_json.replace('\\n\\n', '') # removing escaped newlines
c2 = c1.replace('data:', ',') # replacing the additional 'data:' with json delimiter ','
c3 = c2.replace('\\', '') # removing artificial escapes
c4 = c3[1:-1] # removing leading ',' (introduced in c2) and trailing newline
c5 = '[' + c4 + ']' # wrapping as list
Now you should be able to json.loads(c5) or whatever you need to do with that string.

Pretty-print Lisp using Python

Is there a way to pretty-print Lisp-style code string (in other words, a bunch of balanced parentheses and text within) in Python without re-inventing a wheel?

Short answer
I think a reasonable approach, if you can, is to generate Python lists or custom objects instead of strings and use the pprint module, as suggested by #saulspatz.
Long answer
The whole question look like an instance of an XY-problem. Why? because you are using Python (why not Lisp?) to manipulate strings (why not data-structures?) representing generated Lisp-style code, where Lisp-style is defined as "a bunch of parentheses and text within".
To the question "how to pretty-print?", I would thus respond "I wouldn't start from here!".
The best way to not reinvent the wheel in your case, apart from using existing wheels, is to stick to a simple output format.
But first of all all, why do you need to pretty-print? who will look at the resulting code?
Depending on the exact Lisp dialect you are using and the intended usage of the code, you could format your code very differently. Think about newlines, indentation and maximum width of your text, for example. The Common Lisp pretty-printer is particulary evolved and I doubt you want to have the same level of configurability.
If you used Lisp, a simple call to pprint would solve your problem, but you are using Python, so stick with the most reasonable output for the moment because pretty-printing is a can of worms.
If your code is intended for human readers, please:
don't put closing parenthesis on their own lines
don't vertically align open and close parenthesis
don't add spaces between opening parenthesis
This is ugly:
( * ( + 3 x )
(f
x
y
)
)
This is better:
(* (+ 3 x)
(f x y))
Or simply:
(* (+ 3 x) (f x y))
See here for more details.
But before printing, you have to parse your input string and make sure it is well-formed. Maybe you are sure it is well-formed, due to how you generate your forms, but I'd argue that the printer should ignore that and not make too many assumptions. If you passed the pretty-printer an AST represented by Python objects instead of just strings, this would be easier, as suggested in comments. You could build a data-structure or custom classes and use the pprint (python) module. That, as said above, seems to be the way to go in your case, if you can change how you generate your Lisp-style code.
With strings, you are supposed to handle any possible input and reject invalid ones.
This means checking that parenthesis and quotes are balanced (beware of escape characters), etc.
Actually, you don't need to really build an intermediate tree for printing (though it would probably help for other parts of your program), because Lisp-style code is made of forms that are easily nested and use a prefix notation: you can scan your input string from left-to-right and print as required when seeing parenthesis (open parenthesis: recurse; close parenthesis, return from recursion). When you first encounter an unescaped double-quote ", read until the next one ", ...
This, coupled with a simple printing method, could be sufficient for your needs.

I think the easiest method would be to use triple quotations. If you say:
print """
(((This is some lisp code))) """
It should work.
You can format your code any way you like within the triple quotes and it will come out the way you want it to.
Best of luck and happy coding!

I made this rudimentary pretty printer once for prettifying CLIPS, which is based on Lisp. Might help:
def clips_pprint(clips_str: str) -> str:
"""Pretty-prints a CLIPS string.
Indents a CLIPS string for easier visual confirmation during development
and verification.
Assumes the CLIPS string is valid CLIPS, i.e. braces are paired.
"""
LB = "("
RB = ")"
TAB = " " * 4
formatted_clips_str = ""
tab_count = 0
for c in clips_str:
if c == LB:
formatted_clips_str += os.linesep
for _i in range(tab_count):
formatted_clips_str += TAB
tab_count += 1
elif c == RB:
tab_count -= 1
formatted_clips_str += c
return formatted_clips_str.strip()

Own pretty print option in python script

I'm outputting pretty huge XML structure to file and I want user to be able to enable/disable pretty print.
I'm working with approximately 150MB of data,when I tried xml.etree.ElementTree and build tree structure from it's element objects, it used awfully lot of memory, so I do this manually by storing raw strings and outputing by .write(). My output sequence looks like this:
ofile.write(pretty_print(u'\
\t\t<LexicalEntry id="%s">\n\
\t\t\t<feat att="languageCode" val="cz"/>\n\
\t\t\t<Lemma>\n\
\t\t\t\t<FormRepresentation>\n\
\t\t\t\t\t<feat att="writtenForm" val="%s"/>\n\
\t\t\t\t</FormRepresentation>\n\
\t\t\t</Lemma>\n\
\t\t\t<Sense>%s\n' % (str(lex_id), word['word'], '' if word['pos']=='' else '\n\t\t\t\t<feat att="partOfSpeech" val="%s"/>' % word['pos'])))
inside the .write() I call my function pretty_print which, depending on command line option, SHOULD strip all tab and newline characters
o_parser = OptionParser()
# ....
o_parser.add_option("-p", "--prettyprint", action="store_true", dest="pprint", default=False)
# ....
def pretty_print(string):
if not options.pprint:
return string.strip('\n\t')
return string
I wrote 'should', because it does not, in this particular case it does not strip any of the characters.
BUT in this case, it works fine:
for ss in word['synsets']:
ofile.write(pretty_print(u'\t\t\t\t<Sense synset="%s-synset"/>\n' % ss))
First thing that came on my mind was that there might be some issues with the substitution, but when i print passed string inside the pretty_print function it looks perfectly fine.
Any suggestiones what might cause that .strip() does not work?
Or if there is any better way to do this, I'll accept any advice

Your issue is that str.strip() only removes from the beginning and end of a string.
You either want str.replace() to remove all instances, or to split it into lines and strip each line, if you want to remove them from the beginning and end of lines.
Also note that for your massive string, Python supports multi-line strings with triple quotes that will make it a lot easier to type out, and the old style string formatting with % has been superseded by str.format() - which you probably want to use instead in new code.

Verify CSV against given format

I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:
"<String>","<String>",<Int>,<Float>
That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).
(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.
(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.
EDIT:
Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.
EDIT #2:
Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.

Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).
from pyparsing import *
integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
integer + COMMA + floatnum + LineEnd()
tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()
for t in tests:
print t
try:
print validLine.parseString(t).asList()
except ParseException, pe:
print pe.markInputline('?')
print pe.msg
print
Prints
"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]
"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]
bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes
"bad","good2",100,3
"bad","good2",100,?3
Expected float
"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","
You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:
dblQuotedString.setParseAction(removeQuotes)
If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:
comment = '#' + restOfline
validLine.ignore(comment)
You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):
validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
integer("qty") + COMMA + floatnum("price") + LineEnd()
And your post-processing code can then do this:
data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price

I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).
Python would be an excellent tool for the job.
I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.
import csv
import math
dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
if len(row) != 4:
print 'Invalid row length.'
return
my_int = int(row[2])
my_float = float(row[3])
if math.isnan(my_int):
print 'Bad int found'
return
if math.isnan(my_float):
print 'Bad float found'
return
print 'All good!'

Here's a small snippet I made:
import csv
f = csv.reader(open("test.csv"))
for value in f:
value[0] = str(value[0])
value[1] = str(value[1])
value[2] = int(value[2])
value[3] = float(value[3])
If you run that with a file that doesn't have the format your specified, you'll get an exception:
$ python valid.py
Traceback (most recent call last):
File "valid.py", line 8, in <module>
i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'
You can then make a try-except ValueError to catch it and let the users know what they did wrong.

There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.
Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.
You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.

Try Cutplace. It verifies that tabluar data conforms to an interface control document.

Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.
Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.
Something like...
/"[a-zA-Z' -]+"/
...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.