Python regex for reading CSV-like rows - python

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:
data1, data2 ,"data3'''", 'data4""',,,data5,
but this one is malformed:
data1, data2, da"ta3", 'data4',
-- quotation marks can only be prepended or trailed by spaces.
Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.
I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.
So, maybe someone with experience in parsing something similar could help me on this?
(Or maybe this is too complex for regex and I should just write a function)
EDIT1:
csv module is not much of use here:
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
[['2', ' "dat', 'a1"', " 'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
[['2', 'dat,a1', "'dat", "a2'", '']]
-- unless this can be tuned?
EDIT2: A few language edits - I hope it's more valid English now
EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.

While the csv module is the right answer here, a regex that could do this is quite doable:
import re
r = re.compile(r'''
\s* # Any whitespace.
( # Start capturing here.
[^,"']+? # Either a series of non-comma non-quote characters.
| # OR
"(?: # A double-quote followed by a string of characters...
[^"\\]|\\. # That are either non-quotes or escaped...
)* # ...repeated any number of times.
" # Followed by a closing double-quote.
| # OR
'(?:[^'\\]|\\.)*'# Same as above, for single quotes.
) # Done capturing.
\s* # Allow arbitrary space before the comma.
(?:,|$) # Followed by a comma or the end of a string.
''', re.VERBOSE)
line = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
print r.findall(line)
# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']
EDIT: To validate lines, you can reuse the regex above with small additions:
import re
r_validation = re.compile(r'''
^(?: # Capture from the start.
# Below is the same regex as above, but condensed.
# One tiny modification is that it allows empty values
# The first plus is replaced by an asterisk.
\s*([^,"']*?|"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')\s*(?:,|$)
)*$ # And don't stop until the end.
''', re.VERBOSE)
line1 = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""
if r_validation.match(line1):
print 'Line 1 is valid.'
else:
print 'Line 1 is INvalid.'
if r_validation.match(line2):
print 'Line 2 is valid.'
else:
print 'Line 2 is INvalid.'
# Prints:
# Line 1 is valid.
# Line 2 is INvalid.

Although it would likely be possible with some combination of pre-processing, use of csv module, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csv module, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).
In complex parsing cases, pyparsing is always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...
You can likely find examples of pyparsing-based CSV parsing easily, with this question maybe enough to get you started.

Python has a standard library module to read csv files:
import csv
reader = csv.reader(open('file.csv'))
for line in reader:
print line
For your example input this prints
['data1', ' data2 ', "data3'''", ' \'data4""\'', '', '', 'data5', '']
EDIT:
you need to add skipinitalspace=True to allow spaces before double quotation marks for the extra examples you provided. Not sure about the single quotes yet.
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]

It is not possible to give you an answer, because you have not completely specified the protocol that is being used by the writer.
It evidently contains rules like:
If a field contains any commas or single quotes, quote it with double quotes.
Else if the field contains any double quotes, quote it with single quotes.
Note: the result is still valid if you swap double and single in the above 2 clauses.
Else don't quote it.
The resultant field may have spaces (or other whitespace?) prepended or appended.
The so-augmented fields are assembled into a row, separated by commas and terminated by the platform's newline (LF or CRLF).
What is not mentioned is what the writer does in these cases:
(0) field contains BOTH single quotes and double quotes
(1) field contains leading non-newline whitespace
(2) field contains trailing non-newline whitespace
(3) field contains any newlines.
Where the writer ignores any of these cases, please specify what outcomes you want.
You also mention "quotation marks can only be prepended or trailed by spaces" -- surely you mean commas are allowed also, otherwise your example 'data4""',,,data5, fails on the first comma.
How is your data encoded?

This probably sounds too simple, but really from the looks of things you are looking for a string that contains either [a-zA-Z0-9]["']+[a-zA-Z0-9], I mean without in depth testing against the data really what you're looking for is a quote or double quote (or any combination) in between letters (you could also add numbers there).
Based on what you were asking, it really doesn't matter that it's a CSV, it matter's that you have data that doesn't conform. Which I believe just doing a search for a letter, then any combination of one or more " or ' and another letter.
Now are you looking to get a "quantity" or just a printout of the line that contains it so you know which ones to go back and fix?
I'm sorry I don't know python regex's but in perl this would look something like this:
# Look for one or more letter/number at least one ' or " or more and at least one
# or more letter/number
if ($line =~ m/[a-zA-Z0-9]+['"]+[a-zA-Z0-9]+/ig)
{
# Prints the line if the above regex is found
print $line;
}
Just simply convert that for when you look at a line.
I'm sorry if I misunderstood the question
I hope it helps!

If your goal is to convert the data to XML (or JSON, or YAML), look at this example for a Gelatin syntax that produces the following output:
<xml>
<line>
<column>data1</column>
<column>data2 </column>
<column>data3'''</column>
<column>data4""</column>
<column/>
<column/>
<column>data5</column>
<column/>
</line>
</xml>
Note that Gelatin also has a Python API:
from Gelatin.util import compile, generate_to_file
syntax = compile('syntax.gel')
generate_to_file(syntax, 'input.csv', 'output.xml', 'xml')

Related

Python: retrieve substring bounded by indentations on a text file

I am having trouble on how to even identify indentations on a text file with Python (the ones that appear when you press tab). I thought that using the split function would be helpful, but it seems like there has to be a physical character that can act as the 'separator'.
Here is a sample of the text, where I am trying to retrieve the string 'John'. Assume that the spaces are the indentations:
15:50:00 John 1029384
All help is appreciated! Thanks!
Dependent on the program that you used for creating the file, what is actually inserted when you press TAB may either be a TAB character (\t) or a series of spaces.
You were actually right in thinking that split() is a way to do what you want. If you don't pass any arguments to it, it treats both series of whitespace and tabs as a single separator:
s = "15:50:00 John 1029384"
t = "15:50:00\tJohn\t1029384"
s.split() # Output: ['15:50:00', 'John', '1029384']
t.split() # Output: ['15:50:00', 'John', '1029384']
Tabs are represented by \t. See https://www.w3schools.com/python/gloss_python_escape_characters.asp for a longer list.
So we can do the following:
s = "15:50:00 John 1029384"
s.split("\t") # Output: ['15:50:00', 'John', '1029384']
If you know regex, then you can use look-ahead and look-behind as follows:
import re
re.search("(?<=\t).*?(?=\t)", s)[0] # Output: "John"
Obviously both methods will need to be made more robust by considering edge cases and error handling (eg., what happens if there are fewer -- or more -- than two tabs in the string -- how do you identify the name in that case?)

Use regex to capture "keyword args"

I have a string formatted like this
s="""
stkcode="10001909" marketid="sh" isstop="S 01" turnover="0" contractid="000000" time="84445850"
"""
I want to capture all the "keyword args" substrings in it, i.e., stkcode="10001909", isstop="S 01". Note that a plain s.split() won't work because of possible white spaces in certain field values, for example isstop="S 01". The correct way to go seems to be re.split, but I don't know how to write the appropriate regex. Can anyone help? Thanks!
edit
To add more info: we are guaranteed there is no " in each entry value. Actually, we only need a "protective" split, i.e. only split the whitespace outside of a pairing ".
EDIT: XML is the way to go, not regex. Apologies
My original data comprises many lines of Timestamp + some aux info + an XML string. So it cannot be directly parsed by an XML parser and has to be read line by line as strings. So I initially thought just stick with string and regex for each (relatively easy) single string. But I was wrong apparently. And XML parser is the way to go for sure.
re.findall(r'((?!\<).*?)="(.*?)"', s)
Produces:
[('stkcode', '10001909'),
(' marketid', 'sh'),
(' isstop', 'S 01'),
(' turnover', '0'),
(' contractid', '000000'),
(' time', '84445850')]
Regex Explanation:
(...)="(...)"
Matches everything in this format, the kwarg format you've defined
Now the first group:
((?!\<).*?) will match all characters (.*?) except for the leading bracket ((?!\<))
And the second group:
(.*?)
will just match all characters. The closing bracket is outside of the the quotes of the original matching pattern, so you don't have to worry about it.
EDIT:
To ignore whitespace around characters add this reverse matching group
(?!\s)
Not sure where whitespace would appear in your strings, but this new regex would handle it in every relevant place:
((?!\<)(?!\s).*?(?!\s))="(?!\s)(.*?)(?!\s)

regex: replace hyphens with en-dashes with re.sub

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!
I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo
re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)

With pyparsing, how do you parse a quoted string that ends with a backslash

I'm trying to use pyparsing to parse quoted strings under the following conditions:
The quoted string might contain internal quotes.
I want to use backslashes to escape internal quotes.
The quoted string might end with a backslash.
I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).
Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?
Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):
import pyparsing as pp
import re
# A single-quoted string having:
# - Internal escaped quote.
# - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"
# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks
# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)
Output:
txt: 'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks: ["ab'"]
'ab\'cd\'
Update
Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.
Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\
# demo.txt
foo = 'ab\'cd\\'
My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.
with open('demo.txt') as fh:
txt = fh.read().split()[-1].strip()
parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks # ["ab'cd\\\\"]
I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.
Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:
qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)
I think you're misunderstanding the use of escQuote. According to the docs:
escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)
So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:
>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})
Notice that the backslash was replaced with '.
As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:
>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'
By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.
Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like
'one string \' 'or two'
we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.
What is it about this code that is not working for you?
from pyparsing import *
s = r"foo = 'ab\'cd\\'" # <--- IMPORTANT - use a raw string literal here
ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue
results = strAssign.parseString(s)
print results.asList() # displays repr form of each element
for r in results:
print r # displays str form of each element
# count the backslashes
backslash = '\\'
print results[-1].count(backslash)
prints:
['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2
EDIT:
So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:
import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.
I'll add this in the next patch release of pyparsing.
PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.
This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.
The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.

Get a whole unicode sentence

I'm trying to parse a sentence like Base: Lote Numero 1, Marcelo T de Alvear 500. Demanda: otras palabras. I want to: first, split the text by periods, then, use whatever is before the colon as a label for the sentence after the colon.
Right now I have the following definition:
from pyparsing import *
unicode_printables = u''.join(unichr(c) for c in xrange(65536)
if not unichr(c).isspace())
def parse_test(text):
label = Word(alphas)+Suppress(':')
value = OneOrMore(Word(unicode_printables)|Literal(','))
group = Group(label.setResultsName('label')+value.setResultsName('value'))
exp = delimitedList(
group,
delim='.'
)
return exp.parseString(text)
And kind of works but it drops the unicode caracters (and whatever that is not in alphanums) and I'm thinking that I would like to have the value as a whole sentence and not this: 'value': [(([u'Lote', u'Numero', u'1', ',', u'Marcelo', u'T', u'de', u'Alvear', u'500'], {}), 1).
Is a simple way to tackle this?
To directly answer your question, wrap your value definition with originalTextFor, and this will give you back the string slice that the matching tokens came from, as a single string. You could also add a parse action, like:
value.setParseAction(lambda t : ' '.join(t))
But this would explicitly put a single space between each item, when there might have been no spaces (in the case of a ',' after a word), or more than one space. originalTextFor will give you the exact input substring. But even simpler, if you are just reading everything after the ':', would be to use restOfLine. (Of course, the simplest would be just to use split(':'), but I assume you are specifically asking how to do this with pyparsing.)
A couple of other notes:
xxx.setResultsName('yyy') can be shortened to just xxx('yyy'), improving the readability of your parser definition.
Your definition of value as OneOrMore(Word(unicode_printables) | Literal(',')) has a couple of problems. For one thing, ',' will be included in the set of characters in unicode_printables, so ',' will be included in with any parsed words. The best way to solve this is to use the excludeChars parameter to Word, so that your sentence words do not include commas: OneOrMore(Word(unicode_printables, excludeChars=',') | ','). Now you can also exclude other possible punctuation, like ';', '-', etc. just be adding them in the excludeChars string. (I just noticed that you are using '.' as a delimiter for a delimitedList - for this to work, you will have to include '.' as an excluded character too.) Pyparsing is not like a regular expression in this regard - it does not do any lookahead to try to match the next token in the parser if the next character continues to match the current token. That is why you have to do some extra work of your own to avoid reading too much. In general, something as open-ended as OneOrMore(Word(unicode_printables)) is very likely to eat up the entire rest of your input string.
You should look into PyICU which provides access to the rich Unicode text library provided by ICU, including the BreakIterator class that provides a sentence finder.

Categories

Resources