Splitting line with escaped separators in Python

Splitting line with escaped separators in Python - python

TL; DR:
line = "one|two|three\|four\|five"
fields = line.split(whatever)
for what value of whatever does:
fields == ['one', 'two', 'three\|four\|five']
I have a file delimited by pipe characters. Some of the fields in that file also include pipes, escaped by a leading backslash.
For example, a single row of data in this file might have an array representation of ['one', 'two', 'three\|four\|five'], and this will be represented in the file as one|two|three\|four\|five
I have no control over the file. I cannot preprocess the file. I have to do it in a single split.
I ultimately need to split each row of this file into the separate fields, but that leading backslash is proving to be all sorts of trouble. I initially tried using a negative look-ahead, but there's some sort of arcana surrounding python strings and double-escaped characters which I don't understand, and this is stopping me from figuring it out.
Explanation of the solution is appreciated but optional.

You can use a regex like
re.split(r'([^|]+[^\\])\|', line)
which will use a character group to specify anything except \ followed by a | will be used to do the split
That will give an extra empty match at the beginning of the list, but hopefully you can work around that like
re.split(r'([^|]+[^\\])\|', line)[1:]
This is still subject to the parsing issues that Wiktor raised though, of course

Maybe you can use something like this :
[^\\]\|
where [^\\] match any caracter different of \.

Related

splitlines of quote splits '\n' in sub-quote

Given I have a quote that contains a double sub-quote with a '\n',
If one performs a splitlines on the parent quote, the child quote is split too.
double_quote_in_simple_quote = 'v\n"x\ny"\nz'
print(double_quote_in_simple_quote.splitlines())
Resulting output
['v', '"x', 'y"', 'z']
I would have expected the following:
['v', '"x\ny"', 'z']
Because the '\n' is in the scope of the sub-quote.
I was hoping to get an explanation why it behaves as such and if you have any alternative to 'splitlines' at the level of the main quote only?
Thank you

The split function doesn't care about additional levels of quoting; it simply splits on every occurrence of the character you split on. (There isn't really a concept of nested quoting; a string is a string, and may or may not contain literal quotes, which are treated the same as any other character.)
If you want to implement quoting inside of strings, you have to do it yourself.
Perhaps use a regular expression;
import re
tokens = re.findall(r'"[^"]*"|[^"]*', double_quote_in_simple_quote)
splitresult = [
x if x.startswith('"') else x.split('\n')
for x in tokens]
Demo: https://ideone.com/lAgJTb

It is due to the nature of escape sequences in Python.
\n in python means a new line character. Whenever this sequence is captured by python, it treats it as line breakers and considers skipping a line. splitlines() method splits a string into a list and the splitting is done at line breaks. That's why you get a list without new line character.
However, you can get away with it by specifying a parameter which won't consider the escape line by default :
print(double_quote_in_simple_quote.splitlines(keepends=True))
>>> ['"x\\ny"']

I came up with a nasty code that can get you around while you try to find another method that splits quotes without the characteristics that makes Python's behaves as it does.
double_quote_in_simple_quote = '"x\ny"'
double_quote_in_simple_quote = double_quote_in_simple_quote.replace("\n", "$n")
splitted_quote = double_quote_in_simple_quote.splitlines()
print(splitted_quote)
splitted_quote_decoded = [quote.replace('$n', '\n') for quote in splitted_quote]
print(splitted_quote_decoded)
The idea is to replace the \n by something not meaningful yet not used, and then reverse it. I used your example, but I'm sure you will be able to tune it to fit your needs. My output was:
['"x$ny"']
['"x\ny"']

If you double-quote a string in Python, that doesn't mean there are nested strings, per se. Whatever the outermost quotes are, Python will start and end the string object according to that. Any internal quote-like characters are treated as the ascii characters.
>>> print('dog')
dog
>>> print('"dog"')
"dog"
Note how in the second line, the quotes are also printed, because those actual quote-characters are a part of the string. No nesting happening.

Grabbing text between either double/single quote in Python regex

I have a bunch (thousands) of old unit testing scripts written with the Selenium RC interface in JavaScript. Since we're upgrading to Selenium 3, I want to try and get rid of some of the RC methods in an automated fashion using Python scripts. I'm iterating through these scripts line by line, picking up the Selenese methods, deconstructing them then attempting to rebuild with the WebDriver interface. For example:
selenium.type("xpath=//*[text()='test, xpath']", "test, text");
Would be output as...
driver.findElement(By.xpath("//*[text()='test, xpath']")).sendKeys("test, text");
I have a system for automatically identifying the Selenese methods, storing whitespace and separating the method from the parameters, so what I'm left with is the following string:
("xpath=//*[text()='test, xpath']", "test, text")
A problem I'm running into is, these aren't always consistent. Sometimes there are double-quotes nested in single-quotes, or vice-versa, or escaped double-quotes nested in double-quotes, etc. For example:
("xpath=//*[text()=\"test, xpath\"]", "test, text")
('xpath=//*[text()=\'test, xpath\']', 'test, text')
('xpath=//*[text()="test, xpath"]', 'test, text')
These are all valid. I want to be able to always match the arguments passed into the method, whether double-quotes are used or single-quotes, plus ignore nested quotes opposite of what's used to open the string as well as escaped quotes, then return them as lists.
['xpath=//*[text()="test, xpath"]', 'test, text']
...etc. I've attempted to use the re.findall using the following expression.
([\"'])(?:(?=(\\?))\2.)*?\1
What I'm getting back is this.
>>> print arguments
[('"', ''), ('"', '')]
Is there something I'm missing?

I would not make it this complex using lookbehind or lookahead. Rather I would build a case specific regex. In your case you have something like below
("param1", "param2")
('param1', 'param2')
Inside these params you may have additional escaped quotes or single quotes or what not. But if look at one thing, which is split it using ", " or ', ', these exact patterns will rarely occur in param1 and param2
So simplest non-regex solution would be to split based on ", " or ', '. But then there may be extra spaces or no spaces between, so we use a pattern
^\(\s*["']\s*(?<first_param>.*?)("\s*,\s*"|'\s*,\s*')(?<second_param>.*?)\s*["']\s*\)$
\(\s*["']\s* to match the first brackets and any starting quote
(?<first_param>.*?) to match the first parameter
("\s*,\s*"|'\s*,\s*') to match our split command pattern
(?<second_param>.*?) to match the second param
\s*["']\s*\)$ to match the end.
This is not perfect but will work in 95%+ cases of your
You can check regex fiddle on below link
https://regex101.com/r/z9PytD/1/

How to get rid of trailing \ while reading a file in python3

I am reading a file in python and getting the lines from it.
However, after printing out the values I get, I realize that after each line there is a trailing \ at the end.
I have looked at Python strip with \n and tried everything in it but nothing has removed the trailing .
For example
0048\
0051\
0052\
0054\
0056\
0057\
0058\
0059\
How can I get rid of these slashes?
Here is the code I have so far
for line in f:
line = line.replace('\\n', "")
line = line.replace('\\n', "")
print(line)
I've even tried using regex
strings = re.findall(r"\S+", f.read())
But nothing has worked so far.

You're probably confused about what is in the lines, and as a result you're confusing me too. '\n' is a single newline character, as shown using repr() (which is your friend when you want to know what a value is exactly). A line typically ends with that (the exception being the end of file which might not). That does not contain a backslash; that backslash is part of a string literal escape sequence. Your replace argument of '\\n' contains two characters, a backslash followed by the letter n. This wouldn't match a '\n'; the easiest way to remove the newline specifically is to use str.rstrip('\n'). The line reading itself will guarantee that there's only up to one newline, and it is at the end of the string. Frequently we use strip() with no argument instead as we don't want whitespace either.
If your string really does contain backslash, you can process that as well, whether using replace, strip, re or some other string processing. Just keep in mind that it might be used for escape sequences not only at string literal level but at regular expression level too. For instance, re.sub(r'\\$', '', str) will remove a backslash from the end of a string; the backslash itself is doubled to not mean a special sequence in the regular expression, and the string literal is raw to not need another doubling of the backslashes.

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".

You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each time…

If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

Python regex for reading CSV-like rows

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:
data1, data2 ,"data3'''", 'data4""',,,data5,
but this one is malformed:
data1, data2, da"ta3", 'data4',
-- quotation marks can only be prepended or trailed by spaces.
Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.
I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.
So, maybe someone with experience in parsing something similar could help me on this?
(Or maybe this is too complex for regex and I should just write a function)
EDIT1:
csv module is not much of use here:
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
[['2', ' "dat', 'a1"', " 'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
[['2', 'dat,a1', "'dat", "a2'", '']]
-- unless this can be tuned?
EDIT2: A few language edits - I hope it's more valid English now
EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.

While the csv module is the right answer here, a regex that could do this is quite doable:
import re
r = re.compile(r'''
\s* # Any whitespace.
( # Start capturing here.
[^,"']+? # Either a series of non-comma non-quote characters.
| # OR
"(?: # A double-quote followed by a string of characters...
[^"\\]|\\. # That are either non-quotes or escaped...
)* # ...repeated any number of times.
" # Followed by a closing double-quote.
| # OR
'(?:[^'\\]|\\.)*'# Same as above, for single quotes.
) # Done capturing.
\s* # Allow arbitrary space before the comma.
(?:,|$) # Followed by a comma or the end of a string.
''', re.VERBOSE)
line = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
print r.findall(line)
# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']
EDIT: To validate lines, you can reuse the regex above with small additions:
import re
r_validation = re.compile(r'''
^(?: # Capture from the start.
# Below is the same regex as above, but condensed.
# One tiny modification is that it allows empty values
# The first plus is replaced by an asterisk.
\s*([^,"']*?|"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')\s*(?:,|$)
)*$ # And don't stop until the end.
''', re.VERBOSE)
line1 = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""
if r_validation.match(line1):
print 'Line 1 is valid.'
else:
print 'Line 1 is INvalid.'
if r_validation.match(line2):
print 'Line 2 is valid.'
else:
print 'Line 2 is INvalid.'
# Prints:
# Line 1 is valid.
# Line 2 is INvalid.

Although it would likely be possible with some combination of pre-processing, use of csv module, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csv module, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).
In complex parsing cases, pyparsing is always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...
You can likely find examples of pyparsing-based CSV parsing easily, with this question maybe enough to get you started.

Python has a standard library module to read csv files:
import csv
reader = csv.reader(open('file.csv'))
for line in reader:
print line
For your example input this prints
['data1', ' data2 ', "data3'''", ' \'data4""\'', '', '', 'data5', '']
EDIT:
you need to add skipinitalspace=True to allow spaces before double quotation marks for the extra examples you provided. Not sure about the single quotes yet.
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]

It is not possible to give you an answer, because you have not completely specified the protocol that is being used by the writer.
It evidently contains rules like:
If a field contains any commas or single quotes, quote it with double quotes.
Else if the field contains any double quotes, quote it with single quotes.
Note: the result is still valid if you swap double and single in the above 2 clauses.
Else don't quote it.
The resultant field may have spaces (or other whitespace?) prepended or appended.
The so-augmented fields are assembled into a row, separated by commas and terminated by the platform's newline (LF or CRLF).
What is not mentioned is what the writer does in these cases:
(0) field contains BOTH single quotes and double quotes
(1) field contains leading non-newline whitespace
(2) field contains trailing non-newline whitespace
(3) field contains any newlines.
Where the writer ignores any of these cases, please specify what outcomes you want.
You also mention "quotation marks can only be prepended or trailed by spaces" -- surely you mean commas are allowed also, otherwise your example 'data4""',,,data5, fails on the first comma.
How is your data encoded?

This probably sounds too simple, but really from the looks of things you are looking for a string that contains either [a-zA-Z0-9]["']+[a-zA-Z0-9], I mean without in depth testing against the data really what you're looking for is a quote or double quote (or any combination) in between letters (you could also add numbers there).
Based on what you were asking, it really doesn't matter that it's a CSV, it matter's that you have data that doesn't conform. Which I believe just doing a search for a letter, then any combination of one or more " or ' and another letter.
Now are you looking to get a "quantity" or just a printout of the line that contains it so you know which ones to go back and fix?
I'm sorry I don't know python regex's but in perl this would look something like this:
# Look for one or more letter/number at least one ' or " or more and at least one
# or more letter/number
if ($line =~ m/[a-zA-Z0-9]+['"]+[a-zA-Z0-9]+/ig)
{
# Prints the line if the above regex is found
print $line;
}
Just simply convert that for when you look at a line.
I'm sorry if I misunderstood the question
I hope it helps!

If your goal is to convert the data to XML (or JSON, or YAML), look at this example for a Gelatin syntax that produces the following output:
<xml>
<line>
<column>data1</column>
<column>data2 </column>
<column>data3'''</column>
<column>data4""</column>
<column/>
<column/>
<column>data5</column>
<column/>
</line>
</xml>
Note that Gelatin also has a Python API:
from Gelatin.util import compile, generate_to_file
syntax = compile('syntax.gel')
generate_to_file(syntax, 'input.csv', 'output.xml', 'xml')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.