Python: retrieve substring bounded by indentations on a text file - python

I am having trouble on how to even identify indentations on a text file with Python (the ones that appear when you press tab). I thought that using the split function would be helpful, but it seems like there has to be a physical character that can act as the 'separator'.
Here is a sample of the text, where I am trying to retrieve the string 'John'. Assume that the spaces are the indentations:
15:50:00 John 1029384
All help is appreciated! Thanks!

Dependent on the program that you used for creating the file, what is actually inserted when you press TAB may either be a TAB character (\t) or a series of spaces.
You were actually right in thinking that split() is a way to do what you want. If you don't pass any arguments to it, it treats both series of whitespace and tabs as a single separator:
s = "15:50:00 John 1029384"
t = "15:50:00\tJohn\t1029384"
s.split() # Output: ['15:50:00', 'John', '1029384']
t.split() # Output: ['15:50:00', 'John', '1029384']

Tabs are represented by \t. See https://www.w3schools.com/python/gloss_python_escape_characters.asp for a longer list.
So we can do the following:
s = "15:50:00 John 1029384"
s.split("\t") # Output: ['15:50:00', 'John', '1029384']
If you know regex, then you can use look-ahead and look-behind as follows:
import re
re.search("(?<=\t).*?(?=\t)", s)[0] # Output: "John"
Obviously both methods will need to be made more robust by considering edge cases and error handling (eg., what happens if there are fewer -- or more -- than two tabs in the string -- how do you identify the name in that case?)

Related

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".
You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each time…
If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

Regular Expressions: Special Characters and Tab Spaces

I was testing out a function that I wrote. It is supposed to give me the count of full stops (.) in a line or string. The full stop (.) that I am interested in counting has a tab space before and after it.
Here is what I have written.
def Seek():
a = '1 . . 3 .'
b = a.count(r'\t\.\t')
return b
Seek()
However, when I test it, it returns 0. From a, there are 2 full stops (.) with both a tab space before and after it. Am I using regular expressions improperly? Represented a incorrectly? Any help is appreciated.
Thanks.
It doesn't look like a has any tabs in it. Although you may have hit the tab key on your keyboard, that character would have been interpreted by the text editor as "insert a number of spaces to align with the next tab character". You need your line to look like this:
a = '1\t.\t.\t3\t.'
That should do it.
A more complete example:
from re import *
def Seek():
a = '1\t.\t.\t3\t\.'
re = compile(r'(?<=\t)\.(?=\t)');
return len(re.findall(a))
print Seek()
This uses "lookahead" and "lookbehind" to match the tab character without consuming it. What does that mean? It means that when you have \t.\t.\t, you will actually match both the first and the second \.. The original expression would have matched the initial \t\.\t and discarded them. After, there would have been a \. with nothing in front of it, and thus no second match. The lookaround syntax is "zero width" - the expression is tested but it ends up taking no space in the final match. Thus, the code snippet I just gave returns 2, just as you would expect.
It will work if you replace the '\t' with a single tab key press.
Note that count only counts non-overlapping occurrences of a substring so it won't work as intended unless you use regex instead, or change your substring to only test for a tab in front of the period.

Split string with caret character in python

I have a huge text file, each line seems like this:
Some sort of general menu^a_sub_menu_title^^pagNumber
Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python.
I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.
Could someone please help me????
>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']
If you only want three pieces you can accomplish this through a generator expression:
line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']
What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:
line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')
That gives you the components in a much more straightforward fashion.
You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.
For more information see http://docs.python.org/library/stdtypes.html
Does that help?
It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.
Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »

Python regex for reading CSV-like rows

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:
data1, data2 ,"data3'''", 'data4""',,,data5,
but this one is malformed:
data1, data2, da"ta3", 'data4',
-- quotation marks can only be prepended or trailed by spaces.
Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.
I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.
So, maybe someone with experience in parsing something similar could help me on this?
(Or maybe this is too complex for regex and I should just write a function)
EDIT1:
csv module is not much of use here:
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
[['2', ' "dat', 'a1"', " 'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
[['2', 'dat,a1', "'dat", "a2'", '']]
-- unless this can be tuned?
EDIT2: A few language edits - I hope it's more valid English now
EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.
While the csv module is the right answer here, a regex that could do this is quite doable:
import re
r = re.compile(r'''
\s* # Any whitespace.
( # Start capturing here.
[^,"']+? # Either a series of non-comma non-quote characters.
| # OR
"(?: # A double-quote followed by a string of characters...
[^"\\]|\\. # That are either non-quotes or escaped...
)* # ...repeated any number of times.
" # Followed by a closing double-quote.
| # OR
'(?:[^'\\]|\\.)*'# Same as above, for single quotes.
) # Done capturing.
\s* # Allow arbitrary space before the comma.
(?:,|$) # Followed by a comma or the end of a string.
''', re.VERBOSE)
line = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
print r.findall(line)
# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']
EDIT: To validate lines, you can reuse the regex above with small additions:
import re
r_validation = re.compile(r'''
^(?: # Capture from the start.
# Below is the same regex as above, but condensed.
# One tiny modification is that it allows empty values
# The first plus is replaced by an asterisk.
\s*([^,"']*?|"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')\s*(?:,|$)
)*$ # And don't stop until the end.
''', re.VERBOSE)
line1 = r"""data1, data2 ,"data3'''", 'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""
if r_validation.match(line1):
print 'Line 1 is valid.'
else:
print 'Line 1 is INvalid.'
if r_validation.match(line2):
print 'Line 2 is valid.'
else:
print 'Line 2 is INvalid.'
# Prints:
# Line 1 is valid.
# Line 2 is INvalid.
Although it would likely be possible with some combination of pre-processing, use of csv module, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csv module, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).
In complex parsing cases, pyparsing is always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...
You can likely find examples of pyparsing-based CSV parsing easily, with this question maybe enough to get you started.
Python has a standard library module to read csv files:
import csv
reader = csv.reader(open('file.csv'))
for line in reader:
print line
For your example input this prints
['data1', ' data2 ', "data3'''", ' \'data4""\'', '', '', 'data5', '']
EDIT:
you need to add skipinitalspace=True to allow spaces before double quotation marks for the extra examples you provided. Not sure about the single quotes yet.
>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
It is not possible to give you an answer, because you have not completely specified the protocol that is being used by the writer.
It evidently contains rules like:
If a field contains any commas or single quotes, quote it with double quotes.
Else if the field contains any double quotes, quote it with single quotes.
Note: the result is still valid if you swap double and single in the above 2 clauses.
Else don't quote it.
The resultant field may have spaces (or other whitespace?) prepended or appended.
The so-augmented fields are assembled into a row, separated by commas and terminated by the platform's newline (LF or CRLF).
What is not mentioned is what the writer does in these cases:
(0) field contains BOTH single quotes and double quotes
(1) field contains leading non-newline whitespace
(2) field contains trailing non-newline whitespace
(3) field contains any newlines.
Where the writer ignores any of these cases, please specify what outcomes you want.
You also mention "quotation marks can only be prepended or trailed by spaces" -- surely you mean commas are allowed also, otherwise your example 'data4""',,,data5, fails on the first comma.
How is your data encoded?
This probably sounds too simple, but really from the looks of things you are looking for a string that contains either [a-zA-Z0-9]["']+[a-zA-Z0-9], I mean without in depth testing against the data really what you're looking for is a quote or double quote (or any combination) in between letters (you could also add numbers there).
Based on what you were asking, it really doesn't matter that it's a CSV, it matter's that you have data that doesn't conform. Which I believe just doing a search for a letter, then any combination of one or more " or ' and another letter.
Now are you looking to get a "quantity" or just a printout of the line that contains it so you know which ones to go back and fix?
I'm sorry I don't know python regex's but in perl this would look something like this:
# Look for one or more letter/number at least one ' or " or more and at least one
# or more letter/number
if ($line =~ m/[a-zA-Z0-9]+['"]+[a-zA-Z0-9]+/ig)
{
# Prints the line if the above regex is found
print $line;
}
Just simply convert that for when you look at a line.
I'm sorry if I misunderstood the question
I hope it helps!
If your goal is to convert the data to XML (or JSON, or YAML), look at this example for a Gelatin syntax that produces the following output:
<xml>
<line>
<column>data1</column>
<column>data2 </column>
<column>data3'''</column>
<column>data4""</column>
<column/>
<column/>
<column>data5</column>
<column/>
</line>
</xml>
Note that Gelatin also has a Python API:
from Gelatin.util import compile, generate_to_file
syntax = compile('syntax.gel')
generate_to_file(syntax, 'input.csv', 'output.xml', 'xml')

Extra characters Extracted with XPath and Python (html)

I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?
What does the line of code look like that returns [u'204']? It looks like what is being returned is a Python list containing a unicode string with the value you want. Nothing wront there--just subscript. As for the carriage returns, linefeeds and tabs, as Wai Yip Tung just answered, strip will take them out.
Probably
my_answer = item1['Title'][0].strip()
Or if you are expecting several matches
for ans_i in item1['Title']:
do_something_with( ans_i.strip() )
The standard XPath function normalize-space() has exactly the wanted effect.
It deletes the leading and trailing wite space and replaces any inner whitespace with just one space.
So, you could use:
normalize-space(someExpression)
Use strip() to remove the leading and trailing white spaces.
>>> u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t '.strip()
u'1 \u2013 MathOverflow'

Categories

Resources