My re.sub statement gets hung up

My re.sub statement gets hung up - python

I'm a Python (and regex) rookie with relatively little programming experience outside of statistical packages (SAS & Stata). So far, I've gotten by using Python tutorials and answers to other questions on stackoverflow, but I'm stuck. I'm running Python 3.4 on Mac OS X.
I've written a script which downloads and parses SEC filings. The script has four main steps:
Open the URL and load the contents to a string variable
remove HTML encoding using BeautifulSoup
remove other encoding with regex statements (like jpg definitions, embedded zip files, etc.)
save the resulting text file.
My goal is to remove as much of the "non-text" information as possible from each filing before saving to my local drive. I have another script written where I do the actual analysis on the residual text.
I'm running into a problem with step 3 on at least one filing. The line that is causing the hangup is:
_content1 = re.sub(r'(?i).*\.+(xls|xlsx|pdf|zip|jpg|gif|xml)+?[\d\D]+?(end)',r'',_content1)
where _content is a string variable containing contents of the SEC filing. The regex statement is supposed to capture blocks beginning with a line ending in a file extension (xls, pdf, etc.) and ending on the word "end."
The above code has worked fine for entire years' worth of filings (i.e., I've analyzed all of 2001 and 2002 without issue), but my script is getting hung up on one particular filing in 2013 (http://www.sec.gov/Archives/edgar/data/918160/0000918160-13-000024.txt). I'm unsure how to debug as I'm not getting any error message. The script just hangs up on that one line of code (I've verified this with print statements before and after). Interestingly, if I replace the above line of code with this:
_content1 = re.sub(r'(?i)begin*.*(xls|xlsx|pdf|zip|jpg|gif|xml)+?[\d\D]+?(end)',r'',_content1)
Then everything works fine. Unfortunately, certain kinds of embedded files in the filings don't start with "begin" (like zip files), so it won't work for me.
I'm hoping one of the resident experts can identify something in my regex substitution statement that would cause a problem, as going match-by-match through the linked SEC filing probably isn't feasible (at least I wouldn't know where to begin). Any help is greatly appreciated.
Thanks,
JRM
EDIT:
I was able to get my script working by using the following REGEX:
_content1 = re.sub(r'(?i)begin|\n+?.+?(xls|xlsx|pdf|zip|jpg|gif|xml)+?[\d\D]+?(end)',r'\n',_content1)
This seems to be accomplishing what I want, but I am still curious as to why the original didn't work if anyone has a solution.

I think your biggest problem is the lack of anchors. Your original regex begins with .*, which can start matching anywhere and won't stop matching until it reaches a newline or the end of the text. Then it starts backtracking, giving back one character at a time, trying to match the first falsifiable component of the pattern: the dot and the letters of the file extension.
So it starts at the beginning of the file and consumes potentially thousands of characters, only to backtrack all the way to the beginning before giving up. Then it bumps ahead and does the same thing starting at the second character. And again from the third character, from the fourth, and so on. I know it seems incredibly dense, but that's the tradeoff we make for the power and compactness of regexes.
Try this regex:
r"(?im)^[^<>\n]+\.(?:xlsx?|pdf|zip|jpg|gif|xml)\n(?:(?!end$)\S+\n)+end\n"
The start anchor (^) in multiline mode makes sure the match can only start at the beginning of a line. I used [^<>\n]+ for the first part of the line because I'm working with the file you linked to; if you've removed all the HTML and XML markup, you might be able to use .+ instead.
Then I used (?:(?!end$).+\n)+ to match one or more complete lines that don't consist entirely of end. It's probably more efficient than your [\d\D]+?, but the most important difference is that, when I do match end, I know it's at the beginning of the line (and the $ ensures it's at the end of the line).

Try using the following REGEX
_content1 = re.sub(r'(?i).*?\.+(xls|xlsx|pdf|zip|jpg|gif|xml)+?[\d\D]+?(end)',r'',_content1)
I've converted your * operation to *? which is non-greedy which is most likely what you want.

Related

Is there any way around Python's lack of block commenting?

To be honest, this isn't anything dire, I just can't find anything on the web about it. I'm working on a big project right now in Python, and I need to comment out a large chunk of code for the moment until it can be implemented. It's about 500+ lines, so I'd really rather not have to go through one by one adding '#''s if possible. I've seen posts on here stating the lack of block commenting built in, but is there any way to sort of emulate this, or get the same effect easily of commenting out a large section of code?

I'd use a decent text editor. Sublime Text lets me select a block and comment it out; # will be inserted on every line, and another command lets me revert the commenting.
If you are stuck with no decent editor, you could use a triple-quoted string:
"""This part turned into a string to ease commenting out
if ...:
# 500 lines
""" # end of block string.
This will create a giant string object, that is then not assigned to anything. You do need to make sure that the opening quotes are indented properly, and that the line following the closing quotes has valid indentation too.
Of course, this presumes that you don't have a triple-quoted string using the same quoting style in those 500 lines already; you can capture ''' blocks in """ quotes and vice-versa, but if you have existing text blocks using both styles, you'll have to escape those manually.

You can probably get away with putting it into a multi-line string. Or maybe indent it and put the whole thing under:
if False:
so that you can easily toggle it.
But really, this is an editor's job. I have never seen a code editor that can't comment all lines in a selection.

Python-pandas with large/disordered text files

I have a large (for my experience level anyway) text file of astrophysical data and I'm trying to get a handle on python/pandas. As a noob to python, it's comin' along slowly. Here is a sample of the text file, it's a 145Mb total file. When I'm trying to read this in pandas I'm getting confused because I don't know what to use pd.read_table(example.txt) or pd.read_csv(example.csv). In either case I can't call on a specific column without ipython freaking out, such as here. I know I'm doing something absent-minded. Can anyone explain what that might be? I've done this same procedure with smaller files and it works great, but this one seems to be limiting it's output, or just not working at all.
Thanks.

It looks like your columns are separated by varying amounts of whitespace, so you'll need to specify that as the separator. Try read_csv(example.csv, sep=r'\s+'). \s+ is the regular expression for "any amount of whitespace". Also, you should remove that # character from the beginning of the first line, as that will be read as an extra column and will mess up the reading.

auto indenting wrapped Python code in Word

I'm wrestling with Microsoft Word to display my Python code correctly and am in need of some help.
I am trying to paste large amounts of Python scripts into Microsoft Word with documentation text written around the snippets. Some of these Python snippets are a few lines, others are over a page long. Since the document is now around 500 pages long there are around 200 snippet blocks scattered throughout the document.
I have a font style I created set to the snippets. So I can change font size, color, style etc for all of them at once. But I'm having a big problem with the word wrapping. Long statements in Python get wrapped in word which makes them hard to read since the indents are lost. I am able to successfully indent the level 1 wrapped line using "hanging indents". But I cannot do anything about a level 2 or level 3 indent since nested stuff is further indented.
Example (I've used dots instead of spaces because it kept removing them)
This is a statement
This is another statement
if (condition):
.........This is a third statement
.........This is a fourth statement
.........for loop :
..................This is a fifth statement
..................This is a sixth statement
..................if (condition):
...........................This is a seventh statement
Imagine each statement is fairly long and gets wrapped to the next line on a word page. I get
This is a statement
This is another statement
if (condition):
.........This is a third
statement
.........This is a fourth
statement
.........for loop:
.................This is
a fifth statement
.................This is a
sixth statement
How can I fix this in word? A hanging indent will fix the level 1 indents (the statements in the if condition) but not the level 2 indents (the statements in the for loop)
Note: I would like to use some sort of option, or plugin or macro within word. I cannot use a code editor and copy and paste code in rtf or some other format. Even if I did this 200 times to replace all my code snippets, the moment I change the font size in my document everything will get messed up again. Another option would be some IDE that embeds or links into word (without having to copy and paste) and allows changes to font style and size in its own environment which will get updated at all occurrences in Word automatically.
Please help if you can. I have searched like crazy and found nothing that works...

1) Follow PEP-8 recomendations and keep lines < 80 characters.
Sometimes this seems very difficult or inconvenient. In these cases allow you up to 90-95 characters. Longer lines are probably the result of a bad design of the code or of wrongly selected variable names. (There is people working with standard sizes of up to 120 characters but probably they are not trying to publish the code in Word in portrait mode).
2) Use a monospaced font
3) Keep font size small enough to provide 80-95 characters per line.

Have you tried using Word to draft a plain-text document? You can always convert it later.

Write your code in a Python-enabled code editor with syntax highlight. Save your snippets. Take screenshots. Paste them into MS Word. Resize and crop the images as desired.
Now all you have to do is fight MS Word on the word-wrapping around images, which is a fight you might even win.

Use docutils.
Instead of fighting against MS-Word (and other WYSIWYG editors) it's far, far easier to use docutils.
Write your document in approximately plain text. You'll use RST markup which is very simple and lightweight.
Run the rst2html.py conversion to create nice-looking HTML pages from your source.
Run the rst2latex.py conversion to create LaTeX from your source. There are a variety of tools that can produce PDF from the LaTeX.
In this case, the code snippets are handled perfectly every single time. No work.
If you're writing something really big and complex, you should be using Sphinx for this. It's an extension to docutils with even more cool markup features for code snippets.

I don't use word, but in LibreOffice, you could just use paragraph formatting - create a new paragraph style for each level of indent (pycode, pycode_indent1...). Put all of the formatting you want (mono-spaced, no paragraph spacing, etc.) in the top-level style, and make the indented styles use it as a parent. Then just add appropriate indents to each of the child styles. This is basically the same idea as multi-depth bulleted list, without the bullets. Then select the appropriate indent paragraph style for each line (hint: you can select multiple non-contiguous lines using ctrl+mouse in LibreOffice).
Granted, this way you have to do it line-by line, which could be a big pain in the arse. But if might work if it's just a few snippets that are being problematic.

Which is the closest Python equivalent to Perl::Tidy?

Coming from Perl I've been used to hitting C-c t to reformat my code according to pre-defined Perl::Tidy rules. Now, with Python I'm astonished to learn that there is nothing that even remotely resembles the power of Perl::Tidy. PythonTidy 1.20 looks almost appropriate, but barfed at first mis-aligned line ("unexpected indent").
In particular, I'm looking for the following:
Put PEP-8 into use as far as possible (the following items are essentially derivations of this one)
Convert indentation tabs to spaces
Remove trailing spaces
Break up code according to the predefined line-length as far as it goes (Eclipse-style string splitting and splitting method chains)
Normalize whitespace around
(bonus feature, optional) Re-format code including indentation.
Right now, I'm going throught someone else's code and correct everything pep8 and pyflakes tell me, which is mostly "remove trailing space" and "insert additional blank line". While I know that re-indentation is not trivial in Python (even though it should be possible just by going through the code and remembering the indentation), other features seem easy enough that I can't believe nobody has implemented this before.
Any recommendations?
Update: I'm going to take a deeper look at PythonTidy, since it seems to go into the right direction. Maybe I can find out why it barfs at me.

There is a reindent.py script distributed with python in the scripts directory.

untabify.py (Tools/scripts/untabify.py from the root directory of a Python source distribution) should fix the tabs, which may be what's stopping Python Tidy from doing the rest of the work.

Have you tried creating a wrapper around pythontidy? There's one for the sublime editor here.
Also, does pythontidy break up long lines properly for you? When I have a long line that ends in a tuple, it creates a new line for every entry in the tuple, instead of using Python's implied line continuation inside parentheses, brackets and braces as suggested by PEP-8.

I have used autopep8 for this purpose and found it handy.

Saving gtk.TextTags to file?

So I am trying to write a rich text editor in PyGTK, and originally used the older, third party script InteractivePangoBuffer from Gourmet to do this. While it worked alright, there were still plenty of bugs with it which made it frustrating to use at times, so I decided to write my own utilizing text tags. I have got them displaying and generally working alright, but now I am stuck at trying to figure out how to export them to a file when saving. I've seen that others have had the same problem I've had, though I haven't seen any solutions. I haven't come across any function (built in or otherwise) which comes close to actually getting the starting and ending position of each piece of text with a texttag applied to it so I can use it.
I have come up with one idea which should theoretically work, by walking the text by utilizing gtk.TextBuffer.get_iter_at_offset(), gtk.TextIter.get_offset(), gtk.TextIter.begins_tag(), and gtk.TextIter.ends_tag() in order to check each and every character to see if it begins or ends a tag and, if so, put the appropriate code. This would be horribly inefficient and slow, especially on larger documents, however, so I am wondering if anyone has any better solutions?

You can probably use gtk.TextIter.forward_to_tag_toggle(). I.e. loop over all tags you have and for each tags scan the buffer for the position where it is toggled.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.