How to ensure two line breaks between each paragraph in python - python

I am reading txt files into python, and want to get paragraph breaks consistent. Sometimes there is one, two, three, four... occasionally several tens or hundreds of blank lines between paragraphs.
It is obviously easy to strip out all the breaks, but I can only think of "botched" ways of making everything two breaks (i.e. a single blank line between each paragraph). All i can think of would be specifying multiple strips/replaces for different possible combinations of breaks... which gets unwieldy when the number of breaks is very large ... or iterativly removing excess breaks until left with two, which I guess would be slow and not particularly scalable to many tens of thousands of txt files ...
Is there a moderately fast to process [/simple] way of achieving this?

import re
re.sub(r"([\r\n]){2,}",r"\1\1",x)
You can try this.Here x will be your string containing all the paragraphs.

Here's one way.
import os
f = open("text.txt")
r = f.read()
pars = [p for p in r.split(os.linesep) if p]
print (os.linesep * 2).join(pars)
This is assuming by paragraphs we mean a block of text not containing a linebreak.

Related

Finding sub-strings in LARGE string

#read in csv file in form ("case, num, val \n case1, 1, baz\n...")
# convert to form FOO = "casenumval..." roughly 6 million characters
for someString in List: #60,000 substrings
if substr not in FOO:
#do stuff
else:
#do other stuff
So my issue is that there are far too many sub strings to check against this massive string. I have tried reading the file in line by line and checking the substrings against the line, but this still crashes the program. Are there any techniques for checking a lot of substrings againsts a very large string efficiently?
FOR CONTEXT:
I am performing data checks, suspect data is saved to a csv file to be reviewed/changed. This reviewed/changed file is then compared to the original file. Data which has not changed has been verified as good and must be saved to a new "exceptionFile". Data that has been changed and passes is disregarded. And data which has been changed and is checked and still suspect is the sent off for review again.
The first thing you should do is convert your list of 60,000 strings to search for into one big regular expression:
import re
searcher = re.compile("|".join(re.escape(s) for s in List)
Now you can search for them all at once:
for m in searcher.finditer(FOO):
print(m.group(0)) # prints the substring that matched
If all you care about is knowing which ones were found,
print(set(m.group(0) for m in searcher.finditer(FOO))
This is still doing substantially more work than the absolute minimum, but it should be much more efficient than what you were doing before.
Also, if you know that your input is a CSV file and you also know that none of the strings-to-search-for contain a newline, you can operate line by line, which may or may not be faster than what you were doing depending on conditions, but will certainly use less memory:
with open("foo.csv") as FOO:
for line in FOO:
for m in searcher.finditer(line):
# do something with the substring that matched

Splitting a file with multiple but not all delimiters in Python

I know there's been several answers to questions regarding multiple delimiters, but my issue involves needing to delimit by multiple delimiters but not all of them. I have a file that contains the following:
((((((Anopheles_coluzzii:0.002798,Anopheles_arabiensis:0.005701):0.001405,(Anopheles_gambiae:0.002824,Anopheles_quadriannulatus:0.004249):0.002085):0,Anopheles_melas:0.008552):0.003211,Anopheles_merus:0.011152):0.068265,Anopheles_christyi:0.086784):0.023746,Anopheles_epiroticus:0.082921):1.101881;
It is newick format so all information is in one long line. What I would like to do is isolate all the numbers that follow another number. So for example the first number I would like to isolate is 0.001405. I would like to put that in a list with all the other numbers that follow a number (not a name etc).
I tried to use the following code:
with open("file.nh", "r") as f:
for line in f:
data = line
z = re.findall(r"[\w']+", data)
The issue here is that this splits the list using "." as well as the other delimiters and this is a problem because all the numbers I require have decimal points.
I considered going along with this and converting the numbers in the list to ints and then removing all non-int values and 0 values. However, some of the files contain 0 as a value that needs to be kept.
So is there a way of choosing which delimiters to use and which to avoid when multiple delimiters are required?
It's not necessary to split by multiple but not all delimiters if you set up your regex to catch the wanted parts. By your definition, you could use every number after ):. Using the re module a possible solution is this:
with open("file.nh", "r") as f:
for line in f:
z = re.findall(r"\):([0-9.]+)", line)
print(z)
The result is:
['0.001405', '0.002085', '0', '0.003211', '0.068265', '0.023746', '1.101881']
r"\):([0-9.]+)" is searching for ): followed by a part with numbers or decimal point. The second part is the result and is therefore inside parenthesis.
As Alex Hall mentioned in most cases it's not a good idea to use regex if the data is well structured. Watch out for libraries working with the given data structure instead.

Deleting prior two lines of a text file in Python where certain characters are found

I have a text file that is an ebook. Sometimes full sentences are broken up by two new lines, and I am trying to get rid of these extra new lines so the sentences are not separated mid-sentence by the new lines. The file looks like
Here is a regular sentence.
This one is fine too.
However, this
sentence got split up.
If I hit delete twice on the keyboard, it'd fix it. Here's what I have so far:
with open("i.txt","r") as input:
with open("o.txt","w") as output:
for line in input:
line = line.strip()
if line[0].isalpha() == True and line[0].isupper() == False:
# something like hitting delete twice on the keyboard
output.write(line + "\n")
else:
output.write(line + "\n")
Any help would be greatly appreciated!
If you can read the entire file into memory, then a simple regex will do the trick - it says, replace a sequence of new lines preceding a lowercase letter, with a single space:
import re
i = open('i.txt').read()
o = re.sub(r'\n+(?=[a-z])', ' ', i)
open('o.txt', 'w').write(o)
The important difference here is that you are not in an editor, but rather writing out lines, that means you can't 'go back' to delete things, but rather recognise they are wrong before you write them.
In this case, you will iterate five times. Each time you will get a string ending in \n to indicate a newline, and in two of those cases, you want to remove that newline. The easiest way I can see to identify that time is if there was no full stop at the end of the line. If we check that, and strip the newline off in that case, we can get the result you want:
with open("i.txt", "r") as input, open("o.txt", "w") as output:
for line in input:
if line.endswith(".\n"):
output.write(line)
else:
output.write(line.rstrip("\n"))
Obviously, sometimes it will not be possible to tell in advance that you need to make a change. In such cases, you will either need to make two iterations over the file - the first to find where you want to make changes, the second to make them, or, alternatively, store (part of, or all of) the file in memory until you know what you need to change. Note that if your files are extremely large, storinbg them in memory could cause problems.
You can use fileinput if you just want to remove the lines in the original file:
from __future__ import print_function
for line in fileinput.input(""i.txt", inplace=True):
if not line.rstrip().endswith("."):
print(line.rstrip(),end=" ")
else:
print(line, end="")
Output:
Here is a regular sentence.
This one is fine too.
However, this sentence got split up.

(How much) does it matter which is sorted through first when checking 2 lists against each other?

I have a list of 800 elements I'm looking for in approximately 50k files approximately 50 lines long each. (These are xml tags with non-generic name - the search is simple so I'm not using Beautiful soup.)
The list of 800 elements is shortened each time one is found.
Iterating through the files,
does it matter which I go through first -checking each line against all possible elements (check line for "spot", "rover", "fido", etc...) or go through all lines checking for one element at a time (e.g. check all lines in the file for "spot", then check all lines for "rover", etc...)?
Or is this all together inefficient? (This is using python.)
I was thinking of:
for line in somefile:
for element in somelist:
if re.search(element, line):
....
or:
for element in somelist:
for line in somefile:
if re.search(element, line):
....
You generally leave the larger dataset as the one that's sequentially accessed, and keep the values you're interested in, in-memory or as an index of the larger dataset. So yes, it does matter, and in your example, you're looking to scan the file multiple times, which is a lot slower.
Let's take an example that each of those files is 50 lines, and you have 800 "words" that you're looking for.
for filename in filenames:
for line in open(filename):
if any(word in line for word in words):
pass # do something
Since words is in-memory and easy to scan, it's a lot better than opening each file 800 times over - which is an expensive operation.
So, I guess I should phrase it that you should be trying to sequentially scan the "most expensive" dataset (which may not be the longest).
The big-O notation, which describes the complexity of the algorithm, is the same either way, but if one of your iterables (for example, the file) is both a lot slower to access and likely larger than the other, you should take pains to iterate over it as few times as possible, i.e., once.
Barring that, the algorithm may be easier to write or understand one way or the other. For example, if you want a list of all the strings in a list that match any regex, it will be easier to iterate over the string list first and check each regex against each line, breaking out of the inner loop when one matches.
Actually the whole task can be a one-liner when you iterate it that way:
foundlines = [line for line in inputlines if any(r.search(line) for r in regexes)]
As a bonus you'll get the fastest iteration Python is capable of by using the list comprehension/generator expression, and any().
Iterating over the regexes first, it is most natural to make a list of lists of lines that match each regex, or else one big list (with duplicates) of lines that have matched any regex, including more than one. If you want to end up with a list of lines that match at most one regex, then you will need to eliminate duplicates somehow (either during the iteration or afterward) which will affect the complexity of the algorithm. The results will likely come out in a different order as well, which may be a concern.
In short, choose the approach that best suits the problem you are trying to solve when performance of the iterables are equivalent.
The order of complexity is O(n*m), where n and m can represent the number of entries in your your list and file so it does not matter which way you do first.

re.findall regex hangs or very slow

My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).
I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.
The following code works flawlessly when I feed a small, 100kb sized file into it:
import codecs
import re
outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
outfile.write(result)
outfile.close()
When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.
I am very sure that the source of the problem is the
(.*?)
part of the regex, in combination with re.DOTALL.
When I use a similar regex on smaller distances, the script will run fine and fast.
My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn't be too much to handle, right?
Am I maybe missing a more efficient solution?
Thanks in advance.
You are correct in thinking that using the sequence .*, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*, leading to a result known as catastrophic backtracking.
The usual solution is to replace the . with a character class that is much more specific, usually the production that you are trying to terminate the first .* with. Something like:
`[^\n]*(.*)`
so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n'))
Another workaround to this issue is adding a sane limit to the number of matched characters.
So instead of something like this:
[abc]*.*[def]*
You can limit it to 1-100 instances per character group.
[abc]{1,100}.{1,100}[def]{1,100}
This won't work for every situation, but in some cases it's an acceptable quickfix.

Categories

Resources