Concatenate lines together if they are not emty in a file - python

I have a file where some sentence are spread over multiple lines.
For example:
1:1 This is a simple sentence
[NEWLINE]
1:2 This line is spread over
multiple lines and it goes on
and on.
[NEWLINE]
1:3 This is a line spread over
two lines
[NEWLINE]
So I want it to look like this
1:1 This is a simple sentence
[NEWLINE]
1:2 This line is spread over multiple lines and it goes on and on.
[NEWLINE]
1:3 This is a line spread over two lines
Some lines a spread over 2 or 3 or 4 lines. If there follows al line which is not an new line it should be merged into one single line.
I would like to overwrite the given file of to make an new file.
I've tried it with a while loop but without succes.
input = open(file, "r")
zin = ""
lines = input.readlines()
#Makes array with the lines
for i in lines:
while i != "\n"
zin += i
.....
But this creates a infinite loop.

You should not be nesting for and while loops in your use case. What happens in your code is that a line is assigned to the variable i by the for loop, but it isn't being modified by the nested while loop, so if the while clause is True, then it will remain that way and without a breaking condition, you end up with an infinite loop.
A solution might look like this:
single_lines = []
current = []
for i in lines:
i = i.strip()
if i:
current.append(i)
else:
if not current:
continue # treat multiple blank lines as one
single_lines.append(' '.join(current))
current = []
else:
if current:
# collect the last line if the file doesn't end with a blank line
single_lines.append(' '.join(current))
Good ways of overwriting the input file would be to either collect all output in memory, close the file after reading it out and reopen it for writing, or to write to another file while reading the input and renaming the second one to overwrite the first after closing both.

Related

Spliting / Slicing Text File with Python

Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')
So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)

Threading/Multiprocessing - Match searching a 60gb file with 600k terms

I have a python script that would take ~93 days to complete on 1 CPU, or 1.5 days on 64.
I have a large file (FOO.sdf) and would like to extract the "entries" from FOO.sdf that match a pattern. An "entry" is a block of ~150 lines delimited by "$$$$". The output desired is 600K blocks of ~150 lines. This script I have now is shown below. Is there a way to use multiprocessing or threading to divy up this task across many cores/cpus/threads? I have access to a server with 64 cores.
name_list = []
c=0
#Titles of text blocks I want to extract (form [...,'25163208',...])
with open('Names.txt','r') as names:
for name in names:
name_list.append(name.strip())
#Writing the text blocks to this file
with open("subset.sdf",'w') as subset:
#Opening the large file with many textblocks I don't want
with open("FOO.sdf",'r') as f:
#Loop through each line in the file
for line in f:
#Avoids appending extreanous lines or choking
if line.split() == []:
continue
#Simply, this line would check if that line matches any name in "name_list".
#But since I expect this is expensive to check, I only want it to occur if it passes the first two conditions.
if ("-" not in line.split()[0]) and (len(line.split()[0]) >= 5) and (line.split()[0] in name_list):
c=1 #when c=1 it designates that line should be written
#Write this line to output file
if c==1:
subset.write(line)
#Stop writing to file once we see "$$$$"
if c==1 and line.split()[0] == "$$$$":
c=0
subset.write(line)

Checking For Next Line

I am attempting to loop over a series of text files I have and I want to do so by checking the value of the next line. The input from the text file looks like:
Person1
(COUNT)|key
1|************
Person2
(COUNT)|key
// and so on
Some people may have a key and others may not. I am trying to write a loop that checks for at least 3 consecutive lines (people with keys) before a space like the Person1 example where each line begins with a character and I want to print only those cases.
My current loop looks like this:
for line in input:
if re.match(r'\S', line):
line1 = line
print(line1)
if re.match(r'\S', input.next()):
line2 = line
print(line2)
if re.match(r'\S', input.next()):
line3 = line
print(line3)
However, I cannot seem to get this loop correct. It seems to be printing the Person three times and only sometimes printing the key. Looking for any guidance here available.
You can use enumerate to get the current index and be able to check the next lines too. You'll need to beware the case when you reach the end of the file though.
for i, line in enumerate(input):
if i == len(input) - 2:
break
next_line = line[i+1]

Update iteration value in Python for loop

Pretty new to Python and have been writing up a script to pick out certain lines of a basic log file
Basically the function searches lines of the file and when it finds one I want to output to a separate file, adds it into a list, then also adds the next five lines following that. This then gets output to a separate file at the end in a different funcition.
What I've been trying to do following that is jump the loop to continue on from the last of those five lines, rather than going over them again. I thought the last line in the code would solved the problem, but unfortunately not.
Are there any recommended variations of a for loop I could use for this purpose?
def readSingleDayLogs(aDir):
print 'Processing files in ' + str(aDir) + '\n'
lineNumber = 0
try:
open_aDirFile = open(aDir) #open the log file
for aLine in open_aDirFile: #total the num. lines in file
lineNumber = lineNumber + 1
lowerBound = 0
for lineIDX in range(lowerBound, lineNumber):
currentLine = linecache.getline(aDir, lineIDX)
if (bunch of logic conditions):
issueList.append(currentLine)
for extraLineIDX in range(1, 6): #loop over the next five lines of the error and append to issue list
extraLine = linecache.getline(aDir, lineIDX+ extraLineIDX) #get the x extra line after problem line
issueList.append(extraLine)
issueList.append('\n\n')
lowerBound = lineIDX
You should use a while loop :
line = lowerBound
while line < lineNumber:
...
if conditions:
...
for lineIDX in range(line, line+6):
...
line = line + 6
else:
line = line + 1
A for-loop uses an iterator over the range, so you can have the ability to change the loop variable.
Consider using a while-loop instead. That way, you can update the line index directly.
I would look at something like:
from itertools import islice
with open('somefile') as fin:
line_count = 0
my_lines = []
for line in fin:
line_count += 1
if some_logic(line):
my_lines.append(line)
next_5 = list(islice(fin, 5))
line_count += len(next_5)
my_lines.extend(next_5)
This way, by using islice on the input, you're able to move the iterator ahead and resume after the 5 lines (perhaps fewer if near the end of the file) are exhausted.
This is based on if I'm understanding correctly that you can read forward through the file, identify a line, and only want a fixed number of lines after that point, then resume looping as per normal. (You may not even require the line counting if that's all you're after as it only appears to be for the getline and not any other purpose).
If you indeed you want to take the next 5, and still consider the following line, you can use itertools.tee to branch at the point of the faulty line, and islice that and let the fin iterator resume on the next line.

Deleting Relative Lines with Regex

Using pdftotext, a text file was created that includes footers from the source pdf. The footers get in the way of other parsing that needs to be done. The format of the footer is as follows:
This is important text.
9
Title 2012 and 2013
\fCompany
Important text begins again.
The line for Company is the only one that does not recur elsewhere in the file. It appears as \x0cCompany\n. I would like to search for this line and remove it and the preceding three lines (the page number, title, and a blank line) based on where the \x0cCompany\n appears. This is what I have so far:
report = open('file.txt').readlines()
data = range(len(report))
name = []
for line_i in data:
line = report[line_i]
if re.match('.*\\x0cCompany', line ):
name.append(report[line_i])
print name
This allows me to make a list storing which line numbers have this occurrence, but I do not understand how to delete these lines as well as the three preceding lines. It seems I need to create some other loop based on this loop but I cannot make it work.
Instead of iterating through and getting the indices of that lines you want to delete, iterate through your lines and append only the lines that you want to keep.
It would also be more efficient to iterate your actual file object, rather than putting it all into one list:
keeplines = []
with open('file.txt') as b:
for line in b:
if re.match('.*\\x0cCompany', line):
keeplines = keeplines[:-3] #shave off the preceding lines
else:
keeplines.append(line)
file = open('file.txt', 'w'):
for line in keeplines:
file.write(line)

Categories

Resources