How to remove newlines but keep blank ones in a text file?

How to remove newlines but keep blank ones in a text file? - python

My question is essentially identical to the one found here, but I'd like to perform that operation using python 3. The text in my file looks something like this:
'''
Chapter One ~~ Introductory
The institution of a leisure class is found in its best development at
the higher stages of the barbarian culture; as, for instance, in feudal...
'''
Per numerous suggestions I've found, I have tried:
with open('veblen_txt_test.txt', 'r') as src:
with open('write_new.txt', 'w') as dest:
for line in src:
if len(line) > 0:
line = line.replace('\n', ' ')
dest.write(line)
else:
line = line + '\n\n'
dest.write('%s' % (line))
But this returns:
'''
Chapter One ~~ Introductory The institution of a leisure class is found in its best development at the higher stages of the barbarian culture; as, for instance, in feudal...
'''
The intended output is:
'''
Chapter One ~~ Introductory
The institution of a leisure class is found in its best development at the higher stages of the barbarian culture; as, for instance, in feudal...
'''
I have tried using rstrip():
with open('veblen_txt_test.txt', 'r') as src:
with open('write_new.txt', 'w') as dest:
for line in src:
if len(line) > 0:
line = line.rstrip('\n')
dest.write('%s%s' % (line, ' '))
else:
line = line + '\n\n'
dest.write('%s' % (line))
But that yields the same result.
Most of the responses online address removing blank spaces, not keeping them; I have no doubt the solution is simple, but I've been trying different variations of the above code for about an hour and a half and just thought to ask the community. Thanks for your assistance!

If we change the len(line) > 0 to len(line) > 1, it does the job. This is because \n counts as 1 character. You'll also have to remove this line: line = line + '\n\n' as it adds 4 more extra lines (since there are two \n in between Chapter One ~~ Introductory and The institution....
Output:
Chapter One ~~ Introductory
The institution of a leisure class is found in its best development at the higher stages of the barbarian culture; as, for instance, in feudal...

Related

Get string between two identifiers on multiple lines with a line by line read

I have a huge text file which I need to read line by line for memory optimization.
I would like to get the string within two identifiers, as an example here between the identifiers '{' and '}':
input:
"
not this line
not this line
Pattern 'pattern' {
get this line
get this line
}
not this line
not this line
"
the output would be a string "get this line get this line "
There can be some other identifiers ('{', '}', '[', ...) inside the string but I need matching ones. Ex: Pattern { something else {...} } would get something else {...} (the englobed {...} is inside the string)
I have written a simple counter like this but it is quite slow. I was looking at a faster way of doing this.
currentString = ""
counter = 0
def GetStringBetweenIdentifiers(string, identifierA, identifierB):
global currentString, counter
for i in string:
if (i == identifierB):
counter -= 1
if(counter > 0):
currentString += i
if(i == identifierA):
counter += 1
if(counter==0):
string = currentString
currentString = ""
return string
return ""
with open(filePath) as read_obj:
for num, line in enumerate(read_obj, 1):
String = GetStringBetweenIdentifiers(line, '{', '}')
if (String != ""):
"Do something with the string"
To add some examples, there can be identifiers in the middle of the line, for example:
input:
"
not this line
not this line
Pattern 'pattern' { I want this
get this line { something here }
get this line
also this part } not this part
not this line
not this line
"
the output would be a string " I want this get this line { something here } get this line also this part"
Thank you for reading!

This kind of thing can be very tricky due to ambiguous sequences. For example... Let's say that the start of a sequence of interest is '{' and the end is '}'. Now imagine that you've observed a start sentinel then, before you see an end marker, you see another start marker. What do you do then?
Anyway, here's something that will work in the perfect world (which doesn't really exist but it might give some ideas).
My input file looks like this:
not this line
not this line
Pattern 'pattern' { I want this
get this line { something here }
get this line
also this part } not this part
not this line
not this line
...and the code like this...
START = '{'
END = '}'
capture = 0
data = []
section = []
with open('foo.txt') as txt:
while (c := txt.read(1)):
if c == START:
if (capture := capture + 1) > 1:
section.append(c)
elif c == END:
if (capture := capture - 1) < 0:
print('ERROR: unable to process (too many end tags)')
break
if capture:
section.append(c)
elif section:
data.append(section)
section = []
elif capture and c not in '\r\n':
section.append(c)
for section in data:
print(''.join(section))
...and this output....
I want this get this line { something here }get this line also this part

Welcome to the world of regex. It's quirky, but highly effective. This works for your situation, if in the lines you read there is only one capture-able sequence, which may contain sub sequences that might also be captured, as you show in your example. It will fail if there are independent sequences within the same input string, as it will capture the "outer most" subsequence that it finds. It would be a little more work to have it handle this case. (As they say, an exercise left to the interested reader.)
Lots of good info in the python dox and this website is key for testing.
Aside: You may also want to look into grep terminal command (not a python solution). grep is highly effective at processing massive files and pulling out matches and it works seamlessly with regex also
Anyhow:
import re
with open('dummy_text.txt', 'r') as src:
lines = src.readlines()
composite_string = ''.join(lines)
print('loaded and working with:\n')
print(composite_string)
print()
pattern = r'{((?s:.*))}'
results = re.search(pattern, composite_string)
print(f'I found: {results.group(1)}')
Produces:
loaded and working with:
not this line
not this line
Pattern 'pattern' {
get this line
get {this} line
}
not this line
not this line
I found:
get this line
get {this} line

Obtain tsv from text with a specific pattern

I'm a biologist and I need to take information on a text file
I have a file with plain text like that:
12018411
Comparison of two timed artificial insemination (TAI) protocols for management of first insemination postpartum.
TAI|timed artificial insemination|0.999808
Two estrus-synchronization programs were compared and factors influencing their success over a year were evaluated. All cows received a setup injection of PGF2alpha at 39 +/- 3 d postpartum. Fourteen days later they received GnRH, followed in 7 d by a second injection of PGF2alpha. Cows (n = 523) assigned to treatment 1 (modified targeted breeding) were inseminated based on visual signs of estrus at 24, 48, or 72 h after the second PGF2alpha injection. Any cow not observed in estrus was inseminated at 72 h. Cows (n = 440) assigned to treatment 2 received a second GnRH injection 48 h after the second PGF2alpha, and all were inseminated 24 h later. Treatment, season of calving, multiple birth, estrual status at insemination, number of occurrences of estrus before second PGF2alpha, prophylactic use of PGF2alpha, retained fetal membranes, and occurrence of estrus following the setup PGF2alpha influenced success. Conception rate was 31.2% (treatment 1) and 29.1% (treatment 2). A significant interaction occurred between protocol and estrual status at insemination. Cows in estrus at insemination had a 45.8% (treatment 1) or 35.4% (treatment 2) conception rate. The conception rate for cows not expressing estrus at insemination was 19.2% (treatment 1) and 27.7% (treatment 2). Provided good estrous detection exists, modified targeted breeding can be as successful as other timed artificial insemination programs. Nutritional, environmental, and management strategies to reduce postpartum disorders and to minimize the duration of postpartum anestrus are critical if synchronization schemes are used to program first insemination after the voluntary waiting period.
8406022
Deletion of the beta-turn/alpha-helix motif at the exon 2/3 boundary of human c-Myc leads to the loss of its immortalizing function.
The protein product (c-Myc) of the human c-myc proto-oncogene carries a beta-turn/alpha-helix motif at the exon2/exon3 boundary. The amino acid (aa) sequence and secondary structure of this motif are highly conserved among several nuclearly localized oncogene products, c-Myc, N-Myc, c-Fos, SV40 large T and adenovirus (Ad) Ela. Removal of this region from Ad E1a results in the loss of the transforming properties of the virus without destroying its known transregulatory functions. In order to analyse whether deletion of the above-mentioned region from c-Myc has a similar effect on its transformation activity, we constructed a deletion mutant (c-myc delta) lacking the respective aa at the exon2/exon3 boundary. In contrast to the c-myc wild-type gene product, constitutive expression of c-myc delta does not lead to the immortalization of primary mouse embryo fibroblast cells (MEF cells). This result indicates that c-Myc and Ad El a share a common domain which is involved in the transformation process by both oncogenes.
aa|amino acid|0.99818
Ad|adenovirus|0.96935
MEF cells|mouse embryo fibroblast cells|0.994648
The first line is the id, the second line is the title, the third line used to be the abstract (sometimes there are abbreviations) and the lasts lines (if there are) are abbreviations with double space, the abbreviation, the meaning, and a number. You can see :
GA|general anesthesia|0.99818
Then there is a line in blank and start again: ID, Title, Abstract, Abbreviations or ID, Title, Abbreviations, Abstract.
And I need to take this data and convert to a TSV file like that:
12018411 TAI timed artificial insemination
8406022 aa amino acids
8406022 Ad adenovirus
... ... ...
First column ID, second column Abbreviation, and third column Meaning of this abbreviation.
I tried to convert first in a Dataframe and then convert to TSV but I don't know how take the information of the text with the structure I need.
And I tried with this code too:
from collections import namedtuple
import pandas as pd
Item= namedtuple('Item', 'ID')
items = []
with open("identify_abbr-out.txt", "r", encoding='UTF-8') as f:
lines= f.readlines()
for line in lines:
if line== '\n':
ID= ¿nextline?
if line.startswith(" "):
Abbreviation = line
items.append(Item(ID, Abbreviation))
df = pd.DataFrame.from_records(items, columns=['ID', 'Abbreviation'])
But I don't know how to read the next line and the code not found because there are some lines in blank in the middle between the corpus and the title sometimes.
I'm using python 3.8
Thank you very much in advance.

Assuming test.txt has your input data, I used simple file read functions to process the data -
file1 = open('test.txt', 'r')
Lines = file1.readlines()
outputlines = []
outputline=""
counter = 0
for l in Lines:
if l.strip()=="":
outputline = ""
counter = 0
elif counter==0:
outputline = outputline + l.strip() + "|"
counter = counter + 1
elif counter==1:
counter = counter + 1
else:
if len(l.split("|"))==3 and l[0:2]==" " :
outputlines.append(outputline + l.strip() +"\n")
counter = counter + 1
file1 = open('myfile.txt', 'w')
file1.writelines(outputlines)
file1.close()
Here file is read, line by line, a counter is kept and reset when there is a blank line, and ID is read in just next line. If there are 3 field "|" separated row, with two spaces in beginning, row is exported with ID

Python Code to Write to file with Left and Right Margins and Fixed Line Length

I am writing a Python program that reads a file and then writes its contents to another one, with added margins. The margins are user-input and the line length must be at most 80 characters.
I wrote a recursive function to handle this. For the most part, it is working. However, the 2 lines before any new paragraph display the indentation that was input for the right side, instead of keeping the left indentation.
Any clues on why this happen?
Here's the code:
left_Margin = 4
right_Margin = 5
# create variable to hold the number of characters to withhold from line_Size
avoid = right_Margin
num_chars = left_Margin
def insertNewlines(i, line_Size):
string_length = len(i) + avoid + right_Margin
if len(i) <= 80 + avoid + left_Margin:
return i.rjust(string_length)
else:
i = i.rjust(len(i)+left_Margin)
return i[:line_Size] + '\n' + ' ' * left_Margin + insertNewlines(i[line_Size:], line_Size)
with open("inputfile.txt", "r") as inputfile:
with open("outputfile.txt", "w") as outputfile:
for line in inputfile:
num_chars += len(line)
string_length = len(line) + left_Margin
line = line.rjust(string_length)
words = line.split()
# check if num of characters is enough
outputfile.write(insertNewlines(line, 80 - avoid - left_Margin))
For input of left_Margin=4 and right_Margin = 5, I expect this:
____Poetry is a form of literature that uses aesthetic and rhythmic
____qualities of language—such as phonaesthetics, sound symbolism, and
____metre—to evoke meanings in addition to, or in place of, the prosai
____c ostensible meaning.
____Poetry has a very long history, dating back to prehistorical ti
____mes with the creation of hunting poetry in Africa, and panegyric an
____d elegiac court poetry was developed extensively throughout the his
____tory of the empires of the Nile, Niger and Volta river valleys.
But The result is:
____Poetry is a form of literature that uses aesthetic and rhythmic
______qualities of language—such as phonaesthetics, sound symbolism, and
______metre—to evoke meanings in addition to, or in place of, the prosai
________c ostensible meaning.
_____Poetry has a very long history, dating back to prehistorical ti
_____mes with the creation of hunting poetry in Africa, and panegyric an
_____d elegiac court poetry was developed extensively throughout the his
_____tory of the empires of the Nile, Niger and Volta river valleys.

This isn't really a good fit for a recursive solution in Python. Below is an imperative/iterative solution of the formatting part of your question (I'm assuming you can take this and write it to a file instead). The code assumes that paragraphs are indicated by two consecutive newlines ('\n\n').
txt = """
Poetry is a form of literature that uses aesthetic and rhythmic qualities of language—such as phonaesthetics, sound symbolism, and metre—to evoke meanings in addition to, or in place of, the prosaic ostensible meaning.
Poetry has a very long history, dating back to prehistorical times with the creation of hunting poetry in Africa, and panegyric and elegiac court poetry was developed extensively throughout the history of the empires of the Nile, Niger and Volta river valleys.
"""
def format_paragraph(paragraph, length, left, right):
"""Format paragraph ``p`` so the line length is at most ``length``
with ``left`` as the number of characters for the left margin,
and similiarly for ``right``.
"""
words = paragraph.split()
lines = []
curline = ' ' * (left - 1) # we add a space before the first word
while words:
word = words.pop(0) # process the next word
# +1 in the next line is for the space.
if len(curline) + 1 + len(word) > length - right:
# line would have been too long, start a new line
lines.append(curline)
curline = ' ' * (left - 1)
curline += " " + word
lines.append(curline)
return '\n'.join(lines)
# we need to work on one paragraph at a time
paragraphs = txt.split('\n\n')
print('0123456789' * 8) # print a ruler..
for paragraph in paragraphs:
print(format_paragraph(paragraph, 80, left=4, right=5))
print() # next paragraph
the output of the above is:
01234567890123456789012345678901234567890123456789012345678901234567890123456789
Poetry is a form of literature that uses aesthetic and rhythmic
qualities of language such as phonaesthetics, sound symbolism, and
metre to evoke meanings in addition to, or in place of, the prosaic
ostensible meaning.
Poetry has a very long history, dating back to prehistorical times with
the creation of hunting poetry in Africa, and panegyric and elegiac
court poetry was developed extensively throughout the history of the
empires of the Nile, Niger and Volta river valleys.

Remove quotes holding 2 words and remove comma between them

Following up on Python to replace a symbol between between 2 words in a quote
Extended input and expected output:
trying to replace comma between 2 words Durango and PC in the second line by & and then remove the quotes " as well. Same for third line with Orbis and PC and 4th line has 2 word combos in quotes that I would like to process "AAA - Character Tech, SOF - UPIs","Durango, Orbis, PC"
I would like to retain the rest of the lines using Python.
INPUT
2,SIN-Rendering,Core Tech - Rendering,PC,147,Reopened
2,Kenny Chong,Core Tech - Rendering,"Durango, PC",55,Reopened
3,SIN-Audio,AAA - Audio,"Orbis, PC",13,Open
LTY-168499,[PC][PS4][XB1] Missing textures from Fort Capture NPC face,3,CTU-CharacterTechBacklog,"AAA - Character Tech, SOF - UPIs","Durango, Orbis, PC",29,Waiting For
...
...
...
Like these, there can be 100 lines in my sample. So the expected output is:
2,SIN-Rendering,Core Tech - Rendering,PC,147,Reopened
2,Kenny Chong,Core Tech - Rendering, Durango & PC,55,Reopened
3,SIN-Audio,AAA - Audio, Orbis & PC,13,Open
LTY-168499,[PC][PS4][XB1] Missing textures from Fort Capture NPC face,3,CTU-CharacterTechBacklog,AAA - Character Tech & SOF - UPIs,Durango, Orbis & PC,29,Waiting For
...
...
...
So far, I could think of reading line by line and then if the line contains quote replace it with no character but then replacement of symbol inside is something I am stuck with.
Here is what I have right now:
for line in lines:
expr2 = re.findall('"(.*?)"', line)
if len(expr2)!=0:
expr3 = re.split('"',line)
expr4 = expr3[0]+expr3[1].replace(","," &")+expr3[2]
print >>k, expr4
else:
print >>k, line
but it does not consider the case in 4th line? There can be more than 3 combos as well. For eg.
3,SIN-Audio,"AAA - Audio, xxxx, yyyy","Orbis, PC","13, 22",Open
and wish to make this
3,SIN-Audio,AAA - Audio & xxxx & yyyy, Orbis & PC, 13 & 22,Open
How to achieve this, any suggestion? Learning Python.

So, by treating the input file as a .csv we can easily turn the lines into something easy to work with.
For example,
2,Kenny Chong,Core Tech - Rendering, Durango & PC,55,Reopened
is read as:
['2', 'Kenny Chong', 'Core Tech - Rendering', 'Durango, PC', '55', 'Reopened']
Then, by replacing all instances of , with _& (space) we would have the line:
['2', 'Kenny Chong', 'Core Tech - Rendering', 'Durango & PC', '55', 'Reopened']
And it replaces multiple instances of ,s within a line, and when finally writing we no longer have the original double quotes.
Here is the code, given that in.txt is your input file and it will write to out.txt.
import csv
with open('in.txt') as infile:
reader = csv.reader(infile)
with open('out.txt', 'w') as outfile:
for line in reader:
line = list(map(lambda s: s.replace(',', ' &'), line))
outfile.write(','.join(line) + '\n')
The fourth line is outputted as:
LTY-168499,[PC][PS4][XB1] Missing textures from Fort Capture NPC face,3,CTU-CharacterTechBacklog,AAA - Character Tech & SOF - UPIs,Durango & Orbis & PC,29,Waiting For

Please check this once: I could not find a single expression that could do this. So did it in a bit elaborate way. Will update if I can find a better way(Python 3)
import re
st = "3,SIN-Audio,\"AAA - Audio, xxxx, yyyy\",\"Orbis, PC\",\"13, 22\",Open"
found = re.findall(r'\"(.*)\"',st)[0].split("\",\"")
final = ""
for word in found:
final = final + (" &").join(word.split(","))+","
result = re.sub(r'\"(.*)\"',final[:-1],st)
print(result)

Reading file using python and and see if a particular string is there inthe file

I have a file in the following format
Summary;None;Description;Emails\nDarlene\nGregory Murphy\nDr. Ingram\n;DateStart;20100615T111500;DateEnd;20100615T121500;Time;20100805T084547Z
Summary;Presence tech in smart energy management;Description;;DateStart;20100628T130000;DateEnd;20100628T133000;Time;20100628T055408Z
Summary;meeting;Description;None;DateStart;20100629T110000;DateEnd;20100629T120000;Time;20100805T084547Z
Summary;meeting;Description;None;DateStart;20100630T090000;DateEnd;20100630T100000;Time;20100805T084547Z
Summary;Balaji Viswanath: Meeting;Description;None;DateStart;20100712T140000;DateEnd;20100712T143000;Time;20100805T084547Z
Summary;Government Industry Training: How Smart is Your City - The Smarter City Assessment Tool\nUS Call-In Information: 1-866-803-2143\, International Number: 1-210-795-1098\, International Toll-free Numbers: See below\, Passcode: 6785765\nPresentation Link - Copy and paste URL into web browser: http://w3.tap.ibm.com/medialibrary/media_view?id=87408;Description;International Toll-free Numbers link - Copy and paste this URL into your web browser:\n\nhttps://w3-03.sso.ibm.com/sales/support/ShowDoc.wss?docid=NS010BBUN-7P4TZU&infotype=SK&infosubtype=N0&node=clientset\,IA%7Cindustries\,Y&ftext=&sort=date&showDetails=false&hitsize=25&offset=0&campaign=#International_Call-in_Numbers;DateStart;20100811T203000;DateEnd;20100811T213000;Time;20100805T084547Z
Now I need to create a function that does the following:
The function argument would specify which line to read, and let say i have already done line.split(;)
See if there is "meeting" or "call in number" anywhere in line[1], and see if there is "meeting" or "call in number" anywhere in line[2]. If either of both of these are true, the function should return "call-in meeting". Else it should return "None Inferred".
Thanks in advance

use the in operator to see if there is a match
for line in open("file"):
if "string" in line :
....

vlad003 is right: if you have newline characters in the lines; they will be new lines! In this case, I would split on "Summary" instead:
import itertools
def chunks( filePath ):
"Since you have newline characters in each section,\
you can't read each line in turn. This function reads\
lines of the file and splits them into chunks, restarting\
each time 'Summary' starts a line."
with open( filePath ) as theFile:
chunk = [ ]
for line in theFile:
if line.startswith( "Summary" ):
if chunk: yield chunk
chunk = [ line ]
else:
chunk.append( line )
yield chunk
def nth(iterable, n, default=None):
"Gets the nth element of an iterator."
return next(islice(iterable, n, None), default)
def getStatus( chunkNum ):
"Get the nth chunk of the file, split it by ";", and return the result."
chunk = nth( chunks, chunkNum, "" ).split( ";" )
if not chunk[ 0 ]:
raise SomeError # could not get the right chunk
if "meeting" in chunk[ 1 ].lower() or "call in number" in chunk[ 1 ].lower():
return "call-in meeting"
else:
return "None Inferred"
Note that this is silly if you plan to read all the chunks of the file, since it opens the file and reads through it once per query. If you plan to do this often, it would be worth parsing it into a better data format (e.g. an array of statuses). This would require one pass through the file, and give you much better lookups.

A build on ghostdog74's answer:
def finder(line):
'''Takes line number as argument. First line is number 0.'''
with open('/home/vlad/Desktop/file.txt') as f:
lines = f.read().split('Summary')[1:]
searchLine = lines[line]
if 'meeting' in searchLine.lower() or 'call in number' in searchLine.lower():
return 'call-in meeting'
else:
return 'None Inferred'
I don't quite understand what you meant by line[1] and line[2] so this is the best I could do.
EDIT: Fixed the problem with the \n's. I figure since you're searching for the meeting and call in number you don't need the Summary so I used it to split the lines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove newlines but keep blank ones in a text file? - python

Related

Get string between two identifiers on multiple lines with a line by line read

Obtain tsv from text with a specific pattern

Python Code to Write to file with Left and Right Margins and Fixed Line Length

Remove quotes holding 2 words and remove comma between them

Reading file using python and and see if a particular string is there inthe file

Categories

Resources