(Python) Parsing tab delimited strings with newline characters - python

I am trying to read a file that is tab delimited but fields may contain newline characters and I would like to maintain the field that has newlines. My current implementation creates new fields from each "\n".
I have tried the csv module and just splitting on "\t" with no success on what I'm looking for. The following is a sample line from a given file:
*Field_1 \t Field_2 \t Field_3 \n Additional Text \n More text \t Field_4*
I would like to generate a list of 4 elements from the data above.
*["Field_1", "Field_2", "Field3 \n Additional Text \n More text", "Field_4"]*
Any thoughts or suggestions would be helpful.

Did you try splitting on the tab like this?
data = 'Field_1 \t Field_2 \t Field_3 \n Additional Text \n More text \t Field_4'
print data.split('\t')

Replacing fileName with the path to the file you're reading from:
inFile = open(fileName, "r")
rawData = inFile.read() # Entire file's contents as one multiline string (if there's a line break)
data = rawData.split("\t")
inFile.close()
There is also the option (generally recommended) of using the with statement for File I/O:
with open(fileName, "r") as inFile:
rawData = inFile.read() # Entire file's contents as one multiline string (if there's a line break)
data = rawData.split("\t")
# you can omit the inFile.close() statement.
With the with statement, the opened file stream will be automatically closed in the event of an error that appears at runtime, but it's less clear to people learning File I/O on how it works.

Related

File object variable is not printing newline for \n in the string

I have a text file that contains "b'Fenced Ports :\t\t\tNone\nFenced circuits :\t\tN/A\n'" as text. I am opening this text file in read mode, creating a file object, reading the file content using read() function and storing it in a variable x. This variable x is of type str.
But when I am trying to print this str variable x. It doesn't convert \t and \n into tab space and newline respectively.
fo=open("hcc.txt", "r")
data = fo.read()
print(data)
fo.close()
Output:
b'Fenced Ports :\t\t\tNone\nFenced circuits :\t\tN/A\n'
I need it like below:
Fenced Ports : None
Fenced circuits : N/A
The code you have provided should give the correct output. open uses text mode by default, so tabs and newlines should not appear as \t and \n when you print it. Based on the output you've given, it looks like you're opening the file in binary mode (hence the b at the start of the output), which causes tabs and newlines to be printed verbatim. Ensure that you are running the same code you provided in your question; make sure you do not have "b" in the second argument to open.
Considering that, all of your file's lines start with "b' and end with '"
with open("text.txt", "r") as f:
print(
"".join(
[line.replace("\\t", "\t").replace("\\n", "\n")[3:-3] for line in f]
).strip()
)
Fenced Ports : None
Fenced circuits : N/A
It seems that Python substitutes the newline and tab characters when we put the text into the variable as in:
var="Hello\nWorld!"
but not when we read a script. It is as though we wrote
var="Hello\\nWorld
If there is a newline or tab in the file it will be read correctly.
You could run string substitution to replace \n and \t with the "\n" and which "\t" which will therefore be parsed and substituted.

How to remove erronous tabs/new line from .vcf file?

I am working with a vcf file. I try to extract information from this file, but the file has errors in the format.
In this file there is a column that contains long character strings. The error is, that a number of tabs and a new line character are erronously placed within some rows of this column. So when I try to read in this tab delimited file, all columns are messed up.
I have an idea how to solve this, but don't know how to execute it in code. The string is DNA, so always has ATCG. Basically, if one could look for a number of tabs and a newline within characters ATCG and remove them, then the file is fixed:
ACTGCTGA\t\t\t\t\nCTGATCGA would become:
ACTGCTGACTGATCGA
So one would need to look into this file, look for [ACTG] followed by tabs or newlines, followed by more [ACTG], and then replace this with nothing. Any idea how to do this?
with open(file.vcf, 'r') as f:
lines = [l for l in f if not l.startswith('##')]
Here's one way with regex:
First read the file in:
import re
with open('file.vcf', 'r') as file:
dnafile = file.read()
Then write a new file with the changes:
with open('fileNew.vcf', 'w') as file:
file.write(re.sub("(?<=[ACTG]{2})((\\t)*(\\n)*)(?=[ACTG]{2})", "", dnafile))

Write to file does not go to the new line when '\n' in string

I have a .txt file which contains only one line of text. For example:
command1;\ncommand2, output;\ncommand3\ncommand4, output;\n (but much longer). Since it is hard to read, I want to change this file to some more readable version. I want to remove all ';' and replace '\n' with a new line.
I have few working solutions for this problem:
For example I could remove all '\n' and use print function. Or, replace \\n with \n:
def clean_file(file):
# read file
with open(file) as f:
content = f.readline()
# get rid of ';' and '\n'
content = content.split(';')
for ind, val in enumerate(content):
content[ind] = val.replace('\\n', '\n') # it can be also replace(r'\n', '\n')
# write to file
with open(file, 'w') as f:
for line in content:
f.write(line)
OUT:
command1
command2, output
command3
command4, output
And in this scenario, it works properly!
But I have no idea why it is not working when I remove replace part:
def clean_file(file):
# read file
with open(file) as f:
content = f.readline()
# get rid of ';'
content = content.split(';')
# write to file
with open(file, 'w') as f:
for line in content:
f.write(line)
OUT:
command1\ncommand2, output\ncommand3\ncommand4, output\n
This will print everything in one line.
Can someone explain to me why I have to replace '\n' with the same value?
The file was created, and I am opening it on windows, but the script I am running on Linux.
Most editors in the Windows world (starting with notepad) require \r\n to correctly display an end of line and ignore \n alone. On the other hand, on Linux a single \n is enough for an end of line. If you run a Python script on Windows, it will be smart enough to automatically replace any '\n' with a \r\n at write time and symetrically replace \r\n from a file with a single \n provided the file is opened in text mode. But nothing of that will happen on Linux.
Long story short, text files have different end of lines on Linux and Windows, and text files having \r\n are known as dos text files on Linux.
You have probably been caught by that, and the only way to be sure is to open the file in binary mode and display the byte values (in hex to be more readable for people used to ASCII code)
You are not replacing the same value, you are removing the \ before \n. When handling strings a backslash often means that you have a fancy character (such as newline \n, tab \t, etc..), BUT sometimes you want to print an actual backslash! To do this in python we use \\ to add in a single backslash.
So, when printing out in your first example, python comes up to \n and thinks "new line", in your second example python sees \\n so the first two \ mean print a backslash, then the n is treated and printed like a normal n

How can I remove carriage return from a text file with Python?

The things I've googled haven't worked, so I'm turning to experts!
I have some text in a tab-delimited text file that has some sort of carriage return in it (when I open it in Notepad++ and use "show all characters", I see [CR][LF] at the end of the line). I need to remove this carriage return (or whatever it is), but I can't seem to figure it out. Here's a snippet of the text file showing a line with the carriage return:
firstcolumn secondcolumn third fourth fifth sixth seventh
moreoftheseventh 8th 9th 10th 11th 12th 13th
Here's the code I'm trying to use to replace it, but it's not finding the return:
with open(infile, "r") as f:
for line in f:
if "\n" in line:
line = line.replace("\n", " ")
My script just doesn't find the carriage return. Am I doing something wrong or making an incorrect assumption about this carriage return? I could just remove it manually in a text editor, but there are about 5000 records in the text file that may also contain this issue.
Further information:
The goal here is select two columns from the text file, so I split on \t characters and refer to the values as parts of an array. It works on any line without the returns, but fails on the lines with the returns because, for example, there is no element 9 in those lines.
vals = line.split("\t")
print(vals[0] + " " + vals[9])
So, for the line of text above, this code fails because there is no index 9 in that particular array. For lines of text that don't have the [CR][LF], it works as expected.
Depending on the type of file (and the OS it comes from, etc), your carriage return might be '\r', '\n', or '\r'\n'. The best way to get rid of them regardless of which one they are is to use line.rstrip().
with open(infile, "r") as f:
for line in f:
line = line.rstrip() # strip out all tailing whitespace
If you want to get rid of ONLY the carriage returns and not any extra whitespaces that might be at the end, you can supply the optional argument to rstrip:
with open(infile, "r") as f:
for line in f:
line = line.rstrip('\r\n') # strip out all tailing whitespace
Hope this helps
Here's how to remove carriage returns without using a temporary file:
with open(file_name, 'r') as file:
content = file.read()
with open(file_name, 'w', newline='\n') as file:
file.write(content)
Python opens files in so-called universal newline mode, so newlines are always \n.
Python is usually built with universal newlines support; supplying 'U'
opens the file as a text file, but lines may be terminated by any of
the following: the Unix end-of-line convention '\n', the Macintosh
convention '\r', or the Windows convention '\r\n'. All of these
external representations are seen as '\n' by the Python program.
You iterate through file line-by-line. And you are replacing \n in the lines. But in fact there are no \n because lines are already separated by \n by iterator and each line contains no \n.
You can just read from file f.read(). And then replace \n in it.
with open(infile, "r") as f:
content = f.read()
content = content.replace('\n', ' ')
#do something with content
Technically, there is an answer!
with open(filetoread, "rb") as inf:
with open(filetowrite, "w") as fixed:
for line in inf:
fixed.write(line)
The b in open(filetoread, "rb") apparently opens the file in such a way that I can access those line breaks and remove them. This answer actually came from Stack Overflow user Kenneth Reitz off the site.
Thanks everyone!
I've created a code to do it and it works:
end1='C:\...\file1.txt'
end2='C:\...\file2.txt'
with open(end1, "rb") as inf:
with open(end2, "w") as fixed:
for line in inf:
line = line.replace("\n", "")
line = line.replace("\r", "")
fixed.write(line)

Write a list to file containing text and hex values. How?

I need to write a list of values to a text file. Because of Windows, when I need to write a line feed character, windows does \n\r and other systems do \n.
It occurred to me that maybe I should write to file in binary.
How to I create a list like the following example and write to file in binary?
output = ['my first line', hex_character_for_line_feed_here, 'my_second_line']
How come the following does not work?
output = ['my first line', '\x0a', 'my second line']
Don't. Open the file in text mode and just let Python handle the newlines for you.
When you use the open() function you can set how Python should handle newlines with the newline keyword parameter:
When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.
So the default method is to write the correct line separator for your platform:
with open(outputfilename, 'w') as outputfile:
outputfile.write('\n'.join(output))
and does the right thing; on Windows \r\n characters are saved instead of \n.
If you specifically want to write \n only and not have Python translate these for you, use newline='':
with open(outputfilename, 'w', newline='') as outputfile:
outputfile.write('\n'.join(output))
Note that '\x0a' is exactly the same character as \n; \r is \x0d:
>>> '\x0a'
'\n'
>>> '\x0d'
'\r'
Create a text file, "myTextFile" in the same directory as your Python script. Then write something like:
# wb opens the file in "Write Binary" mode
myTextFile = open("myTextFile.txt", 'wb')
output = ['my first line', '369as3', 'my_second_line']
for member in output:
member.encode("utf-8") # Or whatever encoding you like =)
myTextFile.write(member + "\n")
This outputs a binary text file that looks like:
my first line
369as3
my_second_line
Edit: Updated for Python 3

Categories

Resources