How do I "unfold" my text file? - python

I had a data table I converted to a text file with a tilde ~ at the end of each line. This is how I ended each line, so it is not a delimiter.
I used Linux to fold the data into an 80 byte length wrapped text file and added a line feed at the end of each line.
Example (if I did this at 10 bytes per line):
Original file or table:
abcdefghigklmnop~
1234567890~
New file:
abcdefghig
klmnop~123
4567890~
Linux/Unix, Perl, or even Python responses would help and be appreciated.
I need the new file to look exactly like the original. Sometimes line lengths will be over 80 characters in length which is ok.

If your original data was delimited by ~\n (tildes at the end of a line), and the new format removed the newlines and inserted new ones every 80 bytes, then the reverse is to remove newlines and replacing ~ with ~\n again:
with open(inputfile, 'r') as inp, open(outputfile, 'w') as outp:
data = inp.read().replace('\n', '')
outp.write(data.replace('~', '~\n'))

Related

How to remove erronous tabs/new line from .vcf file?

I am working with a vcf file. I try to extract information from this file, but the file has errors in the format.
In this file there is a column that contains long character strings. The error is, that a number of tabs and a new line character are erronously placed within some rows of this column. So when I try to read in this tab delimited file, all columns are messed up.
I have an idea how to solve this, but don't know how to execute it in code. The string is DNA, so always has ATCG. Basically, if one could look for a number of tabs and a newline within characters ATCG and remove them, then the file is fixed:
ACTGCTGA\t\t\t\t\nCTGATCGA would become:
ACTGCTGACTGATCGA
So one would need to look into this file, look for [ACTG] followed by tabs or newlines, followed by more [ACTG], and then replace this with nothing. Any idea how to do this?
with open(file.vcf, 'r') as f:
lines = [l for l in f if not l.startswith('##')]
Here's one way with regex:
First read the file in:
import re
with open('file.vcf', 'r') as file:
dnafile = file.read()
Then write a new file with the changes:
with open('fileNew.vcf', 'w') as file:
file.write(re.sub("(?<=[ACTG]{2})((\\t)*(\\n)*)(?=[ACTG]{2})", "", dnafile))

Converting tsv to tsv in python

I have a tsv-file (tab-seperated) and would like to filter out a lot of data using python before I import it into a postgresql database.
My problem is that I can't find a way to keep the format of the original file which is mandatory because otherwise the import processes won't work.
The web suggested that I should use the csv library, but no matter what delimter I use I always end up with files in a different format than the origin, e. g. files, that contain a comma after every character or files, that contain a tab after every character or files that have all data in one row.
Here is my code:
import csv
import glob
# create a list of all tsv-files in one directory
liste = glob.glob("/some_directory/*.tsv")
# go thru all the files
for item in liste:
#open the tsv-file for reading and a file for writing
with open(item, 'r') as tsvin, open('/some_directory/new.tsv', 'w') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
# I am not sure if I have to enter a delimter here for the outfile. If I enter "delimter='\t'" like for the In-File, the outfile ends up with a tab after every character
writer = csv.writer(csvout)
# go thru all lines of the input tsv
for row in tsvin:
# do some filtering
if 'some_substring1' in row[4] or 'some_substring2' in row[4]:
#do some more filtering
if 'some_substring1' in str(row[9]) or 'some_substring1' in str(row[9]):
# now I get lost...
writer.writerow(row)
Do you have any idea what I am doing wrong? The final file has to have a tab between every field and some kind of line break at the end.
Somehow you are passing a string to w.writerow(), not a list as expected.
Remember that strings are iterable; each iteration returns a single character from the string. writerow() simply iterates over its argument writing each item separated by the delimiter character (by default a comma). So if you pass a string to writerow() it will write each character from the string separated by the delimiter.
How is it that row is a string? It could be that the delimiter for the input file is incorrect - perhaps the file does not use tabs but has fixed field widths using runs of spaces as the delimiter.
You can check whether the reader is correctly parsing your file by printing out the value of row:
for row in tsvin:
print(row)
...
If the file is being correctly parsed, expect to see that row is a list, and that each element of the list corresponds to a column/field from the file.
If it is not parsing correctly then you might see that row is a string, or that it's a list but the fields are empty and/or out of place.
It would be helpful if you added a sample of your input file to the question.

How does the code know when to split into a line?

So I was learning on how to download files from the web using python but got a bit thrown by one part of the code.
Here is the code:
from urllib import request
def download_stock_data(csv_url):
response = request.urlopen(csv_url)
csv = response.read()
csv_str = str(csv)
lines = csv_str.split("\\n")
dest_url = r"stock.csv"
fx = open(dest_url, "w")
for line in lines:
fx.write(line + "\n")
fx.close()
I don't quite understand the code in the variable lines. How does it know when to split into a new line on a csv file ?
A csv file is essentially just a text file will comma separated data but they also contain new lines (via the newline ascii character).
If there a csv file with a long single comma separated line for line in lines: would only see the single line.
You can open it up in notepad++ or something to see the raw .csv file. Excel will put data seperated by commas in a cell,and data on a new line into a new row.
"\n" is where the instruction to create a new line comes from.
In the code you have presented, you are telling python to split the string you received based upon "\n". So you get a list of strings split into lines.
When you write to fx, you are inserting a newline character onto every line you write by appending "\n". If you didn't do this, then you would just get one very long line.

Reading input files with ASCII 215 as delimiter

I am trying to read from a file which has word pairs delimited by ASCII value 215. When I run the following code:
f = open('file.i', 'r')
for line in f.read().split('×'):
print line
I get a string that looks like garbage. Here is a sample of my input:
abashedness×N
abashment×N
abash×t
abasia×N
abasic×A
abasing×t
Abas×N
abatable×A
abatage×N
abated×V
abatement×N
abater×N
Abate×N
abate×Vti
abating×V
abatis×N
abatjours×p
abatjour×N
abator×N
abattage×N
abattoir×N
abaxial×A
and here is my output after the code above is run:
z?Nlner?N?NANus?A?hion?hk?hhn?he?hanoconiosis?N
My goal is to eventually read this into either a list of tuples or something of that nature, but I'm having trouble just getting the data to print.
Thanks for all help.
Well, two things:
Your source could be Unicode! Use an escape and be safe.
Read in binary mode.
with open("file.i", "rb") as f:
for line in f.read().split(b"\xd7"):
print(line)
The character is delimiting the word and the part of speech, but each word is still on its own line:
with open('file.i', 'rb') as handle:
for line in handle:
word, pos = line.strip().split('×')
print word, pos
Your code was splitting the whole file, so you were ending up with words like N\nabatable, N\nAbate, Vti\nabating.
To interpret bytes from a file as text, you need to know its character encoding. There Ain't No Such Thing As Plain Text. You could use codecs module to read the text:
import codecs
with codecs.open('file.i', 'r', encoding='utf-8') as file:
for line in file:
word, sep, suffix = line.partition(u'\u00d7')
if sep:
print word
Put the actual character encoding of the file instead of utf-8 placeholder e.g., cp1252.
Non-ascii characters in string literals would require source character encoding declaration at the top the script so I've used the unicode escape: u'\u00d7'.
Thanks to both your help, I was able to hack together this bit of code that returns a list of lists holding what I'm looking for.
with open("mobyposi.i", "rb") as f:
content = f.readlines()
f.close()
content = content[0].split()
for item in content:
item.split("\xd7")
It was indeed in unicode! However, the implementation described above discarded the text after the unicode value and before the newline.
EDIT: Able to reduce to:
with open("mobyposi.i", "rb") as f:
for item in f.read().split():
item.split("\xd7")

cropping off characters in python

I am new to Python and I have a .txt file containing numbers and I read them into an array in Python with the code below:
numberInput = []
with open('input.txt') as file:
numberInput = file.readlines()
print numberInput
Unfortunately, the output looks like this:
['54044\r\n', '14108\r\n', '79294\r\n', '29649\r\n', '25260\r\n', '60660\r\n', '2995\r\n', '53777\r\n', '49689\r\n', '9083\r\n', '16122\r\n', '90436\r\n', '4615\r\n', '40660\r\n', '25675\r\n', '58943\r\n', '92904\r\n', '9900\r\n', '95588\r\n', '46120']
How do I crop off the \r\n characters attached to each number in the array?
The \r\n you're seeing at the end of the strings is the newline indicator (a carriage return character followed by a newline character). You can easily remove it using str.strip:
numberInput = [line.strip() for line in file]
This is a list comprehension that iterates over your file (one line at a time) and strips off any whitespace found at either end of the line.
If you're wanting to use the numbers from the file as integers though, you can actually avoid stripping the lines, since the int constructor will ignore any whitespace. Here's how it would look if you did the conversion directly:
numberInput = [int(line) for line in file]
You should use str.splitlines() instead of readlines():
numberInput = []
with open('input.txt') as file:
numberInput = file.read().splitlines()
print numberInput
This read the whole file and splits it by "universal newlines" so you get the same list without \r\n.
See this question:
Best method for reading newline delimited files in Python and discarding the newlines?

Categories

Resources