How to remove erronous tabs/new line from .vcf file? - python

I am working with a vcf file. I try to extract information from this file, but the file has errors in the format.
In this file there is a column that contains long character strings. The error is, that a number of tabs and a new line character are erronously placed within some rows of this column. So when I try to read in this tab delimited file, all columns are messed up.
I have an idea how to solve this, but don't know how to execute it in code. The string is DNA, so always has ATCG. Basically, if one could look for a number of tabs and a newline within characters ATCG and remove them, then the file is fixed:
ACTGCTGA\t\t\t\t\nCTGATCGA would become:
ACTGCTGACTGATCGA
So one would need to look into this file, look for [ACTG] followed by tabs or newlines, followed by more [ACTG], and then replace this with nothing. Any idea how to do this?
with open(file.vcf, 'r') as f:
lines = [l for l in f if not l.startswith('##')]

Here's one way with regex:
First read the file in:
import re
with open('file.vcf', 'r') as file:
dnafile = file.read()
Then write a new file with the changes:
with open('fileNew.vcf', 'w') as file:
file.write(re.sub("(?<=[ACTG]{2})((\\t)*(\\n)*)(?=[ACTG]{2})", "", dnafile))

Related

Converting .tsv file to .txt creates unintended characters, possible fix?

Need to process a .tsv file that has 1 million lines and then save the file as a .txt file . I successfully am able to perform that this way:
import csv
with open("data.tsv") as fd, open('pre_processed_data.txt', 'wb') as csvout:
rd = csv.reader(fd, delimiter="\t", quotechar='"')
csvout = csv.writer(csvout,delimiter='\t')
for row in rd:
csvout.writerow([row[1],row[2],row[3]])
However, beyond a certain point , along with tabs certain special characters unintended crawls in. ie this way:
As you can see the first column expects only numeric values between 0 and 1. However special characters are seen in between.
What is possibly causing this and how to effectively resolve this?
These extra characters exist in the input file. As you have no cntrol over the file, the easiest thing to to do is to remove them as you process the data. The re module's sub function can do this:
>>> import re
>>> s = '1#'
>>> re.sub(r'\D+', '', s)
'1'
The r'\D+' pattern will match any non-numeric character for removal from the provided string.

Extracting a substring from string in Python and putting it to a file

I have a file in the following format
name#company.com, information
name#company2.com, information
....
What I need to do is read in the file and output the email address only to a file. I have the following code created
with open ('n-emails.txt') as f:
lines = f.readlines()
print (lines)
Can someone please show me how to only get the email part of the file and how to output it to a file this is all done on a mac.
2 different ways of doing it:
without csv module: read each line, split according to tokens, strip the blanks, print:
with open ('n-emails.txt') as f:
for line in f:
toks = line.split(",")
if toks:
print(toks[0].strip())
with the csv module, map the opened file on a csv reader, iterate on the rows, print first (stripped) row.
import csv
with open ('n-emails.txt') as f:
cr = csv.reader(delimiter=",")
for row in cr:
print(row[0].strip())
the second method has the advantage of being robust to commas contained in cells, quotes, ... that's why I recommend it.

Replacing cell, not string

I have the following code.
import fileinput
map_dict = {'*':'999999999', '**':'999999999'}
for line in fileinput.FileInput("test.txt",inplace=1):
for old, new in map_dict.iteritems():
line = line.replace(old, new)
sys.stdout.write(line)
I have a txt file
1\tab*
*1\tab**
Then running the python code generates
1\tab999999999
9999999991\tab999999999
However, I want to replace "cell" (sorry if this is not standard terminology in python. I am using the terminology of Excel) not string.
The second cell is
*
So I want to replace it.
The third cell is
1*
This is not *. So I don't want to replace it.
My desired output is
1\tab999999999
*1\tab999999999
How should I make this? The user will tell this program which delimiter I am using. But the program should replace only the cell not string..
And also, how to have a separate output txt rather than overwriting the input?
Open a file for writing, and write to it.
Since you want to replace the exact complete values (for example not touch 1*), do not use replace. However, to analyze each value split your lines according to the tab character ('\t').
You must also remove end of line characters (as they may prevent matching last cells in a row).
Which gives
import fileinput
MAPS = (('*','999999999'),('**','999999999'))
with open('output.txt','w') as out_file:
for line in open("test.txt",'r'):
out_list = []
for inp_cell in line.rstrip('\n').split('\t'):
out_cell = inp_cell
for old, new in MAPS:
if out_cell == old:
out_cell = new
out_list.append(out_cell)
out_file.write( "\t".join(out_list) + "\n" )
There are more condensed/compact/optimized ways to do it, but I detailed each step on purpose, so that you may adapt to your needs (I was not sure this is exactly what you ask for).
the csv module can help:
#!python3
import csv
map_dict = {'*':'999999999','**':'999999999'}
with open('test.txt',newline='') as inf, open('test2.txt','w',newline='') as outf:
w = csv.writer(outf,delimiter='\t')
for line in csv.reader(inf,delimiter='\t'):
line = [map_dict[item] if item in map_dict else item for item in line]
w.writerow(line)
Notes:
with will automatically close files.
csv.reader parses and splits lines on a delimiter.
A list comprehension translates line items in the dictionary into a new line.
csv.writer writes the line back out.

How to write clean data to a file in python in tabulated format

Issue: Remove the hyperlinks, numbers and signs like ^&*$ etc from twitter text. The tweet file is in CSV tabulated format as shown below:
s.No. username tweetText
1. #abc This is a test #abc example.com
2. #bcd This is another test #bcd example.com
Being a novice at python, I search and string together the following code, thanks to a the code given here:
import re
fileName="path-to-file//tweetfile.csv"
fileout=open("Output.txt","w")
with open(fileName,'r') as myfile:
data=myfile.read().lower() # read the file and convert all text to lowercase
clean_data=' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",data).split()) # regular expression to strip the html out of the text
fileout.write(clean_data+'\n') # write the cleaned data to a file
fileout.close()
myfile.close()
print "All done"
It does the data stripping, but the output file format is not as I desire. The output text file is in a single line like
s.no username tweetText 1 abc This is a cleaned tweet 2 bcd This is another cleaned tweet 3 efg This is yet another cleaned tweet
How can I fix this code to give me an output like given below:
s.No. username tweetText
1 abc This is a test
2 bcd This is another test
3 efg This is yet another test
I think something needs to be added in the regular expression code but I don't know what it could be. Any pointers or suggestions will be helpful.
You can read the line, clean it, and write it out in one loop. You can also use the CSV module to help you build out your result file.
import csv
import re
exp = r"(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
def cleaner(row):
return [re.sub(exp, " ", item.lower()) for item in row]
with open('input.csv', 'r') as i, open('output.csv', 'wb') as o:
reader = csv.reader(i, delimiter=',') # Comma is the default
writer = csv.writer(o, delimiter=',')
# Take the first row from the input file (the header)
# and write it to the output file
writer.writerow(next(reader))
for row in reader:
writer.writerow(cleaner(row))
The csv module knows correctly how to add separators between items; as long as you pass it a collection of items.
So, what the cleaner method does it take each item (column) in the row from the input file, apply the substitution to the lowercase version of the item; and then return back a list.
The rest of the code is simply opening the file, configuring the CSV module with the separators you want for the input and output files (in the example code, the separator for both files is a tab, but you can change the output separator).
Next, the first row of the input file is read and written out to the output file. No transformation is done on this row (which is why it is not in the loop).
Reading the row from the input file automatically puts the file pointer on the next row - so then we simply loop through the input rows (in reader), for each row apply the cleaner function - this will return a list - and then write that list back to the output file with writer.writerow().
instead of applying the re.sub() and the .lower() expressions to the entire file at once try iterating over each line in the CSV file like this:
for line in myfile:
line = line.lower()
line = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",line)
fileout.write(line+'\n')
also when you use the with <file> as myfile expression there is no need to close it at the end of your program, this is done automatically when you use with
Try this regex:
clean_data=' '.join(re.sub("[#\^&\*\$]|#\S+|\S+[a-z0-9]\.(com|net|org)"," ",data).split()) # regular expression to strip the html out of the text
Explanation:
[#\^&\*\$] matches on the characters, you want to replace
#\S+matches on hash tags
\S+[a-z0-9]\.(com|net|org) matches on domain names
If the URLs can't be identified by https?, you'll have to complete the list of potential TLDs.
Demo

How does the code know when to split into a line?

So I was learning on how to download files from the web using python but got a bit thrown by one part of the code.
Here is the code:
from urllib import request
def download_stock_data(csv_url):
response = request.urlopen(csv_url)
csv = response.read()
csv_str = str(csv)
lines = csv_str.split("\\n")
dest_url = r"stock.csv"
fx = open(dest_url, "w")
for line in lines:
fx.write(line + "\n")
fx.close()
I don't quite understand the code in the variable lines. How does it know when to split into a new line on a csv file ?
A csv file is essentially just a text file will comma separated data but they also contain new lines (via the newline ascii character).
If there a csv file with a long single comma separated line for line in lines: would only see the single line.
You can open it up in notepad++ or something to see the raw .csv file. Excel will put data seperated by commas in a cell,and data on a new line into a new row.
"\n" is where the instruction to create a new line comes from.
In the code you have presented, you are telling python to split the string you received based upon "\n". So you get a list of strings split into lines.
When you write to fx, you are inserting a newline character onto every line you write by appending "\n". If you didn't do this, then you would just get one very long line.

Categories

Resources