Converting .tsv file to .txt creates unintended characters, possible fix? - python

Need to process a .tsv file that has 1 million lines and then save the file as a .txt file . I successfully am able to perform that this way:
import csv
with open("data.tsv") as fd, open('pre_processed_data.txt', 'wb') as csvout:
rd = csv.reader(fd, delimiter="\t", quotechar='"')
csvout = csv.writer(csvout,delimiter='\t')
for row in rd:
csvout.writerow([row[1],row[2],row[3]])
However, beyond a certain point , along with tabs certain special characters unintended crawls in. ie this way:
As you can see the first column expects only numeric values between 0 and 1. However special characters are seen in between.
What is possibly causing this and how to effectively resolve this?

These extra characters exist in the input file. As you have no cntrol over the file, the easiest thing to to do is to remove them as you process the data. The re module's sub function can do this:
>>> import re
>>> s = '1#'
>>> re.sub(r'\D+', '', s)
'1'
The r'\D+' pattern will match any non-numeric character for removal from the provided string.

Related

How to remove erronous tabs/new line from .vcf file?

I am working with a vcf file. I try to extract information from this file, but the file has errors in the format.
In this file there is a column that contains long character strings. The error is, that a number of tabs and a new line character are erronously placed within some rows of this column. So when I try to read in this tab delimited file, all columns are messed up.
I have an idea how to solve this, but don't know how to execute it in code. The string is DNA, so always has ATCG. Basically, if one could look for a number of tabs and a newline within characters ATCG and remove them, then the file is fixed:
ACTGCTGA\t\t\t\t\nCTGATCGA would become:
ACTGCTGACTGATCGA
So one would need to look into this file, look for [ACTG] followed by tabs or newlines, followed by more [ACTG], and then replace this with nothing. Any idea how to do this?
with open(file.vcf, 'r') as f:
lines = [l for l in f if not l.startswith('##')]
Here's one way with regex:
First read the file in:
import re
with open('file.vcf', 'r') as file:
dnafile = file.read()
Then write a new file with the changes:
with open('fileNew.vcf', 'w') as file:
file.write(re.sub("(?<=[ACTG]{2})((\\t)*(\\n)*)(?=[ACTG]{2})", "", dnafile))

Writing a string to CSV using line escapes in python 3

Working in Python 3.7.
I'm currently pulling data from an API (Qualys's API, fetching a report) to be specific. It returns a string with all the report data in a CSV format with each new line designated with a '\r\n' escape.
(i.e. 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n')
The problem I'm having is writing this string properly to a CSV file. Every iteration of code I've tried writes the data cell by cell when viewed in Excel with the \r\n appended to where ever it was in the string all on one row, rather than on a new line.
(i.e |foo|bar|stuff\r\n|more stuff|data|report\r\n|etc|etc|etc\r\n|)
I'm just making the switch from 2 to 3 so I'm almost positive it's a syntactical error or an error with my understanding of how python 3 handles new line delimiters or something along those lines, but even after reviewing documentation, here and blog posts I just cant either cant get my head around it, or I'm consistently missing something.
current code:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
#input('pause')
f_csv = open(title,'w', newline='\r\n')
f_csv.write(res)
f_csv.close
but i've also tried:
with open(title, 'w', newline='\r\n') as f:
writer = csv.writer(f,<tried encoding here, no luck>)
writer.writerows(res)
#anyone else looking at this, this didn't work because of the difference
#between writerow() and writerows()
and I've also tried various ways to declare newline, such as:
newline=''
newline='\n'
etc...
and various other iterations along these lines. Any suggestions or guidance or... anything at this point would be awesome.
edit:
Ok, I've continued to work on it, and this kinda works:
def dl_report(id, title):
data = {'action': 'fetch', 'id': id}
res = a.request('/api/2.0/fo/report/', data=data)
print(type(res)) #returns string
reader = csv.reader(res.split(r'\r\n'), delimiter=',')
with open(title, 'w') as outfile:
writer = csv.writer(outfile, delimiter= '\n')
writer.writerow(reader)
But its ugly, and does create errors in the output CSV (some rows (less than 1%) don't parse as a CSV row, probably a formatting error somewhere..), but more concerning is that it works wonky when a "\" is presented in data.
I would really be interested in a solution that works... better? More pythonic? more consistently would be nice...
Any ideas?
Based on your comments, the data you're being served doesn't actually include carriage returns or newlines, it includes the text representing the escapes for carriage returns and newlines (so it really has a backslash, r, backslash, n in the data). It's otherwise already in the form you want, so you don't need to involve the csv module at all, just interpret the escapes to their correct value, then write the data directly.
This is relatively simple using the unicode-escape codec (which also handles ASCII escapes):
import codecs # Needed for text->text decoding
# ... retrieve data here, store to res ...
# Converts backslash followed by r to carriage return, by n to newline,
# and so on for other escapes
decoded = codecs.decode(res, 'unicode-escape')
# newline='' means don't perform line ending conversions, so you keep \r\n
# on all systems, no adding, no removing characters
# You may want to explicitly specify an encoding like UTF-8, rather than
# relying on the system default, so your code is portable across locales
with open(title, 'w', newline='') as f:
f.write(decoded)
If the strings you receive are actually wrapped in quotes (so print(repr(s)) includes quotes on either end), it's possible they're intended to be interpreted as JSON strings. In that case, just replace the import and creation of decoded with:
import json
decoded = json.loads(res)
If I understand your question correctly, can't you just replace the string?
with open(title, 'w') as f: f.write(res.replace("¥r¥n","¥n"))
Check out this answer:
Python csv string to array
According to CSVReader's documentation, it expects \r\n as the line delimiter by default. Your string should work fine with it. If you load the string into the CSVReader object, then you should be able to check for the standard way to export it.
Python strings use the single \n newline character. Normally, a \r\n is converted to \n when a file is read
and the newline is converted \n or \r\n depending on your system default and the newline= parameter on write.
In your case, \r wasn't removed when you read it from the web interface. When you opened the file with newline='\r\n', python expanded the \n as it was supposed to, but the \r passed through and now your neline is \r\r\n. You can see that by rereading the text file in binary mode:
>>> res = 'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
>>> open('test', 'w', newline='\r\n').write(res)
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\r\n,more stuff,data,report\r\r\n,etc,etc,etc\r\r\n'
Since you already have the line endings you want, just write in binary mode and skip the conversions:
>>> open('test', 'wb').write(res.encode())
54
>>> open('test', 'rb').read()
b'foo,bar,stuff\r\n,more stuff,data,report\r\n,etc,etc,etc\r\n'
Notice I used the system default encoding, but you likely want to standardize on an encoding.

Converting tsv to tsv in python

I have a tsv-file (tab-seperated) and would like to filter out a lot of data using python before I import it into a postgresql database.
My problem is that I can't find a way to keep the format of the original file which is mandatory because otherwise the import processes won't work.
The web suggested that I should use the csv library, but no matter what delimter I use I always end up with files in a different format than the origin, e. g. files, that contain a comma after every character or files, that contain a tab after every character or files that have all data in one row.
Here is my code:
import csv
import glob
# create a list of all tsv-files in one directory
liste = glob.glob("/some_directory/*.tsv")
# go thru all the files
for item in liste:
#open the tsv-file for reading and a file for writing
with open(item, 'r') as tsvin, open('/some_directory/new.tsv', 'w') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
# I am not sure if I have to enter a delimter here for the outfile. If I enter "delimter='\t'" like for the In-File, the outfile ends up with a tab after every character
writer = csv.writer(csvout)
# go thru all lines of the input tsv
for row in tsvin:
# do some filtering
if 'some_substring1' in row[4] or 'some_substring2' in row[4]:
#do some more filtering
if 'some_substring1' in str(row[9]) or 'some_substring1' in str(row[9]):
# now I get lost...
writer.writerow(row)
Do you have any idea what I am doing wrong? The final file has to have a tab between every field and some kind of line break at the end.
Somehow you are passing a string to w.writerow(), not a list as expected.
Remember that strings are iterable; each iteration returns a single character from the string. writerow() simply iterates over its argument writing each item separated by the delimiter character (by default a comma). So if you pass a string to writerow() it will write each character from the string separated by the delimiter.
How is it that row is a string? It could be that the delimiter for the input file is incorrect - perhaps the file does not use tabs but has fixed field widths using runs of spaces as the delimiter.
You can check whether the reader is correctly parsing your file by printing out the value of row:
for row in tsvin:
print(row)
...
If the file is being correctly parsed, expect to see that row is a list, and that each element of the list corresponds to a column/field from the file.
If it is not parsing correctly then you might see that row is a string, or that it's a list but the fields are empty and/or out of place.
It would be helpful if you added a sample of your input file to the question.

How to write clean data to a file in python in tabulated format

Issue: Remove the hyperlinks, numbers and signs like ^&*$ etc from twitter text. The tweet file is in CSV tabulated format as shown below:
s.No. username tweetText
1. #abc This is a test #abc example.com
2. #bcd This is another test #bcd example.com
Being a novice at python, I search and string together the following code, thanks to a the code given here:
import re
fileName="path-to-file//tweetfile.csv"
fileout=open("Output.txt","w")
with open(fileName,'r') as myfile:
data=myfile.read().lower() # read the file and convert all text to lowercase
clean_data=' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",data).split()) # regular expression to strip the html out of the text
fileout.write(clean_data+'\n') # write the cleaned data to a file
fileout.close()
myfile.close()
print "All done"
It does the data stripping, but the output file format is not as I desire. The output text file is in a single line like
s.no username tweetText 1 abc This is a cleaned tweet 2 bcd This is another cleaned tweet 3 efg This is yet another cleaned tweet
How can I fix this code to give me an output like given below:
s.No. username tweetText
1 abc This is a test
2 bcd This is another test
3 efg This is yet another test
I think something needs to be added in the regular expression code but I don't know what it could be. Any pointers or suggestions will be helpful.
You can read the line, clean it, and write it out in one loop. You can also use the CSV module to help you build out your result file.
import csv
import re
exp = r"(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
def cleaner(row):
return [re.sub(exp, " ", item.lower()) for item in row]
with open('input.csv', 'r') as i, open('output.csv', 'wb') as o:
reader = csv.reader(i, delimiter=',') # Comma is the default
writer = csv.writer(o, delimiter=',')
# Take the first row from the input file (the header)
# and write it to the output file
writer.writerow(next(reader))
for row in reader:
writer.writerow(cleaner(row))
The csv module knows correctly how to add separators between items; as long as you pass it a collection of items.
So, what the cleaner method does it take each item (column) in the row from the input file, apply the substitution to the lowercase version of the item; and then return back a list.
The rest of the code is simply opening the file, configuring the CSV module with the separators you want for the input and output files (in the example code, the separator for both files is a tab, but you can change the output separator).
Next, the first row of the input file is read and written out to the output file. No transformation is done on this row (which is why it is not in the loop).
Reading the row from the input file automatically puts the file pointer on the next row - so then we simply loop through the input rows (in reader), for each row apply the cleaner function - this will return a list - and then write that list back to the output file with writer.writerow().
instead of applying the re.sub() and the .lower() expressions to the entire file at once try iterating over each line in the CSV file like this:
for line in myfile:
line = line.lower()
line = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",line)
fileout.write(line+'\n')
also when you use the with <file> as myfile expression there is no need to close it at the end of your program, this is done automatically when you use with
Try this regex:
clean_data=' '.join(re.sub("[#\^&\*\$]|#\S+|\S+[a-z0-9]\.(com|net|org)"," ",data).split()) # regular expression to strip the html out of the text
Explanation:
[#\^&\*\$] matches on the characters, you want to replace
#\S+matches on hash tags
\S+[a-z0-9]\.(com|net|org) matches on domain names
If the URLs can't be identified by https?, you'll have to complete the list of potential TLDs.
Demo

How to convert tab separated, pipe separated to CSV file format in Python

I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file.
Thanks in advance
I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR:
the Microsoft version of CSV is a textbook example of how not to design a textual file format.
The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task.
So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant.
Edit:
Python provides csv.Sniffer that can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row):
a|b|c
"a|b"|c|d
foo|"bar|baz"|qux
You can do this:
import csv
csvfile = open("csvfile.csv")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.DictReader(csvfile, dialect=dialect)
for row in reader:
print row,
# => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'}
# write records using other dialect
Your strategy could be the following:
parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader
calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation.
In the following example you find the simpler statistic (total number of fields)
import csv
piperows= []
tabrows = []
#parsing | delimiter
f = open("file", "rb")
readerpipe = csv.reader(f, delimiter = "|")
for row in readerpipe:
piperows.append(row)
f.close()
#parsing TAB delimiter
f = open("file", "rb")
readertab = csv.reader(f, delimiter = "\t")
for row in readerpipe:
tabrows.append(row)
f.close()
#in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data)
#count total fields
totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows])
totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows])
if totfieldspipe > totfieldstab:
yourrows = piperows
else:
yourrows = tabrows
#the var yourrows contains the rows, now just write them in any format you like
Like this
from __future__ import with_statement
import csv
import re
with open( input, "r" ) as source:
with open( output, "wb" ) as destination:
writer= csv.writer( destination )
for line in input:
writer.writerow( re.split( '[\t|]', line ) )
I would suggest taking some of the example code from the existing answers, or perhaps better use the csv module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that.
If you actually have lots of files, then you need to try to find a way to detect which file is which.
One of the examples has this:
if "|" in line:
This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file.
Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.
for line in open("file"):
line=line.strip()
if "|" in line:
print ','.join(line.split("|"))
else:
print ','.join(line.split("\t"))

Categories

Resources