problem reading a csv file in python - python

I am trying to read a very simple but somehow large(800Mb) csv file using the csv library in python. The delimiter is a single tab and each line consists of some numbers.
Each line is a record, and I have 20681 rows in my file. I had some problems during my calculations using this file,it always stops at a certain row. I got suspicious about the number of rows in the file.I used the code below to count the number of row in this file:
tfdf_Reader = csv.reader(open('v2-host_tfdf_en.txt'),delimiter=' ')
c = 0
for row in tfdf_Reader:
c = c + 1
print c
To my surprise c is printed with the value of 61722!!! Why is this happening? What am I doing wrong?

800 million bytes in the file and 20681 rows means that the average row size is over 38 THOUSAND bytes. Are you sure? How many numbers do you expect in each line? How do you know that you have 20681 rows? That the file is 800 Mb?
61722 rows is almost exactly 3 times 20681 -- is the number 3 of any significance e.g. 3 logical sub-sections of each record?
To find out what you really have in your file, don't rely on what it looks like. Python's repr() function is your friend.
Are you on Windows? Even if not, always open(filename, 'rb').
If the fields are tab-separated, then don't put delimeter=" " (whatever is between the quotes appears not to be a tab). Put delimiter="\t".
Try putting some debug statements in your code, like this:
DEBUG = True
f = open('v2-host_tfdf_en.txt', 'rb')
if DEBUG:
rawdata = f.read(200)
f.seek(0)
print 'rawdata', repr(rawdata)
# what is the delimiter between fields? between rows?
tfdf_Reader = csv.reader(f,delimiter=' ')
c = 0
for row in tfdf_Reader:
c = c + 1
if DEBUG and c <= 10:
print "row", c, repr(row)
# Are you getting rows like you expect?
print "rowcount", c
Note: if you are getting Error: field larger than field limit (131072), that means your file has 128Kb of data with no delimiters.
I'd suspect that:
(a) your file has random junk or a big chunk of binary zeroes apppended to it -- this should be obvious in a hex editor; it also should be obvious in a TEXT editor. Print all the rows that you do get to help identify where the trouble starts.
or (b) the delimiter is a string of one or more whitespace characters (space, tab), the first few rows have tabs, and the remaining rows have spaces. If so, this should be obvious in a hex editor (or in Notepad++, especially if you do View/Show Symbol/Show all characters). If this is the case, you can't use csv, you'd need something simple like:
f = open('v2-host_tfdf_en.txt', 'r') # NOT 'rb'
rows = [line.split() for line in f]

My first guess would be the delimeter. How are you ensuring the delimeter is a tab?
What is actually the value you are passing? (the code your pased lists a space, but I'm sure you intended to pass something else).
If your file is tab separated, then look specifically for '\t' as your delimeter. Looking for a space would mess up situations where there is space in your data that is not a column separator.
Also, if your file is an excel-tab, then there is a special "dialect" for that.

Related

script to cat every other (even) line in a set of files together while leaving the odd lines unchanged

I have a set of three .fasta files of standardized format. Each one begins with a string that acts as a header on line 1, followed by a long string of nucleotides on line 2, where the header string denotes the animal that the nucleotide sequence came from. There are 14 of them altogether, for a total of 28 lines, and each of the three files has the headers in the same order. A snippet of one of the files is included below as an example, with the sequences shortened for clarity.
anas-crecca-crecca_KSW4951-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
anas-crecca-crecca_KGM021-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
anas-crecca-crecca_KGM020-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
What I would like to do is write a script or program that cats each of the strings of nucleotides together, but keeps them in the same position. My knowledge, however, is limited to rudimentary python, and I'd appreciate any help or tips someone could give me.
Try this:
data = ""
with open('filename.fasta') as f:
i = 0
for line in f:
i=i+1
if (i%2 == 0):
data = data + line[:-1]
# Copy and paste above block for each file,
# replacing filename with the actual name.
print(data)
Remember to replace "filename.fasta" with your actual file name!
How it works
Variable i acts as a line counter, when it is even, i%2 will be zero and the new line is concatenated to the "data" string. This way, the odd lines are ignored.
The [:-1] at the end of the data line removes the line break, allowing you to add all sequences to the same line.

Adding a comma to the end of every row in python

I have a list of sample codes which I input into a website to get information about each of them (they are codes for stars, but it doesn't matter what the codes are, they are just a long string of numbers). All these numbers are in one column, one number per row. The website I need to input this file into accepts the numbers to still be in a column, but with a comma next to the numbers. This is an example:
Instead of:
164891738509173
184818483848283
18483943491u385
It's supposed to look like this:
164891738509173,
184818483848283,
18483943491u385,
I wanted to program a quick python code to do that automatically for each number in the entire column. How do I do that? I can manage theoretically to do that manually if the number of stars I'm dealing with is little, but unfortunately in the website, I need to input something like 60000 stars (so 60000 of these numbers) so doing it manually is suicide.
Very simple:
open('output.txt', 'w').writelines( # open 'output.txt' for writing and write multiple lines
line.rstrip('\n') + ',\n' # append comma to each line
for line in open('input.txt') # read lines with numbers from 'input.txt'
)
You could do it more idiomatically and use a with block, but that's probably overkill for such a small task:
with open('input.txt') as In, open('output.txt', 'w') as Out:
for line in In:
Out.write(line.rstrip('\n') + ',\n')
Is this what you want?
If you want to add comma at end the every entry during printing, you can do this:
>>> codes = ['164891738509173', '184818483848283', '18483943491u385']
>>> for code in codes:
... print(code, end=',\n')
...
164891738509173,
184818483848283,
18483943491u385,
To add a comma to every item within the list,
>>> end_comma = [f"{code}," for code in codes]
>>> end_comma
['164891738509173,', '184818483848283,', '18483943491u385,']

Python script to import a comma separated csv that has fixed length fields

I have a .csv file with comma separated fields. I am receiving this file from a 3rd party and the content cannot change. I need to import the file to a database, but there are commas in some of the "comma" separated fields. The comma separated fields are also fixed length - when I stright up print the fields as per the below lines in function insert_line_csv they are spaced in a fixed length.
I need essentially need an efficient method of collecting fields that could have comma's included in the field. I was hoping to combine the two methods. Not sure if that would be efficient.
I am using python 3 - willing to use any libraries to make the job efficient and easy.
Currently I am have the following:
with open(FileName, 'r') as f:
for count, line in enumerate(f):
insert_line_csv(count, line)
with the insert_line_csv function looking like:
def insert_line_csv(line_no, line):
line = line.split(",")
field0 = line[0]
field1 = line[1]
......
I am importing the line_no, as that is also being entered into the db.
Any insight would be appreciated.
A sample dataset:
text ,2000.00 ,2018-07-07,textwithoutcomma ,text ,1
text ,3000.00 ,2018-07-08,textwith,comma ,text ,7
text ,1000.00 ,2018-07-07,textwithoutcomma ,text ,4
If the comma seperated fields are all fixed length, you should be able to just splice them off by count instead of splicing by commas, see Split string by count of characters
as a mockup-code you have
toParse = line
while (toParse != "")
chunk = first X chars of toParse
restOfLine = toParse without the chars just cut off
write chunk to db
toParse = restOfLine
That should work imho
Edit:
upon seeing your sample dataset. Can there only be one field with commas inside of it? If so, you could split via comma, read out the first 3 fields, then the last two. Whatever is left, you concatenate again, because it is the value fo the 4th field. (If it had commas, ou'll need to actually concatenate there, if not, its already the value)

Excel delimited file

I have an excel file that contains data with multiple columns of varying width that I need to work with on my PC. However, the file contains SOH and STX characters as delimiting characters, since they were from TextEdit on a Mac. The SOH is record delimiter and the STX is row delimiter. On my PC, both these characters are shown as a rectangle (in screenshot). I can't use the fixed width delimited option since I would lose data. I tried writing a Python script, but Python doesn't recognize the SOH and STX either, just displays it as a rectangle too. How do I delimit these records appropriately? I would appreciate any possible method.
Thanks!
This should work
SOH='\x01'
STX='\x02'
# As it is, this function returns the values as strings, not as integers
def read_lines(filename):
rawdata = open(filename, "rb").read()
for l in rawdata.split(SOH + STX):
if not l:
continue
yield l.split(SOH)
# Rows is a list. Each element in the list is a row of values
# (either a list or a tuple, for example)
def write_lines(filename, rows):
with open(filename, "wb") as f:
for row in rows:
f.write(SOH.join([str(x) for x in row]) + SOH + STX)
Edit: Example use...
for row in read_lines("myfile.csv"):
print ", ".join(row)

Python: read a file and replace it line by line with a certain condition

I have a file like this below.
0 0 0
0.00254 0.00047 0.00089
0.54230 0.87300 0.74500
0 0 0
I want to modify this file. If a value is less than 0.05, then a value is to be 1. Otherwise, a value is to be 0.
After python script runs, the file should be like
1 1 1
1 1 1
0 0 0
1 1 1
Would you please help me?
OK, since you're new to StackOverflow (welcome!) I'll walk you through this. I'm assuming your file is called test.txt.
with open("test.txt") as infile, open("new.txt", "w") as outfile:
opens the files we need, our input file and a new output file. The with statement ensures that the files will be closed after the block is exited.
for line in infile:
loops through the file line by line.
values = [float(value) for value in line.split()]
Now this is more complicated. Every line contains space-separated values. These can be split into a list of strings using line.split(). But they are still strings, so they must be converted to floats first. All this is done with a list comprehension. The result is that, for example, after the second line has been processed this way, values is now the following list: [0.00254, 0.00047, 0.00089].
results = ["1" if value < 0.05 else "0" for value in values]
Now we're creating a new list called results. Each element corresponds to an element of values, and it's going to be a "1" if that value < 0.05, or a "0" if it isn't.
outfile.write(" ".join(results))
converts the list of "integer strings" back to a string, separated by 7 spaces each.
outfile.write("\n")
adds a newline. Done.
The two list comprehensions could be combined into one, if you don't mind the extra complexity:
results = ["1" if float(value) < 0.05 else "0" for value in line.split()]
if you can use libraries I'd suggest numpy :
import numpy as np
myarray = np.genfromtxt("my_path_to_text_file.txt")
my_shape = myarray.shape()
out_array = np.where(my_array < 0.05, 1, 0)
np.savetxt(out_array)
You can add formating as arguments to the savetxt function. The docstrings of the function are pretty self explanatory.
If you are stuck with pure python :
with open("my_path_to_text_file") as my_file:
list_of_lines = my_file.readlines()
list_of_lines = [[int( float(x) < 0.05) for x in line.split()] for line in list_of_lines]
then write that list to file as you see fit.
You can use this code
f_in=open("file_in.txt", "r") #opens a file in the reading mode
in_lines=f_in.readlines() #reads it line by line
out=[]
for line in in_lines:
list_values=line.split() #separate elements by the spaces, returning a list with the numbers as strings
for i in range(len(list_values)):
list_values[i]=eval(list_values[i]) #converts them to floats
# print list_values[i],
if list_values[i]<0.05: #your condition
# print ">>", 1
list_values[i]=1
else:
# print ">>", 0
list_values[i]=0
out.append(list_values) #stores the numbers in a list, where each list corresponds to a lines' content
f_in.close() #closes the file
f_out=open("file_out.txt", "w") #opens a new file in the writing mode
for cur_list in out:
for i in cur_list:
f_out.write(str(i)+"\t") #writes each number, plus a tab
f_out.write("\n") #writes a newline
f_out.close() #closes the file
The following code performs the replacements in-place: for that , the file is opened in 'rb+' mode. It's absolutely mandatory to open it in binary mode b. The + in 'rb+' means that it's possible to write and to read in the file. Note that the mode can be written 'r+b' also.
But using 'rb+' is awkward:
if you read with for line in f , the file is read by chunks and several lines are kept in the buffer where they are really read one after the other, until another chunk of data is read and loaded in the buffer. That makes it harder to perform transformations, because one must follow the position of the file's pointer with the help of tell() and to move the pointer with seek() and in fact I've not completly understood how it must done.
.
Happily, there's a solution with replace(), because , I don't know why, but I believe the facts, when readline() reads a line, the file 's pointer doesn't go further on disk than the end of the line (that is to say it stops at the newline).
Now it's easy to move and know positions of the file's pointer
to make writing after reading, it's necessary to make seek() being executed , even if it should be to do seek(0,1), meaning a move of 0 caracters from the actual position. That must change the state of the file's pointer, something like that.
Well, for your problem, the code is as follows:
import re
from os import fsync
from os.path import getsize
reg = re.compile('[\d.]+')
def ripl(m):
g = m.group()
return ('1' if float(g)<0.5 else '0').ljust(len(g))
path = ...........'
print 'length of file before : %d' % getsize(path)
with open('Copie de tixti.txt','rb+') as f:
line = 'go'
while line:
line = f.readline()
lg = len(line)
f.seek(-lg,1)
f.write(reg.sub(ripl,line))
f.flush()
fsync(f.fileno())
print 'length of file after : %d' % getsize(path)
flush() and fsync() must be executed to ensure that the instruction f.write(reg.sub(ripl,line)) effectively writes at the moment it is ordred to.
Note that I've never managed a file encoded in unicode like. It's certainly still more dificult since every unicode character is encoded on several bytes (and in the case of UTF8 , variable number of bytes depending on the character)

Categories

Resources