I'm pretty new to this so please move this topic if it's in the wrong place or something else.
Problem: (Quick note: This is all in Python) I am trying to go through these 100 or so files, each with the same number of columns, and take certain columns of the input (the same ones for each file) and write them in a new file. However, these 100 files don't necessarily all have the same number of rows. In the code below, filec is in a loop and continues altering throughout the 100 files. I am trying to get these certain columns that I want by looking at the number of rows in each txt file and looping that many times then taking the numbers I want.
filec = open(string,'r').read().split(',')
x = len(filec.readlines())
I realize the issue is that filec has become a list after using the split function and was originally a string when I used .read(). How would one go about finding the number of lines, so I can loop through the number of rows and get the positions in each row that I want?
Thank you!
You could do it like this:
filec = open (filename, 'r')
lines = filec.readlines ()
for line in lines:
words = line.split(',')
# Your code here
Excuse me if there are any errors, I'm doing this on mobile.
As you are just looking for the count of rows, then how about this -
t = tuple(open(filepath\filename.txt, 'r'))
print len(t)
I tried to keep the code clear, it is very possible to do with fewer lines. take in a list of file names, give out a dictionary, mapping the filename to the column you wanted (as a list).
def read_col_from_files(file_names, column_number):
ret = {}
for file_name in file_names:
with open(file_name) as fp:
column_for_file = []
for line in fp:
columns = line.split('\t')
column_for_file.append(columns[column_number])
ret[file_name] = column_for_file
return ret
I have assumed you have tab delimited columns. Call it like this:
data = read_col_from_files(["file_1.txt", "/tmp/file_t.txt"], 5)
Here is a sensible shortening of the code using a list comprehension
def read_col_from_files(file_names, column_number):
ret = {}
for file_name in file_names:
with open(file_name) as fp:
ret[file_name] = [line.split('\t')[column_number] for line in fp]
return ret
And here is how to do it on the command line:
cat FILENAMES | awk '{print $3}'
Related
I have been trying to transpose my table of 2000000+ rows and 300+ columns on a cluster, but it seems that my Python script is getting killed due to lack of memory. I would just like to know if anyone has any suggestions on a more efficient way to store my table data other than using the array, as shown in my code below?
import sys
Seperator = "\t"
m = []
f = open(sys.argv[1], 'r')
data = f.read()
lines = data.split("\n")[:-1]
for line in lines:
m.append(line.strip().split("\t"))
for i in zip(*m):
for j in range(len(i)):
if j != len(i):
print(i[j] +Seperator)
else:
print(i[j])
print ("\n")
Thanks very much.
The first thing to note is that you've been careless with your variables. You're loading a large file into memory as a single string, and then a list of a strings, then a list of list of strings, before finally transposing said list. This will result in you storing all the data in the file three times before you even begin to transpose it.
If each individual string in the file is only about 10 characters long then you're going to need 18GB of memory just to store that (2e6 rows * 300 columns * 10 bytes * 3 duplicates). This is before you factor in all the overhead of python objects (~27 bytes per string object).
You have a couple of options options here.
create each new transposed row incrementally by reading over the file once for each old row and appending each new row one at a time (sacrifices time efficiency).
create one file for each new row and combine these row files at the end (sacrifices disk space efficiency, possibly problematic if you have a lot of columns in the initial file due to a limit of the number of open files a process may have).
Transposing with a limited number of open files
delimiter = ','
input_filename = 'file.csv'
output_filename = 'out.csv'
# find out the number of columns in the file
with open(input_filename) as input:
old_cols = input.readline().count(delimiter) + 1
temp_files = [
'temp-file-{}.csv'.format(i)
for i in range(old_cols)
]
# create temp files
for temp_filename in temp_files:
open(temp_filename, 'w') as output:
output.truncate()
with open(input_filename) as input:
for line in input:
parts = line.rstrip().split(delimiter)
assert len(parts) == len(temp_files), 'not enough or too many columns'
for temp_filename, cell in zip(temp_files, parts):
with open(temp_filename, 'a') as output:
output.write(cell)
output.write(',')
# combine temp files
with open(output_filename, 'w') as output:
for temp_filename in temp_files:
with open(temp_filename) as input:
line = input.read().rstrip()[:-1] + '\n'
output.write(line)
As the number of columns is far smaller than nuber of rows I would consider writing each column to separate file. And then combine them together.
import sys
Separator = "\t"
f = open(sys.argv[1], 'r')
for line in f:
for i, c in enumerate(line.strip().split("\t")):
dest = column_file[i] # you shoud open 300+ file handlers, one for each column
dest.write(c)
dest.write(Separator)
# all you need to do after than is combine the content of you "row" files
If you cannot store all of your file into memory, you can read it n times:
column_number = 4 # if necessary, read the first line of the file to calculate it
seperetor = '\t'
filename = sys.argv[1]
def get_nth_column(filename, n):
with open(filename, 'r') as file:
for line in file:
if line: # remove empty lines
yield line.strip().split('\t')[n]
for column in range(column_number):
print(seperetor.join(get_nth_column(filename, column)))
Note that an exception will be raised if the file does not have the right format. You could catch it if necessary.
When reading files : use with construct, to ensure that your file will be closed. And iterate directly on the file, instead of reading the content first. It is more readable and more efficient.
My problem is this. I have one file with 3000 lines and 8 columns(space delimited). The important thing is that the first column is a number ranging from 1 to 22. So in the principle of divide-n-conquer I splitted the original file in to 22 subfiles depending on the first column value.
And I have some result files. Which are 15 sources each containing 1 result file. But the result file is too big, so I applied divide-n-conquer once more to split each of the 15 results in to 22 subfiles.
the file structure is as follows:
Original_file Studies
split_1 study1
split_1, split_2, ...
split_2 study2
split_1, split_2, ...
split_3 ...
... study15
split_1, split_2, ...
split_22
So by doing this, we pay a slight overhead in the beginning, but all of these split files will be deleted at the end. so it doesn't really matter.
I need my final output to be the original file with some values from the studies appended to it.
So, my take so far is this:
Algorithm:
for i in range(1,22):
for j in range(1,15)
compare (split_i of original file) with the jth studys split_i
if one value on a specific column matches:
create a list with needed columns from both files, split row with ' '.join(list) and write the result in outfile.
Is there a better way to go around this problem? Because the study files range from 300MB to 1.5GB in size.
and here's my Python code so far:
folders = ['study1', 'study2', ..., 'study15']
with open("Effects_final.txt", "w") as outfile:
for i in range(1, 23):
chr = i
small_file = "split_"+str(chr)+".txt"
with open(small_file, 'r') as sf:
for sline in sf: #small_files
sf_parts = sline.split(' ')
for f in folders:
file_to_compare_with = f + "split_" + str(chr) + ".txt"
with open(file_to_compare_with, 'r') as cf: #comparison files
for cline in cf:
cf_parts = cline.split(' ')
if cf_parts[0] == sf_parts[1]:
to_write = ' '.join(cf_parts+sf_parts)
outfile.write(to_write)
But this code uses 4 loops which is an overkill, but you have to do it since you need to read the lines from the 2 files being compared at the same time. This is my concern...
I found one solution that seems to work in a good amount of time. The code is the following:
with open("output_file", 'w') as outfile:
for i in range(1,23):
dict1 = {} # use a dictionary to map values from the inital file
with open("split_i", 'r') as split:
next(split) #skip the header
line_list = line.split(delimiter)
for line in split:
dict1[line_list[whatever_key_u_use_as_id]] = line_list
compare_dict = {}
for f in folders:
with open("each folder", 'r') as comp:
next(comp) #skip the header
for cline in comp:
cparts = cline.split('delimiter')
compare_dict[cparts[whatever_key_u_use_as_id]] = cparts
for key in dict1:
if key in compare_dict:
outfile.write("write your data")
outfile.close()
With this approach, I'm able to compute this dataset in ~10mins. Surely, there are ways for improvement. One idea, is to take the time and sort the datasets, that way search later on will be more quick, and we might save time!
I need to crop a large text file with over 10000 lines of numbers in addition to a header with the format (number_of_lines, number_difference, "Sam")
Number_difference is the difference between the first and last number.
For example, if the file looks like this:
10
12
13.5
17
20
Then, the header should be:
5 10 Sam
The problem is the flags do not work for not writing a header more than once and the big file's header carries over to the 1st small file.
The headers will never be the same for each file.
How do I add a changing header to each file?
def TextCropper():
lines_per_file = 1000
smallfile = None
with open(inputFileName) as bigfile:
for lineno, line in enumerate(bigfile):
if lineno % lines_per_file == 0:
if smallfile:
smallfile.close()
small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
smallfile = open(small_filename, "w")
if (flags[counter] == False):
smallfile.write(lines_per_file)
flags[counter] = True
smallfile.write(line)
elif smallfile:
smallfile.close()
TextCropper()
You're reading and writing the lines one at a time, which is inefficient. By doing that, you also don't know what the last line will be, so you can't write your header in advance.
Just read up to N lines, if available. islice() will do exactly that for you. If the list comes back empty, there were no lines left to read, otherwise you can proceed to write the current chunk into a file.
Since each line is read as a number with a trailing newline ('\n'), strip that, convert the first and last numbers into floats and calculate the difference. Writing the actual numbers to the file is straightforward by joining the elements of the list.
To make the function reusuable, include the variables that are likely to change as arguments. That way you can name any big file, any output small file and any number of lines you want without changing hardcoded values.
from itertools import islice
def number_difference(iterable):
return float(iterable[-1].strip('\n')) - float(iterable[0].strip('\n'))
def file_crop(big_fname, chunk_fname, no_lines):
with open(big_fname, 'r') as big_file:
ifile = 0
while True:
data = list(islice(big_file, no_lines))
if not data:
break
with open('{}_{}.txt'.format(chunk_fname, ifile), 'w') as small_file:
small_file.write('{} {} Sam\n'.format(len(data), number_difference(data)))
small_file.write(''.join(data))
ifile += 1
I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-
I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])
I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)
Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']
Basically, I want to be able to count the number of characters in a txt file (with user input of file name). I can get it to display how many lines are in the file, but not how many characters. I am not using the len function and this is what I have:
def length(n):
value = 0
for char in n:
value += 1
return value
filename = input('Enter the name of the file: ')
f = open(filename)
for data in f:
data = length(f)
print(data)
All you need to do is sum the number of characters in each line (data):
total = 0
for line in f:
data = length(line)
total += data
print(total)
There are two problems.
First, for each line in the file, you're passing f itself—that is, a sequence of lines—to length. That's why it's printing the number of lines in the file. The length of that sequence of lines is the number of lines in the file.
To fix this, you want to pass each line, data—that is, a sequence of characters. So:
for data in f:
print length(data)
Next, while that will properly calculate the length of each line, you have to add them all up to get the length of the whole file. So:
total_length = 0
for data in f:
total_length += length(data)
print(total_length)
However, there's another way to tackle this that's a lot simpler. If you read() the file, you will get one giant string, instead of a sequence of separate lines. So you can just call length once:
data = f.read()
print(length(data))
The problem with this is that you have to have enough memory to store the whole file at once. Sometimes that's not appropriate. But sometimes it is.
When you iterate over a file (opened in text mode) you are iterating over its lines.
for data in f: could be rewritten as for line in f: and it is easier to see what it is doing.
Your length function looks like it should work but you are sending the open file to it instead of each line.