I am trying to read a data file into a 2 dimensional array.
For example:
file.dat:
1 2 3 a
4 5 6 b
7 8 9 c
I tried something like:
file=open("file.dat","r")
var = [[]]
var.append([j for j in i.split()] for i in file)
but that didn't work.
I need the data in form of two dimensional array as I need to do operations with each element afterwards, e.g.
for k in range(3):
newval(k) = var[k,1]
Any idea how to do that?
file = open("file.dat", "r") # open file for reading
var = [] # initialize empty array
for line in file:
var.append(line.strip().split(' ')) # split each line on the <space>, and turn it into an array
# thus creating an array of arrays.
file.close() # close file.
This worked for me:
with open("/path/to/file", 'r') as f:
lines = [[float(n) for n in line.strip().split(' ')] for line in f]
It's really weird people said in the comments there is no one line solution for this. It took me so little testing to make this work.
v = []
with open("data.txt", 'r') as file:
for line in file:
if line.split():
line = [float(x) for x in line.split()]
v.append(line)
print(v)
Related
def input_sentence():
sppc = BetterICP(grammar2)
with open("output.txt","w") as op:
with open("input.txt", "r") as ins:
array = []
temp = []
for line in ins:
array.append(line)
for a in array:
op.write(str(sppc.parse(a.split()))
I need to write my output which I will get from str(sppc.parse(a.split()) but not able to write it in file
You're missing a ')' in op.write(str(sppc.parse(a.split()))
It seems sppc.parse returns an iterable object. Try to use:
for a in array:
for b in sppc.parse(a.split())
op.write(str(b))
I'm looking for a way to read from two large files simultaneously without bringing the whole data into memory. I want to parse M lines from the first file with N lines from the second file. Is there any wise and memory efficient solution for it?
So far I know how to do it with reading two files at the same time line by line. But I don't know if it would be possible to extend this code to read for example 4 lines from the first file, and 1 line from the second file.
from itertools import izip
with open("textfile1") as textfile1, open("textfile2") as textfile2:
for x, y in izip(textfile1, textfile2):
x = x.strip()
y = y.strip()
print("{0}\t{1}".format(x, y))
from here, Read two textfile line by line simultaneously -python
Just open the files and use e.g. line = textfile1.readline() to read a line from one of the files.
line will contain a trailing newline. You see that you reached the end, when an empty string is returned.
That would read the next n lines from file1 then next m lines from file2 within some other code
def nextlines(number, file):
n_items = []
index = number
while(index > 0):
try:
n_items += [next(file).strip()]
index -= 1
except StopIteration:
break
return n_items
n = 5
m = 7
l1 = []
l2 = []
with open('file.dat', 'r') as file1:
with open('file.dat', 'r') as file2:
#some code
l1 = nextlines(n, file1)
l2 = nextlines(m, file2)
#some other code
file2.close()
file1.close()
print l1
print l2
Originally posted here: How to read and delete first n lines from file in Python - Elegant Solution
I have a pretty big file ~ 1MB in size and I would like to be able to read first N lines, save them into a list (newlist) for later use, and then delete them.
My original code was:
import os
n = 3 #the number of line to be read and deleted
with open("bigFile.txt") as f:
mylist = f.read().splitlines()
newlist = mylist[:n]
os.remove("bigFile.txt")
thefile = open('bigFile.txt', 'w')
del mylist[:n]
for item in mylist:
thefile.write("%s\n" % item)
Based on Jean-François Fabre code that was posted and later deleted here I am able to run the following code:
import shutil
n = 3
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
for _ in range(n):
next(f)
f2.writelines(f)
This works great for deleting the first n lines and "updating" the bigFile.txt but when I try to store the first n values into a list so I can later use them like this:
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
mylist = f.read().splitlines()
newlist = mylist[:n]
for _ in range(n):
next(f)
f2.writelines(f)
I get an "StopIteration" error
In your sample code you are reading the entire file to find the first n lines:
# this consumes the entire file
mylist = f.read().splitlines()
This leaves nothing left to read for the subsequent code. Instead simply do:
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
# read the first n lines into newlist
newlist = [f.readline() for _ in range(n)]
f2.writelines(f)
I would proceed as follows:
n = 3
yourlist = []
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
i=0
for line in f:
i += 1
if i<n:
yourlist.append(line)
else:
f2.write(f)
Does anyone have a clue how to increase the speed of this part of python code?
It was designed to deal with small files (with just a few lines, and for this is very fast) but i want to run it with big files (with ~50Gb, and millions of lines).
The main goal of this code is to get stings from a file (.txt) and search for these in a input file printing the the number of times that these occurred in the output file.
Here is the code: infile, seqList and out are determined by the optparse as Options in the beginning of the code (not shown)
def novo (infile, seqList, out) :
uDic = dict()
rDic = dict()
nmDic = dict()
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
for k in lines:
l1 = k.split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
if l1[4] == "U":
for k in uDic.keys():
if k == l3:
uDic[k] += 1
if l1[4] == "R":
for j in rDic.keys():
if j == l3:
rDic[j] += 1
if l1[4] == "NM":
for h in nmDic.keys():
if h == l3:
nmDic[h] += 1
f = open(out, "w")
f.write("Sample"+"\t"+"R"+"\t"+"U"+"\t"+"NM"+"\t"+"TOTAL"+"\t"+"%R"+"\t"+"%U"+"\t"+"%NM"+"\n")
for i in samples:
U = int()
R = int()
NM = int ()
for k, j in uDic.items():
if k == i:
U = j
for o, p in rDic.items():
if o == i:
R = p
for y,u in nmDic.items():
if y == i:
NM = u
TOTAL = int(U + R + NM)
try:
f.write(i+"\t"+str(R)+"\t"+str(U)+"\t"+str(NM)+"\t"+str(TOTAL)+"\t"+str(float(R) / TOTAL)+"\t"+str(float(U) / TOTAL)+"\t"+str(float(NM) / TOTAL$
except:
continue
f.close()
With processing 50 GB files, the question is not how to make it faster, but how to make it runnable
at all.
The main problem is, you will run out of memory and shall modify the code to be processing the files
without having all the files in memory, but rather having in memory onle a line, which is needed.
Following code from your question is reading all the lines form two files:
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
# at this moment you are likely to run out of memory already
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
#similar loop over `lines` comes later on
You shall defer reading the lines till the latest possible moment like this:
#Create dictionaires with all the samples
with open(seqList, 'r') as RADlist:
for samplelines in RADlist:
sample = sampleline.strip()
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
Note: did you want to use line.strip() or line.split()?
This way, you do not have to keep all the content in memory.
There are many more options for optimization, but this one will let you to take off and run.
It would make it much easier if you provided some sample inputs. Because you haven't I haven't tested this, but the idea is simple - iterate through each file only once, using iterators rather than reading the whole file into memory. Use the efficient collections.Counter object to handle the counting and minimise inner looping:
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
If you convert your script into functions (it makes profiling easier), and then see what it does when you code profile it : I suggest using runsnake : runsnakerun
I would try replacing your loops with list & dictionary comprehensions:
For example, instead of
for i in samples:
uDict[i.replace(" ","")] = 0
Try:
udict = {i.replace(" ",""):0 for i in samples}
and similarly for the other dicts
I don't really follow what's going on in your "for k in lines" loop, but you only need l3 (and l2) when you have certain values for l1[4]. Why not check for those values before splitting and replacing?
Lastly, instead of looping through all the keys of a dict to see if a given element is in that dict, try:
if x in myDict:
myDict[x] = ....
for example:
for k in uDic.keys():
if k == l3:
uDic[k] += 1
can be replaced with:
if l3 in uDic:
uDic[l3] += 1
Other than that, try profiling.
1)Look into profilers and adjust the code that is taking the most time.
2)You could try optimizing some methods with Cython - use data from profiler to modify correct thing
3)It looks like you can use a counter instead of a dict for the output file, and a set for the input file -look into them.
set = set()
from Collections import Counter
counter = Counter() # Essentially a modified dict, that is optimized for counting...
# like counting occurences of strings in a text file
4) If you are reading 50GB of memory you won't be able to store it all in RAM (I'm assuming who knows what kind of computer you have), so generators should save your memory and time.
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)
I have a file F, content huge numbers e.g F = [1,2,3,4,5,6,7,8,9,...]. So i want to loop over the file F and delete all lines contain any numbers in file say f = [1,2,4,7,...].
F = open(file)
f = [1,2,4,7,...]
for line in F:
if line.contains(any number in f):
delete the line in F
You can not immediately delete lines in a file, so have to create a new file where you write the remaining lines to. That is what chonws example does.
It's not clear to me what the form of the file you are trying to modify is. I'm going to assume it looks like this:
1,2,3
4,5,7,19
6,2,57
7,8,9
128
Something like this might work for you:
filter = set([2, 9])
lines = open("data.txt").readlines()
outlines = []
for line in lines:
line_numbers = set(int(num) for num in line.split(","))
if not filter & line_numbers:
outlines.append(line)
if len(outlines) < len(lines):
open("data.txt", "w").writelines(outlines)
I've never been clear on what the implications of doing an open() as a one-off are, but I use it pretty regularly, and it doesn't seem to cause any problems.
exclude = set((2, 4, 8)) # is faster to find items in a set
out = open('filtered.txt', 'w')
with open('numbers.txt') as i: # iterates over the lines of a file
for l in i:
if not any((int(x) in exclude for x in l.split(','))):
out.write(l)
out.close()
I'm assuming the file contains only integer numbers separated by ,
Something like this?:
nums = [1, 2]
f = open("file", "r")
source = f.read()
f.close()
out = open("file", "w")
for line in source.splitlines():
found = False
for n in nums:
if line.find(str(n)) > -1:
found = True
break
if found:
continue
out.write(line+"\n")
out.close()