I'm pretty new to Python - just wondering if there was a library function or easy way to truncate a file to the first 100 lines or less?
with open("my.file", "r+") as f:
[f.readline() for x in range(100)]
f.truncate()
EDIT A 5 % speed increase can be had by instead using the xrange iterator and not storing the entire list:
with open("my.file", "r+") as f:
for x in xrange(100):
f.readline()
f.truncate()
Use one of the solutions here: Iterate over the lines of a string and just grab the first hundred, i.e.
import itertools
lines = itertools.islice(iter, 100)
Related
I have a train_file.txt which has 3 columns on each row.
For example;
1 10 1
1 12 1
2 64 2
6 17 1
...
I am reading this txt file with
train_data = open("train_file.txt", 'r').readlines()
Then I am trying to get each value with for loop
for eachline in train_data:
uid, lid, x = eachline.strip().split()
Question: Train data is a huge file that's why I want to just get the first 1000 rows.
I was trying to execute the following code but I am getting an error ('list' object cannot be interpreted as an integer)
for eachline in range(train_data,1000)
uid, lid, x = eachline.strip().split()
It is not necessary to read the entire file at all. You could use enumerate on the file directly and break early or use itertools.islice:
from itertools import islice
train_data = list(islice(open("train_file.txt", 'r'), 1000))
You can also keep using the same file handle to read more data later:
f = open("train_file.txt", 'r')
train_data = list(islice(f, 1000)) # reads first 1000
test_data = list(islice(f, 100)) # reads next 100
Maybe try changing this line:
train_data = open("train_file.txt", 'r').readlines()
To:
train_data = open("train_file.txt", 'r').readlines()[:1000]
train_data is a list, use slicing:
for eachline in train_data[:1000]:
As the file is "huge" in your words a better approach is to read just first 1000 rows (readlines() will read the whole file in memory)
with open("train_file.txt", 'r'):
train_data = []
for idx, line in enumerate(f, start=1):
train_data.append(line.strip.split())
if idx == 1000:
break
Note that data will be str, not int. You probably want to convert them to int.
You could use enumerate and a break:
for k, line in enumerate(lines):
if k > 1000:
break # exit the loop
# do stuff on the line
I would recommend using the csv built in library since the data is csv-like (or the pandas one if you're using it), and using with. So something like this:
import csv
from itertools import islice
with open('./test.csv', 'r') as input_file:
csv_reader = csv.reader(input_file, delimiter=' ')
rows = list(islice(csv_reader, 1000))
# Use rows
print(rows)
You don't need it right now but it will make escaped characters or multiline entries way easier to parse. Also, if there are headers you can use csv.DictReader to include them.
Regarding your original code:
The call the readlines() will read all lines at that point so doing any filtering after won't make a difference.
If you did read it that way, to get the first 1000 lines your for loop should be:
for eachline in traindata[:1000]:
...
Hi tried several solutions found on SO but I am missing some info.
I want to read 4 lines at once until I hit EOF. I know how to do it in other languages, but what is the best approach in Python 3?
This is what I have, lines is always the first 4 lines and the code stops afterwards (I know, because the comprehension only gives me the first 4 elements of all_lines. I could use some kind of counter and break and so on, but that seems rather cheap to me.
if os.path.isfile(myfile):
with open(myfile, 'r') as fo:
all_lines = fo.readlines()
for lines in all_lines[:4]:
print(lines)
I want to handle 4 lines at once until I hit EOF. The file I am working with is rather short, maybe about 100 lines MAX
If you want to iterate the lines in chunks of 4, you can do something like this:
if os.path.isfile(myfile):
with open(myfile, 'r') as fo:
all_lines = fo.readlines()
for i in range(0, len(all_lines), 4):
print(all_lines[i:i+4])
Instead of reading in the whole file and then looping over the lines four at a time, you can simply read them in four at a time. Consider
def fun(myfile):
if not os.path.isfile(myfile):
return
with open(myfile, 'r') as fo:
while True:
for line in (fo.readline() for _ in range(4)):
if not line:
return
print(line)
Here, a generator expression is used to read four lines, which is embedded in an "infinite" loop, which stop when line is falsy (the empty str ''), which only happens when we have reached EOF.
This question already has answers here:
How to get line count of a large file cheaply in Python?
(44 answers)
Closed 7 years ago.
I would like to know if it s possible to know how many lines contains my file text without using a command as :
with open('test.txt') as f:
text = f.readlines()
size = len(text)
My file is very huge so it s difficult to use this kind of approach...
As a Pythonic approach you can count the number of lines using a generator expression within sum function as following:
with open('test.txt') as f:
count = sum(1 for _ in f)
Note that here the fileobject f is an iterator object that represents an iterator of file's lines.
Slight modification to your approach
with open('test.txt') as f:
line_count = 0
for line in f:
line_count += 1
print line_count
Notes:
Here you would be going through line by line and will not load the complete file into the memory
with open('test.txt') as f:
size=len([0 for _ in f])
The number of lines of a file is not stored in the metadata. So you actually have to run trough the whole file to figure it out. You can make it a bit more memory efficient though:
lines = 0
with open('test.txt') as f:
for line in f:
lines = lines + 1
I have an input file with containing a list of strings.
I am iterating through every fourth line starting on line two.
From each of these lines I make a new string from the first and last 6 characters and put this in an output file only if that new string is unique.
The code I wrote to do this works, but I am working with very large deep sequencing files, and has been running for a day and has not made much progress. So I'm looking for any suggestions to make this much faster if possible. Thanks.
def method():
target = open(output_file, 'w')
with open(input_file, 'r') as f:
lineCharsList = []
for line in f:
#Make string from first and last 6 characters of a line
lineChars = line[0:6]+line[145:151]
if not (lineChars in lineCharsList):
lineCharsList.append(lineChars)
target.write(lineChars + '\n') #If string is unique, write to output file
for skip in range(3): #Used to step through four lines at a time
try:
check = line #Check for additional lines in file
next(f)
except StopIteration:
break
target.close()
Try defining lineCharsList as a set instead of a list:
lineCharsList = set()
...
lineCharsList.add(lineChars)
That'll improve the performance of the in operator. Also, if memory isn't a problem at all, you might want to accumulate all the output in a list and write it all at the end, instead of performing multiple write() operations.
You can use https://docs.python.org/2/library/itertools.html#itertools.islice:
import itertools
def method():
with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
seen = set()
for line in itertools.islice(inf, None, None, 4):
s = line[:6]+line[-6:]
if s not in seen:
seen.add(s)
ouf.write("{}\n".format(s))
Besides using set as Oscar suggested, you can also use islice to skip lines rather than use a for loop.
As stated in this post, islice preprocesses the iterator in C, so it should be much faster than using a plain vanilla python for loop.
Try replacing
lineChars = line[0:6]+line[145:151]
with
lineChars = ''.join([line[0:6], line[145:151]])
as it can be more efficient, depending on the circumstances.
python: what is the quickest way to split a file into two files, each file having half of the number of lines in the original file, such that the lines in each of the two files are random?
for example: if the file is
1
2
3
4
5
6
7
8
9
10
it could be split into:
3
2
10
9
1
4
6
8
5
7
This sort of operation is often called "partition". Although there isn't a built-in partition function, I found this article: Partition in Python.
Given that definition, you can do this:
import random
def partition(l, pred):
yes, no = [], []
for e in l:
if pred(e):
yes.append(e)
else:
no.append(e)
return yes, no
lines = open("file.txt").readlines()
lines1, lines2 = partition(lines, lambda x: random.random() < 0.5)
Note that this won't necessarily exactly split the file in two, but it will on average.
You can just load the file, call random.shuffle on the resulting list, and then split it into two files (untested code):
def shuffle_split(infilename, outfilename1, outfilename2):
from random import shuffle
with open(infilename, 'r') as f:
lines = f.readlines()
# append a newline in case the last line didn't end with one
lines[-1] = lines[-1].rstrip('\n') + '\n'
shuffle(lines)
with open(outfilename1, 'w') as f:
f.writelines(lines[:len(lines) // 2])
with open(outfilename2, 'w') as f:
f.writelines(lines[len(lines) // 2:])
random.shuffle shuffles lines in-place, and pretty much does all the work here. Python's sequence indexing system (e.g. lines[len(lines) // 2:]) makes things really convenient.
I'm assuming that the file isn't huge, i.e. that it will fit comfortably in memory. If that's not the case, you'll need to do something a bit more fancy, probably using the linecache module to read random line numbers from your input file. I think probably you would want to generate two lists of line numbers, using a similar technique to what's shown above.
update: changed / to // to evade issues when __future__.division is enabled.
import random
data=open("file").readlines()
random.shuffle(data)
c=1
f=open("test."+str(c),"w")
for n,i in enumerate(data):
if n==len(data)/2:
c+=1
f.close()
f=open("test."+str(c),"w")
f.write(i)
Other version:
from random import shuffle
def shuffle_split(infilename, outfilename1, outfilename2):
with open(infilename, 'r') as f:
lines = f.read().splitlines()
shuffle(lines)
half_lines = len(lines) // 2
with open(outfilename1, 'w') as f:
f.write('\n'.join(lines.pop() for count in range(half_lines)))
with open(outfilename2, 'w') as f:
f.writelines('\n'.join(lines))