I have a very large (~8 gb) text file that has very long lines. I would like to pull out lines in selected ranges of this file and put them in another text file. In fact my question is very similar to this and this but I keep getting stuck when I try to select a range of lines instead of a single line.
So far this is the only approach I have gotten to work:
lines = readin.readlines()
out1.write(str(lines[5:67]))
out2.write(str(lines[89:111]))
However this gives me a list and I would like to output a file with a format identical to the input file (one line per row)
You can call join on the ranges.
lines = readin.readlines()
out1.write(''.join(lines[5:67]))
out2.write(''.join(lines[89:111]))
might i suggest not storing the entire file (since it is large) as per one of your links?
f = open('file')
n = open('newfile', 'w')
for i, text in enumerate(f):
if i > 4 and i < 68:
n.write(text)
elif i > 88 and i < 112:
n.write(text)
else:
pass
i'd also recommend using 'with' instead of opening and closing the file, but i unfortunately am not allowed to upgrade to a new enough version of python for that here : (.
The first thing you should think of when facing a problem like this, is to avoid reading the entire file into memory at once. readlines() will do that, so that specific method should be avoided.
Luckily, we have an excellent standard library in Python, itertools. itertools has lot of useful functions, and one of them is islice. islice iterates over an iterable (such as lists, generators, file-like objects etc.) and returns a generator containing the range specified:
itertools.islice(iterable, start, stop[, step])
Make an iterator that returns selected elements from the iterable. If start is non-zero,
then elements from the iterable are skipped until start is reached.
Afterward, elements are returned consecutively unless step is set
higher than one which results in items being skipped. If stop is None,
then iteration continues until the iterator is exhausted, if at all;
otherwise, it stops at the specified position. Unlike regular slicing,
islice() does not support negative values for start, stop, or step.
Can be used to extract related fields from data where the internal
structure has been flattened (for example, a multi-line report may
list a name field on every third line)
Using this information, together with the str.join method, you can e.g. extract lines 10-19 by using this simple code:
from itertools import islice
# Add the 'wb' flag if you use Windows
with open('huge_data_file.txt', 'wb') as data_file:
txt = '\n'.join(islice(data_file, 10, 20))
Note that when looping over the file object, the newline char is stripped from the lines, so you need to set \n as the joining char.
(Partial Answer) In order to make your current approach work you'll have to write line by line. For instance:
lines = readin.readlines()
for each in lines[5:67]:
out1.write(each)
for each in lines[89:111]:
out2.write(each)
path = "c:\\someplace\\"
Open 2 text files. One for reading and one for writing
f_in = open(path + "temp.txt", 'r')
f_out = open(path + output_name, 'w')
go through each line of the input file
for line in f_in:
if i_want_to_write_this_line == True:
f_out.write(line)
close the files when done
f_in.close()
f_out.close()
Related
I have this code for opening a big file:
fr = open('X1','r')
text = fr.read()
print(text)
fr.close()
When I open it with gedit, the file is something like this with the number of each row:
but in terminal it is shown without any row number:
So, it is difficult to distinguish among different rows.
How can I add the number of rows in my python script?
If you just want to show the number of the line, wrap the line iterator in enumerate, it will return a iterator of tuples with the index (zero based) and the line.
Like this:
with open('X1', 'r') as fr:
for index, line in enumerate(fr):
print(f'{index}: {line}')
Also, when working with files, using the context manager with with is better. It ensures proper closing and flushing the data buffers even if an exception is raised
EDIT:
As a bonus, the example I gave uses only iterators, the file object is a iterator of lines and enumerate also returns a iterator that builds the tuple.
This means that this script only holds only line at a time in memory (and the buffers defined by your platform), not the whole file.
fr = open('X1','r')
rowcount=0
for i in fr:
print(rowcount,end="")
print(":",end="")
print(i)
rowcount+=1
print(rowcount)
fr.close()
We have just read file line by line and printing counter each time.
I have a csv which contains text like
AAABBBBCCCDDDDDDD
EEEFFFRRRTTTHHHYY
when I run the code like below:
rows = csv.reader(csvfile)
for row in rows:
print(" ".join('%s' %row for row in rows))
it will project as follow:
['AAABBBBCCCDDDDDDD']
['EEEFFFRRRTTTHHHYY']
But I want to display as a series of words like below:
AAABBBBCCCDDDDDDDEEEFFFRRRTTTHHHYY
Is there anything wrong in the code?
Your example looks like you simply need
with open(csvfile) as inputfile: # misnomer; not really proper CSV
for row in inputfile:
print(row.rstrip('\n'), end='')
The example you provided doesn't look like a csv file. It looks like a simple text file. The you could have something as simple as :
Input.txt
AAABBBBCCCDDDDDDD
EEEFFFRRRTTTHHHYY
Solution.py
input_filename = "Input.txt"
with open(input_filename) as input_file:
print("".join(x.rstrip('\n') for x in input_file))
This is taking advantage of:
A file object can be iterated on. This will give you a new line from each iteration
Every line received from the file will have newline character at its end. Since you seem to not want it we use the method .rstrip() to remove it
The .join() method can accept any iterable even a...
Generator expression which will help us create an iterable that will accepted by .join() using .rstrip() to format every line coming from the input file.
EDIT: OK let's decompose further my answer:
When you open a file you can iterate over it. In the most simple way to explain it, let's say that it means that you do a loop over it (for line in input_file: ...).
But not only that, but with an iterator you can create another iterator by transforming each element. This is what a list comprehension or, in the case I have chosen, a generator expression does. So the expression (x.rstrip() for x in input_file) will be a iterator that takes every element of input_file and applies to it .rstrip()
The string method .join() will glue together the elements provided by an iterator using that string as a separator. Since I use here an empty string there won't be a seperator. I have used the iterator defined before for this.
I then print() the string provided by the .join() operation explained before.
I did a minor correction on my answer because there is the edge case that if there are space or tab characters at the end of a line in the input file they would have been removed if I use x.rstrip() instead of x.rstrip('\n')
You could start with an empty string, and for every row read from the csv file, remove the newline at the end and add the contents to the empty string.
joined = ""
with open(csvfile) as f:
for row in f:
joined = joined + row.replace("\n","")
print(joined)
Output:
>> AAABBBBCCCDDDDDDDEEEFFFRRRTTTHHHYY
I was trying to do some csv processing using csv reader and was stuck on an issue where I have to iterate over lines read by the csv reader. But on iterating second time, it returns nil since all the lines have already been iterated, is there any way to refresh the iterator to start from the scratch again.
Code:
desc=open("example.csv","r")
Reader1=csv.read(desc)
for lines in Reader1:
(Some code)
for lines in Reader1:
(some code)
what is precisely want to do is read a csv file in the format below
id,price,name
x,y,z
a,b,c
and rearrange it in the format below
id:x a
price: y b
name: z c
without using pandas library
Reset the underlying file object with seek, adding the following before the second loop:
desc.seek(0)
# Apparently, csv.reader will not refresh if the file is seeked to 0,
# so recreate it
Reader1 = csv.reader(desc)
Mind you, if memory is not a concern, it would typically be faster to read the input into a list, then iterate the list twice. Alternatively, you could use itertools.tee to make two iterators from the initial iterator (it requires similar memory to slurping to list if you iterate one iterator completely before starting the other, but allows you to begin iterating immediately, instead of waiting for the whole file to be read before you can process any of it). Either approach avoids additional system calls that iterating the file twice would entail. The tee approach, after the line you create Reader1 on:
# It's not safe to reuse the argument to tee, so we replace it with one of
# the results of tee
Reader1, Reader2 = itertools.tee(Reader1)
for line in Reader1:
...
for line in Reader2:
...
in python , suppose i have file data.txt . which has 6 lines of data . I want to calculate the no of lines which i am planning to do by going through each character and finding out the number of '\n' in the file . How to take one character input from the file ? Readline takes the whole line .
I think the method you're looking for is readlines, as in
lines = open("inputfilex.txt", "r").readlines()
This will give you a list of each of the lines in the file. To find out how many lines, you can just do:
len(lines)
And then access it using indexes, like lines[3] or lines[-1] as you would any normal Python list.
You can use read(1) to read a single byte. help(file) says:
read(size) -> read at most size bytes, returned as a string.
If the size argument is negative or omitted, read until EOF is reached.
Notice that when in non-blocking mode, less data than what was requested
may be returned, even if no size parameter was given.
Note that reading a file a byte at a time is quite un-"Pythonic". This is par for the course in C, but Python can do a lot more work with far less code. For example, you can read the entire file into an array in one line of code:
lines = f.readlines()
You could then access by line number with a simple lines[lineNumber] lookup.
Or if you don't want to store the entire file in memory at once, you can iterate over it line-by-line:
for line in f:
# Do whatever you want.
That is much more readable and idiomatic.
It seems the simplest answer for you would be to do:
for line in file:
lines += 1
# do whatever else you need to do for each line
Or the equivalent construction explicitly using readline(). I'm not sure why you want to look at every character when you said above that readline() is correctly reading each line in its entirety.
To access a file based on its lines, make a list of its lines.
with open('myfile') as f:
lines = list(f)
then simply access lines[3] to get the fourth line and so forth. (Note that this will not strip the newline characters.)
The linecache module can also be useful for this.
I am working with a very large text file (tsv) around 200 million entries. One of the column is date and records are sorted on date. Now I want to start reading the record from a given date. Currently I was just reading from start which is very slow since I need to read almost 100-150 million records just to reach that record. I was thinking if I can use binary search to speed it up, I can do away in just max 28 extra record reads (log(200 million)). Does python allow to read nth line without caching or reading lines before it?
If the file is not fixed length, you are out of luck. Some function will have to read the file. If the file is fixed length, you can open the file, use the function file.seek(line*linesize). Then read the file from there.
If the file to read is big, and you don't want to read the whole file in memory at once:
fp = open("file")
for i, line in enumerate(fp):
if i == 25:
# 26th line
elif i == 29:
# 30th line
elif i > 29:
break
fp.close()
Note that i == n-1 for the nth line.
You can use the method fileObject.seek(offset[, whence])
#offset -- This is the position of the read/write pointer within the file.
#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.
file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
print(file.readline())
file.close()
To this code I use the next file:
100101
101102
102103
103104
104105
105106
106107
107108
108109
109110
110111
python has no way to skip "lines" in a file. the best way that I know is to employ a generator to yield lines based on certain condition, i.e. date > 'YYYY-MM-DD'. At least this way you reduce memory usage & time spent on i/o.
example:
# using python 3.4 syntax (parameter type annotation)
from datetime import datetime
def yield_right_dates(filepath: str, mydate: datetime):
with open(filepath, 'r') as myfile:
for line in myfile:
# assume:
# the file is tab separated (because .tsv is the extension)
# the date column has column-index == 0
# the date format is '%Y-%m-%d'
line_splt = line.split('\t')
if datetime.strptime(line_splt[0], '%Y-%m-%d') > mydate:
yield line_splt
my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]
But this is still limiting you to one processor :(
Assuming you're on a unix-like system and bash is your shell, I would split the file using the shell utility split, then use multiprocessing and the generator defined above.
I don't have a large file to test with right now, but I'll update this answer later with a benchmark on iterating it whole, vs. splitting and then iterating it with the generator and multiprocessing module.
With greater knowledge on the file (e.g. if all the desired dates are clustered at the beginning | center | end), you might be able to optimize the read further.
As others have commented python doesn't support this as it doesn't know where lines start and end (unless they're fixed length). If you're doing this repeatedly I'd recommend either padding out the lines to a constant length (if practical) or failing that reading them into some kind of basic database. You'll take a bit of a hit to memory size but unless you're only indexing once in a blue moon it'll probably be worth it.
If space is a big concern and padding isn't possible you could also add a (linenumber) tag at the start of each line. While you would have to guess the size of jumps and then parse a sample of line to check them that would allow you to make a searching algorithm to find the right line quickly for only around 10 characters per line