Skip first rows from CSV with Python without reading the file - python

I need to skip some first lines from a CSV file and save it to another file.
The code i currently accomplish such tasks is:
import pandas as pd
df = pd.read_csv('users.csv', skiprows=2)
df.to_csv("usersOutput.csv", index=False)
and it works without issues. The only thing is: this code reads the whole file before saving. Now my problem is: i have to deal with a file with 4GB size and i think, this code will be very time consuming.
Is there a possibility to skip some first lines and save the file without to read it before?

You don't need to use pandas just to filter lines from a file:
with open('users.csv') as users, open('usersOutput.csv', 'w') as output:
for lineno, line in enumerate(users):
if lineno > 1:
output.write(line)

The most efficient way with shutil.copyfileobj(fsrc, fdst[, length]) feature:
from shutil import copyfileobj
from itertools import islice
with open('users.csv') as f_old, open('usersOutput.csv', 'w') as f_new:
list(islice(f, 2)) # skip 2 lines
copyfileobj(f_old, f_new)
From doc:
... if the current file position of the fsrc object is not 0, only
the contents from the current file position to the end of the file
will be copied.
i.e. the new file will contain the same content except the first 2 lines.

Related

Most efficient way of inserting new data between lines of a file

In Python 2.6 is there a more efficient way of searching a file line by line (for a string) and after finding it, inserting some lines into that file? So the output file would just be the same as the input file with a few lines added in between. Also, I'd rather not read these files into a buffer because the files can be very large.
Right now, I'm reading a file line by line and writing it into a temp file until I find the line I'm looking for and then inserting the extra data in the temp file. And write the rest of the data into the temp file. After I'm done processing the file, overwrite the old file with the new temp file.
Something like this:
with open(file_in_read, 'r') as inFile:
if os.path.exists(file_in_write):
os.remove(file_in_write)
with open(file_in_write, 'a') as outFile:
for line in inFile:
if re.search((r'<search_string',line):
write_some_data(outFile)
outFile.write(line)
else:
outFile.write(line)
os.rename(src,dst)
I was just wondering if I can speed it up somehow.
It looks like using the fileinput module in the standard library is the way to go. You can simplify your code to:
import fileinput
import re
import sys
regex = re.compile(r'<pattern>')
for line in fileinput.input(file_in_read, inplace=True):
sys.stdout.write(line)
if regex.search(line):
sys.stdout.write(additional_lines)
You can seek to some point of the file with file.seek and write there, but this way data will have fixed offset in the file and this is generally not what you want.
If the data need to go after some other data and this one has no fixed offset and size, then there is no way around and you need to read it to find out it's offset and size.
You may having a x,y problem. When you think that can solve x by y so you ask for help on y instead of asking for help in x. If you share what you are trying to get with these files other people may suggest better solutions.

File I/O in Python

I'm attempting to read a CSV file and then write the read CSV into another CSV file.
Here is my code so far:
import csv
with open ("mastertable.csv") as file:
for row in file:
print row
with open("table.csv", "w") as f:
f.write(file)
I eventually want to read a CSV file write to a new CSV with appended data.
I get this error when I try to run it.
Traceback (most recent call last):
File "readlines.py", line 8, in <module>
f.write(file)
TypeError: expected a character buffer object
From what I understood it seems that I have to close the file, but I thought with automatically closed it?
I'm not sure why I can write a string to text but I can't simply write a CSV to another CSV almost like just making a copy by iterating over it.
To read in a CSV and write to a different one, you might do something like this:
with open("table.csv", "w") as f:
with open ("mastertable.csv") as file:
for row in file:
f.write(row)
But I would only do that if the rows needed to be edited while transcribed. For the described use case, you can simply copy it with shutil before hand then opening it to append to it. This method will be much faster, not to mention far more readable.
The with operator will handle file closing for you, and will close the file when you leave that block of code (given by the indentation level)
It looks like you intend to make use of the Python csv module. The following should be a good starting point for what you are trying to acheive:
import csv
with open("mastertable.csv", "r") as file_input, open("table.csv", "wb") as file_output:
csv_input = csv.reader(file_input)
csv_output = csv.writer(file_output)
for cols in csv_input:
cols.append("more data")
csv_output.writerow(cols)
This will read mastertable.csv file in a line at a time as a list of columns. I append an extra column, and then write each line to table.csv.
Note, when you leave the scope of a with statement, the file is automatically closed.
The file variable is not really actual file data but it is a refernce pointer which is used to read data. When you do the following:
with open ("mastertable.csv") as file:
for row in file:
print row
file pointer get closed automatically. The write method expects a character buffer or a string as the input not a file pointer.
If you just want to copy data, you can do something like this:
data = ""
with open ("mastertable.csv","r") as file:
data = file.read()
with open ("table.csv","a") as file:
file.write(data)`

Sort a big file with Python heapq.merge

I'm looking to complete such job but have encountered difficulty:
I have a huge file of texts. Each line is of the format "AGTCCCGGAT filename" where the first part is a DNA thing.
The professor suggests that we break this huge file into many temporary files and use heapq.merge() to sort them. The goal is to have one file at the end which contains every line of the original file and is sorted.
My first try was to break each line into a separate temporary file. The problem is that heapq.merge() reports there are too many files to sort.
My second try was to break it into temporary files by 50000 lines. The problem is that it seems that it does not sort by line, but by file. for example, we have something like:
ACGTACGT filename
CGTACGTA filename
ACGTCCGT filename
CGTAAAAA filename
where the first two lines are from one temp file and the last two lines are from the second file.
My code to sort them is as follows:
for line in heapq.merge(*[open('/var/tmp/L._Ipsum-strain01.fa_dir/'+str(f),'r') for f in os.listdir('/var/tmp/L._Ipsum-strain01.fa_dir')]):
result.write(line)
result.close()
Your solution is almost correct. However, each partial file must be sorted before you write them to the disk. Here's a 2-pass algorithm that demonstrates it: First, iterate the file in 50k line chunks, sort the lines in a chunk and then write this sorted chunk into a file. In second pass, open all these files and merge to the output file.
from heapq import merge
from itertools import count, islice
from contextlib import ExitStack # not available on Python 2
# need to care for closing files otherwise
chunk_names = []
# chunk and sort
with open('input.txt') as input_file:
for chunk_number in count(1):
# read in next 50k lines and sort them
sorted_chunk = sorted(islice(input_file, 50000))
if not sorted_chunk:
# end of input
break
chunk_name = 'chunk_{}.chk'.format(chunk_number)
chunk_names.append(chunk_name)
with open(chunk_name, 'w') as chunk_file:
chunk_file.writelines(sorted_chunk)
with ExitStack() as stack, open('output.txt', 'w') as output_file:
files = [stack.enter_context(open(chunk)) for chunk in chunk_names]
output_file.writelines(merge(*files))

Open a file for input and output in Python

I have the following code which is intended to remove specific lines of a file. When I run it, it prints the two filenames that live in the directory, then deletes all information in them. What am I doing wrong? I'm using Python 3.2 under Windows.
import os
files = [file for file in os.listdir() if file.split(".")[-1] == "txt"]
for file in files:
print(file)
input = open(file,"r")
output = open(file,"w")
for line in input:
print(line)
# if line is good, write it to output
input.close()
output.close()
open(file, 'w') wipes the file. To prevent that, open it in r+ mode (read+write/don't wipe), then read it all at once, filter the lines, and write them back out again. Something like
with open(file, "r+") as f:
lines = f.readlines() # read entire file into memory
f.seek(0) # go back to the beginning of the file
f.writelines(filter(good, lines)) # dump the filtered lines back
f.truncate() # wipe the remains of the old file
I've assumed that good is a function telling whether a line should be kept.
If your file fits in memory, the easiest solution is to open the file for reading, read its contents to memory, close the file, open it for writing and write the filtered output back:
with open(file_name) as f:
lines = list(f)
# filter lines
with open(file_name, "w") as f: # This removes the file contents
f.writelines(lines)
Since you are not intermangling read and write operations, the advanced file modes like "r+" are unnecessary here, and only compicate things.
If the file does not fit into memory, the usual approach is to write the output to a new, temporary file, and move it back to the original file name after processing is finished.
One way is to use the fileinput stdlib module. Then you don't have to worry about open/closing and file modes etc...
import fileinput
from contextlib import closing
import os
fnames = [fname for fname in os.listdir() if fname.split(".")[-1] == "txt"] # use splitext
with closing(fileinput.input(fnames, inplace=True)) as fin:
for line in fin:
# some condition
if 'z' not in line: # your condition here
print line, # suppress new line but adjust for py3 - print(line, eol='') ?
When using inplace=True - the fileinput redirects stdout to be to the file currently opened. A backup of the file with a default '.bak' extension is created which may come in useful if needed.
jon#minerva:~$ cat testtext.txt
one
two
three
four
five
six
seven
eight
nine
ten
After running the above with a condition of not line.startswith('t'):
jon#minerva:~$ cat testtext.txt
one
four
five
six
seven
eight
nine
You're deleting everything when you open the file to write to it. You can't have an open read and write to a file at the same time. Use open(file,"r+") instead, and then save all the lines to another variable before writing anything.
You should not open the same file for reading and writing at the same time.
"w" means create a empty for writing. If the file already exists, its data will be deleted.
So you can use a different file name for writing.

Replace a word in a file

I am new to Python programming...
I have a .txt file....... It looks like..
0,Salary,14000
0,Bonus,5000
0,gift,6000
I want to to replace the first '0' value to '1' in each line. How can I do this? Any one can help me.... With sample code..
Thanks in advance.
Nimmyliji
I know that you're asking about Python, but forgive me for suggesting that perhaps a different tool is better for the job. :) It's a one-liner via sed:
sed 's/^0,/1,/' yourtextfile.txt > output.txt
This applies the regex /^0,/ (which matches any 0, that occurs at the beginning of a line) to each line and replaces the matched text with 1, instead. The output is directed into the file output.txt specified.
inFile = open("old.txt", "r")
outFile = open("new.txt", "w")
for line in inFile:
outFile.write(",".join(["1"] + (line.split(","))[1:]))
inFile.close()
outFile.close()
If you would like something more general, take a look to Python csv module. It contains utilities for processing comma-separated values (abbreviated as csv) in files. But it can work with arbitrary delimiter, not only comma. So as you sample is obviously a csv file, you can use it as follows:
import csv
reader = csv.reader(open("old.txt"))
writer = csv.writer(open("new.txt", "w"))
writer.writerows(["1"] + line[1:] for line in reader)
To overwrite original file with new one:
import os
os.remove("old.txt")
os.rename("new.txt", "old.txt")
I think that writing to new file and then renaming it is more fault-tolerant and less likely corrupt your data than direct overwriting of source file. Imagine, that your program raised an exception while source file was already read to memory and reopened for writing. So you would lose original data and your new data wouldn't be saved because of program crash. In my case, I only lose new data while preserving original.
o=open("output.txt","w")
for line in open("file"):
s=line.split(",")
s[0]="1"
o.write(','.join(s))
o.close()
Or you can use fileinput with in place edit
import fileinput
for line in fileinput.FileInput("file",inplace=1):
s=line.split(",")
s[0]="1"
print ','.join(s)
f = open(filepath,'r')
data = f.readlines()
f.close()
edited = []
for line in data:
edited.append( '1'+line[1:] )
f = open(filepath,'w')
f.writelines(edited)
f.flush()
f.close()
Or in Python 2.5+:
with open(filepath,'r') as f:
data = f.readlines()
with open(outfilepath, 'w') as f:
for line in data:
f.write( '1' + line[1:] )
This should do it. I wouldn't recommend it for a truly big file though ;-)
What is going on (ex 1):
1: Open the file in read mode
2,3: Read all the lines into a list (each line is a separate index) and close the file.
4,5,6: Iterate over the list constructing a new list where each line has the first character replaced by a 1. The line[1:] slices the string from index 1 onward. We concatenate the 1 with the truncated list.
7,8,9: Reopen the file in write mode, write the list to the file (overwrite), flush the buffer, and close the file handle.
In Ex. 2:
I use the with statement that lets the file handle closing itself, but do essentially the same thing.

Categories

Resources