Issue with large files in Python 2.7 - python

I am currently experiencing an issue while reading big files with Python 2.7 [GCC 4.9] on Ubuntu 14.04 LTS, 32-bit. I read other posts on the same topic, such as Reading a large file in python , and tried to follow their advice, but I still obtain MemoryErrors.
The file I am attempting to read is not that big (~425MB), so first I tried a naive block of code like:
data = []
isFirstLine = True
lineNumber = 0
print "Reading input file \"" + sys.argv[1] + "\"..."
with open(sys.argv[1], 'r') as fp :
for x in fp :
print "Now reading line #" + str(lineNumber) + "..."
if isFirstLine :
keys = [ y.replace('\"', '') for y in x.rstrip().split(',') ]
isFirstLine = False
else :
data.append( x.rstrip().split(',') )
lineNumber += 1
The code above crashes around line #3202 (of 3228), with output:
Now reading line #3200...
Now reading line #3201...
Now reading line #3202...
Segmentation fault (core dumped)
I tried invoking gc.collect() after reading every line, but I got the same error (and the code became slower). Then, following some indications I found here on StackOverflow, I tried numpy.loadtxt():
data = numpy.loadtxt(sys.argv[1], skiprows=1, delimiter=',')
This time, I got a slightly more verbose error:
Traceback (most recent call last):
File "plot-memory-efficient.py", line 360, in <module>
if __name__ == "__main__" : main()
File "plot-memory-efficient.py", line 40, in main
data = numpy.loadtxt(sys.argv[1], skiprows=1, delimiter=',')
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 856, in loadtxt
X = np.array(X, dtype)
MemoryError
So, I am under the impression that something is not right. What am I missing? Thanks in advance for your help!
UPDATE
Following hd1's answer below, I tried the csv module, and it worked. However, I think there's something important that I might have overlooked: I was parsing each line, and I was actually storing the values as strings. Using csv like this still causes some errors:
with open(sys.argv[1], 'r') as fp :
reader = csv.reader(fp)
# get the header
keys = reader.next()
for line in reader:
print "Now reading line #" + str(lineNumber) + "..."
data.append( line )
lineNumber += 1
But storing the values as float solves the issue!
with open(sys.argv[1], 'r') as fp :
reader = csv.reader(fp)
# get the header
keys = reader.next()
for line in reader:
print "Now reading line #" + str(lineNumber) + "..."
floatLine = [float(x) for x in line]
data.append( floatLine )
lineNumber += 1
So, another issue might be connected with the data structures.

numpy's loadtxt method is known to be memory-inefficient. That may address your first problem. Per the second, why not use the csv module:
data = []
isFirstLine = True
lineNumber = 0
print "Reading input file \"" + sys.argv[1] + "\"..."
with open(sys.argv[1], 'r') as fp :
reader = csv.reader(fp)
reader.next()
for line in reader:
pass
# line is an array of comma-delimited fields in the file

Related

How can we write a text file from variable using python?

I am working on NLP project and have extracted the text from pdf using PyPDF2. Further, I removed the blank lines. Now, my output is being shown on the console but I want to populate the text file with the same data which is stored in my variable (file).
Below is the code which is removing the blank lines from a text file.
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file=line
print(file)
Output on Console:
Eclipse,
Visual Studio 2012,
Arduino IDE,
Java
,
HTML,
CSS
2013
Excel
.
Now, I want the same data in my (resume1.txt) text file. I have used three methods but all these methods print a single dot in my resume1.txt file. If I see at the end of the text file then there is a dot which is being printed.
Method 1:
with open("resume1.txt", "w") as out_file:
out_file.write(file)
Method 2:
print(file, file=open("resume1.txt", 'w'))
Method 3:
pathlib.Path('resume1.txt').write_text(file)
Could you please be kind to assist me in populating the text file. Thank you for your cooperation.
First of all, note that you are writing to the same file losing the old data, I don't know if you want to do that. Other than that, every time you write using those methods, you are overwriting the data you previously wrote to the output file. So, if you want to use these methods, you must write just 1 time (write all the data).
SOLUTIONS
Using method 1:
to_file = []
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file = line
print(file)
to_file.append(file)
to_save = '\n'.join(to_file)
with open("resume1.txt", "w") as out_file:
out_file.write(to_save)
Using method 2:
to_file = []
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file = line
print(file)
to_file.append(file)
to_save = '\n'.join(to_file)
print(to_save, file=open("resume1.txt", 'w'))
Using method 3:
import pathlib
to_file = []
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file = line
print(file)
to_file.append(file)
to_save = '\n'.join(to_file)
pathlib.Path('resume1.txt').write_text(to_save)
In these 3 methods, I have used to_save = '\n'.join(to_file) because I'm assuming you want to separate each line of other with an EOL, but if I'm wrong, you can just use ''.join(to_file) if you want not space, or ' '.join(to_file) if you want all the lines in a single one.
Other method
You can do this by using other file, let's say 'output.txt'.
out_file = open('output.txt', 'w')
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file = line
print(file)
out_file.write(file)
out_file.write('\n') # EOL
out_file.close()
Also, you can do this (I prefer this):
with open('output.txt', 'w') as out_file:
for line in open('resume1.txt'):
line = line.rstrip()
if line != '':
file = line
print(file)
out_file.write(file)
out_file.write('\n') # EOL
First post on stack, so excuse the format
new_line = ""
for line in open('resume1.txt', "r"):
for char in line:
if char != " ":
new_line += char
print(new_line)
with open('resume1.txt', "w") as f:
f.write(new_line)

Reading and writing a csv file Python

I just started learning Python, and I am trying to do the following:
- Read a .csv file
- Write the filtered data in a new file where the column 7 is not blank/empty
When I am printing my results, it shows the right output in the python shelf, but when I am checking my data in the .csv is no correct (differs from what is showing with the print function)
Any suggestion with my code?
Thank you in advance.
file = open("station.csv", "r")
writeFile = open("stations-filtered.csv", "w")
for line in file:
line2 = line.split(",")
if line2[7] != "":
print(line)
writeFile.write(line)
I agree with #user513093 that you can use csv, like:
file = open("station.csv", "r")
writeFile = open("stations-filtered.csv", "w")
writer = csv.writer(writeFile, delimiter=',')
for line in file:
line2 = line.split(",")
if line2[7] != "":
print(line)
writer.writerow(line)
But still, pandas is good:
import pandas as pd
file = pd.read_csv("station.csv", sep=",", header=None)
file = file[file[7] != ""]
file.to_csv("stations-filtered.csv")

Making a loop to write new lines to a txt file using python

I'm trying to get the script to read a text file of Congress members in which each line is formatted like this:
Darrell Issa (R-Calif)
I want it to print a line to a different file that's formatted like this (notice the added comma):
Darrell Issa,(R-Calif)
For some reason the script below works but it only does it for the first line. How do I get it to execute the loop for each line?
basicfile = open('membersofcongress.txt', 'r')
for line in basicfile:
partyst = line.find('(')
partyend = line.find(')')
party = line[partyst:partyend+1]
name = line[+0:partyst-1]
outfile = open('memberswcomma.txt','w')
outfile.write(name)
outfile.write(",")
outfile.write(party)
outfile.close()
basicfile.close()
print "All Done"
Thank you in advance for your help.
According to documentation,
'w' for only writing (an existing file with the same name will be
erased)
When you open your output file with w, loop keeps creating a new txt file for each line. Using a would be better.
basicfile = open('membersofcongress.txt', 'r')
for line in basicfile:
partyst = line.find('(')
partyend = line.find(')')
party = line[partyst:partyend+1]
name = line[+0:partyst-1]
outfile = open('memberswcomma.txt','a')
outp = name + "," + party + "\n"
outfile.write(outp)
outfile.close()
basicfile.close()
EDIT:
Much better solution would be,
You open your output file at the begining of the loop instead of inside of it.
basicfile = open('membersofcongress.txt', 'r')
outfile = open('memberswcomma.txt','w')
for line in basicfile:
partyst = line.find('(')
partyend = line.find(')')
party = line[partyst:partyend+1]
name = line[+0:partyst-1]
outp = name + "," + party + "\n"
outfile.write(outp)
outfile.close()
basicfile.close()
ok a few things to fix this, use 'a' mode to open your outfile and open it just before the loop, close the outfile after the loop and not inside it.
something like this should work (tested it)
basicfile = open('membersofcongress.txt', 'r')
outfile = open('memberswcomma.txt','a')
for line in basicfile:
partyst = line.find('(')
partyend = line.find(')')
party = line[partyst:partyend+1]
name = line[0:partyst-1]
outfile.write(name)
outfile.write(",")
outfile.write(party)
outfile.close()
basicfile.close()
print "All Done"

subset the data and count the line in each file

I am trying to subset my data from a single file to two separate files and count the lines in each file separately.
ID,MARK1,MARK2
sire1,AA,BB
dam2,AB,AA
sire3,AB,-
dam1,AA,BB
IND4,BB,AB
IND5,BB,AA
One file would be:
ID,MARK1,MARK2
sire1,AA,BB
dam2,AB,AA
sire3,AB,-
dam1,AA,BB
The other would be:
ID,MARK1,MARK2
IND4,BB,AB
IND5,BB,AA
Here is my code:
import re
def file_len(filename):
with open(filename, mode = 'r', buffering = 1) as f:
for i, line in enumerate(f):
pass
return i
inputfile = open("test.txt", 'r')
outputfile_f1 = open("f1.txt", 'w')
outputfile_f2 = open("f2.txt", 'w')
matchlines = inputfile.readlines()
outputfile_f1.write(matchlines[0]) #add the header to the "f1.txt"
for line in matchlines:
if re.match("sire*", line):
outputfile_f1.write(line)
elif re.match("dam*", line):
outputfile_f1.write(line)
else:
outputfile_f2.write(line)
print 'the number of individuals in f1 is:', file_len(outputfile_f1)
print 'the number of individuals in f2 is:', file_len(outputfile_f2)
inputfile.close()
outputfile_f1.close()
outputfile_f2.close()
The code can separate subset the files just fine, but i am particularly not like the way I add the header to the new file, I am wondering if any better way to do it? Also, the function looks fine to count lines, but when I ran it, it gave me an error
"Traceback (most recent call last):
File "./subset_individuals_based_on_ID.py", line 28, in <module>
print 'the number of individuals in f1 is:', file_len(outputfile_f1)
File "./subset_individuals_based_on_ID.py", line 7, in file_len
with open(filename, mode = 'r', buffering = 1) as f:
TypeError: coercing to Unicode: need string or buffer, file found
"
so i googled this site, added buffering = 1 (it was not originally in the code), still not solve the problem.
Thank you very much for helping improve the code and cleaning the error.
You can also use itertools.tee to split the input into multiple streams and process them individually.
import itertools
def write_file(match, source, out_file):
count = -1
with open(out_file, 'w') as output:
for line in source:
if count < 0 or match(line):
output.write(line)
count += 1
print('Wrote {0} lines to {1}'.format(count, out_file))
with open('test.txt', 'r') as f:
first, second = itertools.tee(f.readlines())
write_file(lambda x: not x.startswith('IND'), first, 'f1.txt')
write_file(lambda x: x.startswith('IND'), second, 'f2.txt')
EDIT - removed redundant elif
I might be misreading you, but I believe you are just trying to do this:
>>> with open('test', 'r') as infile:
... with open('test_out1', 'w') as out1, open('test_out2', 'w') as out2:
... header, *lines = infile.readlines()
... out1.write(header)
... out2.write(header)
... for line in lines:
... if line.startswith('sir') or line.startswith('dam'):
... out1.write(line)
... else:
... out2.write(line)
Contents of test before:
ID,MARK1,MARK2
sire1,AA,BB
dam2,AB,AA
sire3,AB,-
dam1,AA,BB
IND4,BB,AB
IND5,BB,AA
Contents of test_out1 after:
ID,MARK1,MARK2
sire1,AA,BB
dam2,AB,AA
sire3,AB,-
dam1,AA,BB
Contents of test_out2 after:
ID,MARK1,MARK2
IND4,BB,AB
IND5,BB,AA

python text reading

datafile = open("temp.txt", "r")
record = datafile.readline()
while record != '':
d1 = datafile.strip("\n").split(",")
print d1[0],float (d1[1])
record = datafile.readline()
datafile.close()
The temp file contains
a,12.7
b,13.7
c,18.12
I can't get output. Please help.
The correct code should be:
with open('temp.txt') as f:
for line in f:
after_split = line.strip("\n").split(",")
print after_split[0], float(after_split[1])
The main reason you're not getting output in your code is that datafile doesn't have a strip() method, and I'm surprised you're not getting exceptions.
I highly suggest you read the Python tutorial - it looks like you're trying to write Python in another language and that is not A Good Thing
You want to call strip and split on the line, not the file.
Replace
d1 = datafile.strip("\n").split(",")
With
d1 = record.strip("\n").split(",")
you operating with file handler, but should work on line
like this d1 = record.strip("\n").split(",")
datafile = open("temp.txt", "r")
record = datafile.readline()
while record != '':
d1 = record.strip("\n").split(",")
print d1[0],float (d1[1])
record = datafile.readline()
datafile.close()
Perhaps the following will work better for you (comments as explanation):
# open file this way so that it automatically closes upon any errors
with open("temp.txt", "r") as f:
data = f.readlines()
for line in data:
# only process non-empty lines
if line.strip():
d1 = line.strip("\n").split(",")
print d1[0], float(d1[1])

Categories

Resources