Read a large text file and write to another file with Python - python

I am trying to convert a large text file (size of 5 gig+) but got a
From this post, I managed to convert encoding format of a text file into a format that is readable with this:
path ='path/to/file'
des_path = 'path/to/store/file'
for filename in os.listdir(path):
with open('{}/{}'.format(path, filename), 'r+', encoding='iso-8859-11') as f:
t = open('{}/{}'.format(des_path, filename), 'w')
string = f.read()
t.write(string)
t.close()
The problem here is that when I tried to convert a text file with a large size(5 GB+). I will got this error
Traceback (most recent call last):
File "Desktop/convertfile.py", line 12, in <module>
string = f.read()
File "/usr/lib/python3.6/encodings/iso8859_11.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
MemoryError
which I know that it cannot read a file with this large. And I found from several link that I can do it by reading line by line.
So, how can I apply to the code I have to make it read line by line? What I understand about reading line by line here is that I need to read a line from f and add it to t until end of the line, right?

You can iterate on the lines of an open file.
for filename in os.listdir(path):
inp, out = open_files(filename):
for line in inp:
out.write(line)
inp.close(), out.close()
Note that I've hidden the complexity of the different paths, encodings, modes in a function that I suggest you to actually write...
Re buffering, i.e. reading/writing larger chunks of the text, Python does its own buffering undercover so this shouldn't be too slow with respect to a more complex solution.

Related

Download images from url that is stored in .txt file?

I'm using python 3.6 on Windows 10, I want to download images so that their urls are stored in 1.txt file.
This is my code:
import requests
import shutil
file_image_url = open("test.txt","r")
while True:
image_url = file_image_url.readline()
filename = image_url.split("/")[-1]
r = requests.get(image_url, stream = True)
r.raw.decode_content = True
with open(filename,'wb') as f:
shutil.copyfileobj(r.raw, f)
but when I run the code above it gives me this error:
Traceback (most recent call last):
File "download_pictures.py", line 10, in <module>
with open(filename,'wb') as f:
OSError: [Errno 22] Invalid argument: '03.jpg\n'
test.txt contains:
https://mysite/images/03.jpg
https://mysite/images/26.jpg
https://mysite/images/34.jpg
When I tried to put just one single URL on test.txt, it works and downloaded the picture,
but I need to download several images.
f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline.
You are passing this filename(with \n) to open function(hence the OSError). So you need to call strip() on filename before passing into open.
Your filename has the new line character (\n) in it, remove that when you’re parsing for the filename and it should fix your issue. It’s working when you only have one file path in the txt file because there is only one line.

How to turn a comma seperated value TXT into a CSV for machine learning

How do I turn this format of TXT file into a CSV file?
Date,Open,high,low,close
1/1/2017,1,2,1,2
1/2/2017,2,3,2,3
1/3/2017,3,4,3,4
I am sure you can understand? It already has the comma -eparated values.
I tried using numpy.
>>> import numpy as np
>>> table = np.genfromtxt("171028 A.txt", comments="%")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\npyio.py", line 1551, in genfromtxt
fhd = iter(np.lib._datasource.open(fname, 'rb'))
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\_datasource.py", line 151, in open
return ds.open(path, mode)
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\_datasource.py", line 501, in open
raise IOError("%s not found." % path)
OSError: 171028 A.txt not found.
I have (S&P) 500 txt files to do this with.
You can use csv module. You can find more information here.
import csv
txt_file = 'mytext.txt'
csv_file = 'mycsv.csv'
in_txt = csv.reader(open(txt_file, "r"), delimiter=',')
out_csv = csv.writer(open(csv_file, 'w+'))
out_csv.writerows(in_txt)
Per #dclarke's comment, check the directory from which you run the code. As you coded the call, the file must be in that directory. When I have it there, the code runs without error (although the resulting table is a single line with four nan values). When I move the file elsewhere, I reproduce your error quite nicely.
Either move the file to be local, add a local link to the file, or change the file name in your program to use the proper path to the file (either relative or absolute).

Parsing hl7 message line by line using python

i want to read my hl7 messages from a file line by line and parse them using python. I'm able to read But my problem is in parsing. It parses only my 1st line of the file and prints till the 2nd line but does not parses futhur because it tells that my 2nd line is not hl7. And the error shown is
h=hl7.parse(line)
File "C:\Python27\lib\site-packages\hl7\parser.py", line 45, in parse
plan = create_parse_plan(strmsg, factory)
File "C:\Python27\lib\site-packages\hl7\parser.py", line 88, in create_parse_plan
assert strmsg[:3] in ('MSH')
AssertionError
for the code:
with open('example.txt','r') as f:
for line in f:
print line
print hl7.isfile(line)
h=hl7.parse(line)
So how do i make my file a valid one. This is example.txt file
MSH|^~\&|AcmeMed|Lab|Main HIS|St.Micheals|20130408031655||ADT^A01|6306E85542000679F11EEA93EE38C18813E1C63CB09673815639B8AD55D6775|P|2.6|
EVN||20050622101634||||20110505110517|
PID|||231331||Garland^Tracy||19010201|F||EU|147 Yonge St.^^LA^CA^58818|||||||28-457-773|291-697-644|
NK1|1|Smith^Sabrina|Second Cousin|
NK1|2|Fitzgerald^Sabrina|Second Cousin|
NK1|3|WHITE^Tracy|Second Cousin|
OBX|||WT^WEIGHT||78|pounds|
OBX|||HT^HEIGHT||57|cm|
I had a similar issue and came up with this solution that works for me.
In short, put all your lines into an object and then parse that object.
(Obviously you can clean up the way I checked to see if the object is made yet or not, but I was going for an easy to read example.)
a = 0
with open('example.txt','r') as f:
for line in f:
if a == 0:
message = line
a = 1
else:
message += line
h=hl7.parse(message)
Now you will have to clean up some \r\n depending on how the file is encoded for end of the line values. But it takes the message as valid and you can parse to your hearts content.
for line in h:
print(line)
MSH|^~\&|AcmeMed|Lab|Main HIS|St.Micheals|20130408031655||ADT^A01|6306E85542000679F11EEA93EE38C18813E1C63CB09673815639B8AD55D6775|P|2.6|
EVN||20050622101634||||20110505110517|
PID|||231331||Garland^Tracy||19010201|F||EU|147 Yonge St.^^LA^CA^58818|||||||28-457-773|291-697-644|
NK1|1|Smith^Sabrina|Second Cousin|
NK1|2|Fitzgerald^Sabrina|Second Cousin|
NK1|3|WHITE^Tracy|Second Cousin|
OBX|||WT^WEIGHT||78|pounds|
OBX|||HT^HEIGHT||57|cm|
Tagging onto #jtweeder answer, the following code worked for prepping my HL7 data.
In notepad++, I noticed that each line ended with LF, but did not have CR. It seems as though this hl7 library requires \r, not \n.
filename = "TEST.dat"
lines = open(filepath + filename, "r").readlines()
h = '\r'.join(lines)

Reading gzipped text file line-by-line for processing in python 3.2.6

I'm a complete newbie when it comes to python, but I've been tasked with trying to get a piece of code running on a machine which has a different version of python (3.2.6) than that which the code was originally built for.
I've come across an issue with reading in a gzipped-text file line-by-line (and processing it depending on the first character). The code (which obviously is written in python > 3.2.6) is
for line in gzip.open(input[0], 'rt'):
if line[:1] != '>':
out.write(line)
continue
chromname = match2chrom(line[1:-1])
seqname = line[1:].split()[0]
print('>{}'.format(chromname), file=out)
print('{}\t{}'.format(seqname, chromname), file=mappingout)
(for those who know, this strips gzipped FASTA genome files into headers (with ">" at start) and sequences, and processes the lines into two different files depending on this)
I have found https://bugs.python.org/issue13989, which states that mode 'rt' cannot be used for gzip.open in python-3.2 and to use something along the lines of:
import io
with io.TextIOWrapper(gzip.open(input[0], "r")) as fin:
for line in fin:
if line[:1] != '>':
out.write(line)
continue
chromname = match2chrom(line[1:-1])
seqname = line[1:].split()[0]
print('>{}'.format(chromname), file=out)
print('{}\t{}'.format(seqname, chromname), file=mappingout)
but the above code does not work:
UnsupportedOperation in line <4> of /path/to/python_file.py:
read1
How can I rewrite this routine to give out exactly what I want - reading the gzip file line-by-line into the variable "line" and processing based on the first character?
EDIT: traceback from the first version of this routine is (python 3.2.6):
Mode rt not supported
File "/path/to/python_file.py", line 79, in __process_genome_sequences
File "/opt/python-3.2.6/lib/python3.2/gzip.py", line 46, in open
File "/opt/python-3.2.6/lib/python3.2/gzip.py", line 157, in __init__
Traceback from the second version is:
UnsupportedOperation in line 81 of /path/to/python_file.py:
read1
File "/path/to/python_file.py", line 81, in __process_genome_sequences
with no further traceback (the extra two lines in the line count are the import io and with io.TextIOWrapper(gzip.open(input[0], "r")) as fin: lines
I have actually appeared to have solved the problem.
In the end I had to use shell("gunzip {input[0]}") to ensure that the gunzipped file could be read in in text mode, and then read in the resulting file using
for line in open(' *< resulting file >* ','r'):
if line[:1] != '>':
out.write(line)
continue
chromname = match2chrom(line[1:-1])
seqname = line[1:].split()[0]
print('>{}'.format(chromname), file=out)
print('{}\t{}'.format(seqname, chromname), file=mappingout)

How to load a formatted txt file into Python to be searched

I have a file that is formatted with different indentation and which is several hundred lines long and I have tried various methods to load it into python as a file and variable but have not been successful. What would be an efficient way to load the file. My end goal is to load the file, and and search it for a specific line of text.
with open('''C:\Users\Samuel\Desktop\raw.txt''') as f:
for line in f:
if line == 'media_url':
print line
else:
print "void"
Error: Traceback (most recent call last): File "<pyshell#35>", line 1, in <module> with open('''C:\Users\Samuel\Desktop\raw''') as f: IOError: [Errno 22] invalid mode ('r') or filename: 'C:\\Users\\Samuel\\Desktop\raw
If you're trying to search for a specific line, then it's much better to avoid loading the whole file in:
with open('filename.txt') as f:
for line in f:
if line == 'search string': # or perhaps: if 'search string' in line:
# do something
If you're trying to search for the presence of a specific line while ignoring indentation, you'll want to use
if line.strip() == 'search string'.strip():
in order to strip off the leading (and trailing) whitespace before comparing.
The following is the standard way of reading a file's contents into a variable:
with open("filename.txt", "r") as f:
contents = f.read()
Use the following if you want a list of lines instead of the whole file in a string:
with open("filename.txt", "r") as f:
contents = list(f.read())
You can then search for text with
if any("search string" in line for line in contents):
print 'line found'
Python uses backslash to mean "escape". For Windows paths, this means giving the path as a "raw string" -- 'r' prefix.
lines have newlines attached. To compare, strip them.
with open(r'C:\Users\Samuel\Desktop\raw.txt') as f:
for line in f:
if line.rstrip() == 'media_url':
print line
else:
print "void"

Categories

Resources