UPD I have opened file and found this format. How I can decode 00000000....
I need to open .ebc file on Python. The size of this file is approximately 12GB.
I have used a huge amount of tools and Python libraries for this action, but it is obvious that I am doing something in a wrong way. I can't find suitable encoding.
I tried to read the file line by line because of it size.
Python lists two code pages for EBCDIC, "cp424" (Hebrew) and "cp500" (Western scripts).
Use it like this:
with open(path, encoding='cp500') as f:
for line in f:
# process one line of text
Note: if the file is 12G in size, you'll want to avoid to call f.read() or f.readlines(), as both would read the entire file into memory.
On a laptop, this is likely to freeze your system.
Instead, iterate over the contents line by line using Python's default line iteration.
If you just want to re-encode the file with a modern encoding, eg. the very popular UTF-8, use the following pattern:
with open(in_path, encoding='cp500') as src, open(out_path, 'w', encoding='utf8') as dest:
dest.writelines(src)
This should re-encode the file with a low-memory footprint, as it reads, converts and writes the contents line by line.
Related
I am trying to sift through a big database that is compressed in a .zst. I am aware that I can simply just decompress it and then work on the resulting file, but that uses up a lot of space on my ssd and takes 2+ hours so I would like to avoid that if possible.
Often when I work with large files I would stream it line by line with code like
with open(filename) as f:
for line in f.readlines():
do_something(line)
I know gzip has this
with gzip.open(filename,'rt') as f:
for line in f:
do_something(line)
but it doesn't seem to work with .zsf, so I am wondering if there're any libraries that can decompress and stream the decompressed data in a similar way. For example:
with zstlib.open(filename) as f:
for line in f.zstreadlines():
do_something(line)
The code below is meant to find any xls or csv file used in a process. The .log file contains full paths with extensions and definitely contains multiple values with "xls" or "csv". However, Python can't find anything...Any idea? The weird thing is when I copy the content of the log file and paste it to another notepad file and save it as log, it works then...
infile=r"C:\Users\me\Desktop\test.log"
important=[]
keep_words=["xls","csv"]
with open(infile,'r') as f:
for line in f:
for word in keep_words:
if word in line:
important.append(line)
print(important)
I was able to figure it out...encoding issue...
with io.open(infile,encoding='utf16') as f:
You must change the line
for line in f:
to
for line in f.readlines():
You made the python search in the bytes opened file, not in his content, even in his lines (in a list, just like the readlines method);
I hope I was able to help (sorry about my bad English).
I am trying to open Notepad using popen and write something into it. I can't get my head around it. I can open Notepad using command:
notepadprocess=subprocess.Popen('notepad.exe')
I am trying to identify how can I write anything in the text file using python. Any help is appreciated.
You can at first write something into txt file (ex. foo.txt) and then open it with notepad:
import os
f = open('foo.txt','w')
f.write('Hello world!')
f.close()
os.system("notepad.exe foo.txt")
You may be confusing the concept of (text) file with the processes that manipulate them.
Notepad is a program, of which you can create a process. A file, on the other hand, is just a structure on your hard drive.
From a programming standpoint, Notepad doesn't edit files. It:
reads a file into computer memory
modifies the content of that memory
writes that memory back into a file (which could be similarly named, or otherwise - which is known as the "Save as" operation).
Your program, just as any other program, can manipulate files, just as notepad does. In particular, you can perform exactly the same sequence as Notepad:
my_file= "myfile.txt" #the name/path of the file
with open(file, "rb") as f: #open the file for reading
content= f.read() #read the file into memory
content+= "mytext" #change the memory
with open(file, "wb") as f: #open the file for writing
f.write( content ) #write the memory into the file
Found the exact solution from Alex K's comment. I used pywinauto to perform this task.
In python's OS module there is a method to open a file and a method to read a file.
The docs for the open method say:
Open the file file and set various flags according to flags and
possibly its mode according to mode. The default mode is 0777 (octal),
and the current umask value is first masked out. Return the file
descriptor for the newly opened file.
The docs for the read method say;
Read at most n bytes from file descriptor fd. Return a string
containing the bytes read. If the end of the file referred to by fd
has been reached, an empty string is returned.
I understand what it means to read n bytes from a file. But how does this differ from open?
"Opening" a file doesn't actually bring any of the data from the file into your program. It just prepares the file for reading (or writing), so when your program is ready to read the contents of the file it can do so right away.
Opening a file allows you to read or write to it (depending on the flag you pass as the second argument), whereas reading it actually pulls the data from a file that is typcially saved into a variable for processing or printed as output.
You do not always read from a file once it is opened. Opening also allows you to write to a file, either by overwriting all the contents or appending to the contents.
To read from a file:
>>> myfile = open('foo.txt', 'r')
>>> myfile.read()
First you open the file with read permission (r)
Then you read() from the file
To write to a file:
>>> myfile = open('foo.txt', 'r')
>>> myfile.write('I am writing to foo.txt')
The only thing that is being done in line 1 of each of these examples is opening the file. It is not until we actually read() from the file that anything is changed
open gets you a fd (file descriptor), you can read from that fd later.
One may also open a file for other purpose, say write to a file.
It seems to me you can read lines from the file handle without invoking the read method but I guess read() truly puts the data in the variable location. In my course we seem to be printing lines, counting lines, and adding numbers from lines without using read().
The rstrip() method needs to be used, however, because printing the line from the file handle using a for in statement also prints the invisible line break symbol at the end of the line, as does the print statement.
From Python for Everybody by Charles Severance, this is the starter code.
"""
7.2
Write a program that prompts for a file name,
then opens that file and reads through the file,
looking for lines of the form:
X-DSPAM-Confidence: 0.8475
Count these lines and extract the floating point
values from each of the lines and compute the
average of those values and produce an output as
shown below. Do not use the sum() function or a
variable named sum in your solution.
You can download the sample data at
http://www.py4e.com/code3/mbox-short.txt when you
are testing below enter mbox-short.txt as the file name.
"""
# Use the file name mbox-short.txt as the file name
fname = input("Enter file name: ")
fh = open(fname)
for line in fh:
if not line.startswith("X-DSPAM-Confidence:") :
continue
print(line)
print("Done")
I seem to remember that the Python gzip module previously allowed you to read non-gzipped files transparently. This was really useful, as it allowed to read an input file whether or not it was gzipped. You simply didn't have to worry about it.
Now,I get an IOError exception (in Python 2.7.5):
Traceback (most recent call last):
File "tst.py", line 14, in <module>
rec = fd.readline()
File "/sw/lib/python2.7/gzip.py", line 455, in readline
c = self.read(readsize)
File "/sw/lib/python2.7/gzip.py", line 261, in read
self._read(readsize)
File "/sw/lib/python2.7/gzip.py", line 296, in _read
self._read_gzip_header()
File "/sw/lib/python2.7/gzip.py", line 190, in _read_gzip_header
raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file
If anyone has a neat trick, I'd like to hear about it. Yes, I know how to catch the exception, but I find it rather clunky to first read a line, then close the file and open it again.
The best solution for this would be to use something like https://github.com/ahupp/python-magic with libmagic. You simply cannot avoid at least reading a header to identify a file (unless you implicitly trust file extensions)
If you're feeling spartan the magic number for identifying gzip(1) files is the first two bytes being 0x1f 0x8b.
In [1]: f = open('foo.html.gz')
In [2]: print `f.read(2)`
'\x1f\x8b'
gzip.open is just a wrapper around GzipFile, you could have a function like this that just returns the correct type of object depending on what the source is without having to open the file twice:
#!/usr/bin/python
import gzip
def opener(filename):
f = open(filename,'rb')
if (f.read(2) == '\x1f\x8b'):
f.seek(0)
return gzip.GzipFile(fileobj=f)
else:
f.seek(0)
return f
Maybe you're thinking of zless or zgrep, which will open compressed or uncompressed files without complaining.
Can you trust that the file name ends in .gz?
if file_name.endswith('.gz'):
opener = gzip.open
else:
opener = open
with opener(file_name, 'r') as f:
...
Read the first four bytes. If the first three are 0x1f, 0x8b, 0x08, and if the high three bits of the fourth byte are zeros, then fire up the gzip compression starting with those four bytes. Otherwise write out the four bytes and continue to read transparently.
You should still have the clunky solution to back that up, so that if the gzip read fails nevertheless, then back up and read transparently. But it should be quite unlikely to have the first four bytes mimic a gzip file so well, but not be a gzip file.
You can iterate over files transparently using fileinput(files, openhook=fileinput.hook_compressed)