Python script to unzip and print one line of a file

Python script to unzip and print one line of a file - python

I am trying a simple example of retrieving data from a file and printing only one line of the output. I get semicolon error around encoded and 'r'.
import gzip
data = gzip.open('pagecounts-20130601-000000.gz', 'r')
encoded=data.read()
print encoded[2]
It gives this error:
Traceback (most recent call last):
File "filter_articles.scpt", line 4, in <module> encoded=data.read()
File "/usr/lib/python2.7/gzip.py", line 249, in read self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 308, in _read self._add_read_data( uncompress )
File "/usr/lib/python2.7/gzip.py", line 326, in _add_read_data self.extrabuf = self.extrabuf[offset:] + data MemoryError
I guess this is because the file is huge and was not able to read the content? What could be better way to print few lines of the file?

I am assuming that:
You meant to have quotes around the file name in your script.
You actually want the third line (as your post suggests) and not the third character (as your script suggests)
In this case the following should work:
import gzip
data = gzip.open('pagecounts-20130601-000000.gz', 'r')
data.readline()
data.readline()
print data.readline()

Related

How can read Minecraft .mca files so that in python I can extract individual blocks?

I can't find a way of reading the Minecraft world files in a way that i could use in python
I've looked around the internet but can find no tutorials and only a few libraries that claim that they can do this but never actually work
from nbt import *
nbtfile = nbt.NBTFile("r.0.0.mca",'rb')
I expected this to work but instead I got errors about the file not being compressed or something of the sort
Full error:
Traceback (most recent call last):
File "C:\Users\rober\Desktop\MinePy\MinecraftWorldReader.py", line 2, in <module>
nbtfile = nbt.NBTFile("r.0.0.mca",'rb')
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 628, in __init__
self.parse_file()
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 652, in parse_file
type = TAG_Byte(buffer=self.file)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 99, in __init__
self._parse_buffer(buffer)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 105, in _parse_buffer
self.value = self.fmt.unpack(buffer.read(self.fmt.size))[0]
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 276, in read
return self._buffer.read(size)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 463, in read
if not self._read_gzip_header():
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'\x00\x00')

Use anvil parser. (Install with pip install anvil-parser)
Reading
import anvil
region = anvil.Region.from_file('r.0.0.mca')
# You can also provide the region file name instead of the object
chunk = anvil.Chunk.from_region(region, 0, 0)
# If `section` is not provided, will get it from the y coords
# and assume it's global
block = chunk.get_block(0, 0, 0)
print(block) # <Block(minecraft:air)>
print(block.id) # air
print(block.properties) # {}
https://pypi.org/project/anvil-parser/

According to this page, the .mca files is not totally kind of of NBT file. It begins with an 8KiB header which includes the offsets of chunks in the region file itself and the timestamps for the last updates of those chunks.
I recommend you to see the offical announcement and this page for more information.

How to turn a comma seperated value TXT into a CSV for machine learning

How do I turn this format of TXT file into a CSV file?
Date,Open,high,low,close
1/1/2017,1,2,1,2
1/2/2017,2,3,2,3
1/3/2017,3,4,3,4
I am sure you can understand? It already has the comma -eparated values.
I tried using numpy.
>>> import numpy as np
>>> table = np.genfromtxt("171028 A.txt", comments="%")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\npyio.py", line 1551, in genfromtxt
fhd = iter(np.lib._datasource.open(fname, 'rb'))
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\_datasource.py", line 151, in open
return ds.open(path, mode)
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\_datasource.py", line 501, in open
raise IOError("%s not found." % path)
OSError: 171028 A.txt not found.
I have (S&P) 500 txt files to do this with.

You can use csv module. You can find more information here.
import csv
txt_file = 'mytext.txt'
csv_file = 'mycsv.csv'
in_txt = csv.reader(open(txt_file, "r"), delimiter=',')
out_csv = csv.writer(open(csv_file, 'w+'))
out_csv.writerows(in_txt)

Per #dclarke's comment, check the directory from which you run the code. As you coded the call, the file must be in that directory. When I have it there, the code runs without error (although the resulting table is a single line with four nan values). When I move the file elsewhere, I reproduce your error quite nicely.
Either move the file to be local, add a local link to the file, or change the file name in your program to use the proper path to the file (either relative or absolute).

"json.decoder.JSONDecodeError: Extra data: line 1 column 5287" but there is nothing there

I have a series of .json files. Each file contains tweets based on a different keyword. Each line in every file is a json object. I read the files using the following code:
# Get tweets out of JSON file
tweetsFromJSON = []
with open(json_file) as f:
for line in f:
json_object = json.loads(line)
tweet_text = json_object["text"]
tweetsFromJSON.append(tweet_text)
For every JSON file I have this works flawlessly. But this particular file gives me the following error:
Traceback (most recent call last):
File "C:/Users/alexandros/Dropbox/Development/Sentiment Analysis/lda_analysis.py", line 119, in <module>
lda_analysis('precision_medicine.json', 'precision medicine')
File "C:/Users/alexandros/Dropbox/Development/Sentiment Analysis/lda_analysis.py", line 46, in lda_analysis
json_object = json.loads(line)
File "C:\Users\alexandros\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Users\alexandros\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 5287 (char 5286)
So tried removing the first line to see what happens. The error persists and again it's in the exact same position (line 1 column 5287 (char 5286)). I removed another line and it's the same. I'm breaking my head trying to figure out what's wrong. What am I missing?

Reading password protected Word Documents with zipfile

I am trying to read a password protected word document on Python using zipfile.
The following code works with a non-password protected document, but gives an error when used with a password protected file.
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
psw = "1234"
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
document = zipfile.ZipFile(path, "r")
document.setpassword(psw)
document.extractall()
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
When running get_docx_text() with a password protected file, I received the following error:
Traceback (most recent call last):
File "<ipython-input-15-d2783899bfe5>", line 1, in <module>
runfile('/Users/username/Workspace/Python/docx2txt.py', wdir='/Users/username/Workspace/Python')
File "/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/spyderlib/widgets/externalshell/sitecustomize.py", line 680, in runfile
execfile(filename, namespace)
File "/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/spyderlib/widgets/externalshell/sitecustomize.py", line 78, in execfile
builtins.execfile(filename, *where)
File "/Users/username/Workspace/Python/docx2txt.py", line 41, in <module>
x = get_docx_text("/Users/username/Desktop/file.docx")
File "/Users/username/Workspace/Python/docx2txt.py", line 23, in get_docx_text
document = zipfile.ZipFile(path, "r")
File "zipfile.pyc", line 770, in __init__
File "zipfile.pyc", line 811, in _RealGetContents
BadZipfile: File is not a zip file
Does anyone have any advice to get this code to work?

I don't think this is an encryption problem, for two reasons:
Decryption is not attempted when the ZipFile object is created. Methods like ZipFile.extractall, extract, and open, and read take an optional pwd parameter containing the password, but the object constructor / initializer does not.
Your stack trace indicates that the BadZipFile is being raised when you create the ZipFile object, before you call setpassword:
document = zipfile.ZipFile(path, "r")
I'd look carefully for other differences between the two files you're testing: ownership, permissions, security context (if you have that on your OS), ... even filename differences can cause a framework to "not see" the file you're working on.
Also --- the obvious one --- try opening the encrypted zip file with your zip-compatible command of choice. See if it really is a zip file.
I tested this by opening an encrypted zip file in Python 3.1, while "forgetting" to provide a password. I could create the ZipFile object (the variable zfile below) without any error, but got a RuntimeError --- not a BadZipFile exception --- when I tried to read a file without providing a password:
Traceback (most recent call last):
File "./zf.py", line 35, in <module>
main()
File "./zf.py", line 29, in main
print_checksums(zipfile_name)
File "./zf.py", line 22, in print_checksums
for checksum in checksum_contents(zipfile_name):
File "./zf.py", line 13, in checksum_contents
inner_file = zfile.open(inner_filename, "r")
File "/usr/lib64/python3.1/zipfile.py", line 903, in open
"password required for extraction" % name)
RuntimeError: File apache.log is encrypted, password required for extraction
I was also able to raise a BadZipfile exception, once by trying to open an empty file and once by trying to open some random logfile text that I'd renamed to a ".zip" extension. The two test files produced identical stack traces, down to the line numbers.
Traceback (most recent call last):
File "./zf.py", line 35, in <module>
main()
File "./zf.py", line 29, in main
print_checksums(zipfile_name)
File "./zf.py", line 22, in print_checksums
for checksum in checksum_contents(zipfile_name):
File "./zf.py", line 10, in checksum_contents
zfile = zipfile.ZipFile(zipfile_name, "r")
File "/usr/lib64/python3.1/zipfile.py", line 706, in __init__
self._GetContents()
File "/usr/lib64/python3.1/zipfile.py", line 726, in _GetContents
self._RealGetContents()
File "/usr/lib64/python3.1/zipfile.py", line 738, in _RealGetContents
raise BadZipfile("File is not a zip file")
zipfile.BadZipfile: File is not a zip file
While this stack trace isn't exactly the same as yours --- mine has a call to _GetContents, and the pre-3.2 "small f" spelling of BadZipfile --- but they're close enough that I think this is the kind of problem you're dealing with.

Read filenames from a textfile in python (double backslash issue)

I am trying to read a list of files from a text file. I am using the following code to do that:
filelist = input("Please Enter the filelist: ")
flist = open (os.path.normpath(filelist),"r")
fname = []
for curline in flist:
# check if its a coment - do comment parsing in this if block
if curline.startswith('#'):
continue
fname.append(os.path.normpath(curline));
flist.close() #close the list file
# read the slave files 100MB at a time to generate stokes vectors
tmp = fname[0].rstrip()
t = np.fromfile(tmp,dtype='float',count=100*1000)
This works perfectly fine and I get the following array:
'H:\\Shaunak\\TerraSAR_X- Sep2012-Glacier_Velocity_Gangotri\\NEST_oregistration\\Glacier_coreg_Cnv\\i_HH_mst_08Oct2012.bin\n'
'H:\\Shaunak\\TerraSAR_X- Sep2012-Glacier_Velocity_Gangotri\\NEST_oregistration\\Glacier_coreg_Cnv\\i_HH_mst_08Oct2012.bin\n'
'H:\\Shaunak\\TerraSAR_X- Sep2012-Glacier_Velocity_Gangotri\\NEST_oregistration\\Glacier_coreg_Cnv\\q_HH_slv3_08Oct2012.bin\n'
'H:\\Shaunak\\TerraSAR_X- Sep2012-Glacier_Velocity_Gangotri\\NEST_oregistration\\Glacier_coreg_Cnv\\q_VV_slv3_08Oct2012.bin'
The problem is that the '\' charecter is escaped and there is a trailing '\n' in the strings. I used the str.rstrip() to get rid of the '\n' - this works, but leaves the problem of the two back slashes.
I have used the following approaches to try getting rid of these:
Used the codecs.unicode_escape_decode() but I get this error:
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 56-57: malformed \N character escape. Clearly this is not the right approach because I just want to decode the backslashed, not the rest of the string.
This does not work either: tmp = fname[0].rstrip().replace(r'\\','\\');
Is there no way to make readline() read a raw string?
UPDATE:
Basically I have a text file with 4 file names I would like to open and read data from in python. The text file contains:
H:\Shaunak\TerraSAR_X-Sep2012-Glacier_Velocity_Gangotri\NEST_oregistration\Glacier_coreg_Cnv\i_HH_mst_08Oct2012.bin
H:\Shaunak\TerraSAR_X-Sep2012-Glacier_Velocity_Gangotri\NEST_oregistration\Glacier_coreg_Cnv\i_HH_mst_08Oct2012.bin
H:\Shaunak\TerraSAR_X-Sep2012-Glacier_Velocity_Gangotri\NEST_oregistration\Glacier_coreg_Cnv\q_HH_slv3_08Oct2012.bin
H:\Shaunak\TerraSAR_X-Sep2012-Glacier_Velocity_Gangotri\NEST_oregistration\Glacier_coreg_Cnv\q_VV_slv3_08Oct2012.bin
I would like to open each file one by one and read 100MBs of data from them.
When I use this command:np.fromfile(flist[0],dtype='float',count=100) I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'H:\\Shaunak\\TerraSAR_X-Sep2012-Glacier_Velocity_Gangotri\\NEST_oregistration\\Glacier_coreg_Cnv\\i_HH_mst_08Oct2012.bin'
Update
Full Traceback:
Please Enter the filelist: H:/Shaunak/TerraSAR_X- Sep2012-Glacier_Velocity_Gangotri/NEST_oregistration/Glacier_coreg_Cnv/filelist.txt
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "G:\WinPython-32bit-3.3.2.3\python-3.3.2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 581, in runfile
execfile(filename, namespace)
File "G:\WinPython-32bit-3.3.2.3\python-3.3.2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 41, in execfile
exec(compile(open(filename).read(), filename, 'exec'), namespace)
File "H:/Shaunak/Programs/Arnab_glacier_vel/Stokes_generation_2.py", line 28, in <module>
t = np.fromfile(tmp,dtype='float',count=100*1000)
FileNotFoundError: [Errno 2] No such file or directory: 'H:\\Shaunak\\TerraSAR_X-Sep2012-Glacier_Velocity_Gangotri\\NEST_oregistration\\Glacier_coreg_Cnv\\i_HH_mst_08Oct2012.bin'
>>>

As #volcano stated, double slash is only an internal representation. If you print it, they're gone. The same if you write it to files, there will only be one '\'.
>>> string_with_double_backslash = "Here is a double backslash: \\"
>>> print(string_with_double_backslash)
Here is a double backslash: \

try this:
a_escaped = 'attachment; filename="Nuovo Cinema Paradiso 1988 Director\\\'s Cut"'
a_unescaped = codecs.getdecoder("unicode_escape")(a)[0]
yielding:
'attachment; filename="Nuovo Cinema Paradiso 1988 Director\'s Cut"'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python script to unzip and print one line of a file - python

Related

How can read Minecraft .mca files so that in python I can extract individual blocks?

How to turn a comma seperated value TXT into a CSV for machine learning

"json.decoder.JSONDecodeError: Extra data: line 1 column 5287" but there is nothing there

Reading password protected Word Documents with zipfile

Read filenames from a textfile in python (double backslash issue)

Categories

Resources