How to read a specific file from a tar file using Windows?

How to read a specific file from a tar file using Windows? - python

I have a tar file with several files compressed in it. I need to read one specific file (it is in csv format) using pandas. I tried to use the following code:
import tarfile
tar = tarfile.open('my_files.tar', 'r:gz')
f = tar.extractfile('some_files/need_to_be_read.csv')
import pandas as pd
df = pd.read_csv(f.read())
but it throws up the following error:
OSError: Expected file path name or file-like object, got <class 'bytes'> type
on the last line of the code. How do I go about this to read this file?

When you call pandas.read_csv(), you need to give it a filename or file-like object. tar.extractfile() returns a file-like object. Instead of reading the file into memory, pass the file to Pandas.
So remove the .read() part:
import tarfile
tar = tarfile.open('my_files.tar', 'r:gz')
f = tar.extractfile('some_files/need_to_be_read.csv')
import pandas as pd
df = pd.read_csv(f)

Related

Compress excel file in python

Right now my final output is in excel format. I wanted to compressed my excel file using gzip. Is there a way to do it ?
import pandas as pd
import gzip
import re
def renaming_ad_unit():
with gzip.open('weekly_direct_house.xlsx.gz') as f:
df = pd.read_excel(f)
result = df['Ad unit'].to_list()
for index, a_string in enumerate(result):
modified_string = re.sub(r"\([^()]*\)", "", a_string)
df.at[index,'Ad unit'] = modified_string
return df.to_excel('weekly_direct_house.xlsx',index=False)

Yes, this is possible.
To create a gzip file, you can open the file like this:
with gzip.open('filename.xlsx.gz', 'wb') as f:
...
Unfortunately, when I tried this, I found that I get the error OSError: Negative seek in write mode. This is because the Pandas excel writer moves backwards in the file when writing, and uses multiple passes to write the file. This is not allowed by the gzip module.
To fix this, I created a temporary file, and wrote the excel file there. Then, I read the file back, and write it to the compressed archive.
I wrote a short program to demonstrate this. It reads an excel file from a gzip archive, prints it out, and writes it back to another gzip file.
import pandas as pd
import gzip
import tempfile
def main():
with gzip.open('apportionment-2020-table02.xlsx.gz') as f:
df = pd.read_excel(f)
print(df)
with tempfile.TemporaryFile() as excel_f:
df.to_excel(excel_f, index=False)
with gzip.open('output.xlsx.gz', 'wb') as gzip_f:
excel_f.seek(0)
gzip_f.write(excel_f.read())
if __name__ == '__main__':
main()
Here's the file I'm using to demonstrate this: Link

You could also use io.BytesIO to create file in memory and write excel in this file and next write this file as gzip on disk.
I used link to excel file from Nick ODell answer.
import pandas as pd
import gzip
import io
df = pd.read_excel('https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment-2020-table02.xlsx')
buf = io.BytesIO()
df.to_excel(buf)
buf.seek(0) # move to the beginning of file
with gzip.open('output.xlsx.gz', 'wb') as f:
f.write(buf.read())
Similar to Nick ODell answer.
import pandas as pd
import gzip
import io
df = pd.read_excel('https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment-2020-table02.xlsx')
with io.BytesIO() as buf:
df.to_excel(buf)
buf.seek(0) # move to the beginning of file
with gzip.open('output.xlsx.gz', 'wb') as f:
f.write(buf.read())
Tested on Linux

convert csv data to dict without writing file to disk

Here is my scenario: I have a zip file that I am downloading with requests into memory rather than writing a file. I am unzipping the data into an object called myzipfile. Inside the zip file is a csv file. I would like to convert each row of the csv data into a dictionary. Here is what I have so far.
import csv
from io import BytesIO
import requests
# other imports etc.
r = requests.get(url=fileurl, headers=headers, stream=True)
filebytes = BytesIO(r.content)
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
mycsv = myzipfile.open(name).read()
for row in csv.DictReader(mycsv): # it fails here.
print(row)
errors:
Traceback (most recent call last):
File "/usr/lib64/python3.7/csv.py", line 98, in fieldnames
self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)
Looks like csv.DictReader(mycsv) expects a file object instead of raw data. How do I convert the rows in the mycsv object data (<class 'bytes'>) to a list of dictionaries? I'm trying to accomplish this without writing a file to disk and working directly from csv objects in memory.

The DictReader expects a file or file-like object: we can satisfy this expectation by loading the zipped file into an io.StringIO instance.
Note that StringIO expects its argument to be a str, but reading a file from the zipfile returns bytes, so the data must be decoded. This example assumes that the csv was originally encoded with the local system's default encoding. If that is not the case the correct encoding must be passed to decode().
for name in myzipfile.namelist():
data = myzipfile.open(name).read().decode()
mycsv = io.StringIO(data)
reader = csv.DictReader(mycsv)
for row in reader:
print(row)

dict_list = [] # a list
reader = csv.DictReader(open('yourfile.csv', 'rb'))
for line in reader: # since we used DictReader, each line will be saved as a dictionary
dict_list.append(line)

Reading ARFF from ZIP with zipfile and scipy.io.arff

I want to process quite big ARFF files in scikit-learn. The files are in a zip archive and I do not want to unpack the archive to a folder before processing. Hence, I use the zipfile module of Python 3.6:
from zipfile import ZipFile
from scipy.io.arff import loadarff
archive = ZipFile( 'archive.zip', 'r' )
datafile = archive.open( 'datafile.arff' )
data = loadarff( datafile )
# …
datafile.close()
archive.close()
However, this yields the following error:
Traceback (most recent call last):
File "./m.py", line 6, in <module>
data = loadarff( datafile )
File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 541, in loadarff
return _loadarff(ofile)
File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 550, in _loadarff
rel, attr = read_header(ofile)
File "/usr/lib64/python3.6/site-packages/scipy/io/arff/arffread.py", line 323, in read_header
while r_comment.match(i):
TypeError: cannot use a string pattern on a bytes-like object
According to loadarff documentation, loadarff requires a file-like object.
According to zipfile documentation, open returns a file-like ZipExtFile.
Hence, my question is how to use what ZipFile.open returns as the ARFF input to loadarff.
Note: If I unzip manually and load the ARFF directly with data = loadarff( 'datafile.arff' ), all is fine.

from io import BytesIO, TextIOWrapper
from zipfile import ZipFile
from scipy.io.arff import loadarff
zfile = ZipFile('archive.zip', 'r')
in_mem_fo = TextIOWrapper(BytesIO(zfile.read('datafile.arff')), encoding='utf-8')
data = loadarff(in_mem_fo)
Read zfile into a in-memory BytesIO object. Use TextIOWrapper with encoding='utf-8'. Use this in-memory buffered text object in loadarff.
Edit: Turnsout zfile.open() returns a file-like object so the above can be accomplished by :
zfile = ZipFile('archive.zip', 'r')
in_mem_fo = TextIOWrapper(zfile.open('datafile.arff'), encoding='ascii')
data = loadarff(in_mem_fo)
Thanks #Bernhard

Iterating through excel files and capturing a specific cell value in each file

I have a directory of participation forms (as excel files) from clients, and I want to write a script that will grab all of the relevant cells from the participation form and write them to an excel doc where each client is on its own row. When I try and iterate through the directory using the following code:
import os
import xlrd
import xlwt
from xlrd import open_workbook
from xlwt import easyxf
import pandas as pd
from pandas import np
import csv
for i in os.listdir("filepath"):
book=xlrd.open_workbook("filepath",i)
print book
sheet=book.sheet_by_index(0)
a1=sheet.cell_value(rowx=8, colx=3)
print a1
I get the error: IOError: [Errno 13] Permission denied: 'filepath'
EDIT Here is the Full Traceback after making edits suggested by Steven Rumbalski:
Traceback (most recent call last):
File "C:\Users\Me\Desktop\participation_form.py", line 11, in <module>
book=xlrd.open_workbook(("Y:/Directory1/Directory2/Signup/", i))
File "c:\python27\lib\site-packages\xlrd\__init__.py", line 394, in open_workbook
f = open(filename, "rb")
TypeError: coercing to Unicode: need string or buffer, tuple found

xlrd.open_workbook expects its first argument to be a full path to a file. You are trying to open the folder and not the file. You need to join the filepath and the filename. Do
book = xlrd.open_workbook(os.path.join("filepath", i))
You also my want to guard against trying to open things that are not excel files. You could add this as the first line of your loop:
if not i.endswith((".xls", ".xlsx")): continue

You can simplify all of this with the glob module and the .read_excel() method in pandas (which you already seem to be importing). The following iterates over all the files in some directory that match "*.xlsx", parses them into data frames, and prints out the contents of the appropriate cell.
from glob import glob
for f in glob("/my/path/to/files/*.xlsx"):
print pd.read_excel(f).ix[8,3]

Downloading and unzipping a .zip file without writing to disk

I have managed to get my first python script to work which downloads a list of .ZIP files from a URL and then proceeds to extract the ZIP files and writes them to disk.
I am now at a loss to achieve the next step.
My primary goal is to download and extract the zip file and pass the contents (CSV data) via a TCP stream. I would prefer not to actually write any of the zip or extracted files to disk if I could get away with it.
Here is my current script which works but unfortunately has to write the files to disk.
import urllib, urllister
import zipfile
import urllib2
import os
import time
import pickle
# check for extraction directories existence
if not os.path.isdir('downloaded'):
os.makedirs('downloaded')
if not os.path.isdir('extracted'):
os.makedirs('extracted')
# open logfile for downloaded data and save to local variable
if os.path.isfile('downloaded.pickle'):
downloadedLog = pickle.load(open('downloaded.pickle'))
else:
downloadedLog = {'key':'value'}
# remove entries older than 5 days (to maintain speed)
# path of zip files
zipFileURL = "http://www.thewebserver.com/that/contains/a/directory/of/zip/files"
# retrieve list of URLs from the webservers
usock = urllib.urlopen(zipFileURL)
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
# only parse urls
for url in parser.urls:
if "PUBLIC_P5MIN" in url:
# download the file
downloadURL = zipFileURL + url
outputFilename = "downloaded/" + url
# check if file already exists on disk
if url in downloadedLog or os.path.isfile(outputFilename):
print "Skipping " + downloadURL
continue
print "Downloading ",downloadURL
response = urllib2.urlopen(downloadURL)
zippedData = response.read()
# save data to disk
print "Saving to ",outputFilename
output = open(outputFilename,'wb')
output.write(zippedData)
output.close()
# extract the data
zfobj = zipfile.ZipFile(outputFilename)
for name in zfobj.namelist():
uncompressed = zfobj.read(name)
# save uncompressed data to disk
outputFilename = "extracted/" + name
print "Saving extracted file to ",outputFilename
output = open(outputFilename,'wb')
output.write(uncompressed)
output.close()
# send data via tcp stream
# file successfully downloaded and extracted store into local log and filesystem log
downloadedLog[url] = time.time();
pickle.dump(downloadedLog, open('downloaded.pickle', "wb" ))

Below is a code snippet I used to fetch zipped csv file, please have a look:
Python 2:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(StringIO(resp.read()))
for line in myzip.open(file).readlines():
print line
Python 3:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
# or: requests.get(url).content
resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(BytesIO(resp.read()))
for line in myzip.open(file).readlines():
print(line.decode('utf-8'))
Here file is a string. To get the actual string that you want to pass, you can use zipfile.namelist(). For instance,
resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip')
myzip = ZipFile(BytesIO(resp.read()))
myzip.namelist()
# ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']

My suggestion would be to use a StringIO object. They emulate files, but reside in memory. So you could do something like this:
# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'
import zipfile
from StringIO import StringIO
zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()
# output: "hey, foo"
Or more simply (apologies to Vishal):
myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
[ ... ]
In Python 3 use BytesIO instead of StringIO:
import zipfile
from io import BytesIO
filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
[ ... ]

I'd like to offer an updated Python 3 version of Vishal's excellent answer, which was using Python 2, along with some explanation of the adaptations / changes, which may have been already mentioned.
from io import BytesIO
from zipfile import ZipFile
import urllib.request
url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip")
with ZipFile(BytesIO(url.read())) as my_zip_file:
for contained_file in my_zip_file.namelist():
# with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
for line in my_zip_file.open(contained_file).readlines():
print(line)
# output.write(line)
Necessary changes:
There's no StringIO module in Python 3 (it's been moved to io.StringIO). Instead, I use io.BytesIO]2, because we will be handling a bytestream -- Docs, also this thread.
urlopen:
"The legacy urllib.urlopen function from Python 2.6 and earlier has been discontinued; urllib.request.urlopen() corresponds to the old urllib2.urlopen.", Docs and this thread.
Note:
In Python 3, the printed output lines will look like so: b'some text'. This is expected, as they aren't strings - remember, we're reading a bytestream. Have a look at Dan04's excellent answer.
A few minor changes I made:
I use with ... as instead of zipfile = ... according to the Docs.
The script now uses .namelist() to cycle through all the files in the zip and print their contents.
I moved the creation of the ZipFile object into the with statement, although I'm not sure if that's better.
I added (and commented out) an option to write the bytestream to file (per file in the zip), in response to NumenorForLife's comment; it adds "unzipped_and_read_" to the beginning of the filename and a ".file" extension (I prefer not to use ".txt" for files with bytestrings). The indenting of the code will, of course, need to be adjusted if you want to use it.
Need to be careful here -- because we have a byte string, we use binary mode, so "wb"; I have a feeling that writing binary opens a can of worms anyway...
I am using an example file, the UN/LOCODE text archive:
What I didn't do:
NumenorForLife asked about saving the zip to disk. I'm not sure what he meant by it -- downloading the zip file? That's a different task; see Oleh Prypin's excellent answer.
Here's a way:
import urllib.request
import shutil
with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file:
shutil.copyfileobj(response, out_file)

I'd like to add my Python3 answer for completeness:
from io import BytesIO
from zipfile import ZipFile
import requests
def get_zip(file_url):
url = requests.get(file_url)
zipfile = ZipFile(BytesIO(url.content))
files = [zipfile.open(file_name) for file_name in zipfile.namelist()]
return files.pop() if len(files) == 1 else files

write to a temporary file which resides in RAM
it turns out the tempfile module ( http://docs.python.org/library/tempfile.html ) has just the thing:
tempfile.SpooledTemporaryFile([max_size=0[,
mode='w+b'[, bufsize=-1[, suffix=''[,
prefix='tmp'[, dir=None]]]]]])
This
function operates exactly as
TemporaryFile() does, except that data
is spooled in memory until the file
size exceeds max_size, or until the
file’s fileno() method is called, at
which point the contents are written
to disk and operation proceeds as with
TemporaryFile().
The resulting file has one additional
method, rollover(), which causes the
file to roll over to an on-disk file
regardless of its size.
The returned object is a file-like
object whose _file attribute is either
a StringIO object or a true file
object, depending on whether
rollover() has been called. This
file-like object can be used in a with
statement, just like a normal file.
New in version 2.6.
or if you're lazy and you have a tmpfs-mounted /tmp on Linux, you can just make a file there, but you have to delete it yourself and deal with naming

Adding on to the other answers using requests:
# download from web
import requests
url = 'http://mlg.ucd.ie/files/datasets/bbc.zip'
content = requests.get(url)
# unzip the content
from io import BytesIO
from zipfile import ZipFile
f = ZipFile(BytesIO(content.content))
print(f.namelist())
# outputs ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']
Use help(f) to get more functions details for e.g. extractall() which extracts the contents in zip file which later can be used with with open.

All of these answers appear too bulky and long. Use requests to shorten the code, e.g.:
import requests, zipfile, io
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("/path/to/directory")

Vishal's example, however great, confuses when it comes to the file name, and I do not see the merit of redefing 'zipfile'.
Here is my example that downloads a zip that contains some files, one of which is a csv file that I subsequently read into a pandas DataFrame:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
import pandas
url = urlopen("https://www.federalreserve.gov/apps/mdrm/pdf/MDRM.zip")
zf = ZipFile(StringIO(url.read()))
for item in zf.namelist():
print("File in zip: "+ item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])
(Note, I use Python 2.7.13)
This is the exact solution that worked for me. I just tweaked it a little bit for Python 3 version by removing StringIO and adding IO library
Python 3 Version
from io import BytesIO
from zipfile import ZipFile
import pandas
import requests
url = "https://www.nseindia.com/content/indices/mcwb_jun19.zip"
content = requests.get(url)
zf = ZipFile(BytesIO(content.content))
for item in zf.namelist():
print("File in zip: "+ item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])

It wasn't obvious in Vishal's answer what the file name was supposed to be in cases where there is no file on disk. I've modified his answer to work without modification for most needs.
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
def unzip_string(zipped_string):
unzipped_string = ''
zipfile = ZipFile(StringIO(zipped_string))
for name in zipfile.namelist():
unzipped_string += zipfile.open(name).read()
return unzipped_string

Use the zipfile module. To extract a file from a URL, you'll need to wrap the result of a urlopen call in a BytesIO object. This is because the result of a web request returned by urlopen doesn't support seeking:
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
zip_url = 'http://example.com/my_file.zip'
with urlopen(zip_url) as f:
with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
foofile = myzipfile.open('foo.txt')
print(foofile.read())
If you already have the file downloaded locally, you don't need BytesIO, just open it in binary mode and pass to ZipFile directly:
from zipfile import ZipFile
zip_filename = 'my_file.zip'
with open(zip_filename, 'rb') as f:
with ZipFile(f) as myzipfile:
foofile = myzipfile.open('foo.txt')
print(foofile.read().decode('utf-8'))
Again, note that you have to open the file in binary ('rb') mode, not as text or you'll get a zipfile.BadZipFile: File is not a zip file error.
It's good practice to use all these things as context managers with the with statement, so that they'll be closed properly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read a specific file from a tar file using Windows? - python

Related

Compress excel file in python

convert csv data to dict without writing file to disk

Reading ARFF from ZIP with zipfile and scipy.io.arff

Iterating through excel files and capturing a specific cell value in each file

Downloading and unzipping a .zip file without writing to disk

Categories

Resources