Python - Pandas "BadGzipFile" Error When Reading in ".json.gz" File - python

I am trying to read in data from a ".json.gz" file as a dataframe. I keep getting an error indicating that it is a "BadGzipFile". However, when I unzip the file manually (i.e., just double clicking it in my finder), I am able to successfully open the json file. This leads me to believe that the file is fine, but when I run the below code in Python, I receive the "BadGzipFile" error.
I am very new to .gzip files and have done a fair bit of research trying to figure out what the issue is. So far, I have been unsuccessful. Any help would be greatly appreciated!
Here is my code:
import os
import json
import gzip
file_path = '/data/data_0_0_0.json.gz'
with gzip.open(file_path, 'rb') as f:
df = pd.read_json(f, compression='gzip', lines=True)
And here is the error I am receiving:
BadGzipFile: Not a gzipped file (b'{"')

What's happening with your code here:
with gzip.open(file_path, 'rb') as f:
df = pd.read_json(f, compression='gzip', lines=True)
Is that you're opening a Gzip file at file_path. Then you're telling Pandas that the thing that you opened (f), is itself another Gzip file. It isn't; it's a Json file. When it says BadGzipFile with that starting bracket, it is telling you that the file it found starts with a bracket instead of the Gzip file's magic number.
You should change it either to open the file with gzip and then directly read the resulting file or have Pandas read the file.
The first would be:
with gzip.open(file_path, 'rb') as f:
df = pd.read_json(f, lines=True)
The second is actually easier. Because pd.read_json will infer the compression format based on the file name and your file ends with .gz, you can just write:
df = pd.read_json(file_path)

Related

Read in-memory csv file with open() in python

I have a dataframe. I want to write this dataframe from an excel file to a csv without saving it to disk. Looked like StringIO() was the clear solution. I then wanted to open the file like object from in memory with open() but was getting a type error. How do I get around this type error and read the in memory csv file with open()? Or, actually, is it wrong to assume the open() will work?
TypeError: expected str, bytes or os.PathLike object, not StringIO
The error cites the row below. From the code that's even further below.
f = open(writer_file)
To get a proper example I had to open the file "pandas_example" after creation. I then removed a row, then the code runs into an empty row.
from pandas import util
df = util.testing.makeDataFrame()
df.to_excel('pandas_example.xlsx')
df_1 = pd.read_excel('pandas_example.xlsx')
writer_file = io.StringIO()
write_to_the_in_mem_file = csv.writer(writer_file, dialect='excel', delimiter=',')
write_to_the_in_mem_file.writerow(df_1)
f = open(writer_file)
while f.readline() not in (',,,,,,,,,,,,,,,,,,\n', '\n'):
pass
final_df = pd.read_csv(f, header=None)
f.close()
Think of writer_file as what is returned from open(), you don't need to open it again.
For example:
import pandas as pd
from pandas import util
import io
# Create test file
df = util.testing.makeDataFrame()
df.to_excel('pandas_example.xlsx')
df_1 = pd.read_excel('pandas_example.xlsx')
writer_file = io.StringIO()
df_1.to_csv(writer_file)
writer_file.seek(0) # Rewind back to the start
for line in writer_file:
print(line.strip())
The to_csv() call writes the dataframe into your memory file in CSV format.
After
writer_file = io.StringIO()
, writer_file is already a file-like object. It already has a readline method, as well as read, write, seek, etc. See io.TextIOBase, which io.StringIO inherits from.
In other words, open(writer_file) is unnecessary or, rather, it results in a type error (as you already observed).

How to resolve an encoding issue?

I need to read the content of a csv file using Python. However when I run this code:
with(open(self.path, 'r')) as csv_file:
csv_reader = csv.reader(csv_file, dialect=csv.excel, delimiter=';')
self.data = [[cell for cell in row] for row in csv_reader]
I get this error:
File "C:\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1137: character maps to <undefined>
My understanding is that this file was not encoded in cp-1252, and that I need to find out what encoding was used. I tried a bunch of things, but nothing worked for now.
About the file:
It is sent by an external company, I can't have more information about it.
It comes with other similar files, with which I don't have any issue when I run the same code
It has an .xls extension, but is more a csv file delimited with semicolons
When I open it with Excel it opens in Compatibility mode. But I don't see any sort of encoding issue: everything displays right.
What I already tried:
Saving it under a different file format to get rid of the compatibility mode
Adding an encoding in the first line of my code: (I tried more or less randomly some encodings that I know of)
with(open(self.path, 'r', encoding = 'utf8')) as csv_file:
Copy-pasting the content of the file into a new file, or deleting the whole content of the file. Still does not work. This one really bugs me because I feel like it means the probelm is not in the content of the file, and not in the file itself.
Searching a lot everywhere how to solve this kind of issue.
I recommend using pandas library (as well as numpy), it is very handy when it comes to data manipulation. This function imports the data from an xlsx or csv file type.
/!\ change datapath according to your needs /!\
import os
import pandas as pd
def GetData(directory, dataUse, format):
dataPath = os.getcwd() + "\\Data\\" + directory + "\\" + dataUse + "Set." + format
if format == "xlsx":
dataSet = pd.read_excel(dataPath, sheetname = 'Sheet1')
elif format == "csv":
dataSet = pd.read_csv(dataPath)
return dataSet
I finally found some sort of solution :
Open the file with Excel
Display the file properly using the "Text to Columns" feature
Save the file to csv format
Run the code
This does not quite satisfy me, but it works.
I still don't understand what the problem actually is, and why this solved it, so I am interested in any additional information !

File I/O in Python

I'm attempting to read a CSV file and then write the read CSV into another CSV file.
Here is my code so far:
import csv
with open ("mastertable.csv") as file:
for row in file:
print row
with open("table.csv", "w") as f:
f.write(file)
I eventually want to read a CSV file write to a new CSV with appended data.
I get this error when I try to run it.
Traceback (most recent call last):
File "readlines.py", line 8, in <module>
f.write(file)
TypeError: expected a character buffer object
From what I understood it seems that I have to close the file, but I thought with automatically closed it?
I'm not sure why I can write a string to text but I can't simply write a CSV to another CSV almost like just making a copy by iterating over it.
To read in a CSV and write to a different one, you might do something like this:
with open("table.csv", "w") as f:
with open ("mastertable.csv") as file:
for row in file:
f.write(row)
But I would only do that if the rows needed to be edited while transcribed. For the described use case, you can simply copy it with shutil before hand then opening it to append to it. This method will be much faster, not to mention far more readable.
The with operator will handle file closing for you, and will close the file when you leave that block of code (given by the indentation level)
It looks like you intend to make use of the Python csv module. The following should be a good starting point for what you are trying to acheive:
import csv
with open("mastertable.csv", "r") as file_input, open("table.csv", "wb") as file_output:
csv_input = csv.reader(file_input)
csv_output = csv.writer(file_output)
for cols in csv_input:
cols.append("more data")
csv_output.writerow(cols)
This will read mastertable.csv file in a line at a time as a list of columns. I append an extra column, and then write each line to table.csv.
Note, when you leave the scope of a with statement, the file is automatically closed.
The file variable is not really actual file data but it is a refernce pointer which is used to read data. When you do the following:
with open ("mastertable.csv") as file:
for row in file:
print row
file pointer get closed automatically. The write method expects a character buffer or a string as the input not a file pointer.
If you just want to copy data, you can do something like this:
data = ""
with open ("mastertable.csv","r") as file:
data = file.read()
with open ("table.csv","a") as file:
file.write(data)`

Trying to download data from URL with CSV File

I'm slightly new to Python and have a question as to why the following code doesn't produce any output in the csv file. The code is as follows:
import csv
import urllib2
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
with open("AusCentralbank.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(row)
Cheers.
Edit:
Brien and Albert solved the initial issue I had. However, I now have one further question. When I download the CSV File which I have listed above which is in "http://www.rba.gov.au/statistics/tables/#interest-rates" under Zero-coupon "Interest Rates - Analytical Series - 2009 to Current - F17" and is the F-17 Yields CSV I see that it has 5 workbooks and I actually just want to gather the data in the 5th Workbook. Is there a way I could do this? Cheers.
I could only test my code using Python 3. However, the only diffence should be urllib2, hence I am using urllib.respose for opening the desired url.
The variable html is type bytes and can generally be written to a file in binary mode. Additionally, your source is a csv-file already, so there should be no need to convert it somehow:
#!/usr/bin/env python3
# coding: utf-8
import urllib
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib.request.urlopen(url)
html = response.read()
with open('output.csv', 'wb') as f:
f.write(html)
It is probably because of your opening mode.
According to documentation:
'w' for only writing (an existing file with the same name will be
erased)
You should use append(a) mode to append it to the end of the file.
'a' opens the file for appending; any data written to the file is
automatically added to the end.
Also, since the file you are trying to download is csv file, you don't need to convert it.
#albert had a great answer. I've gone ahead and converted it to the equivalent Python 2.x code. You were doing a bit too much work in your original program; since the file was already a csv you didn't need to do any special work to turn it into a csv.
import urllib2
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib2.urlopen(url)
html = response.read()
with open('AusCentralbank.csv', 'wb') as f:
f.write(html)

python clear csv file

how can I clear a complete csv file with python. Most forum entries that cover the issue of deleting row/columns basically say, write the stuff you want to keep into a new file. I need to completely clear a file - how can I do that?
Basically you want to truncate the file, this can be any file. In this case it's a csv file so:
filename = "filewithcontents.csv"
# opening the file with w+ mode truncates the file
f = open(filename, "w+")
f.close()
Your question is rather strange, but I'll interpret it literally. Clearing a file is not the same as deleting it.
You want to open a file object to the CSV file, and then truncate the file, bringing it to zero length.
f = open("filename.csv", "w")
f.truncate()
f.close()
If you want to delete it instead, that's just a os filesystem call:
import os
os.remove("filename.csv")
The Python csv module is only for reading and writing whole CSV files but not for manipulating them. If you need to filter data from file then you have to read it, create a new csv file and write the filtered rows back to new file.

Categories

Resources