python filtered input to csv.reader - python

My program reads a csv file but recently the input file was changed to be base64 encoded. So currently the read code is:
with open(uploadFile, 'rb') as csvfile:
spreadSheet = csv.reader(csvfile, delimiter=',')
I know the csv is a file descriptor and this can't be done, but I want to do something like:
import base64
with open(uploadFile, 'rb') as csvfile:
spreadSheet = csv.reader(bas64.decode(csvfile), delimiter=',')
That is the file input would be base64 decoded as though in a pipe and then parsed as a csv file.
I can read the file decode it write back into another file and then read that file with the csv reader but that all seems as though there should be a way to do it as a pipe sequence.

Try the following
import base64
import csv
with open(uploadFile, 'rb') as csvfile:
decoded = base64.standard_b64decode(csvfile.read()).decode('utf-8')
spreadSheet = csv.reader(decoded.splitlines(), delimiter=',')

Related

How to read the headers of a csv file using csv module in "rb" mode?

I am currently reading the csv file in "rb" mode and uploading the file to an s3 bucket.
with open(csv_file, 'rb') as DATA:
s3_put_response = requests.put(s3_presigned_url,data=DATA,headers=headers)
All of this is working fine but now I have to validate the headers in the csv file before making the put call.
When I try to run below, I get an error.
with open(csv_file, 'rb') as DATA:
csvreader = csv.reader(file)
columns = next(csvreader)
# run-some-validations
s3_put_response = requests.put(s3_presigned_url,data=DATA,headers=headers)
This throws
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
As a workaround, I have created a new function which opens the file in "r" mode and does validation on the csv headers and this works ok.
def check_csv_headers():
with open(csv_file, 'r') as file:
csvreader = csv.reader(file)
columns = next(csvreader)
I do not want to read the same file twice. Once for header validation and once for uploading to s3. The upload part also doesn't work if I do it in "r" mode.
Is there a way I can achieve this while reading the file only once in "rb" mode ? I have to make this work using the csv module and not the pandas library.
Doing what you want is possible but not very efficient. Simply opening a file isn't that expensive. The CSV reader only reads only line at a time, not the entire file.
To do what you want you have to :
Read the first line as bytes
Decode it into a string (using the correct encoding)
Convert it to a list of strings
Parse it with csv.reader and finally
Seek to the start of the stream.
Otherwise you'll end up uploading only the data without the headers :
with open(csv_file, 'rb') as DATA:
header=file.readline()
lines=[header.decode()]
csvreader = csv.reader(lines)
columns = next(csvreader)
// run-some-validations
DATA.seek(0)
s3_put_response = requests.put(s3_presigned_url,data=DATA,headers=headers)
Opening the file as text is not only simpler, it allows you to separate the validation logic from the upload code.
To ensure only one line is read at a time you can use buffering=1
def check_csv_headers():
with open(csv_file, 'r', buffering=1) as file:
csvreader = csv.reader(file)
columns = next(csvreader)
// run-some-validations
with open(csv_data, 'rb') as DATA:
s3_put_response = requests.put(s3_presigned_url,data=DATA,headers=headers)
Or
def check_csv_headers():
with open(csv_file, 'r', buffering=1) as file:
csvreader = csv.reader(file)
columns = next(csvreader)
// run-some-validations
//If successful
return True
def upload_csv(filePath):
if check_csv_headers(filePath) :
with open(csv_data, 'rb') as DATA:
s3_put_response = requests.put(s3_presigned_url,data=DATA,headers=headers)

Gzipping a CSV file truncates it

I have a script that writes data to a CSV file and then gzips it.
The bizarre thing is the gzipped file is truncated by a few lines (total file size is over 18million lines).
I've manually gzipped the CSV file produced by the script and there is no file truncated. However, when I use Python to gzip the file (I've tried gzip, os, and subprocess), the file is truncated. I can't figure out why this might be happening.
Code snippet below:
#Remove quotes from file
with open(localFile, "r") as csvfile:
csvreader = csv.reader(csvfile, skipinitialspace=True)
#Skip the header row
next(csvreader)
writer = csv.writer(open(outputFile, "w"), quoting=csv.QUOTE_NONE)
for row in csvreader:
writer.writerow(row)
#Zip file
zipCommand = f"gzip {outputFile}"
exit_code =os.system(zipCommand)
total file size is over 18million lines
I assume that holding whole of this in RAM memory is not option. You might give csv.writer gzip's file handle to avoid that. Consider following simple example:
import csv, gzip
with gzip.open("file.csv.gz", "wt") as gf:
writer = csv.writer(gf, quoting=csv.QUOTE_NONE)
writer.writerow([1,2,3])
writer.writerow([4,5,6])
writer.writerow([7,8,9])
this will create file.csv.gz, after gunzip file.csv.gz you will get file with following content
1,2,3
4,5,6
7,8,9
Note: use wt (write-text) mode for usage with csv.writer which emits text.

How to open a json.gz file and return to dictionary in Python

I have downloaded a compressed json file and want to open it as a dictionary.
I used json.load but the data type still gives me a string.
I want to extract a keyword list from the json file. Is there a way I can do it even though my data is a string?
Here is my code:
import gzip
import json
with gzip.open("19.04_association_data.json.gz", "r") as f:
data = f.read()
with open('association.json', 'w') as json_file:
json.dump(data.decode('utf-8'), json_file)
with open("association.json", "r") as read_it:
association_data = json.load(read_it)
print(type(association_data))
#The actual output is 'str' but I expect it is 'dic'
In the first with block you already got the uncompressed string, no need to open it a second time.
import gzip
import json
with gzip.open("19.04_association_data.json.gz", "r") as f:
data = f.read()
j = json.loads (data.decode('utf-8'))
print (type(j))
Open the file using the gzip package from the standard library (docs), then read it directly into json.loads():
import gzip
import json
with gzip.open("19.04_association_data.json.gz", "rb") as f:
data = json.loads(f.read(), encoding="utf-8")
To read from a json.gz, you can use the following snippet:
import json
import gzip
with gzip.open("file_path_to_read", "rt") as f:
expected_dict = json.load(f)
The result is of type dict.
In case if you want to write to a json.gz, you can use the following snippet:
import json
import gzip
with gzip.open("file_path_to_write", "wt") as f:
json.dump(expected_dict, f)

How to change .csv.gz encoding to utf-8

I want to user either R or Python to convert .csv.gz file to utf-8 encoding. How can I do this directly? I am not able find any comprehensive guide as how to do this.
My best attempt was to read .csv.gz file with csv.reader in python:
csvFile = gzip.open('pracodawcy_20190611_5.csv.gz', 'rt', newline='')
reader = csv.reader(csvFile)
But later how to save it as csv with utf-8?
Very easily, it puts the file in a vector:
import gzip
### assuming the file is separated as you said
with gzip.open('input_file.csv.gz', 'rt', newline='\n') as f:
content = f.readlines()
### to print the vector content
for v in content :
print(v)
### to write to .csv.gz
with gzip.open('output.csv.gz', 'wb') as f:
for v in content :
f.write(v.encode('utf-8'))
you can also lazy-open it line per line if it's too big with read() or for. There are a lot of examples here and in the web.

Converting a .csv.gz to .csv in Python 2.7

I have read the documentation and a few additional posts on SO and other various places, but I can't quite figure out this concept:
When you call csvFilename = gzip.open(filename, 'rb') then reader = csv.reader(open(csvFilename)), is that reader not a valid csv file?
I am trying to solve the problem outlined below, and am getting a coercing to Unicode: need string or buffer, GzipFile found error on line 41 and 7 (highlighted below), leading me to believe that the gzip.open and csv.reader do not work as I had previously thought.
Problem I am trying to solve
I am trying to take a results.csv.gz and convert it to a results.csv so that I can turn the results.csv into a python dictionary and then combine it with another python dictionary.
File 1:
alertFile = payload.get('results_file')
alertDataCSV = rh.dataToDict(alertFile) # LINE 41
alertDataTotal = rh.mergeTwoDicts(splunkParams, alertDataCSV)
Calls File 2:
import gzip
import csv
def dataToDict(filename):
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename)) # LINE 7
alertData={}
for row in reader:
alertData[row[0]]=row[1:]
return alertData
def mergeTwoDicts(dictA, dictB):
dictC = dictA.copy()
dictC.update(dictB)
return dictC
*edit: also forgive my non-PEP style of naming in Python
gzip.open returns a file-like object (same as what plain open returns), not the name of the decompressed file. Simply pass the result directly to csv.reader and it will work (the csv.reader will receive the decompressed lines). csv does expect text though, so on Python 3 you need to open it to read as text (on Python 2 'rb' is fine, the module doesn't deal with encodings, but then, neither does the csv module). Simply change:
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename))
to:
# Python 2
csvFile = gzip.open(filename, 'rb')
reader = csv.reader(csvFile) # No reopening involved
# Python 3
csvFile = gzip.open(filename, 'rt', newline='') # Open in text mode, not binary, no line ending translation
reader = csv.reader(csvFile) # No reopening involved
The following worked for me for python==3.7.9:
import gzip
my_filename = my_compressed_file.csv.gz
with gzip.open(my_filename, 'rt') as gz_file:
data = gz_file.read() # read decompressed data
with open(my_filename[:-3], 'wt') as out_file:
out_file.write(data) # write decompressed data
my_filename[:-3] is to get the actual filename so that it does get a random filename.

Categories

Resources