Hash a column in CSV and output in Base64 - python

Still getting my feet wet with Python, but my goal is to read a CSV file and hash a specific column using SHA256 then output in Base64.
Here is an example of the conversion that needs to take place
This calculator can be found at https://www.liavaag.org/English/SHA-Generator/
Here is the code I have currently
import hashlib
import csv
import base64
with open('File1.csv') as csvfile:
with open('File2.csv', 'w') as newfile:
reader = csv.DictReader(csvfile)
for i, r in enumerate(reader):
# writing csv headers
if i == 0:
newfile.write(','.join(r) + '\n')
# hashing the 'CardNumber' column
r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id']).encode('utf-8')).digest()
# writing the new row to the file with hashed 'CardNumber'
newfile.write(','.join(r.values()) + '\n')
The error I receive is
r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id']).encode('utf-8')).digest()
TypeError: Strings must be encoded before hashing

You are on the right track, just need to take it a step at a time before doing it all at once to see how it pieces together:
import hashlib
import base64
text = "1234567890"
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
print(text, str(encoded, encoding="utf-8"))
That should give you:
1234567890 x3Xnt1ft5jDNCqERO9ECZhqziCnKUqZCKreChi8mhkY=
As a "one-liner":
r['consumer_id'] = str(base64.b64encode(hashlib.sha256(r['consumer_id'].encode('utf-8')).digest()), encoding="utf-8")
As you can see, your current use is close, but just has some parentheses opportunities to fix.
If you wanted to use this in a loop, say when iterating over a list of words or the rows of a csv you might do this:
import hashlib
import base64
def encode_text(text):
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
return str(encoded, encoding="utf-8")
words = "1234567890 Hello World".split()
for word in words:
print(word, encode_text(word))
Giving you:
234567890 x3Xnt1ft5jDNCqERO9ECZhqziCnKUqZCKreChi8mhkY=
Hello GF+NsyJx/iX1Yab8k4suJkMG7DBO2lGAB9F2SCY4GWk=
World eK5kfcVUTSJxMKBoKlHjC8d3f7ttio8XAHRjo+zR1SQ=
Assuming the rest of your code works as you like, then:
import hashlib
import csv
import base64
def encode_text(text):
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
return str(encoded, encoding="utf-8")
with open('File1.csv') as csvfile:
with open('File2.csv', 'w') as newfile:
reader = csv.DictReader(csvfile)
for i, r in enumerate(reader):
# writing csv headers
if i == 0:
newfile.write(','.join(r) + '\n')
# hashing the 'CardNumber' column
r['consumer_id'] = encode_text(r['consumer_id'])
# writing the new row to the file with hashed 'CardNumber'
newfile.write(','.join(r.values()) + '\n')

In addition to JonSG's answer about getting the hashing/encoding correct, I'd like to comment on how you're reading and writing the CSV files.
It took me a minute to understand how you're dealing with the header vs the body of the CSV here:
with open("File1.csv") as csvfile:
with open("File2.csv", "w") as newfile:
reader = csv.DictReader(csvfile)
for i, r in enumerate(reader):
print(i, r)
if i == 0:
newfile.write(",".join(r) + "\n") # writing csv headers
newfile.write(",".join(r.values()) + "\n")
At first, I didn't realize that calling join() on a dict would just give back the keys; then you move on to join the values. That's clever!
I think it'd be clearer, and easier, to use the complementary DictWriter.
For clarity, I'm going to separate the reading, processing, and writing:
with open("File1.csv", newline="") as f_in:
reader = csv.DictReader(f_in, skipinitialspace=True)
rows = list(reader)
for row in rows:
row["ID"] = encode_text(row["ID"])
print(row)
with open("File2.csv", "w", newline="") as f_out:
writer = csv.DictWriter(f_out, fieldnames=rows[0])
writer.writeheader()
writer.writerows(rows)
In your case, you'll create your writer and need to give it the fieldnames. I just passed in the first row and the DictWriter() constructor used the keys from that dict to establish the header values. You need to explicitly call the writeheader() method, then you can write your (processed) rows.
I started with this File1.csv:
ID, Phone, Email
1234680000000000, 123-456-7890, johnsmith#test.com
and ended up with this File2.csv:
ID,Phone,Email
tO2Knao73NzQP/rnBR5t8Hsm/XIQVnsrPKQlsXmpkb8=,123-456-7890,johnsmith#test.com
That organization means all your rows are read into memory first. You mentioned having "thousands of entries", but for those 3 fields of data that'll only be a few hundred KB of RAM, maybe a MB of RAM.
If you do want to "stream" the data through, you'll want something like:
reader = csv.DictReader(f_in, skipinitialspace=True)
writer = csv.DictWriter(f_out, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
row["ID"] = encode_text(row["ID"])
writer.writerow(row)
In this example, I passed reader.fieldnames to the fieldnames= param of the DictWriter constructor.
For dealing with multiple files, I'll just open and close them myself, because the multiple with open(...) as x can look cluttered to me:
f_in = open("File1.csv", newline="")
f_out = open("File2.csv", "w", newline="")
...
f_in.close()
f_out.close()
I don't see any real benefit to the context managers for these simple utility scripts: if the program fails it will automatically close the files.
But the conventional wisdom is to use the with open(...) as x context managers, like you were. You could do nested, like you were, separate them with a comma, or if you have Python 3.10+ use grouping parenthesis for a cleaner look (also in that Q/A).

Related

how do I remove commas within columns from data retrieved from a CSV file

I have several CSV files that I need to process. Within the columns of each, there might be commas in the fields. Strings might also be sitting within double quotes. I got it right to come up with something, but I am working with CSV files that are sometimes between 200 - 400 MB. Processing them with my current code lets a 11MB file take 4 minutes to be processed.
What can I do here to have it run faster or maybe to process the entire data all at once instead of running through the code field by field ?
import csv
def rem_lrspaces(data):
data = data.lstrip()
data = data.rstrip()
data = data.strip()
return data
def strip_bs(data):
data = data.replace(",", " ")
return data
def rem_comma(tmp1,tmp2):
with open(tmp2, "w") as f:
f.write("")
f.close()
file=open(tmp1, "r")
reader = csv.reader(file,quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True)
for line in reader:
for field in line:
if "," in field :
field=rem_lrspaces(strip_bs(field))
with open(tmp2, "a") as myfile:
myfile.write(field+",")
with open(tmp2, "a") as myfile:
myfile.write("\n")
pdfsource=r"C:\automation\cutoff\test2"
csvsource=pdfsource
ofn = "T3296N17"
file_in = r"C:\automation\cutoff\test2"+chr(92)+ofn+".CSV"
file_out = r"C:\automation\cutoff\test2"+chr(92)+ofn+".TSV"
rem_comma(file_in,file_out)
A few low-hanging fruit:
strip_bs is too simple to justify the overhead of calling the function.
rem_lrspaces is redundantly stripping whitespace; one call to data.strip() is all you need, in which case it too is too simple to justify a separate function.
You are also spending a lot of time repeatedly opening the output file.
Also, it's better to pass already-open file handles to rem_comma, as it makes testing easier by allowing in-memory file-like objects to be passed as arguments.
This code simply builds a new list of fields from each line, then uses csv.writer to write the new fields back to the output file.
import csv
def rem_comma(f_in, f_out):
reader = csv.reader(f_in, quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True)
writer = csv.writer(f_out)
for line in reader:
new_line = [field.replace(",", " ").strip() for field in line]
writer.write_row(new_line)
ofn = "T3296N17"
file_in = r"C:\automation\cutoff\test2"+chr(92)+ofn+".CSV"
file_out = r"C:\automation\cutoff\test2"+chr(92)+ofn+".TSV"
with open(file_in) as f1, open(file_out) as f2:
rem_comma(f1, f2)

Adding custom delimiters back to a csv?

Currently, I take in a csv file using custom delimiters, "|". I then read it in and modify it using the code below:
import csv
ChangedDate = '2018-10-31'
firstfile = open('example.csv',"r")
firstReader = csv.reader(firstfile, delimiter='|')
firstData = list(firstReader)
outputFile = open("output.csv","w")
iteration = 0
for row in firstData:
firstData[iteration][25] = ChangedDate
iteration+=1
outputwriter = csv.writer(open("output.csv","w"))
outputwriter.writerows(firstData)
outputFile.close()
However, when I write the rows to my output file, they are comma seperated. This is a problem because I am dealing with large financial data, and therefore commas appear naturally, such as $8,000.00, hence the "|" delimiters of the original file. Is there a way to "re-delimit" my list before I write it to an output file?
You can provide the delimiter to the csv.writer:
with open("output.csv", "w") as f:
outputwriter = csv.writer(f, delimiter='|')

Read CSV with comma as linebreak

I have a file saved as .csv
"400":0.1,"401":0.2,"402":0.3
Ultimately I want to save the data in a proper format in a csv file for further processing. The problem is that there are no line breaks in the file.
pathname = r"C:\pathtofile\file.csv"
with open(pathname, newline='') as file:
reader = file.read().replace(',', '\n')
print(reader)
with open(r"C:\pathtofile\filenew.csv", 'w') as new_file:
csv_writer = csv.writer(new_file)
csv_writer.writerow(reader)
The print reader output looks exactly how I want (or at least it's a format I can further process).
"400":0.1
"401":0.2
"402":0.3
And now I want to save that to a new csv file. However the output looks like
"""",4,0,0,"""",:,0,.,1,"
","""",4,0,1,"""",:,0,.,2,"
","""",4,0,2,"""",:,0,.,3
I'm sure it would be intelligent to convert the format to
400,0.1
401,0.2
402,0.3
at this stage instead of doing later with another script.
The main problem is that my current code
with open(pathname, newline='') as file:
reader = file.read().replace(',', '\n')
reader = csv.reader(reader,delimiter=':')
x = []
y = []
print(reader)
for row in reader:
x.append( float(row[0]) )
y.append( float(row[1]) )
print(x)
print(y)
works fine for the type of csv files I currently have, but doesn't work for these mentioned above:
y.append( float(row[1]) )
IndexError: list index out of range
So I'm trying to find a way to work with them too. I think I'm missing something obvious as I imagine that it can't be too hard to properly define the linebreak character and delimiter of a file.
with open(pathname, newline=',') as file:
yields
ValueError: illegal newline value: ,
The right way with csv module, without replacing and casting to float:
import csv
with open('file.csv', 'r') as f, open('filenew.csv', 'w', newline='') as out:
reader = csv.reader(f)
writer = csv.writer(out, quotechar=None)
for r in reader:
for i in r:
writer.writerow(i.split(':'))
The resulting filenew.csv contents (according to your "intelligent" condition):
400,0.1
401,0.2
402,0.3
Nuances:
csv.reader and csv.writer objects treat comma , as default delimiter (no need to file.read().replace(',', '\n'))
quotechar=None is specified for csv.writer object to eliminate double quotes around the values being saved
You need to split the values to form a list to represent a row. Presently the code is splitting the string into individual characters to represent the row.
pathname = r"C:\pathtofile\file.csv"
with open(pathname) as old_file:
with open(r"C:\pathtofile\filenew.csv", 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter=',')
text_rows = old_file.read().split(",")
for row in text_rows:
items = row.split(":")
csv_writer.writerow([int(items[0]), items[1])
If you look at the documentation, for write_row, it says:
Write the row parameter to the writer’s file
object, formatted according to the current dialect.
But, you are writing an entire string in your code
csv_writer.writerow(reader)
because reader is a string at this point.
Now, the format you want to use in your CSV file is not clearly mentioned in the question. But as you said, if you can do some preprocessing to create a list of lists and pass each sublist to writerow(), you should be able to produce the required file format.

Pipe delimiter file, but no pipe inside data

Problem
I need to re-format a text from comma (,) separated values to pipe (|) separated values. Pipe characters within the values of the original (comma separated) text shall be replaced by a space for representation in the (pipe separated) result text.
The pipe separated result text shall be written back to the same file from which the original comma separated text has been read.
I am using python 2.6
Possible Solution
I should read the file first and remove all pipes with spaces in that and later replace (,) with (|).
Is there a the better way to achieve this?
Don't reinvent the value-separated file parsing wheel. Use the csv module to do the parsing and the writing for you.
The csv module will add "..." quotes around values that contain the separator, so in principle you don't need to replace the | pipe symbols in the values. To replace the original file, write to a new (temporary) outputfile then move that back into place.
import csv
import os
outputfile = inputfile + '.tmp'
with open(inputfile, 'rb') as inf, open(outputfile, 'wb') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter='|')
writer.writerows(reader)
os.remove(inputfile)
os.rename(outputfile, inputfile)
For an input file containing:
foo,bar|baz,spam
this produces
foo|"bar|baz"|spam
Note that the middle column is wrapped in quotes.
If you do need to replace the | characters in the values, you can do so as you copy the rows:
outputfile = inputfile + '.tmp'
with open(inputfile, 'rb') as inf, open(outputfile, 'wb') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter='|')
for row in reader:
writer.writerow([col.replace('|', ' ') for col in row])
os.remove(inputfile)
os.rename(outputfile, inputfile)
Now the output for my example becomes:
foo|bar baz|spam
Sounds like you're trying to work with a variation of CSV - in that case, Python's CSV library might as well be what you need. You can use it with custom delimiters and it will auto-handle escaping for you (this example was yanked from the manual and modified):
import csv
with open('eggs.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|')
spamwriter.writerow(['One', 'Two', 'Three])
There are also ways to modify quoting and escaping and other options. Reading works similarly.
You can create a temporary file from the original that has the pipe characters replaced, and then replace the original file with it when the processing is done:
import csv
import tempfile
import os
filepath = 'C:/Path/InputFile.csv'
with open(filepath, 'rb') as fin:
reader = csv.DictReader(fin)
fout = tempfile.NamedTemporaryFile(dir=os.path.dirname(filepath)
delete=False)
temp_filepath = fout.name
writer = csv.DictWriter(fout, reader.fieldnames, delimiter='|')
# writer.writeheader() # requires Python 2.7
header = dict(zip(reader.fieldnames, reader.fieldnames))
writer.writerow(header)
for row in reader:
for k,v in row.items():
row[k] = v.replace('|'. ' ')
writer.writerow(row)
fout.close()
os.remove(filepath)
os.rename(temp_filepath, filepath)

Python Hash not being updated in csv file output

I have working code that takes a directory of csv files and hashes one column of each line, then aggregates all files together. The issue is the output only displays the first hash value, not re-running the hash for each line. Here is the code:
import glob
import hashlib
files = glob.glob( '*.csv' )
output="combined.csv"
with open(output, 'w' ) as result:
for thefile in files:
f = open(thefile)
m = f.readlines()
for line in m[1:]:
fields = line.split()
hash_object = hashlib.md5(b'(fields[2])')
newline = fields[0],fields[1],hash_object.hexdigest(),fields[3]
joined_line = ','.join(newline)
result.write(joined_line+ '\n')
f.close()
You are creating a hash of a fixed bytestring b'(fields[2])'. That value has no relationship to your CSV data, even though it uses the same characters as are used in your row variable name.
You need to pass in bytes from your actual row:
hash_object = hashlib.md5(fields[2].encode('utf8'))
I am assuming your fields[2] column is a string, so you'd need to encoding it first to get bytes. The UTF-8 encoding can handle all codepoints that could possibly be contained in a string.
You also appear to be re-inventing the CSV reading and writing wheel; you probably should use the csv module instead:
import csv
# ...
with open(output, 'w', newline='') as result:
writer = csv.writer(result)
for thefile in files:
with open(thefile, newline='') as f:
reader = csv.reader(f)
next(reader, None) # skip first row
for fields in reader:
hash_object = hashlib.md5(fields[2].encode('utf8'))
newrow = fields[:2] + [hash_object.hexdigest()] + fields[3:]
writer.writerow(newrow)

Categories

Resources