I have working code that takes a directory of csv files and hashes one column of each line, then aggregates all files together. The issue is the output only displays the first hash value, not re-running the hash for each line. Here is the code:
import glob
import hashlib
files = glob.glob( '*.csv' )
output="combined.csv"
with open(output, 'w' ) as result:
for thefile in files:
f = open(thefile)
m = f.readlines()
for line in m[1:]:
fields = line.split()
hash_object = hashlib.md5(b'(fields[2])')
newline = fields[0],fields[1],hash_object.hexdigest(),fields[3]
joined_line = ','.join(newline)
result.write(joined_line+ '\n')
f.close()
You are creating a hash of a fixed bytestring b'(fields[2])'. That value has no relationship to your CSV data, even though it uses the same characters as are used in your row variable name.
You need to pass in bytes from your actual row:
hash_object = hashlib.md5(fields[2].encode('utf8'))
I am assuming your fields[2] column is a string, so you'd need to encoding it first to get bytes. The UTF-8 encoding can handle all codepoints that could possibly be contained in a string.
You also appear to be re-inventing the CSV reading and writing wheel; you probably should use the csv module instead:
import csv
# ...
with open(output, 'w', newline='') as result:
writer = csv.writer(result)
for thefile in files:
with open(thefile, newline='') as f:
reader = csv.reader(f)
next(reader, None) # skip first row
for fields in reader:
hash_object = hashlib.md5(fields[2].encode('utf8'))
newrow = fields[:2] + [hash_object.hexdigest()] + fields[3:]
writer.writerow(newrow)
Related
Still getting my feet wet with Python, but my goal is to read a CSV file and hash a specific column using SHA256 then output in Base64.
Here is an example of the conversion that needs to take place
This calculator can be found at https://www.liavaag.org/English/SHA-Generator/
Here is the code I have currently
import hashlib
import csv
import base64
with open('File1.csv') as csvfile:
with open('File2.csv', 'w') as newfile:
reader = csv.DictReader(csvfile)
for i, r in enumerate(reader):
# writing csv headers
if i == 0:
newfile.write(','.join(r) + '\n')
# hashing the 'CardNumber' column
r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id']).encode('utf-8')).digest()
# writing the new row to the file with hashed 'CardNumber'
newfile.write(','.join(r.values()) + '\n')
The error I receive is
r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id']).encode('utf-8')).digest()
TypeError: Strings must be encoded before hashing
You are on the right track, just need to take it a step at a time before doing it all at once to see how it pieces together:
import hashlib
import base64
text = "1234567890"
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
print(text, str(encoded, encoding="utf-8"))
That should give you:
1234567890 x3Xnt1ft5jDNCqERO9ECZhqziCnKUqZCKreChi8mhkY=
As a "one-liner":
r['consumer_id'] = str(base64.b64encode(hashlib.sha256(r['consumer_id'].encode('utf-8')).digest()), encoding="utf-8")
As you can see, your current use is close, but just has some parentheses opportunities to fix.
If you wanted to use this in a loop, say when iterating over a list of words or the rows of a csv you might do this:
import hashlib
import base64
def encode_text(text):
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
return str(encoded, encoding="utf-8")
words = "1234567890 Hello World".split()
for word in words:
print(word, encode_text(word))
Giving you:
234567890 x3Xnt1ft5jDNCqERO9ECZhqziCnKUqZCKreChi8mhkY=
Hello GF+NsyJx/iX1Yab8k4suJkMG7DBO2lGAB9F2SCY4GWk=
World eK5kfcVUTSJxMKBoKlHjC8d3f7ttio8XAHRjo+zR1SQ=
Assuming the rest of your code works as you like, then:
import hashlib
import csv
import base64
def encode_text(text):
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
return str(encoded, encoding="utf-8")
with open('File1.csv') as csvfile:
with open('File2.csv', 'w') as newfile:
reader = csv.DictReader(csvfile)
for i, r in enumerate(reader):
# writing csv headers
if i == 0:
newfile.write(','.join(r) + '\n')
# hashing the 'CardNumber' column
r['consumer_id'] = encode_text(r['consumer_id'])
# writing the new row to the file with hashed 'CardNumber'
newfile.write(','.join(r.values()) + '\n')
In addition to JonSG's answer about getting the hashing/encoding correct, I'd like to comment on how you're reading and writing the CSV files.
It took me a minute to understand how you're dealing with the header vs the body of the CSV here:
with open("File1.csv") as csvfile:
with open("File2.csv", "w") as newfile:
reader = csv.DictReader(csvfile)
for i, r in enumerate(reader):
print(i, r)
if i == 0:
newfile.write(",".join(r) + "\n") # writing csv headers
newfile.write(",".join(r.values()) + "\n")
At first, I didn't realize that calling join() on a dict would just give back the keys; then you move on to join the values. That's clever!
I think it'd be clearer, and easier, to use the complementary DictWriter.
For clarity, I'm going to separate the reading, processing, and writing:
with open("File1.csv", newline="") as f_in:
reader = csv.DictReader(f_in, skipinitialspace=True)
rows = list(reader)
for row in rows:
row["ID"] = encode_text(row["ID"])
print(row)
with open("File2.csv", "w", newline="") as f_out:
writer = csv.DictWriter(f_out, fieldnames=rows[0])
writer.writeheader()
writer.writerows(rows)
In your case, you'll create your writer and need to give it the fieldnames. I just passed in the first row and the DictWriter() constructor used the keys from that dict to establish the header values. You need to explicitly call the writeheader() method, then you can write your (processed) rows.
I started with this File1.csv:
ID, Phone, Email
1234680000000000, 123-456-7890, johnsmith#test.com
and ended up with this File2.csv:
ID,Phone,Email
tO2Knao73NzQP/rnBR5t8Hsm/XIQVnsrPKQlsXmpkb8=,123-456-7890,johnsmith#test.com
That organization means all your rows are read into memory first. You mentioned having "thousands of entries", but for those 3 fields of data that'll only be a few hundred KB of RAM, maybe a MB of RAM.
If you do want to "stream" the data through, you'll want something like:
reader = csv.DictReader(f_in, skipinitialspace=True)
writer = csv.DictWriter(f_out, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
row["ID"] = encode_text(row["ID"])
writer.writerow(row)
In this example, I passed reader.fieldnames to the fieldnames= param of the DictWriter constructor.
For dealing with multiple files, I'll just open and close them myself, because the multiple with open(...) as x can look cluttered to me:
f_in = open("File1.csv", newline="")
f_out = open("File2.csv", "w", newline="")
...
f_in.close()
f_out.close()
I don't see any real benefit to the context managers for these simple utility scripts: if the program fails it will automatically close the files.
But the conventional wisdom is to use the with open(...) as x context managers, like you were. You could do nested, like you were, separate them with a comma, or if you have Python 3.10+ use grouping parenthesis for a cleaner look (also in that Q/A).
I am attempting to remove special characters from a specific column within my csv file. But I cant figure out a way to specify the column I would like to change. Here is what I have:
import csv
input_file = open('src/list.csv', 'r')
output_file = open('src/list_new.csv', 'w')
data = csv.reader(input_file)
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL) # dialect='excel')
specials = '#'
for line in data:
line = str(line)
new_line = str.replace(line, specials, '')
writer.writerow(new_line.split(','))
input_file.close()
output_file.close()
Instead of searching through the whole file how can I specify the column ("Names") I would like to remove the special characters from?
Maybe use csv.DictReader? Then you can refer to the column by name.
I am very new to Python. I am trying to read a csv file and displaying the result to another CSV file. What I want to do is I want to write selective rows in the input csv file on to the output file. Below is the code I wrote so far. This code read every single row from the input file i.e. 1.csv and write it to an output file out.csv. How can I tweak this code say for example I want my output file to contain only those rows which starts with READ in column 8 and rows which are not equal to 0000 in column 10. Both of these conditions need to be met. Like start with READ and not equal to 0000. I want to write all these rows. Also this block of code is for a single csv file. Can anyone please tell me how I can do it for say 10000 csv files ? Also when I execute the code, I can see spaces between lines on my out csv. How can I remove those spaces ?
import csv
f1 = open("1.csv", "r")
reader = csv.reader(f1)
header = reader.next()
f2 = open("out.csv", "w")
writer = csv.writer(f2)
writer.writerow(header)
for row in reader:
writer.writerow(row)
f1.close()
f2.close()
Something like:
import os
import csv
import glob
class CSVReadWriter(object):
def munge(self, filename, suffix):
name,ext = os.path.split(filename)
return '{0}{1}.{2}'.format(name, suffix, ext)
def is_valid(self, row):
return row[8] == 'READ' and row[10] == '0000'
def filter_csv(fin, fout):
reader = csv.reader(fin)
writer = csv.writer(fout)
writer.write(reader.next()) # header
for row in reader:
if self.is_valid(row):
writer.writerow(row)
def read_write(self, iname, suffix):
with open(iname, 'rb') as fin:
oname = self.munge(filename, suffix)
with open(oname, 'wb') as fout:
self.filter_csv(fin, fout)
work_directory = r"C:\Temp\Data"
for filename in glob.glob(work_directory):
csvrw = CSVReadWriter()
csvrw.read_write(filename, '_out')
I've made it a class so that you can over ride the munge and is_valid methods to suit different cases. Being a class also means that you can store state better, for example if you wanted to output lines between certain criteria.
The extra spaces between lines that you mention are to do with \r\n carriage return and line feed line endings. Using open with 'wb' might resolve it.
Problem
I need to re-format a text from comma (,) separated values to pipe (|) separated values. Pipe characters within the values of the original (comma separated) text shall be replaced by a space for representation in the (pipe separated) result text.
The pipe separated result text shall be written back to the same file from which the original comma separated text has been read.
I am using python 2.6
Possible Solution
I should read the file first and remove all pipes with spaces in that and later replace (,) with (|).
Is there a the better way to achieve this?
Don't reinvent the value-separated file parsing wheel. Use the csv module to do the parsing and the writing for you.
The csv module will add "..." quotes around values that contain the separator, so in principle you don't need to replace the | pipe symbols in the values. To replace the original file, write to a new (temporary) outputfile then move that back into place.
import csv
import os
outputfile = inputfile + '.tmp'
with open(inputfile, 'rb') as inf, open(outputfile, 'wb') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter='|')
writer.writerows(reader)
os.remove(inputfile)
os.rename(outputfile, inputfile)
For an input file containing:
foo,bar|baz,spam
this produces
foo|"bar|baz"|spam
Note that the middle column is wrapped in quotes.
If you do need to replace the | characters in the values, you can do so as you copy the rows:
outputfile = inputfile + '.tmp'
with open(inputfile, 'rb') as inf, open(outputfile, 'wb') as outf:
reader = csv.reader(inf)
writer = csv.writer(outf, delimiter='|')
for row in reader:
writer.writerow([col.replace('|', ' ') for col in row])
os.remove(inputfile)
os.rename(outputfile, inputfile)
Now the output for my example becomes:
foo|bar baz|spam
Sounds like you're trying to work with a variation of CSV - in that case, Python's CSV library might as well be what you need. You can use it with custom delimiters and it will auto-handle escaping for you (this example was yanked from the manual and modified):
import csv
with open('eggs.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='|')
spamwriter.writerow(['One', 'Two', 'Three])
There are also ways to modify quoting and escaping and other options. Reading works similarly.
You can create a temporary file from the original that has the pipe characters replaced, and then replace the original file with it when the processing is done:
import csv
import tempfile
import os
filepath = 'C:/Path/InputFile.csv'
with open(filepath, 'rb') as fin:
reader = csv.DictReader(fin)
fout = tempfile.NamedTemporaryFile(dir=os.path.dirname(filepath)
delete=False)
temp_filepath = fout.name
writer = csv.DictWriter(fout, reader.fieldnames, delimiter='|')
# writer.writeheader() # requires Python 2.7
header = dict(zip(reader.fieldnames, reader.fieldnames))
writer.writerow(header)
for row in reader:
for k,v in row.items():
row[k] = v.replace('|'. ' ')
writer.writerow(row)
fout.close()
os.remove(filepath)
os.rename(temp_filepath, filepath)
I have a .csv with 3000 rows of data in 2 columns like this:
uc007ayl.1 ENSMUSG00000041439
uc009mkn.1 ENSMUSG00000031708
uc009mkn.1 ENSMUSG00000035491
In another folder I have a graphs with name like this:
uc007csg.1_nt_counts.txt
uc007gjg.1_nt_counts.txt
You should notice those graphs have a name in the same format of my 1st column
I am trying to use python to identify those rows that have a graph and print the name of 2nd column in a new .txt file
These are the codes I have
import csv
with open("C:/*my dir*/UCSC to Ensembl.csv", "r") as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
print row[0]
But this as far as I can get and I am stuck.
You're almost there:
import csv
import os.path
with open("C:/*my dir*/UCSC to Ensembl.csv", "rb") as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
graph_filename = os.path.join("C:/folder", row[0] + "_nt_counts.txt")
if os.path.exists(graph_filename):
print (row[1])
Note that the repeated calls to os.path.exists may slow down the process, especially if the directory lies on a remote filesystem and does not significantly more files than the number of lines in the CSV file. You may want to use os.listdir instead:
import csv
import os
graphs = set(os.listdir("C:/graph folder"))
with open("C:/*my dir*/UCSC to Ensembl.csv", "rb") as f:
reader = csv.reader(f, delimiter = ',')
for row in reader:
if row[0] + "_nt_counts.txt" in graphs:
print (row[1])
First, try to see if print row[0] really gives the correct file identifier.
Second, concatenate the path to the files with row[0] and check if this full path exists (if the file exists, actually) with os.path.exists(path) (see http://docs.python.org/library/os.path.html#os.path.exists ).
If it exits, you can write the row[1] (the second column) to a new file with f2.write("%s\n" % row[1] (first you have to open f2 for writing of course).
Well, the next step would be to check if the file exists? There are a few ways, but I like the EAFP approach.
try:
with open(os.path.join(the_dir,row[0])) as f: pass
except IOError:
print 'Oops no file'
the_dir is the directory where the files are.
result = open('result.txt', 'w')
for line in open('C:/*my dir*/UCSC to Ensembl.csv', 'r'):
line = line.split(',')
try:
open('/path/to/dir/' + line[0] + '_nt_counts.txt', 'r')
except:
continue
else:
result.write(line[1] + '\n')
result.close()
import csv
import os
# get prefixes of all graphs in another directory
suff = '_nt_counts.txt'
graphs = set(fn[:-len(suff)] for fn in os.listdir('another dir') if fn.endswith(suff))
with open(r'c:\path to\file.csv', 'rb') as f:
# extract 2nd column if the 1st one is a known graph prefix
names = (row[1] for row in csv.reader(f, delimiter='\t') if row[0] in graphs)
# write one name per line
with open('output.txt', 'w') as output_file:
for name in names:
print >>output_file, name