data clean-up in python using large (1.7gig) csv files

data clean-up in python using large (1.7gig) csv files - python

I'm trying to do some data clean-up using python. I have some large (1 - 2gigs) csv files that I want to sort by some attribute (e.g. date, time), and then output another csv file with this info with the purpose of making it able to be used in excel.
As I iterate through the rows, I come across some big memory issues. Initially I was using a 32-bit Idle which wouldn't run my code, and then switched to 64-bit Spyder. Now the code runs, but halts (appears to process, memory is consumed, but haven't seen it move on in the last half hour) at the first iterative line.
My code is as follows. The process halts at line 10 (highlighted). I'm pretty new to python so I'm sure my code is very primitive, but its the best I can do! Thanks for your help in advance :)
def file_reader(filename):
"function takes string of file name and returns a list of lists"
global master_list
with open(filename, 'rt') as csvfile:
rows = []
master_list = []
rowreader = csv.reader(csvfile, delimiter=',', quotechar='|')
**for row in rowreader:**
rows.append(','.join(row))
for i in rows:
master_list.append(i.replace(' ', '').replace('/2013', ',').split(","))
return master_list
def trip_dateroute(date,route):
dateroute_list = []
for i in master_list:
if str(i[1]) == date and str(i[3]) == route:
dateroute_list.append(i)
return dateroute_list
def output_csv(filename, listname):
with open(filename, "w") as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='|', lineterminator='\n')
for i in listname:
writer.writerow(i)

If you don't need to hold the whole file content in memory, you can just process each line and immediately write it to the output file. Also, in your example you parse the CSV and then generate CSV again, but you don't seem to make use of parsed data. If that is correct, you could simply do this:
def file_converter(infilename, outfilename):
with open(infilename, 'rt') as infile, open(outfilename, "w") as outfile:
for line in infile:
line.replace(' ', '').replace('/2013', ',')
outfile.write(line)
If the function trip_dateroute() is used to filter the lines that should actually be written out, you can add that, too, but then you'd actually have to parse CSV:
def filter_row(row, date, route):
return str(row[1]) == date and str(row[3]) == route
def cleanup(field):
return field.replace(' ', '').replace('/2013', ',')
def file_converter(infilename, outfilename, date, route):
with open(infilename, 'rt') as infile, open(outfilename, "w") as outfile:
reader = csv.reader(infile, delimiter=',', quotechar='|')
writer = csv.writer(outfile, delimiter=',', quotechar='|', lineterminator='\n')
for row in reader:
row = [cleanup(field) for field in row if filter_row(row, date, route)]
writer.writerow(row)

Related

Hash a column in CSV and output in Base64

Still getting my feet wet with Python, but my goal is to read a CSV file and hash a specific column using SHA256 then output in Base64.
Here is an example of the conversion that needs to take place
This calculator can be found at https://www.liavaag.org/English/SHA-Generator/
Here is the code I have currently
import hashlib
import csv
import base64
with open('File1.csv') as csvfile:
with open('File2.csv', 'w') as newfile:
reader = csv.DictReader(csvfile)
for i, r in enumerate(reader):
# writing csv headers
if i == 0:
newfile.write(','.join(r) + '\n')
# hashing the 'CardNumber' column
r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id']).encode('utf-8')).digest()
# writing the new row to the file with hashed 'CardNumber'
newfile.write(','.join(r.values()) + '\n')
The error I receive is
r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id']).encode('utf-8')).digest()
TypeError: Strings must be encoded before hashing

You are on the right track, just need to take it a step at a time before doing it all at once to see how it pieces together:
import hashlib
import base64
text = "1234567890"
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
print(text, str(encoded, encoding="utf-8"))
That should give you:
1234567890 x3Xnt1ft5jDNCqERO9ECZhqziCnKUqZCKreChi8mhkY=
As a "one-liner":
r['consumer_id'] = str(base64.b64encode(hashlib.sha256(r['consumer_id'].encode('utf-8')).digest()), encoding="utf-8")
As you can see, your current use is close, but just has some parentheses opportunities to fix.
If you wanted to use this in a loop, say when iterating over a list of words or the rows of a csv you might do this:
import hashlib
import base64
def encode_text(text):
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
return str(encoded, encoding="utf-8")
words = "1234567890 Hello World".split()
for word in words:
print(word, encode_text(word))
Giving you:
234567890 x3Xnt1ft5jDNCqERO9ECZhqziCnKUqZCKreChi8mhkY=
Hello GF+NsyJx/iX1Yab8k4suJkMG7DBO2lGAB9F2SCY4GWk=
World eK5kfcVUTSJxMKBoKlHjC8d3f7ttio8XAHRjo+zR1SQ=
Assuming the rest of your code works as you like, then:
import hashlib
import csv
import base64
def encode_text(text):
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
return str(encoded, encoding="utf-8")
with open('File1.csv') as csvfile:
with open('File2.csv', 'w') as newfile:
reader = csv.DictReader(csvfile)
for i, r in enumerate(reader):
# writing csv headers
if i == 0:
newfile.write(','.join(r) + '\n')
# hashing the 'CardNumber' column
r['consumer_id'] = encode_text(r['consumer_id'])
# writing the new row to the file with hashed 'CardNumber'
newfile.write(','.join(r.values()) + '\n')

In addition to JonSG's answer about getting the hashing/encoding correct, I'd like to comment on how you're reading and writing the CSV files.
It took me a minute to understand how you're dealing with the header vs the body of the CSV here:
with open("File1.csv") as csvfile:
with open("File2.csv", "w") as newfile:
reader = csv.DictReader(csvfile)
for i, r in enumerate(reader):
print(i, r)
if i == 0:
newfile.write(",".join(r) + "\n") # writing csv headers
newfile.write(",".join(r.values()) + "\n")
At first, I didn't realize that calling join() on a dict would just give back the keys; then you move on to join the values. That's clever!
I think it'd be clearer, and easier, to use the complementary DictWriter.
For clarity, I'm going to separate the reading, processing, and writing:
with open("File1.csv", newline="") as f_in:
reader = csv.DictReader(f_in, skipinitialspace=True)
rows = list(reader)
for row in rows:
row["ID"] = encode_text(row["ID"])
print(row)
with open("File2.csv", "w", newline="") as f_out:
writer = csv.DictWriter(f_out, fieldnames=rows[0])
writer.writeheader()
writer.writerows(rows)
In your case, you'll create your writer and need to give it the fieldnames. I just passed in the first row and the DictWriter() constructor used the keys from that dict to establish the header values. You need to explicitly call the writeheader() method, then you can write your (processed) rows.
I started with this File1.csv:
ID, Phone, Email
1234680000000000, 123-456-7890, johnsmith#test.com
and ended up with this File2.csv:
ID,Phone,Email
tO2Knao73NzQP/rnBR5t8Hsm/XIQVnsrPKQlsXmpkb8=,123-456-7890,johnsmith#test.com
That organization means all your rows are read into memory first. You mentioned having "thousands of entries", but for those 3 fields of data that'll only be a few hundred KB of RAM, maybe a MB of RAM.
If you do want to "stream" the data through, you'll want something like:
reader = csv.DictReader(f_in, skipinitialspace=True)
writer = csv.DictWriter(f_out, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
row["ID"] = encode_text(row["ID"])
writer.writerow(row)
In this example, I passed reader.fieldnames to the fieldnames= param of the DictWriter constructor.
For dealing with multiple files, I'll just open and close them myself, because the multiple with open(...) as x can look cluttered to me:
f_in = open("File1.csv", newline="")
f_out = open("File2.csv", "w", newline="")
...
f_in.close()
f_out.close()
I don't see any real benefit to the context managers for these simple utility scripts: if the program fails it will automatically close the files.
But the conventional wisdom is to use the with open(...) as x context managers, like you were. You could do nested, like you were, separate them with a comma, or if you have Python 3.10+ use grouping parenthesis for a cleaner look (also in that Q/A).

Write list to csv with new line without duplicates in iteration

The main problem that I can't write to CSV a list and I got such a result - https://i.imgur.com/Y9PzO7y.png
It shows me a lot of columns. And I need only one column and I don't want any duplicates.
What should I do?
import csv
matched_dynamic_pattern = []
matched_static_pattern = []
not_matched = []
with open('processes.csv', 'r') as t1, open('static_patterns.csv', 'r') as t2:
commands = set()
reader = csv.DictReader(t1, dialect='excel', delimiter=',')
for row in reader:
commands.add(row['Command Line'])
static_patterns = set(t2.read().splitlines())
with open('results1.csv', 'w', newline='') as outFile:
writer = csv.writer(outFile)
for command in commands:
if command.startswith('"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --type') or command.startswith('"C:\\Users\\test\\AppData\\Local\\Microsoft\\Teams\\current\\Teams.exe" --type'):
matched_dynamic_pattern.append(command)
else:
if command not in static_patterns:
not_matched.append(command)
writer.writerows([not_matched]) # THE PROBLEM LINE OF CODE
if command in static_patterns:
matched_static_pattern.append(command)
all_processes = len(commands)
exclude_dynamic = len(matched_dynamic_pattern)
exclude_static = len(matched_static_pattern)
print(all_processes - exclude_dynamic - exclude_static)
print('Results —', len(not_matched))
print(type(not_matched))
UPD:
I found a new solution:
with open('results1.csv', 'w', newline='') as outFile:
for r in not_matched:
outFile.write(r + "\n")
outFile.close()
But the problem is: https://i.imgur.com/0j7DFfv.png

Your code accumulates non-matching commands in the not_matched list. However, the entire contents of that list are dumped to the CSV file each time a non-matching command is identified. To fix replace:
writer.writerows([not_matched]) # THE PROBLEM LINE OF CODE
with
writer.writerow([command]) # N.B. writerow, not writerows
Alternatively write the not_matched list to the CSV file in one go after the for loop terminates:
with open('results1.csv', 'w', newline='') as outFile:
writer = csv.writer(outFile)
for command in commands:
if command.startswith('"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --type') or command.startswith('"C:\\Users\\test\\AppData\\Local\\Microsoft\\Teams\\current\\Teams.exe" --type'):
matched_dynamic_pattern.append(command)
else:
if command not in static_patterns:
not_matched.append([command])
else:
matched_static_pattern.append(command)
writer.writerows([not_matched]) # write all non-matched commands in one go
Notice also that if command in static_patterns: can simply be replaced with else:.

This solution is working for me!
with open('results1.csv', 'w', newline='') as outFile:
writer = csv.writer(outFile)
rows = zip(not_matched)
for row in rows:
writer.writerow(row)
outFile.close()

how do I remove commas within columns from data retrieved from a CSV file

I have several CSV files that I need to process. Within the columns of each, there might be commas in the fields. Strings might also be sitting within double quotes. I got it right to come up with something, but I am working with CSV files that are sometimes between 200 - 400 MB. Processing them with my current code lets a 11MB file take 4 minutes to be processed.
What can I do here to have it run faster or maybe to process the entire data all at once instead of running through the code field by field ?
import csv
def rem_lrspaces(data):
data = data.lstrip()
data = data.rstrip()
data = data.strip()
return data
def strip_bs(data):
data = data.replace(",", " ")
return data
def rem_comma(tmp1,tmp2):
with open(tmp2, "w") as f:
f.write("")
f.close()
file=open(tmp1, "r")
reader = csv.reader(file,quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True)
for line in reader:
for field in line:
if "," in field :
field=rem_lrspaces(strip_bs(field))
with open(tmp2, "a") as myfile:
myfile.write(field+",")
with open(tmp2, "a") as myfile:
myfile.write("\n")
pdfsource=r"C:\automation\cutoff\test2"
csvsource=pdfsource
ofn = "T3296N17"
file_in = r"C:\automation\cutoff\test2"+chr(92)+ofn+".CSV"
file_out = r"C:\automation\cutoff\test2"+chr(92)+ofn+".TSV"
rem_comma(file_in,file_out)

A few low-hanging fruit:
strip_bs is too simple to justify the overhead of calling the function.
rem_lrspaces is redundantly stripping whitespace; one call to data.strip() is all you need, in which case it too is too simple to justify a separate function.
You are also spending a lot of time repeatedly opening the output file.
Also, it's better to pass already-open file handles to rem_comma, as it makes testing easier by allowing in-memory file-like objects to be passed as arguments.
This code simply builds a new list of fields from each line, then uses csv.writer to write the new fields back to the output file.
import csv
def rem_comma(f_in, f_out):
reader = csv.reader(f_in, quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True)
writer = csv.writer(f_out)
for line in reader:
new_line = [field.replace(",", " ").strip() for field in line]
writer.write_row(new_line)
ofn = "T3296N17"
file_in = r"C:\automation\cutoff\test2"+chr(92)+ofn+".CSV"
file_out = r"C:\automation\cutoff\test2"+chr(92)+ofn+".TSV"
with open(file_in) as f1, open(file_out) as f2:
rem_comma(f1, f2)

How to read csv data, strip spaces/tabs and write to new csv file?

I have a large (1.6million rows+) .csv file that has some data with leading spaces, tabs, and trailing spaces and maybe even trailing tabs. I need to read the data in, strip all of that whitespace, and then spit the rows back out into a new .csv file preferably with the most efficient code possible and using only built-in modules in python 3.7
Here is what I have that is currently working, except it only spits out the header over and over and over and doesn't seem to take care of trailing tabs (not a huge deal though on trailing tabs):
def new_stripper(self, input_filename: str, output_filename: str):
"""
new_stripper(self, filename: str):
:param self: no idea what this does
:param filename: name of file to be stripped, must have .csv at end of file
:return: for now, it doesn't return anything...
-still doesn't remove trailing tabs?? But it can remove trailing spaces
-removes leading tabs and spaces
-still needs to write to new .csv file
"""
import csv
csv.register_dialect('strip', skipinitialspace=True)
reader = csv.DictReader(open(input_filename), dialect='strip')
reader = (dict((k, v.strip()) for k, v in row.items() if v) for row in reader)
for row in reader:
with open(output_filename, 'w', newline='') as out_file:
writer = csv.writer(out_file, delimiter=',')
writer.writerow(row)
input_filename = 'testFile.csv'
output_filename = 'output_testFile.csv'
new_stripper(self='', input_filename=input_filename, output_filename=output_filename)
As written above, the code just prints the headers over and over in a single line. I've played around with the arrangement and indenting of the last four lines of the def with some different results, but the closest I've gotten is getting it to print the header row again and again on new lines each time:
...
# headers and headers for days
with open(output_filename, 'w', newline='') as out_file:
writer = csv.writer(out_file, delimiter=',')
for row in reader:
writer.writerow(row)
EDIT1: Here's the result from the non-stripping correctly thing. Some of them have leading spaces that weren't stripped, some have trailing spaces that weren't stripped. It seems like the left-most column was properly stripped of leading spaces, but not trailing spaces; same with header row.
enter image description here
Update: Here's the solution I was looking for:
def get_data(self, input_filename: str, output_filename: str):
import csv
with open(input_filename, 'r', newline='') as in_file, open(output_filename, 'w', newline='') as out_file:
r = csv.reader(in_file, delimiter=',')
w = csv.writer(out_file, delimiter=',')
for line in r:
trim = (field.strip() for field in line)
w.writerow(trim)
input_filename = 'testFile.csv'
output_filename = 'output_testFile.csv'
get_data(self='', input_filename=input_filename, output_filename=output_filename)

Don't make life complicated for yourself, "CSV" files are simple plain text files, and can be handled in a generic way:
with open('input.csv', 'r') as inf, open('output.csv', 'w') as of:
for line in inf:
trim = (field.strip() for field in line.split(','))
of.write(','.join(trim)+'\n')
Alternatively, using the csv module:
import csv
with open('input.csv', 'r') as inf, open('output.csv', 'w') as of:
r = csv.reader(inf, delimiter=',')
w = csv.writer(of, delimiter=',')
for line in r:
trim = (field.strip() for field in line)
w.writerow(trim)

Unfortunately I cannot comment, but I believe you might want to strip every entry in csv of the white space (not just the line). If that is the case, then, based on Jan's answer, this might do the trick:
with open('file.csv', 'r') as inf, open('output.csv', 'w') as of:
for line in inf:
of.write(','.join(list(map(str.strip, line.split(',')))) + '\n')
What it does is it splits each line by comma resulting in a list of values, then strips every element from whitespace to later join them back up and save to output file.

your final reader variable contains tuple of dicts but your writer expects list.
you can either user csv.DictWriter or store the processed data(v) in a list first and then write to csv and include headers using writer.writeheader()

Get length of csv file without ruining reader?

I am trying to do the following:
reader = csv.DictReader(open(self.file_path), delimiter='|')
reader_length = sum([_ for item in reader])
for line in reader:
print line
However, doing the reader_length line, makes the reader itself unreadable. Note that I do not want to do a list() on the reader, as it is too big to read on my machine entirely from memory.

Use enumerate with a start value of 1, when you get to the end of the file you will have the line count:
for count,line in enumerate(reader,1):
# do work
print count
Or if you need the count at the start for some reason sum using a generator expression and seek back to the start of the file:
with open(self.file_path) as f:
reader = csv.DictReader(f, delimiter='|')
count = sum(1 for _ in reader)
f.seek(0)
reader = csv.DictReader(f, delimiter='|')
for line in reader:
print(line)

reader = list(csv.DictReader(open(self.file_path), delimiter='|'))
print len(reader)
is one way to do this i suppose
another way to do it would be
reader = csv.DictReader(open(self.file_path), delimiter='|')
for i,row in enumerate(reader):
...
num_rows = i+1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

data clean-up in python using large (1.7gig) csv files - python

Related

Hash a column in CSV and output in Base64

Write list to csv with new line without duplicates in iteration

how do I remove commas within columns from data retrieved from a CSV file

How to read csv data, strip spaces/tabs and write to new csv file?

Get length of csv file without ruining reader?

Categories

Resources