Using Python to automate data cleanup, some issues

Using Python to automate data cleanup, some issues - python

I have to process ~16,000 rows of data. Each row is a transaction record, with several parts. For example:
row= [ID, thing, widget]
What I would like to do is kind of simple- for each row, compare it to the rest of the rows one by one. If row A has a unique ID and unique widget, I want to write it to an outfile. Otherwise, I don't need it. (This program basically automates data cleanup for me.) Here's what I have so far:
try:
infile=open(file1, 'r')
for line in infile:
line_wk=line.split(",")
outfile=open(file2, 'r')
for line in outfile:
line_wk2=line.split(",")
if line_wk[0]==line_wk2[0]:
if line_wk[2]!=line_wk2[2]: #ID is not unique, but the widget is
to_write=','.join(line_wk) #queued to write later
else:
to_write=','.join(line_wk) #queued to write later
if len(to_write)>0:
outfile.close()
outfile=open(file2, 'a')
outfile.write(to_write)
outfile.close()
outfile=open(file2, 'r')
infile.close()
outfile.close()
except:
print("Something went wrong.")
Running this on a small test set, it stays within the 'try' block but otherwise just writes everything, not only the ones with a unique ID and widget. I assume there is an infinitely simpler way to do this. Any help is appreciated!

What you want to do is create a dictionary where the key is a tuple of (ID, widget) and the value is thing. Dictionary keys are guaranteed unique. So, your code would look something like this.
uniques = {}
with open("yourfile.txt") as infile:
for line in infile:
ID, thing, widget = line.strip().split(',')
uniques[(ID, widget)] = thing
with open("output.txt", "w") as outfile:
for k, v in uniques.iteritems():
outfile.write("%s,%s,%s\n" % (k[0], v, k[1]))
If preserving their original order is important then you can use OrderedDict from the collections package
You can also clean up how the outfile.write line is written, but it should work as is.
Lastly, since it appears you are reading/writing csv (comma separated values) format, you can make use of the csv module.
To test this I wrote a script
import random
import string
IDS = range(1, 100)
widgets = ['ITEM_%s' % (i, ) for i in range(10)]
thing_chars = list(string.uppercase + string.lowercase)
def get_thing():
return "".join(random.sample(thing_chars, 10))
with open("yourfile.txt", "w") as out:
for i in xrange(0, 16000):
ID = random.choice(IDS)
widget = random.choice(widgets)
thing = get_thing()
out.write("%s,%s,%s\n" % (ID, thing, widget))
It appears to run with the correct results.

Related

Correlate data from two CSVs and write the data to the first CSV using Python

I'm having trouble figuring out where to dive in on this personal project and I was hoping this community could help me create a Python script to deal with this data.
I have a CSV file that contains a list of meals fed to dogs at an animal rescue, associated by with the kennel number:
Source CSV - mealsandtreats.csv
blank_column,Kennel_Number,Species,Food,Meal_ID
,1,Dog,Meal,11.2
,5,Dog,Meal,45.2
,3,Dog,Meal,21.4
,4,Dog,Meal,17
,2,Dog,Meal,11.2
,4,Dog,Meal,21.4
,6,Dog,Meal,17
,2,Dog,Meal,45.2
I have a second CSV file that provides a key which maps the meals to what treats come with the meal:
Meal to Treat Key - MealsToTreatsKey.csv
Meals_fed,Treats_fed
10.1,2.4
11.2,2.4
13.5,3
15.6,3.2
17,3.2
20.1,5.1
21.4,5.2
35.7,7.7
45.2,7.9
I need to take every meal type (eg; drop duplicate entries) that was delivered from table 1, find the associated treat type, and then create an individual entry for every time a treat was served to a specific kennel. The final result should look something like this:
Result CSV - mealsandtreats.csv
blank_column,Kennel_Number,Species,Food,Meal_ID
,1,Dog,Meal,11.2
,5,Dog,Meal,45.2
,3,Dog,Meal,21.4
,4,Dog,Meal,17
,2,Dog,Meal,11.2
,4,Dog,Meal,21.4
,6,Dog,Meal,17
,2,Dog,Meal,45.2
,1,Dog,Treat,2.4
,5,Dog,Treat,7.9
,3,Dog,Treat,5.2
,4,Dog,Treat,3.2
,1,Dog,Treat,2.4
,4,Dog,Treat,5.2
Would prefer to do this with the csv module and not Pandas, but I'm open to using Pandas if necessary.
I have a bit of code so far just opening the CSVs, but I'm really stuck on where to go next:
import csv
with open('./meals/results/foodToTreats.csv', 'r') as t1,
open('./results/food.csv', 'r') as t2:
key = t1.readlines()
map = t2.readlines()
with open('./results/food.csv', 'w') as outFileF:
for line in map:
if line not in key:
outFileF.write(line)
with open('./results/foodandtreats.csv', 'w') as outFileFT:
for line in map:
if line not in key:
outFileFT.write(line)
So basically I just need to take every treat entry in the 2nd sheet, search for matching associated food entries in the 1st sheet, look up the kennel number associated with that entry and then write it to the 1st sheet.
Giving it my best shot in pseudo code, something like:
for x in column 0,y:
y,1 = Z
food = x
treat = y
kennel_number = z
when x,z:
writerows('', {'kennel_number"}, 'species', '{food/treat}',
{'meal_id"})
Update: Here is the exact code I'm using, thanks to #wwii. Seeing a minor bug:
import csv
import collections
treats = {}
with open('mealsToTreatsKey.csv') as f2:
for line in f2:
meal,treat = line.strip().split(',')
treats[meal] = treat
new_items = set()
Treat = collections.namedtuple('Treat', ['blank_column','Kennel_Number','Species','Food','Meal_ID'])
with open('foodandtreats.csv') as f1:
reader = csv.DictReader(f1)
for row in reader:
row['Food'] = 'Treat'
row['Meal_ID'] = treats[row['Meal_ID']]
new_items.add(Treat(**row))
fieldnames = reader.fieldnames
with open('foodandtreats.csv', 'a') as f1:
writer = csv.DictWriter(f1, fieldnames)
for row in new_items:
writer.writerow(row._asdict())
This works perfectly except for one small bug. The first new row written isn't starting on its own line:
enter image description here

Make a dictionary mapping meals to treats
treats = {}
with open(treatfile) as f2:
for line in f2:
meal,treat = line.strip().split(',')
treats[meal] = treat
Iterate over the meal file and create set of new entries. Use namedtuples for the new items.
import collections
new_items = set()
Treat = collections.namedtuple('Treat', ['blank_column','Kennel_Number','Species','Food','Meal_ID'])
with open(mealfile) as f1:
reader = csv.DictReader(f1)
for row in reader:
row['Food'] = 'Treat'
row['Meal_ID'] = treats[row['Meal_ID']]
new_items.add(Treat(**row))
fieldnames = reader.fieldnames
Open the meal file (again) for appending and write the new entries
with open(mealfile, 'a') as f1:
writer = csv.DictWriter(f1, fieldnames)
for row in new_items:
writer.writerow(row._asdict())
If the meals file does not end with a newline character, you will need to add one before writing the new treat lines. Since you have control of the files you should just make sure it always ends in a blank line.

Python 2 - iterating through csv with determinating specific lines as dicitonary

I generated csv from multiple dictionaries (to be readable and editable too) with help of this question. Output is simple
//Dictionary
key,value
key2,value2
//Dictionary2
key4, value4
key5, value5
i want double backslash to be separator to create new dictionary, but every calling csv.reader(open("input.csv")) evaluates through lines so i have no use of:
import csv
dict = {}
for key, val in csv.reader(open("input.csv")):
dict[key] = val
Thanks for helping me out..
Edit: i made this piece of.. well "code".. I'll be glad if you can check it out and review:
#! /usr/bin/python
import csv
# list of dictionaries
l = []
# evalute throught csv
for row in csv.reader(open("test.csv")):
if row[0].startswith("//"):
# stripped "//" line is name for dictionary
n = row[0][2:]
# append stripped "//" line as name for dictionary
#debug
print n
l.append(n)
#debug print l[:]
elif len(row) == 2:
# debug
print "len(row) %s" % len(row)
# debug
print "row[:] %s" % row[:]
for key, val in row:
# print key,val
l[-1] = dic
dic = {}
dic[key] = val
# debug
for d in l:
print l
for key, value in d:
print key, value
unfortunately i got this Error:
DictName
len(row) 2
row[:] ['key', ' value']
Traceback (most recent call last):
File "reader.py", line 31, in <module>
for key, val in row:
ValueError: too many values to unpack

Consider not using CSV
First of all, your overall strategy to the data problem is probably not optimal. The less tabular your data looks, the less sense it makes to keep it in a CSV file (though your needs aren't too far out of the realm).
For example, it would be really easy to solve this problem using json:
import json
# First the data
data = dict(dict1=dict(key1="value1", key2="value2"),
dict2=dict(key3="value3", key4="value4"))
# Convert and write
js = json.dumps(data)
f = file("data.json", 'w')
f.write(js)
f.close()
# Now read back
f = file("data.json", 'r')
data = json.load(f)
print data
Answering the question as written
However, if you are really set on this strategy, you can do something along the lines suggested by jonrsharpe. You can't just use the csv module to do all the work for you, but actually have to go through and filter out (and split by) the "//" lines.
import csv
import re
def header_matcher(line):
"Returns something truthy if the line looks like a dict separator"
return re.match("//", line)
# Open the file and ...
f = open("data.csv")
# create some containers we can populate as we iterate
data = []
d = {}
for line in f:
if not header_matcher(line):
# We have a non-header row, so we make a new entry in our draft dictionary
key, val = line.strip().split(',')
d[key] = val
else:
# We've hit a new header, so we should throw our draft dictionary in our data list
if d:
# ... but only if we actually have had data since the last header
data.append(d)
d = {}
# The very last chunk will need to be captured as well
if d:
data.append(d)
# And we're done...
print data
This is quite a bit messier, and if there is any chance of needed to escape commas, it will get messier still. If you needed, you could probably find a clever way of chunking up the file into generators that you read with CSV readers, but it won't be particularly clean/easy (I started an approach like this but it looked like pain...). This is all a testament to your approach likely being the wrong way to store this data.
An alternative if you're set on CSV
Another way to go if you really want CSV but aren't stuck on the exact data format you specify: Add a column in the CSV file corresponding to the dictionary the data should go into. Imagine a file (data2.csv) that looks like this:
dict1,key1,value1
dict1,key2,value2
dict2,key3,value3
dict2,key4,value4
Now we can do something cleaner, like the following:
import csv
data = dict()
for chunk, key, val in csv.reader(file('test2.csv')):
try:
# If we already have a dict for the given chunk id, this should add the key/value pair
data[chunk][key] = val
except KeyError:
# Otherwise, we catch the exception and add a fresh dictionary with the key/value pair
data[chunk] = {key: val}
print data
Much nicer...
The only good argument for doing something closer to what you have in mind over this is if there is LOTS of data, and space is a concern. But that is not very likely to be case in most situations.
And pandas
Oh yes... one more possible solution is pandas. I haven't used it much yet, so I'm not as much help, but there is something along the lines of a group_by function it provides, which would let you group by the first column if you end up structuring the data as in the the 3-column CSV approach.

I decided to use json instead
Reading this is easier for the program and there's no need to filter text. For generating the data inside database in external file.json will serve python program.
#! /usr/bin/python
import json
category1 = {"server name1":"ip address1","server name2":"ip address2"}
category2 = {"server name1":"ip address1","server name1":"ip address1"}
servers = { "category Alias1":category1,"category Alias2":category2}
js = json.dumps(servers)
f = file("servers.json", "w")
f.write(js)
f.close()
# Now read back
f = file("servers.json", "r")
data = json.load(f)
print data
So the output is dictionary containing keys for categories and as values are another dictionaries. Exactly as i wanted.

CSV parsing in Python

I want to parse a csv file which is in the following format:
Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3
Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3
Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3
Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3
and would like to turn this into tab seperated format like in the following:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3
Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)
I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?
Best

I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!
I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:
import csv
from itertools import groupby
with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
# Use the csv module to handle reading and writing of delimited files.
reader = csv.reader(ifile)
writer = csv.writer(ofile, delimiter='\t')
# Skip info line
next(reader)
# Group datasets by the condition if len(row) > 0 or not, then filter
# out all empty lines
for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
test_data = list(group)
# Write header
writer.writerow([test_data[0][1]])
# Write transposed data
writer.writerows(zip(*test_data[1:]))
# Write blank line
writer.writerow([])
Output, given that the supplied data is stored in my_data.csv:
TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:
import csv
def print_section(section, f_out):
if len(section) > 0:
# find maximum column length
max_len = max([len(col) for col in section])
# build and print each row
for i in xrange(max_len):
f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
f_out.write('\n')
with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
line = f_in.next()
section = []
for line in f_in:
# test for new "Test" section
if len(line) == 3 and line[0] == 'Test' and line[2] == '':
# write previous section data
print_section(section, f_out)
# reset section
section = []
# write new section header
f_out.write(line[1] + '\n')
else:
# add line to section
section.append(line)
# print the last section
print_section(section, f_out)
Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.
The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).

Trouble with Python order of operations/loop

I have some code that is meant to convert CSV files into tab delimited files. My problem is that I cannot figure out how to write the correct values in the correct order. Here is my code:
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write(item['name']+'\t'+item['order_num']...)
tab_file.write('\n'+item['amt_due']+'\t'+item['due_date']...)
Now, since both my write statements are in the for row in data loop, my headers are being written multiple times over. If I outdent the first write statement, I'll have an obvious formatting error. If I move the second write statement above the first and then outdent, my data will be out of order. What can I do to make sure that the first write statement gets written once as a header, and the second gets written for each line in the CSV file? How do I extract the first 'write' statement outside of the loop without breaking the dictionary? Thanks!

The csv module contains methods for writing as well as reading, making this pretty trivial:
import csv
with open("test.csv") as file, open("test_tab.csv", "w") as out:
reader = csv.reader(file)
writer = csv.writer(out, dialect=csv.excel_tab)
for row in reader:
writer.writerow(row)
No need to do it all yourself. Note my use of the with statement, which should always be used when working with files in Python.
Edit: Naturally, if you want to select specific values, you can do that easily enough. You appear to be making your own dictionary to select the values - again, the csv module provides DictReader to do that for you:
import csv
with open("test.csv") as file, open("test_tab.csv", "w") as out:
reader = csv.DictReader(file)
writer = csv.writer(out, dialect=csv.excel_tab)
for row in reader:
writer.writerow([row["name"], row["order_num"], ...])
As kirelagin points out in the commends, csv.writerows() could also be used, here with a generator expression:
writer.writerows([row["name"], row["order_num"], ...] for row in reader)

Extract the code that writes the headers outside the main loop, in such a way that it only gets written exactly once at the beginning.
Also, consider using the CSV module for writing CSV files (not just for reading), don't reinvent the wheel!

Ok, so I figured it out, but it's not the most elegant solutions. Basically, I just ran the first loop, wrote to the file, then ran it a second time and appended the results. See my code below. I would love any input on a better way to accomplish what I've done here. Thanks!
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write(item['name']+'\t'+item['order_num']...)
tab_file.close()
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write('\n'+item['amt_due']+'\t'+item['due_date']...)
tab_file.close()

Python csvreader separate lines

I am using the csv module for Python. I have had a good look at the CSV File Reading and Writing guide. I want to write a loop that runs through each row in the CSV file and assigns each row do a different variable. Does anyone have any ideas on this?
I am aware that there is a .next() and .line_num, I didn't think that these would be suitable in this case although I might be wrong.
Currently I have the following code, which print out the whole CSV file:
print_csv = csv.reader(open(csv_name, 'rb'), delimiter=' ', quotechar='|')
for row in print_csv:
print ', '.join(row)
[EDIT]
I am now aware, from this question thread, that the best way to do this will depend on what the first line is going to be used for.
What I want to do with the first line of the CSV file is to check whether it is in the correct format. This would involve:
checking to see whether it has the expected number of columns
checking to see whether the column headers have the correct name
checking to see whether to columns are in the correct order.

1.- Fast Answer
Instead of setting different independent variables you could do:
mydict = {}
for idx, item in enumerate(reader):
mydict['var%i' %idx] = item
then you call your var like:
mydict['var0']
Or still shorter in py3k:
mydict = {'var%i' %idx : item for idx, item in enumerate(reader)}
But this doesnt have much sense applied this way
As a commenter said this is not different than doing directly:
mylist = list(reader)
and then
mylist[0] # instead of 'var0'
and this option is much better.
The dictionary strategy is best suited when you extract the dictionary key from the very same reader line. For example, if it were at pos pos 0,:
mydict = {item[0] : item for item in reader}
2.- The Proper Answer
But if what you want is simply to check the format of the first line (maybe to calculate the space you need for printing), the method could be:
line = reader.next()
like_this = check_how_is_my(line)
if like_this == 'something_long':
spaces = 23
else:
spaces = 0
while True:
try:
print_with_spaces(line, spaces)
reader.next()
except StopIteration:
break

Well, you can obviously do:
var1 = reader.next()
var2 = reader.next()
var3 = reader.next()
var4 = reader.next()
var5 = reader.next()
or any variation thereof. This is not my favorite coding style, but it works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Python to automate data cleanup, some issues - python

Related

Correlate data from two CSVs and write the data to the first CSV using Python

Python 2 - iterating through csv with determinating specific lines as dicitonary

CSV parsing in Python

Trouble with Python order of operations/loop

Python csvreader separate lines

Categories

Resources