Comparing data in two CSV files - python

I have two CSV files which contain all the products in the database and currently the files are being compared using Excel formulas which is a long process. (approx. 130,000 rows in each file)
I have written a script in Python which works well with small sample data, however it isn't practical in the real world
CSV layout is:
ID, Product Title, Cost, Price1, Price2, Price3, Status
import csv
data_old = []
data_new = []
with open(file_path_old) as f1:
data = csv.reader(f1, delimiter=",")
next(data)
for row in data:
data_old.append(row)
f1.close()
with open(file_path_new) as f2:
data = csv.reader(f2, delimiter=",")
for row in data:
data_new.append(row)
f2.close()
for d1 in data_new:
for d2 in data_old:
if d2[0] == d1[0]:
# If match check rest of data in the same row
if d2[1] != d1[1]:
...
if d2[2] != d1[2]:
...
The issue with the above is as it is a nested for loop its going through each row of the second data 130,000 times (Slow is an understatement)
What I'm trying to achieve is to get a list of all the products which have had a change in either the title, cost, any of the 3 prices and status as well as a boolean flag to show which data has changed from the previous weeks data.
Desired Output CSV Format:
ID, Old Title, New Title, Changed, Old Cost, New Cost, Changed....
123, ABC, ABC, False, £12, £13, True....
SOLUTION:
import pandas as pd
# Read CSVs
old = pd.read_csv(old_file, sep=",")
new = pd.read_csv(new_file, sep=",")
# Join data together in single data table
df_join = pd.concat([old.set_index('PARTNO'), new.set_index('PARTNO'], axis='columns', key=['Old', 'New'])
# Displays data side by side
df_swap = pd.swaplevel(axis='columns')[old.columns[1:]]
# Output to CSV
out = df_swap.to_csv(output_file)

Just use pandas
import pandas as pd
old = pd.read_csv(file_path_old, sep=',')
new = pd.read_csv(file_path_new, sep=',')
Then you can do whatever (just read the doc). For example, to compare the titles:
old['Title'] == new['Title'] gives you an array of booleans for every row in your file.

Do you care about new and removed products? If not, then you can get O(n) performance by using a dictionary.
Pick one CSV file and shove it into a dictionary keyed by id. Use lookups into the dictionary to find products that changed.
Note that I simplified your data down to one column for brevity.
data_old = [
(1, 'alpha'),
(2, 'bravo'),
(3, 'delta'),
(5, 'echo')
]
data_new = [
(1, 'alpha'),
(2, 'zulu'),
(4, 'foxtrot'),
(6, 'mike'),
(7, 'lima'),
]
changed_products = []
new_product_map = {id: product for (id, product) in data_new}
for id, old_product in data_old:
if id in new_product_map and new_product_map[id] != old_product:
changed_products.append(id)
print('Changed products: ', changed_products)
You can shorten this even more using a list comprehension
new_product_map = {id: product for (id, product) in data_new}
changed_products = [id for (id, old_product) in data_old if id in new_product_map and new_product_map[id] != old_product]
print('Changed products: ', changed_products)
The diff algorithm below can also track insertions and deletions. You can use it if your CSV files are sorted by id.
You can sort the data in O(n*Lg(n)) time after loading it if the CSV files have no sensible order. Proceed with the diff after sorting.
Either way, this will be faster than the O(n^2) loops in your original post:
data_old = # same setup as before
data_new = # ditto
old_index = 0
new_index = 0
new_products = []
deleted_products = []
changed_products = []
while old_index < len(data_old) and new_index < len(data_new):
(old_id, old_product) = data_old[old_index]
(new_id, new_product) = data_new[new_index]
if old_id < new_id:
print('Product removed : %d' % old_id)
deleted_products.append(old_id)
old_index += 1
elif new_id < old_id:
print('Product added : %d' % new_id)
new_products.append(new_id)
new_index += 1
else:
if old_product != new_product:
print ('Product %d changed from %s to %s' %(old_id, old_product, new_product))
changed_products.append(old_id)
else:
print ('Product %d did not change' % old_id)
old_index += 1
new_index += 1
if old_index != len(data_old):
num_deleted = len(data_old) - old_index
print('The last %d old items were deleted' % num_deleted)
deleted_products += [id for (id, _) in data_old[old_index:]]
elif new_index != len(data_new):
num_added = len(data_new) - new_index
print('The last %d ne items were completely new' % num_added)
new_products += [id for (id, _) in data_new[new_index:]]
print('New products: ', new_products)
print('Changed products: ', changed_products)
print('Deleted products: ', deleted_products)
PS: The suggestion to use pandas is a great one. Use it if possible.

Related

Looking for a more elegant and sophisticated solution when multiple if and for-loop are used

I am beginner/intermediate user working with python and when I write elaborate code (at least for me), I always try to rewrite it looking for reducing the number of lines when possible.
Here the code I have written.
It is basically read all values of one data frame looking for a specific string, if string found save index and value in a dictionary and drop rows where these string was found. And the same with next string...
##### Reading CSV file values and looking for variants IDs ######
# Find Variant ID (rs000000) in CSV
# \d+ is neccesary in case the line find a rs+something. rs\d+ looks for rs+ numbers
rs = df_draft[df_draft.apply(lambda x:x.str.contains("rs\d+"))].dropna(how='all').dropna(axis=1, how='all')
# Now, we save the results found in a dict key=index and value=variand ID
if rs.empty == False:
ind = rs.index.to_list()
vals = list(rs.stack().values)
row2rs = dict(zip(ind, vals))
print(row2rs)
# We need to remove the row where rs has been found.
# Because if in the same row more than one ID variant found (i.e rs# and NM_#)
# this code is going to get same variant more than one.
for index, rs in row2rs.items():
# Rows where substring 'rs' has been found need to be delete to avoid repetition
# This will be done in df_draft
df_draft = df_draft.drop(index)
## Same thing with other ID variants
# Here with Variant ID (NM_0000000) in CSV
NM = df_draft[df_draft.apply(lambda x:x.str.contains("NM_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NM.empty == False:
ind = NM.index.to_list()
vals = list(NM.stack().values)
row2NM = dict(zip(ind, vals))
print(row2NM)
for index, NM in row2NM.items():
df_draft = df_draft.drop(index)
# Here with Variant ID (NP_0000000) in CSV
NP = df_draft[df_draft.apply(lambda x:x.str.contains("NP_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NP.empty == False:
ind = NP.index.to_list()
vals = list(NP.stack().values)
row2NP = dict(zip(ind, vals))
print(row2NP)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
# Here with ClinVar field (RCV#) in CSV
RCV = df_draft[df_draft.apply(lambda x:x.str.contains("RCV\d+"))].dropna(how='all').dropna(axis=1, how='all')
if RCV.empty == False:
ind = RCV.index.to_list()
vals = list(RCV.stack().values)
row2RCV = dict(zip(ind, vals))
print(row2RCV)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
I was wondering for a more elegant solution of writing this simple but long code.
I have been thinking of sa

Fast distinct list of elements in an array python

I need to speed up the time to count the distinct elements in this code and I'm not really sure how to do a faster count.
def process_columns(columns):
with open(columns, 'r') as src:
data = csv.reader(src, delimiter ='\t', skipinitialspace = False)
category = []
group = columns.split("/")
group = group[-1].split(".")
if group[0] in ["data_1", "data_2"]:
for row in data:
if row[0] not in category:
category.append(row[0])
message = "\t%d distinct elements from %ss" % (len(category), group[0])
print message
A master method to count distinct elements in a python array is :
array = [1,1,2,3,3,4,5,6,6]
n_elts = len(set(array))
print(n_elts)
Output:
6
Without much knowledge on your data, here's a quick way to maintain a set of unique words for your groups, using collections.defaultdict.
from collections import defaultdict
def process_columns(columns):
categories = defaultdict(set) # initialises a default dict with values as sets
with open(columns, 'r') as src:
data = csv.reader(src, delimiter ='\t', skipinitialspace = False)
group = columns.split("/")[-1].split('.')
for row in data:
categories[group[0]].update(row[0])
for k in categories:
message = "\t%d distinct elements from %ss" % (len(categories[k]), k)
print message
Initialise category as a set; and remove the if block to add data into category,replace it with category.add
category = {}
group = columns.split("/")
group = group[-1].split(".")
if group[0] in ["data_1", "data_2"]:
for row in data:
category.add(row[0])
Hope this is clear

Need to get the predicted values into a csv or Sframe or Dataframe in Python

Prerequisites
Dataset I'm working with is MovieLens 100k
Python Packages I'm using is Surprise, Io and Pandas
Agenda is to test the recommendation system using KNN (+ K-Fold) on Algorithms: Vector cosine & Pearson, for both User based CF & Item based CF
Briefing
So far, I have coded for both UBCF & IBCF as below
Q1. IBCF Generates data as per input given to it, I need it to export a csv file since I need to find out the predicted values
Q2. UBCF needs to enter each data separately and doesn't work even with immediate below code:
csvfile = 'pred_matrix.csv'
with open(csvfile, "w") as output:
writer = csv.writer(output,lineterminator='\n')
#algo.predict(user_id, item_id, estimated_ratings)
for val in algo.predict(str(range(1,943)),range(1,1683),1):
writer.writerow([val])
Clearly it throws the error of lists, as it cannot be Comma separated.
Q3 Getting Precision & Recall on Evaluated and Recommended values
CODE
STARTS WITH
if ip == 1:
one = 'cosine'
else:
one = 'pearson'
choice = raw_input("Filtering Method: \n1.User based \n2.Item based \n Choice:")
if choice == '1':
user_based_cf(one)
elif choice == '2':
item_based_cf(one)
else:
sim_op={}
exit(0)
UBCF:
def user_based_cf(co_pe):
# INITIALIZE REQUIRED PARAMETERS
path = '/home/mister-t/Projects/PycharmProjects/RecommendationSys/ml-100k/u.user'
prnt = "USER"
sim_op = {'name': co_pe, 'user_based': True}
algo = KNNBasic(sim_options=sim_op)
# RESPONSIBLE TO EXECUTE DATA SPLITS Mentioned in STEP 4
perf = evaluate(algo, df, measures=['RMSE', 'MAE'])
print_perf(perf)
print type(perf)
# START TRAINING
trainset = df.build_full_trainset()
# APPLYING ALGORITHM KNN Basic
res = algo.train(trainset)
print "\t\t >>>TRAINED SET<<<<\n\n", res
# PEEKING PREDICTED VALUES
search_key = raw_input("Enter User ID:")
item_id = raw_input("Enter Item ID:")
actual_rating = input("Enter actual Rating:")
print algo.predict(str(search_key), item_id, actual_rating)
IBCF
def item_based_cf(co_pe):
# INITIALIZE REQUIRED PARAMETERS
path = '/location/ml-100k/u.item'
prnt = "ITEM"
sim_op = {'name': co_pe, 'user_based': False}
algo = KNNBasic(sim_options=sim_op)
# RESPONSIBLE TO EXECUTE DATA SPLITS = 2
perf = evaluate(algo, df, measures=['RMSE', 'MAE'])
print_perf(perf)
print type(perf)
# START TRAINING
trainset = df.build_full_trainset()
# APPLYING ALGORITHM KNN Basic
res = algo.train(trainset)
print "\t\t >>>TRAINED SET<<<<\n\n", res
# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names(path)
search_key = raw_input("ID:")
print "ALGORITHM USED : ", one
toy_story_raw_id = name_to_rid[search_key]
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
# Retrieve inner ids of the nearest neighbors of Toy Story.
k=5
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=k)
# Convert inner ids of the neighbors into names.
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
for rid in toy_story_neighbors)
print 'The ', k,' nearest neighbors of ', search_key,' are:'
for movie in toy_story_neighbors:
print(movie)
Q1. IBCF Generates data as per input given to it, I need it to export a csv file since I need to find out the predicted values
the easiest way to dump anything to a csv would be to use the csv module!
import csv
res = [x, y, z, ....]
csvfile = "<path to output csv or txt>"
#Assuming res is a flat list
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in res:
writer.writerow([val])
#Assuming res is a list of lists
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
writer.writerows(res)

Google chart input data

I have a python script to build inputs for a Google chart. It correctly creates column headers and the correct number of rows, but repeats the data for the last row in every row. I tried explicitly setting the row indices rather than using a loop (which wouldn't work in practice, but should have worked in testing). It still gives me the same values for each entry. I also had it working when I had this code on the same page as the HTML user form.
end1 = number of rows in the data table
end2 = number of columns in the data table represented by a list of column headers
viewData = data stored in database
c = connections['default'].cursor()
c.execute("SELECT * FROM {0}.\"{1}\"".format(analysis_schema, viewName))
viewData=c.fetchall()
curDesc = c.description
end1 = len(viewData)
end2 = len(curDesc)
Creates column headers:
colOrder=[curDesc[2][0]]
if activityOrCommodity=="activity":
tableDescription={curDesc[2][0] : ("string", "Activity")}
elif (activityOrCommodity == "commodity") or (activityOrCommodity == "aa_commodity"):
tableDescription={curDesc[2][0] : ("string", "Commodity")}
for i in range(3,end2 ):
attValue = curDesc[i][0]
tableDescription[curDesc[i][0]]= ("number", attValue)
colOrder.append(curDesc[i][0])
Creates row data:
data=[]
values = {}
for i in range(0,end1):
for j in range(2, end2):
if j == 2:
values[curDesc[j][0]] = viewData[i][j].encode("utf-8")
else:
values[curDesc[j][0]] = viewData[i][j]
data.append(values)
dataTable = gviz_api.DataTable(tableDescription)
dataTable.LoadData(data)
return dataTable.ToJSon(columns_order=colOrder)
An example javascript output:
var dt = new google.visualization.DataTable({cols:[{id:'activity',label:'Activity',type:'string'},{id:'size',label:'size',type:'number'},{id:'compositeutility',label:'compositeutility',type:'number'}],rows:[{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]}]}, 0.6);
it seems you're appending values to the data but your values are not being reset after each iteration...
i assume this is not intended right? if so just move values inside the first for loop in your row setting code

Pandas: Efficiently splitting entries

I have a Pandas dataframe with columns as such:
event_id, obj_0_type, obj_0_foo, obj_0_bar, obj_1_type, obj_1_foo, obj_1_bar, obj_n_type, obj_n_foo, obj_n_bar, ....
For example:
col_idx = ['event_id']
[col_idx.extend(('obj_%d_id' %d, 'obj_%d_foo' %d, 'obj_%d_bar' %d)) for d in range(5)]
event_id = np.array(range(0,5))
data = np.random.rand(15,5)
data = np.vstack((event_id, data))
df = DataFrame(data.T, index = range(5), columns = col_idx)
I would like to split each individual row of the dataframe so that I'd have a single entry per object, as such:
event_id, obj_type, obj_foo, obj_bar
Where event_id would be shared among all the objects of a given event.
There are lots of very slow ways of doing it (iterating over the dataframe rows and creating new series objects) but those are atrociously slow and obviously unpythonic. Is there a simpler way I am missing?
With some suggestions from some people in #pydata on freenode, this is what I came up with:
data = []
for d in range(5):
temp = df.ix[:, ['event_id', 'obj_%d_id' % d, 'obj_%d_foo' % d, 'obj_%d_bar' % d]]
temp.columns = ['event_id', 'obj_id', 'obj_foo', 'obj_bar']
# Giving columns unique names.
temp.index = temp['event_id']*10 + d
# Creating a unique index.
data.append(temp)
concat(data)
This works and is reasonably fast!

Categories

Resources