Python Duplicate Removal

Python Duplicate Removal - python

I have a question about removing duplicates in Python. I've read a bunch of posts but have not yet been able to solve it. I have the following csv file:
EDIT
Input:
ID, Source, 1.A, 1.B, 1.C, 1.D
1, ESPN, 5,7,,,M
1, NY Times,,10,12,W
1, ESPN, 10,,Q,,M
Output should be:
ID, Source, 1.A, 1.B, 1.C, 1.D, duplicate_flag
1, ESPN, 5,7,,,M, duplicate
1, NY Times,,10,12,W, duplicate
1, ESPN, 10,,Q,,M, duplicate
1, NY Times, 5 (or 10 doesn't matter which one),7, 10, 12, W, not_duplicate
In words, if the ID is the same, take values from the row with source "NY Times", if the row with "NY Times" has a blank value and the duplicate row from the "ESPN" source has a value for that cell, take the value from the row with the "ESPN" source. For outputting, flag the original two lines as duplicates and create a third line.
To clarify a bit further, since I need to run this script on many different csv files with different column headers, I can't do something like:
def main():
with open(input_csv, "rb") as infile:
input_fields = ("ID", "Source", "1.A", "1.B", "1.C", "1.D")
reader = csv.DictReader(infile, fieldnames = input_fields)
with open(output_csv, "wb") as outfile:
output_fields = ("ID", "Source", "1.A", "1.B", "1.C", "1.D", "d_flag")
writer = csv.DictWriter(outfile, fieldnames = output_fields)
writer.writerow(dict((h,h) for h in output_fields))
next(reader)
first_row = next(reader)
for next_row in reader:
#stuff
Because I want the program to run on the first two columns independently of whatever other columns are in the table. In other words, "ID" and "Source" will be in every input file, but the rest of the columns will change depending on the file.
Would greatly appreciate any help you can provide! FYI, "Source" can only be: NY Times, ESPN, or Wall Street Journal and the order of priority for duplicates is: take NY Times if available, otherwise take ESPN, otherwise take Wall Street Journal. This holds for every input file.

The below code reads all of the records into a big dictionary whose keys are their identifiers and whose values are dictionaries mapping source names to entire data rows. Then it iterates through the dictionary and gives you the output you asked for.
import csv
header = None
idfld = None
sourcefld = None
record_table = {}
with open('input.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
row = [x.strip() for x in row]
if header is None:
header = row
for i, fld in enumerate(header):
if fld == 'ID':
idfld = i
elif fld == 'Source':
sourcefld = i
continue
key = row[idfld]
sourcename = row[sourcefld]
if key not in record_table:
record_table[key] = {sourcename: row, "all_rows": [row]}
else:
if sourcename in record_table[key]:
cur_row = record_table[key][sourcename]
for i, fld in enumerate(row):
if cur_row[i] == '':
record_table[key][sourcename][i] = fld
else:
record_table[key][sourcename] = row
record_table[key]["all_rows"].append(row)
print ', '.join(header) + ', duplicate_flag'
for recordid in record_table:
rowdict = record_table[recordid]
final_row = [''] * len(header)
rowcount = len(rowdict)
for sourcetype in ['NY Times', 'ESPN', 'Wall Street Journal']:
if sourcetype in rowdict:
row = rowdict[sourcetype]
for i, fld in enumerate(row):
if final_row[i] != '':
continue
if fld != '':
final_row[i] = fld
if rowcount > 1:
for row in rowdict["all_rows"]:
print ', '.join(row) + ', duplicate'
print ', '.join(final_row) + ', not_duplicate'

Related

extracting row from CSV file with Python / Django

hey I'm trying to extract certain row from a CSV file with content in this form:
POS,Transaction id,Product,Quantity,Customer,Date
1,E100,TV,1,Test Customer,2022-09-19
2,E100,Laptop,3,Test Customer,2022-09-20
3,E200,TV,1,Test Customer,2022-09-21
4,E300,Smartphone,2,Test Customer,2022-09-22
5,E300,Laptop,5,New Customer,2022-09-23
6,E300,TV,1,New Customer,2022-09-23
7,E400,TV,2,ABC,2022-09-24
8,E500,Smartwatch,4,ABC,2022-09-25
the code I wrote is the following
def csv_upload_view(request):
print('file is being uploaded')
if request.method == 'POST':
csv_file = request.FILES.get('file')
obj = CSV.objects.create(file_name=csv_file)
with open(obj.file_name.path, 'r') as f:
reader = csv.reader(f)
reader.__next__()
for row in reader:
data = "".join(row)
data = data.split(";")
#data.pop()
print(data[0], type(data))
transaction_id = data[0]
product = data[1]
quantity = int(data[2])
customer = data[3]
date = parse_date(data[4])
In the console then I get the following output:
Quit the server with CONTROL-C.
[22/Sep/2022 15:16:28] "GET /reports/from-file/ HTTP/1.1" 200 11719
file is being uploaded
1E100TV1Test Customer2022-09-19 <class 'list'>
So that I get the correct row put everything concatenated. If instead I put in a space in the " ".join.row I get the entire row separated with empty spaces - what I would like to do is access this row with
transaction_id = data[0]
product = data[1]
quantity = int(data[2])
customer = data[3]
date = parse_date(data[4])
but I always get an
IndexError: list index out of range
I also tried with data.replace(" ",";") but this gives me another error and the data type becomes a string instead of a list:
ValueError: invalid literal for int() with base 10: 'E'
Can someone please show me what I'm missing here?

I'm not sure why you are joining/splitting the row up. And you realize your split is using a semicolon?
I would expect something like this:
import csv
from collections import namedtuple
Transaction = namedtuple('Transaction', ['id', 'product', 'qty', 'customer', 'date'])
f_name = 'data.csv'
transactions = [] # to hold the result
with open(f_name, 'r') as src:
src.readline() # burn the header row
reader = csv.reader(src) # if you want to use csv reader
for data in reader:
#print(data) <-- to see what the csv reader gives you...
t = Transaction(data[1], data[2], int(data[3]), data[4], data[5])
transactions.append(t)
for t in transactions:
print(t)
The above "catches" results with a namedtuple, which is obviously optional. You could put them in lists, etc.
Also csv.reader will do the splitting (by comma) by default. I edited my previous answer.
As far as your question goes... You mention extracting a "certain row" but you gave no indication how you would find such row. If you know the row index/number, you could burn lines with readline or such, or just keep a counter while you read. If you are looking for keyword in the data, just pop a conditional statement in either before or after splitting up the line.

This way you can split the rows (and find which row you want based on some provided value)
with open('data.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter = ',')
line_count = 0
for row in csv_reader:
# Line 0 is the header
if line_count == 0:
print(f'Column names are {", ".join(row)}')
line_count += 1
else:
line_count += 1
# Here you can check if the row value is equal what you're finding
# row[0] = POS
# row[1] = Transaction id
# row[2] = Product
# row[3] = Quantity
# row[4] = Customer
# row[5] = Date
if row[2] = "TV":
#If you want to add all variables into a single string:
data = ",".join(row)
# Make each row into a single variable:
transaction_id = row[0]
product = row[1]
quantity = row[2]
customer = row[3]
date = row[4]

CSV without unique headers to list of dicts with unique keys

I have a CSV file with headers on row 0. The headers are often unique but sometimes they are not, for "comments" in this example. For each of several comments, the header is "Comment".
The problem with my function that makes dicts from CSVs is that it only returns the last column of Comment.
def csv_to_list_with_dicts(csvfile):
with open(csvfile) as f:
list_of_issues = [{k: v for k, v in row.items()}
for row in csv.DictReader(f, skipinitialspace=True)]
return list_of_issues
My CSV file columns are like this:
User;ID;Comment;Comment;Comment
If one of the headers is repeating, I need to add an index to make it unique (like Comment1;Comment2 without changing the CSV) in the dict or all comments included under just Comment.

This did return just the way I wanted. Just tweaked yours a small bit Happy Ahmad! HUGE THANKS!!! <3
def csv_to_list_with_dicts(csvfile):
with open(csvfile, "r") as file:
keys = file.readline().split(",")
alteredKeys = []
for eachKey in keys:
counter = 0
while(eachKey in alteredKeys):
counter += 1
eachKey = eachKey[:len(eachKey)-(0 if counter == 1 else 1)] + str(counter)
alteredKeys.append(eachKey)
list_of_issues = []
reader = csv.reader(file, delimiter=',', skipinitialspace=True)
for eachLine in reader:
eachIssue = dict()
columnIndex = 0
for eachColumn in eachLine:
if columnIndex < len(alteredKeys):
eachIssue[alteredKeys[columnIndex]] = eachColumn
columnIndex += 1
list_of_issues.append(eachIssue)
return list_of_issues

In this solution, I use an alterKey list that changes any repeated key in the header by adding an index at its end. Then, I iterate on the other lines of the CSV file and make a dictionary from each one.
def csv_to_list_with_dicts(csvfile):
with open(csvfile, "r") as file:
keys = file.readline().split(";")
alteredKeys = []
for eachKey in keys:
counter = 0
while(eachKey in alteredKeys):
counter += 1
eachKey = eachKey[:len(eachKey)-(0 if counter == 1 else 1)] + str(counter)
alteredKeys.append(eachKey)
list_of_issues = []
for eachLine in file:
eachIssue = dict()
columnIndex = 0
for eachColumn in eachLine.split(";")
if columnIndex < len(alteredKeys):
eachIssue[alteredKeys[columnIndex]] = eachColumn
columnIndex += 1
list_of_issues.append(eachIssue)
return list_of_issues

It woujld be fairly easy to write code that will automatically generate unique keys for you by simply keeping track of those already seen and generating a unique name for any encountered that conflicted with one before it. Checking for that would be quick if those seen were kept in a set which features fast membership testing.
For example, assume this was in a CSV file named non-unique.csv:
User;ID;Comment;Comment;Comment
Jose;1138;something1;something2;something3
Gene;2907;abc;def;ghi
Guido;6450;jkl;mno;pqr
Code:
import csv
def csv_to_list_with_dicts(csv_filename):
# Read the first row of the csv file.
with open(csv_filename, encoding='utf-8', newline='') as csv_file:
reader = csv.reader(csv_file, delimiter=';', skipinitialspace=True)
names = next(reader) # Header row.
# Create list of unique fieldnames for the namee in the header row.
seen = set()
fieldnames = []
for i, name in enumerate(names):
if name in seen:
name = f'_{i}'
else:
seen.add(name)
fieldnames.append(name)
# Read entire file and make each row a dictionary with keys based on the fieldnames.
with open(csv_filename, encoding='utf-8', newline='') as csv_file:
reader = csv.DictReader(csv_file, fieldnames=fieldnames, delimiter=';',
skipinitialspace=True)
next(reader) # Ignore header row.
return list(reader)
results = csv_to_list_with_dicts('non-unique.csv')
from pprint import pprint
pprint(results, sort_dicts=False, width=120)
Results:
[{'User': 'Jose', 'ID': '1138', 'Comment': 'something1', '_3': 'something2', '_4': 'something3'},
{'User': 'Gene', 'ID': '2907', 'Comment': 'abc', '_3': 'def', '_4': 'ghi'},
{'User': 'Guido', 'ID': '6450', 'Comment': 'jkl', '_3': 'mno', '_4': 'pqr'}]

Calculating row and column totals form csv files

I have the following CSV file about family expenses:
Family, Medical, Travel, Education
Smith, 346, 566, 45
Taylor, 56,837,848
I want to be able to calculate the row totals and column totals. For example:
Smith = 346+566+45
Taylor = 56+837+848
Medical = 346+56
Travel = 566+837
Education = 45+848
I have the following so far:
import csv
file = open('Family expenses.csv', newline='')
reader = csv.reader(file)
header = next(reader)
data = [row for row in header]
ndata = []
x = 0
for x in range(0, 3):
for i in data[x]:
i.split(',')
x += 1
ndata.append(i)
rdata = [int(s) if s.isdecimal() else s for s in ndata]

There's no need for pandas for this; using DictReader makes it easy:
import csv
file = open("Family expenses.csv", newline="")
reader = csv.DictReader(file, skipinitialspace=True)
results = {}
for row in reader:
results[row["Family"]] = 0 # initialize result for each family name
for key, value in row.items():
if key == "Family":
continue
if key not in results: # initialize result for each category
results[key] = 0
results[key] += float(value) # add value for category
results[row["Family"]] += float(value) # add value for family name
for key, result in results.items():
print(key, result)
I used skipinitialspace because there were some whitespaces in your CSV data.

#Using a list in Python. Here you go
import csv
file = open('Family expenses.csv', newline='')
reader = csv.reader(file)
header = next(reader) #read first row & skip first row (header)
header.pop(0) #removing [0,0] first row first column for column wise sum heading
num_of_cols = len(header) #counting #columns
sum_col=[0,0,0] #a list for columnwise sum
j,temp=0,0
for row in reader:
sum_row,i = 0,0
print(row[0])
for i in range(1,len(row)):
sum_row+=int(row[i])
sum_col[i-1]=int(sum_col[i-1])+int(row[i])
print(sum_row)
print(header)
print(sum_col)`

Pivot a CSV string using python without using pandas or any similar library

You may think of this one as another redundant question asked, but I tried to go through all similar questions asked, no luck so far. In my specific use-case, I can't use pandas or any other similar library for this operation.
This is what my input looks like
AttributeName,Value
Name,John
Gender,M
PlaceofBirth,Texas
Name,Alexa
Gender,F
SurName,Garden
This is my expected output
Name,Gender,Surname,PlaceofBirth
John,M,,Texas
Alexa,F,Garden,
So far, I have tried to store my input into a dictionary and then tried writing it to a csv string. But, it is failing as I am not sure how to incorporate missing column values conditions. Here is my code so far
reader = csv.reader(csvstring.split('\n'), delimiter=',')
csvdata = {}
csvfile = ''
for row in reader:
if row[0] != '' and row[0] in csvdata and row[1] != '':
csvdata[row[0]].append(row[1])
elif row[0] != '' and row[0] in csvdata and row[1] == '':
csvdata[row[0]].append(' ')
elif row[0] != '' and row[1] != '':
csvdata[row[0]] = [row[1]]
elif row[0] != '' and row[1] == '':
csvdata[row[0]] = [' ']
for key, value in csvdata.items():
if value == ' ':
csvdata[key] = []
csvfile += ','.join(csvdata.keys()) + '\n'
for row in zip(*csvdata.values()):
csvfile += ','.join(row) + '\n'
For the above code as well, I took some help here. Thanks in advance for any suggestions/advice.
Edit #1 : Update code to imply that I am doing processing on a csv string instead of a csv file.

What you need is something like that:
import csv
with open("in.csv") as infile:
buffer = []
item = {}
lines = csv.reader(infile)
for line in lines:
if line[0] == 'Name':
buffer.append(item.copy())
item = {'Name':line[1]}
else:
item[line[0]] = line[1]
buffer.append(item.copy())
for item in buffer[1:]:
print item

If none of the attributes is mandatory, I think #framontb solution needs to be rearranged in order to work also when Name field is not given.
This is an import-free solution, and it's not super elegant.
I assume you have lines already in this form, with this columns:
lines = [
"Name,John",
"Gender,M",
"PlaceofBirth,Texas",
"Gender,F",
"Name,Alexa",
"Surname,Garden" # modified typo here: SurName -> Surname
]
cols = ["Name", "Gender", "Surname", "PlaceofBirth"]
We need to distinguish one record from another, and without mandatory fields the best I can do is start considering a new record when an attribute has already been seen.
To do this, I use a temporary list of attributes tempcols from which I remove elements until an error is raised, i.e. new record.
Code:
csvdata = {k:[] for k in cols}
tempcols = list(cols)
for line in lines:
attr, value = line.split(",")
try:
csvdata[attr].append(value)
tempcols.remove(attr)
except ValueError:
for c in tempcols: # now tempcols has only "missing" attributes
csvdata[c].append("")
tempcols = [c for c in cols if c != attr]
for c in tempcols:
csvdata[c].append("")
# write csv string with the code you provided
csvfile = ""
csvfile += ",".join(csvdata.keys()) + "\n"
for row in zip(*csvdata.values()):
csvfile += ",".join(row) + "\n"
>>> print(csvfile)
Name,PlaceofBirth,Surname,Gender
John,Texas,,M
Alexa,,Garden,F
While, if you want to sort columns according to your desired output:
csvfile = ""
csvfile += ",".join(cols) + "\n"
for row in zip(*[csvdata[k] for k in cols]):
csvfile += ",".join(row) + "\n"
>>> print(csvfile)
Name,Gender,Surname,PlaceofBirth
John,M,,Texas
Alexa,F,Garden,

This works for me:
with open("in.csv") as infile, open("out.csv", "w") as outfile:
incsv, outcsv = csv.reader(infile), csv.writer(outfile)
incsv.__next__() # Skip 1st row
outcsv.writerows(zip(*incsv))
Update: For input and output as strings:
import csv, io
with io.StringIO(indata) as infile, io.StringIO() as outfile:
incsv, outcsv = csv.reader(infile), csv.writer(outfile)
incsv.__next__() # Skip 1st row
outcsv.writerows(zip(*incsv))
print(outfile.getvalue())

csv.DictReader delimiter inside a csv field with multiple quotes

I have the following example csv, that I am reading with:
f = StringIO(response.encode('utf-8'))
reader = csv.DictReader(f, quotechar='"', delimiter=';', quoting=csv.QUOTE_ALL, skipinitialspace=True)
example csv:
id;name;community;owner;owns;description;uuid
3c;NP;NoProb;NoP;Text;text_with_no_issues;
3c;NP;NoProb;NoP;TextText;text_with_no_issues2;
1A;fooo;barr;Bar;TEXT1;"\"text\"\"None\"\";text\"\"TEXT\"\"text\"";
1A;fooo;barr;Bar;TEXT2;"\"text\"\"None\"\";text\"\"TEXT\"\"text\"";
2B;BAR;foo;Bar;TEXT3;"\"text\"\"None\"\";text\"\"TEXT\"\"text\";text\"\"TEXT\"\"text\"";
2B;BAR;foo;Bar;TEXT4;"\"text\"\"None\"\";text\"\"TEXT\"\"text\";text\"\"TEXT\"\"text\"";
the uuid column is empty in all cases.
within the "reader" there are multiple entries with the same 'name' and 'id' which I am "merging", but in lines like the last four (1A,2B) I am hitting an issue because of the ";" delimiter in the description.
Even with quotechar='"' and quoting=csv.QUOTE_ALL the description column gets spitted by the delimiter and goes to the next column (uuid) and to a "None" column which corrupts my data.
Any idea how to solve this one ?
P.S. for the merge logic I am using two variants:
##############################################################
name_index = []
result = []
for line in reader:
idx = line["name"]
if idx not in name_index:
name_index.append(idx)
result.append(line)
else:
idx_curr_dict = result[name_index.index(idx)]
merge_entries = [idx_curr_dict, line]
placeholder = {}
for key in idx_curr_dict:
placeholder[key] = ", ".join(list(set(d[key] for d in merge_entries if d[key] != "" and d[key])))
result[name_index.index(idx)] = placeholder
##############################################################
and a bit slower one, but not that complicated:
##############################################################
data = [line for line in reply] # Deplete the iterator
unique_names = set([line['name'] for line in data]) # List of unique names
column_names = [key for key in data[0] if key != 'name' and key != 'uuid'] # all other useful columns
result = []
for name in unique_names:
same_named_lines = [line for line in data if line['name'] == name]
unique_line = {'name': name}
for column in column_names:
value = ", ".join(set([line[column] for line in same_named_lines]))
unique_line[column] = value
result.append(unique_line)
##############################################################
Thanks a lot in advance!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Duplicate Removal - python

Related

extracting row from CSV file with Python / Django

CSV without unique headers to list of dicts with unique keys

Calculating row and column totals form csv files

Pivot a CSV string using python without using pandas or any similar library

csv.DictReader delimiter inside a csv field with multiple quotes

Categories

Resources