shorten csv file based on rules python - python

I am stuck writing the following program.
I have a csv file
"SNo","Column1","Column2"
"A1","X","Y"
"A2","A","B"
"A1","X","Z"
"A3","M","N"
"A1","D","E"
I want to shorten this csv to follow these rules
a.) If the SNo occurs more than once in the file,
combine all column1 and column2 entries of that serial number
b.) If same column1 entries and column2 entries occur more than once,
then do not combine them twice.
Therefore the output of the above should be
"SNo","Column1","Column2"
"A1","X,D","Y,Z,E"
"A2","A","B"
"A3","M","N"
So far I am reading the csv file, iterating the rows. checking if SNo of next row is same as the previous row. Whats the best way to combine.
import csv
temp = "A1"
col1=""
col2=""
col3=""
with open("C:\\file\\file1.csv","rb") as f:
reader = csv.reader(f)
for row in reader:
if row[0] == temp:
continue
col1 = col1+row[1]
col2=col2+row[2]
col3=col3+row[3]
temp = row[0]
print row[0]+";"+col1+";"+col2+";"+col3
col1=""
col2=""
col3=""
Please let me know a good way to do this.
Thanks

The simplest approach is to maintain a dictionary with keys as serial numbers and sets to contain the columns. Then you could do something like the following:
my_dict = {}
for row in reader:
if not row[0] in my_dict.keys():
my_dict[row[0]] = [set(), set()]
my_dict[row[0]][0].add(row[1])
my_dict[row[0]][1].add(row[2])
Writing the file out (to a file opened as file_out) would be as simple as iterating through the dictionary using a join command:
for k in my_dict.keys():
file_out.write("{0},\"{1}\",\"{2}\"\n".format(
k,
','.join([x for x in my_dict[k][0]]),
','.join([x for x in my_dict[k][1]])
))

Related

Comparing Data Inside CSV Files

I am brand new to Python, so go easy on me!
I am simply trying to compare the values in two lists and get an output for them in yes or no form.
Image of the CSV file storing values in:
My code looks like this:
import csv
f = open("datatest.csv")
reader = csv.reader(f)
dataListed = [row for row in reader]
rc = csv.writer(f)
column1 = []
for row in dataListed:
column1.append(row[0])
column2 = []
for row in dataListed:
column2.append(row[1])
for row in dataListed:
if (column1) > (column2):
print ("yes,")
else:
print("no,")
Currently, the output is just no, no, no, no, no ..., when it should not look like that based on the values!
I will show below what I have attempted, any help would be huge.
for row in dataListed:
if (column1) > (column2):
print ("yes,")
else:
print("no,")
This loop is comparing the same column1 and column2 variables each time through the loop. Those variables never change.
The code does not magically know that column1 and column2 actually mean "the first column in the current row" and "the second column in the current row".
Presumably you meant something like this instead:
if row[0] > row[1]:
... because this actually does use the current value of row for each loop iteration.
You can simplify your code and obtain the expected result with something like this:
import csv
from pathlib import Path
dataset = Path('dataset.csv')
with dataset.open() as f:
reader = csv.reader(f)
headers = next(reader)
for col1, col2 in reader:
print(col1, col2, 'yes' if int(col1) > int(col2) else 'no', sep=',')
For the sample CSV you posted in the image, the output would be the following:
1,5,no
7,12,no
11,6,yes
89,100,no
99,87,yes
Here you can get a simple alternative for your program.
f=open("sample.csv")
for each in f:
header=""
row=each.split(",")
try:
column1=int(row[0])
column2=int(row[1])
if column1>column2:
print(f"{column1}\t{column2}\tYes")
else:
print(f"{column1}\t{column2}\tNo")
except:
header=each.split(",")
head=f"{header[0]}\t{header[1].strip()}\tresult"
print(head)
print((len(head)+5)*"-")
pass

Read Column Values Between Two Rows CSV Python

I have a CSV which is in the format:
Name1,Value1
,Value2
,Value3
Name2,Value40
,Value50
,Value60
Name3,Value5
,Value10
,Value15
There is not a set number of "values" per "name".
There is not pattern to the names.
I want to read the Values for Each Name into a dict such as:
Name1 : [Value1,Value2,Value3]
Name2 : [Value40,Value50,Value60]
etc.
My current code is this:
CSVFile = open("GroupsCSV.csv")
Reader = csv.reader(CSVFile)
for row in Reader:
if row[0] and row[2]:
objlist = []
objlist.append(row[2])
for row in Reader:
if not row[0] and row[2]:
objlist.append(row[2])
else:
break
print(objlist)
This half-works.
It will do Name1,Name3,Name5,Name7 etc.
I cant seem to find a way to stop it skipping.
Would prefer to do this without the use of something like Lambda (as its not something i fully understand yet!).
EDIT: Image of example csv (real data has another unnecessary column, hence the "row[2]" in the code.:
Try pandas:
import pandas as pd
df = pd.read_csv('your_file.csv', header=None)
(df.ffill() # fill the blank with the previous Name
.groupby([0])[1] # collect those with same name
.apply(list) # put those in a list
.to_dict() # make a dictionary
)
Output:
{'Name1': ['Value1', 'Value2', 'Value3'],
'Name2': ['Value40', 'Value50', 'Value60'],
'Name3': ['Value5', 'Value10', 'Value15']}
Update: the pure python(3) solution:
with open('your_file.csv') as f:
lines = f.readlines()
d = {}
for line in lines:
row = line.split(',')
if row[0] != '':
key = row[0]
d[key] = []
d[key].append(row[1])
d
I think the issue you are facing is because of your nested loop. Both loops are pointing to the same iterator. You are starting the second loop after it finds Name1 and breaking it when it finds Name2. By the time the outer loops continues after the break you have already skipped Name2.
You could have both conditions in the same loop:
# with open("GroupsCSV.csv") as csv_file:
# reader = csv.reader(csv_file)
reader = [[1,2,3],[None,5,6]] # Mocking the csv input
objlist = []
for row in reader:
if row[0] and row[2]:
objlist.clear()
objlist.append(row[2])
elif not row[0] and row[2]:
objlist.append(row[2])
print(objlist)
EDIT: I have updated the code to provide a testable output.
The printed output looks as follows:
[3]
[3, 6]

Loop within loop when comparing csv files in Python

I have two csv files. I am trying to look up a value the first column in one file (file 1) in the first column in the other file (file 2). If they match then print the row from file 2.
Pseudo code:
read file1.csv
read file2.csv
loop through file1
compare each row with each row of file 2 in turn
if file1[0] == file2[0]:
print row of file 2
file1:
45,John
46,Fred
47,Bill
File2:
46,Roger
48,Pete
49,Bob
I want it to print :
46 Roger
EDIT - these are examples, the actual file is much bigger (5,000 rows, 7 columns)
I have the following:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[0] == rowcsv2[0]:
print(rowcsv1)
However I am getting no output.
I am aware there are other ways of doing it (with dict, pandas) but I cam keen to know why my approach is not working.
EDIT: I now see that it is only iterating through the first row of file 1 and then closing, but I am unclear how to stop it closing (I also understand that this is not the best way to do do it).
You open csv2reader = csv.reader(csvfile2) then iterate through it vs the first row of csv1reader - it has now reached end of file and will not produce any more data.
So for the second through last rows of csv1reader you are comparing against the items of an empty list, ie no comparison takes place.
In any case, this is a very inefficient method; unless you are working on very large files, it would be much better to do
import csv
# load second file as lookup table
data = {}
with open("csv2file.csv") as inf2:
for row in csv.reader(inf2):
data[row[0]] = row
# now process first file against it
with open("csv1file.csv") as inf1:
for row in csv.reader(inf1):
if row[0] in data:
print(data[row[0]])
See Hugh Bothwell's answer for why your code isn't working. For a fast way of doing what you stated you want to do in your question, try this:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
duplicates = {a[0] for a in csv1} & {a[0] for a in csv2}
for row in csv2:
if row[0] in duplicates:
print(row)
It gets the duplicate numbers from the two csv files, then loops through the second cvs file, printing the row if the number at index 0 is in the first cvs file. This is a much faster algorithm than what you were attempting to do.
If order matters, as #hugh-bothwell's mentioned in #will-da-silva's answer, you could do:
import csv
from collections import OrderedDict
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
d = {row[0]: row for row in csv2}
k = OrderedDict.fromkeys([a[0] for a in csv1]).keys()
duplicate_keys = [k for k in k if k in d]
for k in duplicate_keys:
print(d[k])
I'm pretty sure there's a better way to do this, but try out this solution, it should work.
counter = 0
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as
csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[counter] == rowcsv2[counter]:
print(rowcsv1)
counter += 1 #increment it out of the IF statement.

How do I make sure that all of the items that I read from a CSV are parsed into a dictionary?

I have a large CSV file from which I am reading some data and adding that data into a dictionary. My CSV file has approximately 360000 rows and my dictionary has only a length of 5700. I know my CSV has a lot of duplicates but I expect about 50000 unique rows. I know that Python dictionaries have no limits to size. My code reads all the 360000 entries in the file, writes it to another file and terminates. All this processing finishes in about 2 seconds without any exceptions. How do I know for sure that all of the items in the CSV that I process are actually being added to the dictionary?
The code that I am using is as follows:
with open("read.csv", 'rb') as input1:
with open("write.csv", 'wb') as output1:
reader = csv.reader(input1, delimiter="\t")
writer = csv.writer(output1, delimiter="\t")
#Just testing if my program reads the whole CSV file
for row in reader:
count += 1
print count # Gives 360000
input1.seek(0)
for row in reader:
#print row[1] + "\t" + row[2]
dict1.update({row[1] : [ row[2], row[0] ]})
print len(dict1) # Gives 5700
for key in dict1:
ext_id = key
list1 = dict1[key]
name = list1[0]
url = list1[1]
writer.writerow([ext_id, name, url])
EDIT
I am not sure if people are understanding what I am trying to do and how that would be relevant but still, I'll explain.
My CSV file has 3 columns for each row. Their format is as follows:
URL+unique value | unique value | some name
However, the rows are duplicated in the CSV and I want another CSV which just has rows without any duplicates.
The keys in your dictionary are row[1]. The size of the dictionary will depend on how many different values of this field are in the input. It does not matter if the rest of the row (row[2], row[0]) differs between rows.
Example:
a,foo,1
b,bar,2
c,foo,3
d,baz,4
Will result in a dictionary of length 3 if the first field (index 1) is used as a key. The result will be:
{'foo':['3', 'c'],
'bar':['2', 'b'],
'baz':['4', 'd']}
The first line will be overwritten. Of course the 'order' can be different since a dictionary has no order.
EDIT: if you're just checking for uniqueness, there's no need to put this into a dictionary (which are designed for fast lookup and grouping). Use a set here instead.
out_ = set()
for row in reader:
out_.add(tuple(row))
# csv.reader may iterate through tuples already, I don't know! If so
# there's obviously no reason to cast it as one. Do:
## print(type(reader[0]))
# to find out.
for row in out_:
writer.writerow([row[1], row[0], row[2]])
Here's the quickest check I can think of.
set_headers = {row[1] for row in reader}
This is a set containing all the 2nd columns (that is to say, row[1]) of all the rows in your CSV. As you probably know, sets cannot contain duplicates, so this gives you how many UNIQUE values you have in your header column of each row.
Since dict.update REPLACES values, this is exactly what you're going to see with len(dict1), in fact len(set_headers) == len(dict1). Each time you iterate through a row, dict.update CHANGES THE VALUE of the key row[1] to (row[0], row[2]). Which is probably just fine if you don't care about the earlier values, but somehow I don't think that's true.
Instead, do this:
for row in reader:
dict1.setdefault(row[1],[]).append((row[0],row[1]))
This will end up with something like:
dict1 = {"foo": [(row1_col0,row1_col2),(row3_col0,row3_col2)],
"baz": [(row2_col0,row2_col2)]}
from input of:
row1_col0, foo, row1_col2
row2_col0, baz, row2_col2
row3_col0, foo, row3_col2
Now you can do, for instance:
for header in dict1:
for row in header:
print("{}\t{}".format(row[0],row[1]))
The simplest way to make sure you are getting everything is to add 2 test lists. Add: test1,test2=[],[], then right after you update your dictionary add test1.append(row[1]) then if row[1] not in test2: test2.append(row[1]). then you can print the length of both lists and see if the length of test2 is the same as your dictionary and the length of test1 is the same as the length of your input csv.
with open("read.csv", 'rb') as input1:
with open("write.csv", 'wb') as output1:
reader = csv.reader(input1, delimiter="\t")
writer = csv.writer(output1, delimiter="\t")
#Just testing if my program reads the whole CSV file
test1,test2=[],[]
for row in reader:
count += 1
print count # Gives 360000
input1.seek(0)
for row in reader:
#print row[1] + "\t" + row[2]
dict1.update({row[1] : [ row[2], row[0] ]})
test1.append(row[1])
if row[1] not in test2: test2.append(row[1])
print 'dictionary length:', len(dict1) # Gives 5700
print 'test 1 length (total values):',len(test1)
print 'test2 length (unique key values):',len(test2)
for key in dict1:
ext_id = key
list1 = dict1[key]
name = list1[0]
url = list1[1]
writer.writerow([ext_id, name, url])

Python: General CSV file parsing and manipulation

The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?
Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.
This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']
If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).

Categories

Resources