python3 csv with duplicate keys + python defaultdict - python

I have a csv file which is having lot of serial numbers and material numbers for ex: show below (I need only first 2columns i.e serial and chassis and rest is not required).
serial chassis type date
ZX34215 Test XX YY
ZX34215 final-001 XX YY
AB30000 Used XX YY
ZX34215 final-002 XX YY
I have below snippet which gets all the serial and material numbers into a dictionary but here duplicate keys are eliminated and it captures latest serial key.
Working code
import sys
import csv
with open('file1.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict1 = {rows[0]:rows[1] for rows in reader}
print(mydict1)
I need to capture duplicate keys with respective values also but it failed. I used python defaultdict and looks like I missed something here.
not working
from collections import defaultdict
with open('file1.csv',mode='r') as infile:
data=defaultdict(dict)
reader=csv.reader(infile)
list_res = list(reader)
for row in reader:
result=data[row[0]].append(row[1])
print(result)
Can some one correct me to capture duplicate keys into dictionary.

You need to pass a list to your defaultdict not dict :
data=defaultdict(list)
Also you don't need to convert the reader object to list, for iterating over it, you also shouldn't assign the append snipped to a variable in each iteration:
data=defaultdict(list)
with open('file1.csv') as infile:
reader=csv.reader(infile)
for row in reader:
try:
data[row[0]].append(row[1])
except IndexError:
pass
print(data)

Related

how to select a specific column of a csv file in python

I am a beginner of Python and would like to have your opinion..
I wrote this code that reads the only column in a file on my pc and puts it in a list.
I have difficulties understanding how I could modify the same code with a file that has multiple columns and select only the column of my interest.
Can you help me?
list = []
with open(r'C:\Users\Desktop\mydoc.csv') as file:
for line in file:
item = int(line)
list.append(item)
results = []
for i in range(0,1086):
a = list[i-1]
b = list[i]
c = list[i+1]
results.append(b)
print(results)
You can use pandas.read_csv() method very simply like this:
import pandas as pd
my_data_frame = pd.read_csv('path/to/your/data')
results = my_data_frame['name_of_your_wanted_column'].values.tolist()
A useful module for the kind of work you are doing is the imaginatively named csv module.
Many csv files have a "header" at the top, this by convention is a useful way of labeling the columns of your file. Assuming you can insert a line at the top of your csv file with comma delimited fieldnames, then you could replace your program with something like:
import csv
with open(r'C:\Users\Desktop\mydoc.csv') as myfile:
csv_reader = csv.DictReader(myfile)
for row in csv_reader:
print ( row['column_name_of_interest'])
The above will print to the terminal all the values that match your specific 'column_name_of_interest' after you edit it to match your particular file.
It's normal to work with lots of columns at once, so that dictionary method of packing a whole row into a single object, addressable by column-name can be very convenient later on.
To a pure python implementation, you should use the package csv.
data.csv
Project1,folder1/file1,data
Project1,folder1/file2,data
Project1,folder1/file3,data
Project1,folder1/file4,data
Project1,folder2/file11,data
Project1,folder2/file42a,data
Project1,folder2/file42b,data
Project1,folder2/file42c,data
Project1,folder2/file42d,data
Project1,folder3/filec,data
Project1,folder3/fileb,data
Project1,folder3/filea,data
Your python program should read it by line
import csv
a = []
with open('data.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
print(row)
# ['Project1', 'folder1/file1', 'data']
If you print the row element you will see it is a list like that
['Project1', 'folder1/file1', 'data']
If I would like to put in my list all elements in column 1, I need to put that element in my list, doing:
a.append(row[1])
Now in list a I will have a list like:
['folder1/file1', 'folder1/file2', 'folder1/file3', 'folder1/file4', 'folder2/file11', 'folder2/file42a', 'folder2/file42b', 'folder2/file42c', 'folder2/file42d', 'folder3/filec', 'folder3/fileb', 'folder3/filea']
Here is the complete code:
import csv
a = []
with open('data.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
for row in reader:
a.append(row[1])

Python (3.7) CSV Sort/Sum by Field Value

I have a csv file (of indefinite size) that I would like to read and do some work with.
Here is the structure of the csv file:
User, Value
CN,500.00
CN,-250.00
CN,360.00
PT,200.00
PT,230.00
...
I would like to read the file and get the sum of each row where the first field is the same.
I have been trying the following just to try and identify a value for the first field:
with open("Data.csv", newline='') as data:
reader = csv.reader(data)
for row in reader:
if row.startswith('CN'):
print("heres one")
This fails because startswith does not work on a list object. I have also tried using readlines().
EDIT 1:
I can currently print the following dataframe object with the sorted sums:
Value
User
CN 3587881.89
D 1000.00
KC 1767783.99
REC 12000.00
SB 25000.00
SC 1443039.12
SS 0.00
T 9966998.93
TH 2640009.32
ls 500.00
I get this output using this code:
mydata=pd.read_csv('Data.csv')
out = mydata.groupby(['user']).sum()
print(out)
Id now like be able to write if statements for this object. Something like:
if out contains User 'CN'
varX = Value for 'CN'
because this is now a dataframe type I am having trouble setting the Value to a variable for a specific user.
You can do the followings:
import pandas as pd
my_data= pd.read_csv('Data.csv')
my_data.group_by('user').sum()
you can use first row element:
import csv
with open("Data.csv", newline='') as data:
reader = csv.reader(data)
for row in reader:
if row[0].startswith('CN'):
print("heres one")
Using collections.defaultdict
Ex:
import csv
from collections import defaultdict
result = defaultdict(int)
with open(filename, newline='') as data:
reader = csv.reader(data)
next(reader)
for row in reader:
result[row[0]] += float(row[1])
print(result)
Output
defaultdict(<class 'int'>, {'CN': 610.0, 'PT': 430.0})

Error in forming dictionary from a csv file in python

I have a csv file whose structure is like this:
Year-Sem,Course,Studentid,Score
201001,CS301,100,363
201001,CS301,101,283
201001,CS301,102,332
201001,CS301,103,254
201002,CS302,101,466
201002,CS302,102,500
Here each year is divided into two semesters - 01 (for fall) and 02 (for spring) and data has years from 2008 till 2014 (for a total of 14 semesters). Now what I want to do is to form a dictionary where course and studentid become the key and there respective score ordered by the year-sem as values. So the output should be something like this for each student:
[(studentid,course):(year-sem1 score,year-sem2 score,...)]
I first tried to make a dictionary of [(studentid,course):(score)] using this code but I get error as IndexError: list index out of range:
with open('file1.csv', mode='rU') as infile:
reader = csv.reader(infile,dialect=csv.excel_tab)
with open('file2.csv', mode='w') as outfile:
writer = csv.writer(outfile)
mydict = {(rows[2],rows[1]): rows[3] for rows in reader}
writer.writerows(mydict)
When I was not using dialect=csv.excel_tab and rU then I was getting error as _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?.
How can I resolve this error and form the dictionary with structure [(studentid,course):(year-sem1 score,year-sem2 score,...)] that I had mentioned in my post above?
The dialect you've chosen seems to be wrong. csv.excel_tab uses the tabulator character as delimiter. For your data, the default dialect should work.
You got the error message about newlines earlier because of the missing U in the rU mode.
with open(r"test.csv", "rU") as file:
reader = csv.reader(file)
for row in reader:
print(row)
This example seems to work for me (Python 3).
If you have repeating keys you need to store the values in some container, if you want the data ordered you will need to use an OrderedDict:
import csv
from collections import OrderedDict
with open("in.csv") as infile, open('file2.csv', mode='w') as outfile:
d = OrderedDict()
reader, writer = csv.reader(infile), csv.writer(outfile)
header = next(reader) # skip header
# choose whatever column names you want
writer.writerow(["id-crse","score"])
# unpack the values from each row
for yr, cre, stid, scr in reader:
# use id and course as keys and append scores
d.setdefault("{} {}".format(stid, cre),[]).append(scr)
# iterate over the dict keys and values and write each new row
for k,v in d.items():
writer.writerow([k] + v)
Which will give you something like:
id-crse,score
100 CS301,363
101 CS301,283
102 CS301,332
103 CS301,254
101 CS302,466
102 CS302,500
In your own code you would only store the last value for the key, you also only write the keys using writer.writerows(mydict) as you are just iterating over the keys of the dict, not the keys and values. If the data is not all in chronological order you will have to call sorted on the reader object using itemgetter:
for yr, cre, stid, scr in sorted(reader,key=operator.itemgetter(3)):
............

Write dictionary of lists (varying length) to csv in Python

iam currently struggling with dictionaries of lists.
Given a dictionary like that:
GO_list = {'Seq_A': ['GO:1234', 'GO:2345', 'GO:3456'],
'Seq_B': ['GO:7777', 'GO:8888']}
No i wanted to write this dictionary to a csv file as
follows:
EDIT i have added the whole function to give more information
def map_GI2GO(gilist, mapped, gi_to_go):
with open(gilist) as infile:
read_gi = csv.reader(infile)
GI_list = {rows[0]:rows[1] for rows in read_gi} # read GI list into dictionary
GO_list = defaultdict(list) # set up GO list as empty dictionary of lists
infile.close()
with open(gi_to_go) as mapping:
read_go = csv.reader(mapping, delimiter=',')
for k, v in GI_list.items(): # iterate over GI list and mapping file
for row in read_go:
if len(set(row[0]).intersection(v)) > 0 :
GO_list[k].append(row[1]) # write found GOs into dictionary
break
mapping.close()
with open(mapped, 'wb') as outfile: # save mapped SeqIDs plus GOs
looked_up_go = csv.writer(outfile, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
for key, val in GO_list.iteritems():
looked_up_go.writerow([key] + val)
outfile.close()
However this gives me the following output:
Seq_A,GO:1234;GO2345;GO:3456
Seq_B,GO:7777;GO:8888
I would prefer to have the list entries in separate columns,
separated by a defined delimiter. I have a hard time to get
rid of the ;, which are apparently separating the list entries.
Any ideas are welcome
If I were you I would try out itertools izip_longest to match up columns of varying length...
from csv import writer
from itertools import izip_longest
GO_list = {'Seq_A': ['GO:1234', 'GO:2345', 'GO:3456'],
'Seq_B': ['GO:7777', 'GO:8888']}
with open("test.csv","wb") as csvfile:
wr = writer(csvfile)
wr.writerow(GO_list.keys())#writes title row
for each in izip_longest(*GO_list.values()): wr.writerow(each)

Convert a csv to a dictionary with multiple values?

I have a csv file like this:
pos,place
6696,266835
6698,266835
938,176299
940,176299
941,176299
947,176299
948,176299
949,176299
950,176299
951,176299
770,272944
2751,190650
2752,190650
2753,190650
I want to convert it to a dictionary like the following:
{266835:[6696,6698],176299:[938,940,941,947,948,949,950,951],190650:[2751,2752,2753]}
And then, fill the missing numbers in the range in the values:
{{266835:[6696,6697,6698],176299:[938,939,940,941,942,943,944,945,946947,948,949,950,951],190650:[2751,2752,2753]}
}
Right now i have tried to build the dictionary using solution suggested here, but it overwrites the old value with new one.
Any help would be great.
Here is a function that i wrote for converting csv2dict
def csv2dict(filename):
"""
reads in a two column csv file, and the converts it into dictionary
"""
import csv
with open(filename) as f:
f.readline()#ignore first line
reader=csv.reader(f,delimiter=',')
mydict=dict((rows[1],rows[0]) for rows in reader)
return mydict
Easiest is to use collections.defaultdict() with a list:
import csv
from collections import defaultdict
data = defaultdict(list)
with open(inputfilename, 'rb') as infh:
reader = csv.reader(infh)
next(reader, None) # skip the header
for col1, col2 in reader:
data[col2].append(int(col1))
if len(data[col2]) > 1:
data[col2] = range(min(data[col2]), max(data[col2]) + 1)
This also expands the ranges on the fly as you read the data.
Based on what you have tried -
from collections import default dict
# open archive reader
myFile = open ("myfile.csv","rb")
archive = csv.reader(myFile, delimiter=',')
arch_dict = defaultdict(list)
for rows in archive:
arch_dict[row[1]].append(row[0])
print arch_dict

Categories

Resources