I need to write a multi-nested dictionary to an Excel file. The dictionary is structured as dict1:{dict2:{dict3:value}}}. My third write loop is raising a keyerror: '' even though there ought not be any blank keys.
I attempted to use this abhorrent monster as it has worked wonderfully for smaller dictionaries, but 1) there has to be a better way 2) I'm unable to scale it for this dictionary...
import xlsxwriter
workbook = xlsxwriter.Workbook('datatest.xlsx')
worksheet = workbook.add_worksheet('test1')
row = 0
col = 0
for key in sam_bps.keys():
row += 1
worksheet.write(row, col, key)
for key in sam_bps[sam].keys():
row, col = 0,1
worksheet.write(row,col,key)
row += 1
for key in sam_bps[sam][bp].keys():
row,col = 0,2
worksheet.write(row,col,key)
row += 1
for key in sam_bps[sam][bp][mpn].keys():
row,col = 0,3
worksheet.write(row,col,key)
row += 1
for item in sam_bps[sam][bp][mpn].keys():
row,col = 0,4
worksheet.write(row,col,item)
row += 1
workbook.close()
I've also considered converting the dictionary to a list of tuples or list of lists, but it doesn't output how I need. And it'll probably cost more time to split those open afterwards anyway.
Here's the code for the dicionary:
sam_bps = {}
sam_bps_header = ['SAM','BP','MPN','PLM_Rate']
for row in plmdata:
sam,mpn,bp,doc = row[24],row[5],row[7],row[2]
if sam == '':
sam = 'Unknown'
if doc == 'Requirement':
if sam not in sam_bps:
sam_bps[sam] = {bp:{mpn:heatscores[mpn]}}
elif bp not in sam_bps[sam]:
sam_bps[sam][bp] = {mpn:heatscores[mpn]}
elif mpn not in sam_bps[sam][bp]:
sam_bps[sam][bp][mpn] = heatscores[mpn]
print(sam_bps['Dan Reiser'])
EDIT: Added print statement to show output per feedback
{'Fortress Solutions': {'MSM5118160F60JS': 45}, 'Benchmark Electronics': {'LT1963AES8': 15}, 'Axxcss Wireless Solutions Inc': {'MGA62563TR1G': 405}}
I'd like to see this output to an Excel file, with the first column of course being the first key of sam_bps
Your question would be easier to answer if you provided an example of the dictionary you are trying to save.
Have you considered just serializing/deserializing the dictionary to JSON format?
you can save/re-load the file with minimal code:
import json
data = {'test': {'test2': {'test3':2}}}
with open('data.json', 'w') as outfile:
json.dump(data, outfile)
with open('data.json') as data_file:
data = json.load(data_file)
Related
Locked. There are disputes about this question’s content being resolved at this time. It is not currently accepting new answers or interactions.
Lets say I have the following example csv file
a,b
100,200
400,500
How would I make into a dictionary like below:
{a:[100,400],b:[200,500]}
I am having trouble figuring out how to do it manually before I use a package, so I understand. Any one can help?
some code I tried
with open("fake.csv") as f:
index= 0
dictionary = {}
for line in f:
words = line.strip()
words = words.split(",")
if index >= 1:
for x in range(len(headers_list)):
dictionary[headers_list[i]] = words[i]
# only returns the last element which makes sense
else:
headers_list = words
index += 1
At the very least, you should be using the built-in csv package for reading csv files without having to bother with parsing. That said, this first approach is still applicable to your .strip and .split technique:
Initialize a dictionary with the column names as keys and empty lists as values
Read a line from the csv reader
Zip the line's contents with the column names you got in step 1
For each key:value pair in the zip, update the dictionary by appending
with open("test.csv", "r") as file:
reader = csv.reader(file)
column_names = next(reader) # Reads the first line, which contains the header
data = {col: [] for col in column_names}
for row in reader:
for key, value in zip(column_names, row):
data[key].append(value)
Your issue was that you were using the assignment operator = to overwrite the contents of your dictionary on every iteration. This is why you either want to pre-initialize the dictionary like above, or use a membership check first to test if the key exists in the dictionary, adding it if not:
key = headers_list[i]
if key not in dictionary:
dictionary[key] = []
dictionary[key].append(words[i])
An even cleaner shortcut is to take advantage of dict.get:
key = headers_list[i]
dictionary[key] = dictionary.get(key, []) + [words[i]]
Another approach would be to take advantage of the csv package by reading each row of the csv file as a dictionary itself:
with open("test.csv", "r") as file:
reader = csv.DictReader(file)
data = {}
for row_dict in reader:
for key, value in row_dict.items():
data[key] = data.get(key, []) + [value]
Another standard library package you could use to clean this up further is collections, with defaultdict(list), where you can directly append to the dictionary at a given key without worrying about initializing with an empty list if the key wasn't already there.
To do that just keep the column name and data seperate then iterate the column and add the value for the corresponding index in data, not sure if this work with empty values.
However, I am much sure that going through pandas would be 100% easier, it's a really used library for working with data in external files.
import csv
datas = []
with open('fake.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
for row in csv_reader:
if line_count == 0:
cols = row
line_count += 1
else:
datas.append(row)
line_count += 1
dict = {}
for index, col in enumerate(cols): #Iterate through the data with value and indices
dict[col] = []
for data in datas: #append a in the current dict key, a new value.
#if this key doesn't exist, it will create a new one.
dict[col].append(data[index])
print(dict)
I am having two csv files where I need a python code to do a vlookup that does match the values and takes only the needed column and creates a new csv file. I know it can be done with pandas but I need it to do this without pandas or any 3rd party tools.
INPUT 1 csv file
ID NAME SUBJECT
1 Raj CS
2 Allen PS
3 Bradly DP
4 Tim FS
INPUT 2 csv file
ID COUNTRY TIME
2 USA 1:00
4 JAPAN 14:00
1 ENGLAND 5:00
3 CHINA 0.00
OUTPUT csv file
ID NAME SUBJECT COUNTRY
1 Raj CS ENGLAND
2 Allen PS USA
3 Bradly DP CHINA
4 Tim FS JAPAN
Probably a more efficient way to do it, but basically create a nested dictionary (using the ID as the key) with the other column names and their values under the ID key. Then when you iterate through each file, it'll update the dictionary on the ID key.
Finally put them together into a list and write to file:
input_files = ['C:/test/input_1.csv', 'C:/test/input_2.csv']
lookup_column_name = 'ID'
output_dict = {}
for file in input_files:
file = open(file, 'r')
header = {}
# Read each line in the csv
for idx, line in enumerate(file.readlines()):
# If it's the first line, store as the header
if idx == 0:
header = line.split(',')
# Get the index value of the lookup column from the list of headers
header_dict = {idx:x.strip() for idx, x in enumerate(header)}
lookup_column_idx = dict((v,k) for k,v in header_dict.items())[lookup_column_name]
continue
line_split = line.split(',')
# Initialize the dictionary by look up column
if line_split[lookup_column_idx] not in output_dict.keys():
output_dict[line_split[lookup_column_idx]] = {}
# If not the lookup column, then add the other column and data to the dictionary
for idx, value in enumerate(line_split):
if idx != lookup_column_idx:
output_dict[line_split[lookup_column_idx]].update({header_dict[idx]:value})
# Create a list of the rows that will be written to file under the correct columns
rows = []
for k, v in output_dict.items():
header = [lookup_column_name] + list(v.keys())
row = [k] + [output_dict[k][x].strip() for x in header if x != lookup_column_name]
row = ','.join(row) + '\n'
rows.append(row)
# Final list of rows, begining with the header
output_lines = [','.join(header) + '\n'] + rows
# writing to file
output = open('C:/test/output.csv', 'w')
output.writelines(output_lines)
output.close()
To do this without pandas (and assuming you know the structure of your data + it fits in memory), you can iterate through the csv file and store the results in a dictionary, where you fill the entries where the ID maps to the other information that you want to keep.
You can do this for both csv files and join them manually afterwards by iterating over the keys of the dictionary.
input1='.\file1.csv'
input2='.\file2.csv'
with open(input1,'r',encoding='utf-8-sig') as inuputlist:
with open(input2, "r",encoding='utf-8-sig') as inputlist1:
with open('.\output.csv','w',newline='',encoding='utf-8-sig') as output:
reader = csv.reader(inputlist)
reader2 = csv.reader(inputlist1)
writer = csv.writer(output)
dict1 = {}
for xl in reader2:
dict1[xl[0]] = xl[1]
for i in reader:
if i[2] in dict1:
i.append(dict1[i[2]])
writer.writerow(i)
else:
i.append("N/A")
writer.writerow(i)
I have a CSV which is in the format:
Name1,Value1
,Value2
,Value3
Name2,Value40
,Value50
,Value60
Name3,Value5
,Value10
,Value15
There is not a set number of "values" per "name".
There is not pattern to the names.
I want to read the Values for Each Name into a dict such as:
Name1 : [Value1,Value2,Value3]
Name2 : [Value40,Value50,Value60]
etc.
My current code is this:
CSVFile = open("GroupsCSV.csv")
Reader = csv.reader(CSVFile)
for row in Reader:
if row[0] and row[2]:
objlist = []
objlist.append(row[2])
for row in Reader:
if not row[0] and row[2]:
objlist.append(row[2])
else:
break
print(objlist)
This half-works.
It will do Name1,Name3,Name5,Name7 etc.
I cant seem to find a way to stop it skipping.
Would prefer to do this without the use of something like Lambda (as its not something i fully understand yet!).
EDIT: Image of example csv (real data has another unnecessary column, hence the "row[2]" in the code.:
Try pandas:
import pandas as pd
df = pd.read_csv('your_file.csv', header=None)
(df.ffill() # fill the blank with the previous Name
.groupby([0])[1] # collect those with same name
.apply(list) # put those in a list
.to_dict() # make a dictionary
)
Output:
{'Name1': ['Value1', 'Value2', 'Value3'],
'Name2': ['Value40', 'Value50', 'Value60'],
'Name3': ['Value5', 'Value10', 'Value15']}
Update: the pure python(3) solution:
with open('your_file.csv') as f:
lines = f.readlines()
d = {}
for line in lines:
row = line.split(',')
if row[0] != '':
key = row[0]
d[key] = []
d[key].append(row[1])
d
I think the issue you are facing is because of your nested loop. Both loops are pointing to the same iterator. You are starting the second loop after it finds Name1 and breaking it when it finds Name2. By the time the outer loops continues after the break you have already skipped Name2.
You could have both conditions in the same loop:
# with open("GroupsCSV.csv") as csv_file:
# reader = csv.reader(csv_file)
reader = [[1,2,3],[None,5,6]] # Mocking the csv input
objlist = []
for row in reader:
if row[0] and row[2]:
objlist.clear()
objlist.append(row[2])
elif not row[0] and row[2]:
objlist.append(row[2])
print(objlist)
EDIT: I have updated the code to provide a testable output.
The printed output looks as follows:
[3]
[3, 6]
I am new to Python and I prepared a script that will modify the following csv file
accordingly:
1) Each row that contains multiple Gene entries separated by the /// such as:
C16orf52 /// LOC102725138 1.00551
should be transformed to:
C16orf52 1.00551
LOC102725138 1.00551
2) The same gene may have different ratio values
AASDHPPT 0.860705
AASDHPPT 0.983691
and we want to keep only the pair with the highest ratio value (delete the pair AASDHPPT 0.860705)
Here is the script I wrote but it does not assign the correct ratio values to the genes:
import csv
import pandas as pd
with open('2column.csv','rb') as f:
reader = csv.reader(f)
a = list(reader)
gene = []
ratio = []
for t in range(len(a)):
if '///' in a[t][0]:
s = a[t][0].split('///')
gene.append(s[0])
gene.append(s[1])
ratio.append(a[t][1])
ratio.append(a[t][1])
else:
gene.append(a[t][0])
ratio.append(a[t][1])
gene[t] = gene[t].strip()
newgene = []
newratio = []
for i in range(len(gene)):
g = gene[i]
r = ratio[i]
if g not in newgene:
newgene.append(g)
for j in range(i+1,len(gene)):
if g==gene[j]:
if ratio[j]>r:
r = ratio[j]
newratio.append(r)
for i in range(len(newgene)):
print newgene[i] + '\t' + newratio[i]
if len(newgene) > len(set(newgene)):
print 'missionfailed'
Thank you very much for any help or suggestion.
Try this:
with open('2column.csv') as f:
lines = f.read().splitlines()
new_lines = {}
for line in lines:
cols = line.split(',')
for part in cols[0].split('///'):
part = part.strip()
if not part in new_lines:
new_lines[part] = cols[1]
else:
if float(cols[1]) > float(new_lines[part]):
new_lines[part] = cols[1]
import csv
with open('clean_2column.csv', 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=' ',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
for k, v in new_lines.items():
writer.writerow([k, v])
First of all, if you're importing Pandas, know that you have I/O Tools to read CSV files.
So first, let's import it that way :
df = pd.read_csv('2column.csv')
Then, you can extract the indexes where you have your '///' pattern:
l = list(df[df['Gene Symbol'].str.contains('///')].index)
Then, you can create your new rows :
for i in l :
for sub in df['Gene Symbol'][i].split('///') :
df=df.append(pd.DataFrame([[sub, df['Ratio(ifna vs. ctrl)'][i]]], columns = df.columns))
Then, drop the old ones :
df=df.drop(df.index[l])
Then, I'll do a little trick to remove your lowest duplicate values. First, I'll sort them by 'Ratio (ifna vs. ctrl)' then I'll drop all the duplicates but the first one :
df = df.sort('Ratio(ifna vs. ctrl)', ascending=False).drop_duplicates('Gene Symbol', keep='first')
If you want to keep your sorting by Gene Symbol and reset indexes to have simpler ones, simply do :
df = df.sort('Gene Symbol').reset_index(drop=True)
If you want to re-export your modified data to your csv, do :
df.to_csv('2column.csv')
EDIT : I edited my answer to correct syntax errors, I've tested this solution with your csv and it worked perfectly :)
This should work.
It uses the dictionary suggestion of Peter.
import csv
with open('2column.csv','r') as f:
reader = csv.reader(f)
original_file = list(reader)
# gets rid of the header
original_file = original_file[1:]
# create an empty dictionary
genes_ratio = {}
# loop over every row in the original file
for row in original_file:
gene_name = row[0]
gene_ratio = row[1]
# check if /// is in the string if so split the string
if '///' in gene_name:
gene_names = gene_name.split('///')
# loop over all the resulting compontents
for gene in gene_names:
# check if the component is in the dictionary
# if not in dictionary set value to gene_ratio
if gene not in genes_ratio:
genes_ratio[gene] = gene_ratio
# if in dictionary compare value in dictionary to gene_ratio
# if dictionary value is smaller overwrite value
elif genes_ratio[gene] < gene_ratio:
genes_ratio[gene] = gene_ratio
else:
if gene_name not in genes_ratio:
genes_ratio[gene_name] = gene_ratio
elif genes_ratio[gene_name] < gene_ratio:
genes_ratio[gene_name] = gene_ratio
#loop over dictionary and print gene names and their ratio values
for key in genes_ratio:
print key, genes_ratio[key]
I have a large CSV file from which I am reading some data and adding that data into a dictionary. My CSV file has approximately 360000 rows and my dictionary has only a length of 5700. I know my CSV has a lot of duplicates but I expect about 50000 unique rows. I know that Python dictionaries have no limits to size. My code reads all the 360000 entries in the file, writes it to another file and terminates. All this processing finishes in about 2 seconds without any exceptions. How do I know for sure that all of the items in the CSV that I process are actually being added to the dictionary?
The code that I am using is as follows:
with open("read.csv", 'rb') as input1:
with open("write.csv", 'wb') as output1:
reader = csv.reader(input1, delimiter="\t")
writer = csv.writer(output1, delimiter="\t")
#Just testing if my program reads the whole CSV file
for row in reader:
count += 1
print count # Gives 360000
input1.seek(0)
for row in reader:
#print row[1] + "\t" + row[2]
dict1.update({row[1] : [ row[2], row[0] ]})
print len(dict1) # Gives 5700
for key in dict1:
ext_id = key
list1 = dict1[key]
name = list1[0]
url = list1[1]
writer.writerow([ext_id, name, url])
EDIT
I am not sure if people are understanding what I am trying to do and how that would be relevant but still, I'll explain.
My CSV file has 3 columns for each row. Their format is as follows:
URL+unique value | unique value | some name
However, the rows are duplicated in the CSV and I want another CSV which just has rows without any duplicates.
The keys in your dictionary are row[1]. The size of the dictionary will depend on how many different values of this field are in the input. It does not matter if the rest of the row (row[2], row[0]) differs between rows.
Example:
a,foo,1
b,bar,2
c,foo,3
d,baz,4
Will result in a dictionary of length 3 if the first field (index 1) is used as a key. The result will be:
{'foo':['3', 'c'],
'bar':['2', 'b'],
'baz':['4', 'd']}
The first line will be overwritten. Of course the 'order' can be different since a dictionary has no order.
EDIT: if you're just checking for uniqueness, there's no need to put this into a dictionary (which are designed for fast lookup and grouping). Use a set here instead.
out_ = set()
for row in reader:
out_.add(tuple(row))
# csv.reader may iterate through tuples already, I don't know! If so
# there's obviously no reason to cast it as one. Do:
## print(type(reader[0]))
# to find out.
for row in out_:
writer.writerow([row[1], row[0], row[2]])
Here's the quickest check I can think of.
set_headers = {row[1] for row in reader}
This is a set containing all the 2nd columns (that is to say, row[1]) of all the rows in your CSV. As you probably know, sets cannot contain duplicates, so this gives you how many UNIQUE values you have in your header column of each row.
Since dict.update REPLACES values, this is exactly what you're going to see with len(dict1), in fact len(set_headers) == len(dict1). Each time you iterate through a row, dict.update CHANGES THE VALUE of the key row[1] to (row[0], row[2]). Which is probably just fine if you don't care about the earlier values, but somehow I don't think that's true.
Instead, do this:
for row in reader:
dict1.setdefault(row[1],[]).append((row[0],row[1]))
This will end up with something like:
dict1 = {"foo": [(row1_col0,row1_col2),(row3_col0,row3_col2)],
"baz": [(row2_col0,row2_col2)]}
from input of:
row1_col0, foo, row1_col2
row2_col0, baz, row2_col2
row3_col0, foo, row3_col2
Now you can do, for instance:
for header in dict1:
for row in header:
print("{}\t{}".format(row[0],row[1]))
The simplest way to make sure you are getting everything is to add 2 test lists. Add: test1,test2=[],[], then right after you update your dictionary add test1.append(row[1]) then if row[1] not in test2: test2.append(row[1]). then you can print the length of both lists and see if the length of test2 is the same as your dictionary and the length of test1 is the same as the length of your input csv.
with open("read.csv", 'rb') as input1:
with open("write.csv", 'wb') as output1:
reader = csv.reader(input1, delimiter="\t")
writer = csv.writer(output1, delimiter="\t")
#Just testing if my program reads the whole CSV file
test1,test2=[],[]
for row in reader:
count += 1
print count # Gives 360000
input1.seek(0)
for row in reader:
#print row[1] + "\t" + row[2]
dict1.update({row[1] : [ row[2], row[0] ]})
test1.append(row[1])
if row[1] not in test2: test2.append(row[1])
print 'dictionary length:', len(dict1) # Gives 5700
print 'test 1 length (total values):',len(test1)
print 'test2 length (unique key values):',len(test2)
for key in dict1:
ext_id = key
list1 = dict1[key]
name = list1[0]
url = list1[1]
writer.writerow([ext_id, name, url])