I have a csv that looks like this:
HA-MASTER,CategoryID
38231-S04-A00,14
39790-S10-A03,14
38231-S04-A00,15
39790-S10-A03,15
38231-S04-A00,16
39790-S10-A03,16
38231-S04-A00,17
39790-S10-A03,17
38231-S04-A00,18
39790-S10-A03,18
38231-S04-A00,19
39795-ST7-000,75
57019-SN7-000,75
38251-SV4-911,75
57119-SN7-003,75
57017-SV4-A02,75
39795-ST7-000,76
57019-SN7-000,76
38251-SV4-911,76
57119-SN7-003,76
57017-SV4-A02,76
What I would like to do is reformat this data so that there is only one line for each categoryID for example:
14,38231-S04-A00,39790-S10-A03
76,39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02
I have not found a way in excel that I can accomplish this programatically. I have over 100,000 lines. Is there a way using python CSV Read and Write to do something like this?
Yes there is a way:
import csv
def addRowToDict(row):
global myDict
key=row[1]
if key in myDict.keys():
#append values if entry already exists
myDict[key].append(row[0])
else:
#create entry
myDict[key]=[row[1],row[0]]
global myDict
myDict=dict()
inFile='C:/Users/xxx/Desktop/pythons/test.csv'
outFile='C:/Users/xxx/Desktop/pythons/testOut.csv'
with open(inFile, 'r') as f:
reader = csv.reader(f)
ignore=True
for row in reader:
if ignore:
#ignore first row
ignore=False
else:
#add entry to dict
addRowToDict(row)
with open(outFile,'w') as f:
writer = csv.writer(f)
#write everything to file
writer.writerows(myDict.itervalues())
Just edit inFile and outFile
This is pretty trivial using a dictionary of lists (Python 2.7 solution):
#!/usr/bin/env python
import fileinput
categories={}
for line in fileinput.input():
# Skip the first line in the file (assuming it is a header).
if fileinput.isfirstline():
continue
# Split the input line into two fields.
ha_master, cat_id = line.strip().split(',')
# If the given category id is NOT already in the dictionary
# add a new empty list
if not cat_id in categories:
categories[cat_id]=[]
# Append a new value to the category.
categories[cat_id].append(ha_master)
# Iterate over all category IDs and lists. Use ','.join() to
# to output a comma separate list from an Python list.
for k,v in categories.iteritems():
print '%s,%s' %(k,','.join(v))
I would read in the entire file, create a dictionary where the key is the ID and the value is a list of the other data.
data = {}
with open("test.csv", "r") as f:
for line in f:
temp = line.rstrip().split(',')
if len(temp[0].split('-')) == 3: # => specific format that ignores the header...
if temp[1] in data:
data[temp[1]].append(temp[0])
else:
data[temp[1]] = [temp[0]]
with open("output.csv", "w+") as f:
for id, datum in data.iteritems():
f.write("{},{}\n".format(id, ','.join(datum)))
Use pandas!
import pandas
csv_data = pandas.read_csv('path/to/csv/file')
use_this = csv_data.group_by('CategoryID').values
You will get a list with everything you want, now you just have to format it.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Cheers.
I see many beautiful answers have come up while I was trying it, but I'll post mine as well.
import re
csvIN = open('your csv file','r')
csvOUT = open('out.csv','w')
cat = dict()
for line in csvIN:
line = line.rstrip()
if not re.search('^[0-9]+',line): continue
ham, cid = line.split(',')
if cat.get(cid,False):
cat[cid] = cat[cid] + ',' + ham
else:
cat[cid] = ham
for i in sorted(cat):
csvOUT.write(i + ',' + cat[i] + '\n')
Pandas approach:
import pandas as pd
df = pd.read_csv('data.csv')
#new = df.groupby('CategoryID')['HA-MASTER'].apply(lambda row: '%s' % ','.join(row))
new = df.groupby('CategoryID')['HA-MASTER'].agg(','.join)
new.to_csv('out.csv')
out.csv:
14,"38231-S04-A00,39790-S10-A03"
15,"38231-S04-A00,39790-S10-A03"
16,"38231-S04-A00,39790-S10-A03"
17,"38231-S04-A00,39790-S10-A03"
18,"38231-S04-A00,39790-S10-A03"
19,38231-S04-A00
75,"39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02"
76,"39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02"
This was an interesting question. My solution was to append each new item for a given key to a single string in the value, along with a comma to delimit the columns.
with open('Input01.csv') as input_file:
file_lines = [item.strip() for item in input_file.readlines()]
for item in iter([i.split(',') for i in file_lines]):
if item[1] in set_vals:
set_vals[item[1]] = set_vals[item[1]] + ',' + item[0]
else:
set_vals[item[1]] = item[0]
with open('Results01.csv','w') as output_file:
for i in sorted(set_vals.keys()):
output_file.write('{},{}\n'.format(i, set_vals[i]))
MaxU's implementation, using pandas, has good potential and looks really elegant, but all the values are placed into one cell, because each of the strings is double-quoted. For example, the line corresponding to the code '18'—"38231-S04-A00,39790-S10-A03"—would place both values in the second column.
import csv
from collections import defaultdict
inpath = '' # Path to input CSV
outpath = '' # Path to output CSV
output = defaultdict(list) # To hold {category: [serial_numbers]}
for row in csv.DictReader(open(inpath)):
output[row['CategoryID']].append(row['HA-MASTER'])
with open(outpath, 'w') as f:
f.write('CategoryID,HA-MASTER\n')
for category, serial_number in output.items():
row = '%s,%s\n' % (category, serial_number)
f.write(row)
Related
I have a csv file looks like this:
I have a column called “Inventory”, within that column I pulled data from another source and it put it in a dictionary format as you see.
What I need to do is iterate through the 1000+ lines, if it sees the keywords: comforter, sheets and pillow exist than write “bedding” to the “Location” column for that row, else write “home-fashions” if the if statement is not true.
I have been able to just get it to the if statement to tell me if it goes into bedding or “home-fashions” I just do not know how I tell it to write the corresponding results to the “Location” field for that line.
In my script, im printing just to see my results but in the end I just want to write to the same CSV file.
from csv import DictReader
with open('test.csv', 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
for line in csv_dict_reader:
if 'comforter' in line['Inventory'] and 'sheets' in line['Inventory'] and 'pillow' in line['Inventory']:
print('Bedding')
print(line['Inventory'])
else:
print('home-fashions')
print(line['Inventory'])
The last column of your csv contains commas. You cannot read it using DictReader.
import re
data = []
with open('test.csv', 'r') as f:
# Get the header row
header = next(f).strip().split(',')
for line in f:
# Parse 4 columns
row = re.findall('([^,]*),([^,]*),([^,]*),(.*)', line)[0]
# Create a dictionary of one row
item = {header[0]: row[0], header[1]: row[1], header[2]: row[2],
header[3]: row[3]}
# Add each row to the list
data.append(item)
After preparing your data, you can check with your conditions.
for item in data:
if all([x in item['Inventory'] for x in ['comforter', 'sheets', 'pillow']]):
item['Location'] = 'Bedding'
else:
item['Location'] = 'home-fashions'
Write output to a file.
import csv
with open('output.csv', 'w') as f:
dict_writer = csv.DictWriter(f, data[0].keys())
dict_writer.writeheader()
dict_writer.writerows(data)
csv.DictReader returns a dict, so just assign the new value to the column:
if 'comforter' in line['Inventory'] and ...:
line['Location'] = 'Bedding'
else:
line['Location'] = 'home-fashions'
print(line['Inventory'])
I am new to Python and I prepared a script that will modify the following csv file
accordingly:
1) Each row that contains multiple Gene entries separated by the /// such as:
C16orf52 /// LOC102725138 1.00551
should be transformed to:
C16orf52 1.00551
LOC102725138 1.00551
2) The same gene may have different ratio values
AASDHPPT 0.860705
AASDHPPT 0.983691
and we want to keep only the pair with the highest ratio value (delete the pair AASDHPPT 0.860705)
Here is the script I wrote but it does not assign the correct ratio values to the genes:
import csv
import pandas as pd
with open('2column.csv','rb') as f:
reader = csv.reader(f)
a = list(reader)
gene = []
ratio = []
for t in range(len(a)):
if '///' in a[t][0]:
s = a[t][0].split('///')
gene.append(s[0])
gene.append(s[1])
ratio.append(a[t][1])
ratio.append(a[t][1])
else:
gene.append(a[t][0])
ratio.append(a[t][1])
gene[t] = gene[t].strip()
newgene = []
newratio = []
for i in range(len(gene)):
g = gene[i]
r = ratio[i]
if g not in newgene:
newgene.append(g)
for j in range(i+1,len(gene)):
if g==gene[j]:
if ratio[j]>r:
r = ratio[j]
newratio.append(r)
for i in range(len(newgene)):
print newgene[i] + '\t' + newratio[i]
if len(newgene) > len(set(newgene)):
print 'missionfailed'
Thank you very much for any help or suggestion.
Try this:
with open('2column.csv') as f:
lines = f.read().splitlines()
new_lines = {}
for line in lines:
cols = line.split(',')
for part in cols[0].split('///'):
part = part.strip()
if not part in new_lines:
new_lines[part] = cols[1]
else:
if float(cols[1]) > float(new_lines[part]):
new_lines[part] = cols[1]
import csv
with open('clean_2column.csv', 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=' ',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
for k, v in new_lines.items():
writer.writerow([k, v])
First of all, if you're importing Pandas, know that you have I/O Tools to read CSV files.
So first, let's import it that way :
df = pd.read_csv('2column.csv')
Then, you can extract the indexes where you have your '///' pattern:
l = list(df[df['Gene Symbol'].str.contains('///')].index)
Then, you can create your new rows :
for i in l :
for sub in df['Gene Symbol'][i].split('///') :
df=df.append(pd.DataFrame([[sub, df['Ratio(ifna vs. ctrl)'][i]]], columns = df.columns))
Then, drop the old ones :
df=df.drop(df.index[l])
Then, I'll do a little trick to remove your lowest duplicate values. First, I'll sort them by 'Ratio (ifna vs. ctrl)' then I'll drop all the duplicates but the first one :
df = df.sort('Ratio(ifna vs. ctrl)', ascending=False).drop_duplicates('Gene Symbol', keep='first')
If you want to keep your sorting by Gene Symbol and reset indexes to have simpler ones, simply do :
df = df.sort('Gene Symbol').reset_index(drop=True)
If you want to re-export your modified data to your csv, do :
df.to_csv('2column.csv')
EDIT : I edited my answer to correct syntax errors, I've tested this solution with your csv and it worked perfectly :)
This should work.
It uses the dictionary suggestion of Peter.
import csv
with open('2column.csv','r') as f:
reader = csv.reader(f)
original_file = list(reader)
# gets rid of the header
original_file = original_file[1:]
# create an empty dictionary
genes_ratio = {}
# loop over every row in the original file
for row in original_file:
gene_name = row[0]
gene_ratio = row[1]
# check if /// is in the string if so split the string
if '///' in gene_name:
gene_names = gene_name.split('///')
# loop over all the resulting compontents
for gene in gene_names:
# check if the component is in the dictionary
# if not in dictionary set value to gene_ratio
if gene not in genes_ratio:
genes_ratio[gene] = gene_ratio
# if in dictionary compare value in dictionary to gene_ratio
# if dictionary value is smaller overwrite value
elif genes_ratio[gene] < gene_ratio:
genes_ratio[gene] = gene_ratio
else:
if gene_name not in genes_ratio:
genes_ratio[gene_name] = gene_ratio
elif genes_ratio[gene_name] < gene_ratio:
genes_ratio[gene_name] = gene_ratio
#loop over dictionary and print gene names and their ratio values
for key in genes_ratio:
print key, genes_ratio[key]
i want to add up lines in a csv file (It's a BOM) if they are identical and in the same part, but not if they are a specific type.
Here is the example to make it more clear:
LevelName,Type,Amount
Part_1,a,1
Part_1,a,1
Part_1,b,1
Part_1,c,1
Part_1,d,1
Part_1,f,1
Part_2,a,1
Part_2,c,1
Part_2,d,1
Part_2,a,1
Part_2,a,1
Part_2,d,1
Part_2,d,1
So i need to some up all Types within a Part but not if the type is 'd'.
Result should look like this:
LevelName,Type,Amount
Part_1,a,2
Part_1,b,1
Part_1,c,1
Part_1,d,1
Part_1,f,1
Part_2,a,3
Part_2,c,1
Part_2,d,1
Part_2,d,1
Part_2,d,1
unfortunatly i can not use any external lib. so pandas is no option here.
That is how far i got:
import csv
map = {}
with open('infile.csv', 'rt') as f:
reader = csv.reader(f, delimiter = ',')
with open('outfile.csv', 'w', newline='') as fout:
writer = csv.writer(fout, delimiter=';', quoting=csv.QUOTE_MINIMAL)
writer.writerow(next(reader))
for row in reader:
(level, type, count) = row
if not type=='d':
Well, here i just don't get ahead...
Thanks for any hint!
Ok sorry about using pandas. Then first read the file saving the results in a defaultdict.
from collections import defaultdict
grouped = defaultdict(int)
if not type=='d':
grouped[(level, type)] += int(count)
Then you can save the result of that dict to a file
import csv
import os
cwd = os.getcwd()
master = {}
file = csv.DictReader(open(cwd+'\\infile.csv', 'rb'), delimiter=',')
data = [row for row in file]
for row in data:
master.setdefault(row['LevelName'], {})
if row['Type'] != 'd':
master[row['LevelName']].setdefault(row['Type'], 0)
master[row['LevelName']][row['Type']] += int(row['Amount'])
print (master)
Not as simple as the soloution above but this shows how to iterate over the data
OR i suppose you could concatenate the 'LevelName' and the 'Type' so that you have one less line of code. It depends on what you what you want.
for row in data:
if row['Type'] != 'd':
master.setdefault(row['LevelName'] + row['Type'], 0)
master[row['LevelName'] + row['Type']] += int(row['Amount'])
print (master)
EDIT
to write back to original format something like:
out = open(cwd+'\\outfile.csv', 'wb')
out.write('LevelName,Type,Amount\n')
for k,v in master.iteritems():
for z in v:
out.write('%s,%s,%s\n' % (k, z, str(v[z])))
This question already has answers here:
Converting csv file to dictionary
(2 answers)
Closed 6 years ago.
I want to write from a csv file to a dictionary.
But:
The first word from a row should be the key
All other words in the row should be seperate values for this key.
My code so far:
def coordinates(text):
import csv
reader = csv.reader(open(text))
d = {}
for row in reader:
key = row[0]
d[key] = row[1:]
print(d)
coordinates('luchthavens2.csv')
With this code all items from the row are the key in my dictionary.
Who can help?
EDIT:
Input file looks like this:
BIN,"Bamiyan","Bamiyan","Afghanistan","AF",34.800000,67.816667,701,"Afghanistan",\N,\N,1149361
BST,"Bost","Bost","Afghanistan","AF",31.550000,64.366667,701,"Afghanistan",\N,1134720,1149361
CCN,"Chakcharan","Chakcharan","Afghanistan","AF",34.533333,65.266667,701,"Afghanistan",\N,\N,1149361
All from an excel file called luchthavens2.csv, the positions of the text are A1-A2-A3-etc.
You should find it here: https://expirebox.com/download/bb8cb3a39f9be041743a8b86db89093b.html
Output:
{
'CCN,"Chakcharan","Chakcharan","Afghanistan","AF",34.533333,65.266667,701,"Afghanistan",\\N,\\N,1149361': [],
'BST,"Bost","Bost","Afghanistan","AF",31.550000,64.366667,701,"Afghanistan",\\N,1134720,1149361': [],
'BIN,"Bamiyan","Bamiyan","Afghanistan","AF",34.800000,67.816667,701,"Afghanistan",\\N,\\N,1149361': []
}
EDIT:
I've changed my input file to a text file, then back again to a csv file. Strangely enough this worked, I can read it without any problems.
If you run the following,
import pprint
import csv
def coordinates(text):
ret = {}
with open(text, 'r') as fp:
reader = csv.reader(fp)
for row in reader:
key = row.pop(0)
ret[key] = row
return ret
data = coordinates('data.csv')
pprint.pprint(data)
On the following file,
$ cat data.csv
AAA,"Blub",25.25
BBB,"Blob",27.27
Then you will get,
$ python stackoverflow.py
{'AAA': ['Blub', '25.25'], 'BBB': ['Blob', '27.27']}
In your input file, the quotation marks are messed up.
"BIN,""Bamiyan"",""Bamiyan"",""Afghanistan"",""AF"",34.800000,67.816667,701,""Afghanistan"",\N,\N,1149361"
"BST,""Bost"",""Bost"",""Afghanistan"",""AF"",31.550000,64.366667,701,""Afghanistan"",\N,1134720,1149361"
"CCN,""Chakcharan"",""Chakcharan"",""Afghanistan"",""AF"",34.533333,65.266667,701,""Afghanistan"",\N,\N,1149361"
should be
"BIN","Bamiyan","Bamiyan","Afghanistan","AF",34.800000,67.816667,701,"Afghanistan",\N,\N,1149361
"BST","Bost","Bost","Afghanistan","AF",31.550000,64.366667,701,"Afghanistan",\N,1134720,1149361
"CCN","Chakcharan","Chakcharan","Afghanistan","AF",34.533333,65.266667,701,"Afghanistan",\N,\N,1149361
This should give you desired output.
Your data is not in proper CSV format
import csv
reader = csv.reader(open('luchthavens2.csv'))
d = {}
for row in reader:
row = row[0].split(',')
key = row[0]
d[key] = row[1:]
Rather you dont need a CSV module for this thing because your data is not proper CSV
Below code will solve your problem without CSV module
d = dict()
with open('luchthavens2.csv') as fh:
for row in fh:
row = row.split(',')
key = row[0]
d[key] = row[1:]
I have a csv file like this:
pos,place
6696,266835
6698,266835
938,176299
940,176299
941,176299
947,176299
948,176299
949,176299
950,176299
951,176299
770,272944
2751,190650
2752,190650
2753,190650
I want to convert it to a dictionary like the following:
{266835:[6696,6698],176299:[938,940,941,947,948,949,950,951],190650:[2751,2752,2753]}
And then, fill the missing numbers in the range in the values:
{{266835:[6696,6697,6698],176299:[938,939,940,941,942,943,944,945,946947,948,949,950,951],190650:[2751,2752,2753]}
}
Right now i have tried to build the dictionary using solution suggested here, but it overwrites the old value with new one.
Any help would be great.
Here is a function that i wrote for converting csv2dict
def csv2dict(filename):
"""
reads in a two column csv file, and the converts it into dictionary
"""
import csv
with open(filename) as f:
f.readline()#ignore first line
reader=csv.reader(f,delimiter=',')
mydict=dict((rows[1],rows[0]) for rows in reader)
return mydict
Easiest is to use collections.defaultdict() with a list:
import csv
from collections import defaultdict
data = defaultdict(list)
with open(inputfilename, 'rb') as infh:
reader = csv.reader(infh)
next(reader, None) # skip the header
for col1, col2 in reader:
data[col2].append(int(col1))
if len(data[col2]) > 1:
data[col2] = range(min(data[col2]), max(data[col2]) + 1)
This also expands the ranges on the fly as you read the data.
Based on what you have tried -
from collections import default dict
# open archive reader
myFile = open ("myfile.csv","rb")
archive = csv.reader(myFile, delimiter=',')
arch_dict = defaultdict(list)
for rows in archive:
arch_dict[row[1]].append(row[0])
print arch_dict