How can I add record from csv file into dictionary in function where the input attribute will be tha path fo that csv file?
Please help with this uncompleted function :
def csv_file (p):
dictionary={}
file=csv.reader(p)
for rows in file:
dictionary......(rows)
return dictionary
You need to open the file first:
def csv_file(p):
dictionary = {}
with open(p, "rb") as infile: # Assuming Python 2
file = csv.reader(infile) # Possibly DictReader might be more suitable,
for row in file: # but that...
dictionary......(row) # depends on what you want to do.
return dictionary
It seems as though you haven't even opened the file, you need to use open for that.
Try the following code:
import csv
from pprint import pprint
INFO_LIST = []
with open('sample.csv') as f:
reader = csv.reader(f, delimiter=',', quotechar='"')
for i, row in enumerate(reader):
if i == 0:
TITLE_LIST = [var for var in row]
continue
INFO_LIST.append({title: info for title, info in zip(TITLE_LIST, row)})
pprint(INFO_LIST)
I use the following csv file as an example:
"REVIEW_DATE","AUTHOR","ISBN","DISCOUNTED_PRICE"
"1985/01/21","Douglas Adams",0345391802,5.95
"1990/01/12","Douglas Hofstadter",0465026567,9.95
"1998/07/15","Timothy ""The Parser"" Campbell",0968411304,18.99
"1999/12/03","Richard Friedman",0060630353,5.95
"2001/09/19","Karen Armstrong",0345384563,9.95
"2002/06/23","David Jones",0198504691,9.95
"2002/06/23","Julian Jaynes",0618057072,12.50
"2003/09/30","Scott Adams",0740721909,4.95
"2004/10/04","Benjamin Radcliff",0804818088,4.95
"2004/10/04","Randel Helms",0879755725,4.50
You can put all that logic into a function like so:
def csv_file(file_path):
# Checking if a filepath is a string, if not then we return None
if not isinstance(file_path, str):
return None
# Creating a the list in which we will hold our dictionary's files
_info_list = []
with open(file_path) as f:
# Setting the delimiter and quotechar
reader = csv.reader(f, delimiter=',', quotechar='"')
# We user enumerate here, because we know the first row contains data about the information
for i, row in enumerate(reader):
# The first row contains the headings
if i == 0:
# Creating a list from first row
title_list = [var for var in row]
continue
# Zipping title_list and info_list together, so that a dictionary comprehension is possible
_info_list.append({title: info for title, info in zip(title_list, row)})
return _info_list
APPENDIX
open()
zip
Dictionary Comprehension
Delmiter, its the character that separates values, in this case ,.
Quotechar, its the character, that holds values in a csv, in this case ".
Related
I have an ASCII file with the following columns :
ID, val1, val2, val3
where ID is a row_number but not sorted. I want to write a new ascii file with the same columns with sorted ID (from smaller to larger).
How I could do that in python?
In fact, this file has been produced by the concatenation of 2 ascii files using the following code:
import os.path
maindir1="/home/d01/"
maindir2="/home/d02/"
outdir="/home/final/"
pols=[ "F1","F2","F3" ]
months=["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
for ipol in pols:
for imonth in months:
for kk in range(1, 7):
template_args = {"ipol": ipol, "imonth": imonth, "kk": kk}
filename = "{ipol}_{imonth}_0{kk}_L1.txt".format(ipol=ipol, imonth=imonth, kk=kk)
out_name = os.path.join(outdir, filename)
in_names = [os.path.join(maindir1, filename), os.path.join(maindir2, filename)]
with open(out_name, "w") as out_file:
for in_name in in_names:
with open(in_name, "r") as in_file:
out_file.write(in_file.read())
How could I define to the above code to write the final file in a sorted way (based on the first column) ?
Assuming Comma Separated Values
I think you're talking about a Comma Separated Values (CSV) file. The character encoding is probably ASCII. If this is true, you'll have an input like this:
id,val1,val2,val3
3,a,b,c
1,a,b,c
2,a,b,c
Python has a good standard library for this: csv.
import csv
with open("in.csv") as f:
reader = csv.reader(f)
We import the csv library first, then open the file using a context processor. Basically, it's a nice way to open the file, do stuff (in the with block) and then close it.
The csv.reader method takes the file pointer f as an argument. This reader can be iterated and represents the contents of your file. If you cast it to a list, you get a list of lists. The first item in the list of lists is the header, which you want to save, and the rest is the contents:
contents = list(reader)
header = contents[0]
rows = contents[1:]
You then want to sort the rows. But sorting a list of lists might not do what you expect. You need to write a function that helps you find the key to use to perform the sorting:
lambda line: line[0]
This means for every line (which we expect to be a list), the key is equal to the first member of the list. If you prefer not to use lambdas, you can also define a function:
def get_key(line):
return line[0]
get_key is identical to the lambda.
Combine this all together to get:
new_file = sorted(rows, key=lambda line: line[0])
If you didn't use the lambda, that's:
new_file = sorted(rows, key=get_key)
To write it to a file, you can use the csv library again. Remember to first write the header then the rest of the contents:
with open("out.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(new_file)
All together, the code looks like this:
import csv
with open("in.txt") as f:
reader = csv.reader(f)
contents = list(reader)
header = contents[0]
rows = contents[1:]
new_file = sorted(rows, key=lambda line: line[0])
with open("out.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(new_file)
Assuming Custom
If the file is custom and definitely has the spaces in the header like you described (almost like a CSV) or you don't want to use the csv library, you can extract rows like this:
contents = [row.replace(" ", "").split(",") for row in f.readlines()]
If, for instance, it's space-delimited instead of comma-delimited, you would use this:
contents = [row.split() for row in f.readlines()]
You can write rows like this:
with open("out.csv", "w") as f:
f.write(", ".join(header))
for row in new_file:
f.write(", ".join(row))
In ensemble:
with open("in.txt") as f:
contents = [row.replace(" ", "").split(",") for row in f.readlines()]
header = contents[0]
rows = contents[1:]
new_file = sorted(rows, key=lambda line: line[0])
with open("out.csv", "w") as f:
f.write(", ".join(header))
for row in new_file:
f.write(", ".join(row))
Hope that helps!
EDIT: This would perform a lexicographical sort on the first column, which is probably not what you want. If you can guarantee that all of the first column (aside from the header) are integers, you can just cast them from a str:
lambda line: line[0]
...becomes:
lambda line: int(line[0])
...with full code:
import csv
with open("in.txt") as f:
reader = csv.reader(f)
contents = list(reader)
header = contents[0]
rows = contents[1:]
new_file = sorted(rows, key=lambda line: int(line[0]))
with open("out.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(new_file)
So, you need to sort the data in csv format you have in ascending order on the basis of Id.
You can use this function to do it
def Sort(sub_li):
sub_li.sort(key = lambda x: x[0])
return sub_li
x[0] to sort according to Id means first column or you can change according to your use case.
I took the input as `
a = [["1a", 122323,1000,0],
["6a", 12323213,24,2],
["3a", 1233,1,3]]
So, using the above function I got the output as
[['1a', 122323, 1000, 0],
['3a', 1233, 1, 3],
['6a', 12323213, 24, 2]]
I hope this will help.
I am new to python and I'm trying to create a csv parsing script.
I pass rows from the csv to a list but what currently troubles me is that I need to add the first header line as a dictionary in each item.
def parse_csv(datafile):
data = []
with open(datafile, "r") as f:
next(f) #skip headerline
for line in f:
splitLine = line.strip(',')
rowL = splitLine.rstrip('\n') #remove the newline char
data.append(rowL)
pprint(data)
return data
If the 1st header line has the dictionaries (e.g Title, Name etc) how am I going to pass to each stripped element?
e.g {'Dict1': 'data1', 'Dict2': 'data2' }
This may be considered duplicate but tried various ways from similar posts but none worked properly on my case.
I strongly recommend to use the provided csv library. It will save you a lot of time and effort. Here is what you want to do:
import csv
data = []
with open(datafile, 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
print(row['Title'], row['Name'])
In this example each row is actually a python dictionary.
#GeorgiDimitrov is certainly right that the proper approach is to use the csv module from the standard library, but, if you're doing this only for self-instruction purposes, then...:
def parse_csv(datafile):
data = []
with open(datafile, "r") as f:
headers = next(f).split(',')
for line in f:
splitLine = line.split(',')
dd = dict(zip(headers,splitLine))
data.append(dd
pprint(data)
return data
This will not properly deal with quoted/escaped commas, &c -- all subtleties that are definitely best left to the csv module:-).
I have a dataset which looks like this:
id,created_at,username
1,2006-10-09T18:21:51Z,hey
2,2007-10-09T18:30:28Z,bob
3,2008-10-09T18:40:33Z,bob
4,2009-10-09T18:47:42Z,john
5,2010-10-09T18:51:04Z,brad
...
I contains 1M+ lines.
I'd like to extract the list of username without duplicate from it using python. So far my code looks like this:
import csv
file1 = file("sample.csv", 'r')
file2 = file("users.csv", 'w')
reader = csv.reader(file1)
writer = csv.writer(file2)
rownum = 0
L = []
for row in reader:
if not rownum == 0:
if not row[2] in L:
L.append(row[2])
writer.writerow(row[2])
rownum += 1
I have several questions:
1 - my output in users.csv looks like this:
h,e,y
b,o,b
j,o,h,n
b,r,a,d
How do I remove the commas between each letter?
2 - My code is not very elegant, is there any way to import the csv file as a matrix to select the last row and then to use an elegant library like underscore.js in javascript to remove the duplicates?
Many thanks
You can use a set here, it provides O(1) item lookup compared to O(N) of lists.
seen = set()
add_ = seen.add
next(reader) #skip header
writer.writerows([row[-1]] for row in reader if row[-1] not in seen
and not add_(row[-1]))
And always use the with statement for handling files, it'll automatically close the file for you:
with file("sample.csv", 'r') as file1, file("users.csv", 'w') as file2:
#Do stuff with file1 and file2 here
Change
writer.writerow(row[2])
to
writer.writerow([row[2]])
Also, checking for membership in lists is computationally expensive [O(n)]. If you will be checking for membership in a large collection of items, and doing it often, use a set [O(1)]:
L = set()
reader.next() # Skip the header
for row in reader:
if row[2] not in L:
L.add(row[2])
writer.writerow([row[2]])
Alternatively
If you're okay with using a few megabytes of memory, just do this:
with open("sample.csv", "rb") as infile:
reader = csv.reader(infile)
reader.next()
no_duplicates = set(tuple(row) for row in reader)
with open("users.csv", "wb") as outfile:
csv.writer(outfile).writerows(no_duplicates)
if order is important, use an OrderedDict instead of a set:
from collections import OrderedDict
with open("sample.csv", "rb") as infile:
reader = csv.reader(infile)
reader.next()
no_duplicates = OrderedDict.fromkeys(tuple(row) for row in reader)
with open("users.csv", "wb") as outfile:
csv.writer(outfile).writerows(no_duplicates.keys())
Easy and short!
for line in reader:
string = str(line)
split = string.split("," , 2)
username = split[2][2:-2]
This is a function that takes the tabular file as input and return the first row as list:
def firstline_to_list(fvar):
"""
Input tab separated file.
Output first row as list.
"""
import csv
lineno = 0
with open(fvar, 'r') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter='\t')
for row in tabreader:
lineno += 1
if lineno == 1:
return row
break
Is there a better way to do it than this clunky code of mine?
Just replace your for loop with a single call of next on the iterator tabreader. In python 2.7, this should be tabreader.next(), and in python 3, I think it's next(tabreader). You might also want to wrap the call in try except block for StopIteration exception, just in case the file is empty.
So putting everything together, here's version that's compatible with python 2 and 3:
def firstline_to_list(fvar):
"""
Input tab separated file.
Output first row as list.
"""
import csv, sys
with open(fvar, 'r') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter='\t')
try:
if sys.version > '3':
result = next(tabreader)
else:
result = tabreader.next()
except StopIteration:
result = None
return result
The absolute minimum modification to your code would be this:
def firstline_to_list(fvar):
"""
Input tab separated files.
Output first row as list.
"""
import csv
with open(fvar, 'r') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter='\t')
for row in tabreader:
return row
A better way would be to use Reader.next() as documented here: https://docs.python.org/2/library/csv.html
def firstline_to_list(fvar):
"""
Input tab separated files.
Output first row as list.
"""
import csv
with open(fvar, 'r') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter='\t')
return tabreader.next()
How about:
import pandas
return list(pandas.read_csv(fvar,sep='\t',nrows=1))
I have some code that is meant to convert CSV files into tab delimited files. My problem is that I cannot figure out how to write the correct values in the correct order. Here is my code:
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write(item['name']+'\t'+item['order_num']...)
tab_file.write('\n'+item['amt_due']+'\t'+item['due_date']...)
Now, since both my write statements are in the for row in data loop, my headers are being written multiple times over. If I outdent the first write statement, I'll have an obvious formatting error. If I move the second write statement above the first and then outdent, my data will be out of order. What can I do to make sure that the first write statement gets written once as a header, and the second gets written for each line in the CSV file? How do I extract the first 'write' statement outside of the loop without breaking the dictionary? Thanks!
The csv module contains methods for writing as well as reading, making this pretty trivial:
import csv
with open("test.csv") as file, open("test_tab.csv", "w") as out:
reader = csv.reader(file)
writer = csv.writer(out, dialect=csv.excel_tab)
for row in reader:
writer.writerow(row)
No need to do it all yourself. Note my use of the with statement, which should always be used when working with files in Python.
Edit: Naturally, if you want to select specific values, you can do that easily enough. You appear to be making your own dictionary to select the values - again, the csv module provides DictReader to do that for you:
import csv
with open("test.csv") as file, open("test_tab.csv", "w") as out:
reader = csv.DictReader(file)
writer = csv.writer(out, dialect=csv.excel_tab)
for row in reader:
writer.writerow([row["name"], row["order_num"], ...])
As kirelagin points out in the commends, csv.writerows() could also be used, here with a generator expression:
writer.writerows([row["name"], row["order_num"], ...] for row in reader)
Extract the code that writes the headers outside the main loop, in such a way that it only gets written exactly once at the beginning.
Also, consider using the CSV module for writing CSV files (not just for reading), don't reinvent the wheel!
Ok, so I figured it out, but it's not the most elegant solutions. Basically, I just ran the first loop, wrote to the file, then ran it a second time and appended the results. See my code below. I would love any input on a better way to accomplish what I've done here. Thanks!
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write(item['name']+'\t'+item['order_num']...)
tab_file.close()
for file in import_dir:
data = csv.reader(open(file))
fields = data.next()
new_file = export_dir+os.path.basename(file)
tab_file = open(export_dir+os.path.basename(file), 'a+')
for row in data:
items = zip(fields, row)
item = {}
for (name, value) in items:
item[name] = value.strip()
tab_file.write('\n'+item['amt_due']+'\t'+item['due_date']...)
tab_file.close()