Get length of csv file without ruining reader? - python

I am trying to do the following:
reader = csv.DictReader(open(self.file_path), delimiter='|')
reader_length = sum([_ for item in reader])
for line in reader:
print line
However, doing the reader_length line, makes the reader itself unreadable. Note that I do not want to do a list() on the reader, as it is too big to read on my machine entirely from memory.

Use enumerate with a start value of 1, when you get to the end of the file you will have the line count:
for count,line in enumerate(reader,1):
# do work
print count
Or if you need the count at the start for some reason sum using a generator expression and seek back to the start of the file:
with open(self.file_path) as f:
reader = csv.DictReader(f, delimiter='|')
count = sum(1 for _ in reader)
f.seek(0)
reader = csv.DictReader(f, delimiter='|')
for line in reader:
print(line)

reader = list(csv.DictReader(open(self.file_path), delimiter='|'))
print len(reader)
is one way to do this i suppose
another way to do it would be
reader = csv.DictReader(open(self.file_path), delimiter='|')
for i,row in enumerate(reader):
...
num_rows = i+1

Related

Why is using .reader() skipping the first line and .readlines() isn't?

I am attempting to read in all the data from a .csv file. First, I tried using csv.reader(), but this would skip the first line of my file. I was able to remedy this using .readlines(), but I am wondering why this happens with .reader() and would like to make it read my first line.
import glob
import csv
new_cards = []
path = 'C:\\Users\\zrc\\Desktop\\GCData2\\*.asc'
files = glob.glob(path)
# First Method
for name in files:
with open(name) as f:
for line in f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in reader:
new_cards.append(row)
print(len(new_cards))
# Second Method
for name in files:
with open(name) as f:
m = f.readlines()
for line in m:
new_cards.append(line)
print(len(new_cards))
In your first function you dont need to use for line in f: this line is taking your first line and then the reader starts from the second.
The correct way should be:
for name in files:
with open(name) as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in reader:
new_cards.append(row)
print(len(new_cards))
You dont need to iterate over each line in the first one because you are already doing it with for row in reader:

data clean-up in python using large (1.7gig) csv files

I'm trying to do some data clean-up using python. I have some large (1 - 2gigs) csv files that I want to sort by some attribute (e.g. date, time), and then output another csv file with this info with the purpose of making it able to be used in excel.
As I iterate through the rows, I come across some big memory issues. Initially I was using a 32-bit Idle which wouldn't run my code, and then switched to 64-bit Spyder. Now the code runs, but halts (appears to process, memory is consumed, but haven't seen it move on in the last half hour) at the first iterative line.
My code is as follows. The process halts at line 10 (highlighted). I'm pretty new to python so I'm sure my code is very primitive, but its the best I can do! Thanks for your help in advance :)
def file_reader(filename):
"function takes string of file name and returns a list of lists"
global master_list
with open(filename, 'rt') as csvfile:
rows = []
master_list = []
rowreader = csv.reader(csvfile, delimiter=',', quotechar='|')
**for row in rowreader:**
rows.append(','.join(row))
for i in rows:
master_list.append(i.replace(' ', '').replace('/2013', ',').split(","))
return master_list
def trip_dateroute(date,route):
dateroute_list = []
for i in master_list:
if str(i[1]) == date and str(i[3]) == route:
dateroute_list.append(i)
return dateroute_list
def output_csv(filename, listname):
with open(filename, "w") as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='|', lineterminator='\n')
for i in listname:
writer.writerow(i)
If you don't need to hold the whole file content in memory, you can just process each line and immediately write it to the output file. Also, in your example you parse the CSV and then generate CSV again, but you don't seem to make use of parsed data. If that is correct, you could simply do this:
def file_converter(infilename, outfilename):
with open(infilename, 'rt') as infile, open(outfilename, "w") as outfile:
for line in infile:
line.replace(' ', '').replace('/2013', ',')
outfile.write(line)
If the function trip_dateroute() is used to filter the lines that should actually be written out, you can add that, too, but then you'd actually have to parse CSV:
def filter_row(row, date, route):
return str(row[1]) == date and str(row[3]) == route
def cleanup(field):
return field.replace(' ', '').replace('/2013', ',')
def file_converter(infilename, outfilename, date, route):
with open(infilename, 'rt') as infile, open(outfilename, "w") as outfile:
reader = csv.reader(infile, delimiter=',', quotechar='|')
writer = csv.writer(outfile, delimiter=',', quotechar='|', lineterminator='\n')
for row in reader:
row = [cleanup(field) for field in row if filter_row(row, date, route)]
writer.writerow(row)

Reading a specific line from CSV file

I have a ten line CSV file. From this file, I only want the, say, fourth line. What's the quickest way to do this? I'm looking for something like:
with open(file, 'r') as my_file:
reader = csv.reader(my_file)
print reader[3]
where reader[3] is obviously incorrect syntax for what I want to achieve. How do I move the reader to line 4 and get it's content?
If all you have is 10 lines, you can load the whole file into a list:
with open(file, 'r') as my_file:
reader = csv.reader(my_file)
rows = list(reader)
print rows[3]
For a larger file, use itertools.islice():
from itertools import islice
with open(file, 'r') as my_file:
reader = csv.reader(my_file)
print next(islice(reader, 3, 4))

Extract a list without duplicates from a CSV file

I have a dataset which looks like this:
id,created_at,username
1,2006-10-09T18:21:51Z,hey
2,2007-10-09T18:30:28Z,bob
3,2008-10-09T18:40:33Z,bob
4,2009-10-09T18:47:42Z,john
5,2010-10-09T18:51:04Z,brad
...
I contains 1M+ lines.
I'd like to extract the list of username without duplicate from it using python. So far my code looks like this:
import csv
file1 = file("sample.csv", 'r')
file2 = file("users.csv", 'w')
reader = csv.reader(file1)
writer = csv.writer(file2)
rownum = 0
L = []
for row in reader:
if not rownum == 0:
if not row[2] in L:
L.append(row[2])
writer.writerow(row[2])
rownum += 1
I have several questions:
1 - my output in users.csv looks like this:
h,e,y
b,o,b
j,o,h,n
b,r,a,d
How do I remove the commas between each letter?
2 - My code is not very elegant, is there any way to import the csv file as a matrix to select the last row and then to use an elegant library like underscore.js in javascript to remove the duplicates?
Many thanks
You can use a set here, it provides O(1) item lookup compared to O(N) of lists.
seen = set()
add_ = seen.add
next(reader) #skip header
writer.writerows([row[-1]] for row in reader if row[-1] not in seen
and not add_(row[-1]))
And always use the with statement for handling files, it'll automatically close the file for you:
with file("sample.csv", 'r') as file1, file("users.csv", 'w') as file2:
#Do stuff with file1 and file2 here
Change
writer.writerow(row[2])
to
writer.writerow([row[2]])
Also, checking for membership in lists is computationally expensive [O(n)]. If you will be checking for membership in a large collection of items, and doing it often, use a set [O(1)]:
L = set()
reader.next() # Skip the header
for row in reader:
if row[2] not in L:
L.add(row[2])
writer.writerow([row[2]])
Alternatively
If you're okay with using a few megabytes of memory, just do this:
with open("sample.csv", "rb") as infile:
reader = csv.reader(infile)
reader.next()
no_duplicates = set(tuple(row) for row in reader)
with open("users.csv", "wb") as outfile:
csv.writer(outfile).writerows(no_duplicates)
if order is important, use an OrderedDict instead of a set:
from collections import OrderedDict
with open("sample.csv", "rb") as infile:
reader = csv.reader(infile)
reader.next()
no_duplicates = OrderedDict.fromkeys(tuple(row) for row in reader)
with open("users.csv", "wb") as outfile:
csv.writer(outfile).writerows(no_duplicates.keys())
Easy and short!
for line in reader:
string = str(line)
split = string.split("," , 2)
username = split[2][2:-2]

How do I put lines into a list from CSV using python

I am new to Python (coming from PHP background) and I have a hard time figuring out how do I put each line of CSV into a list. I wrote this:
import csv
data=[]
reader = csv.reader(open("file.csv", "r"), delimiter=',')
for line in reader:
if "DEFAULT" not in line:
data+=line
print(data)
But when I print out data, I see that it's treated as one string. I want a list. I want to be able to loop and append every line that does not have "DEFAULT" in a given line. Then write to a new file.
How about this?
import csv
reader = csv.reader(open("file.csv", "r"), delimiter=',')
print([line for line in reader if 'DEFAULT' not in line])
or if it's easier to understand:
import csv
reader = csv.reader(open("file.csv", "r"), delimiter=',')
data = [line for line in reader if 'DEFAULT' not in line]
print(data)
and of course the ultimate one-liner:
import csv
print([l for l in csv.reader(open("file.csv"), delimiter=',') if 'DEFAULT' not in l])

Categories

Resources