I was a little curious because when I add a single line in my code, that counts the number of rows in the csv file, the for loop is stop working and just skipping everything inside.
My code shown below, is working now, but if I uncomment the row_count it's not working, so my question is why?
with open(r"C:\Users\heltbork\Desktop\test\ICM.csv", newline='') as csvfile:
sensor = csv.reader(csvfile, delimiter=',', quotechar='|')
#row_count = sum(1 for row in sensor)
#print(row_count)
for row in sensor:
#alot of stuff here
The reader is an iterable (see the iterator protocol):
... One notable exception is code which attempts multiple iteration passes. A container object (such as a list) produces a fresh new iterator each time you pass it to the iter() function or use it in a for loop. Attempting this with an iterator will just return the same exhausted iterator object used in the previous iteration pass, making it appear like an empty container.
The iterable is consumed when you iterate it. It is not a concrete data structure:
sensor = csv.reader(...) # creates an iterator
row_count = sum(1 for row in sensor) # *consumes* the iterator
for row in sensor: # nothing in the iterator, consumed by `sum`
# a lot of stuff here
You should count while you iterate (inside for row in sensor:), because once you iterate and consume it - you can't iterate again.
Alternatives are using list for concreting the data, or if you need the iterable interface - itertools.tee (if you don't have a lot if data). You can also use enumerate and keep the last index.
Example:
sensor = csv.reader(...) # creates an iterator
count = 0
for idx, row in enumerate(sensor):
# a lot of stuff here
# ...
count = idx
print(count)
Or:
count = 0
for row in sensor:
# a lot of stuff here
# ...
count += 1
print(count)
Related
I'm interested in finding the FASTEST way to iterate through a list of lists and replace a character in the innermost list. I am generating the list of lists from a CSV file in Python.
Bing Ads API sends me a giant report but any percentage is represented as "20.00%" as opposed to "20.00". This means I can't insert each row as is to my database because "20.00%" doesn't convert to a numeric on SQL Server.
My solution thus far has been to use a list comprehension inside a list comprehension. I wrote a small script to test how fast this runs compared to just getting the list and it's doing ok (about 2x the runtime) but I am curious to know if there is a faster way.
Note: Every record in the report has a rate and therefore a percent. So every
record has to be visited once, and every rate has to be visited once (is that the cause of the 2x slowdown?)
Anyway I would love a faster solution as the size of these reports continue to grow!
import time
import csv
def getRecords1():
with open('report.csv', 'rU',encoding='utf-8-sig') as records:
reader = csv.reader(records)
while next(reader)[0]!='GregorianDate': #Skip all lines in header (the last row in header is column headers so the row containing 'GregorianDate' is the last to skip)
next(reader)
recordList = list(reader)
return recordList
def getRecords2():
with open('report.csv', 'rU',encoding='utf-8-sig') as records:
reader = csv.reader(records)
while next(reader)[0]!='GregorianDate': #Skip all lines in header (the last row in header is column headers so the row containing 'GregorianDate' is the last to skip)
next(reader)
recordList = list(reader)
data = [[field.replace('%', '') for field in record] for record in recordList]
return recordList
def getRecords3():
data = []
with open('c:\\Users\\sflynn\\Documents\\Google API Project\\Bing\\uploadBing\\reports\\report.csv', 'rU',encoding='utf-8-sig') as records:
reader = csv.reader(records)
while next(reader)[0]!='GregorianDate': #Skip all lines in header (the last row in header is column headers so the row containing 'GregorianDate' is the last to skip)
next(reader)
for row in reader:
row[10] = row[10].replace('%','')
data+=[row]
return data
def main():
t0=time.time()
for i in range(2000):
getRecords1()
t1=time.time()
print("Get records normally takes " +str(t1-t0))
t0=time.time()
for i in range(2000):
getRecords2()
t1=time.time()
print("Using nested list comprehension takes " +str(t1-t0))
t0=time.time()
for i in range(2000):
getRecords3()
t1=time.time()
print("Modifying row as it's read takes " +str(t1-t0))
main()
Edit: I have added a third function getRecords3() which is the fastest implementation I have seen yet. The output of running the program is as follows:
Get records normally takes 30.61197066307068
Using nested list comprehension takes 60.81756520271301
Modifying row as it's read takes 43.761850357055664
This means we have taken it down from a 2x slower algorithm to approximately 1.5x slower. Thank you everyone!
You could potentially check if the in-place inner-list modification is faster than creating a new list of list using list comprehension.
So, something like
for field in record:
for index in range(len(field)):
range[index] = range[index].replace('%', '')
We can't really modify the string in-place since strings are immutable.
this type of question has been asked many times. So apologies; I have searched hard to get an answer - but have not found anything that is close enough to my needs (and I am not sufficiently advanced (I am a total newbie) to customize an existing answer). So thanks in advance for any help.
Here's my query:
I have 30 or so csv files and each contains between 500 and 15,000 rows.
Within each of them (in the 1st column) - are rows of alphabetical IDs (some contain underscores and some also have numbers).
I don't care about the unique IDs - but I would like to identify the duplicate IDs and the number of times they appear in all the different csv files.
Ideally I'd like the output for each duped ID to appear in a new csv file and be listed in 2 columns ("ID", "times_seen")
It may be that I need to compile just 1 csv with all the IDs for your code to run properly - so please let me know if I need to do that
I am using python 2.7 (a crawling script that I run needs this version, apparently).
Thanks again
It seems the most easy way to achieve want you want would make use of dictionaries.
import csv
import os
# Assuming all your csv are in a single directory we will iterate on the
# files in this directory, selecting only those ending with .csv
# to list files in the directory we will use the walk function in the
# os module. os.walk(path_to_dir) returns a generator (a lazy iterator)
# this generator generates tuples of the form root_directory,
# list_of_directories, list_of_files.
# So: declare the generator
file_generator = os.walk("/path/to/csv/dir")
# get the first values, as we won't recurse in subdirectories, we
# only ned this one
root_dir, list_of_dir, list_of_files = file_generator.next()
# Now, we only keep the files ending with .csv. Let me break that down
csv_list = []
for f in list_of_files:
if f.endswith(".csv"):
csv_list.append(f)
# That's what was contained in the line
# csv_list = [f for _, _, f in os.walk("/path/to/csv/dir").next() if f.endswith(".csv")]
# The dictionary (key value map) that will contain the id count.
ref_count = {}
# We loop on all the csv filenames...
for csv_file in csv_list:
# open the files in read mode
with open(csv_file, "r") as _:
# build a csv reader around the file
csv_reader = csv.reader(_)
# loop on all the lines of the file, transformed to lists by the
# csv reader
for row in csv_reader:
# If we haven't encountered this id yet, create
# the corresponding entry in the dictionary.
if not row[0] in ref_count:
ref_count[row[0]] = 0
# increment the number of occurrences associated with
# this id
ref_count[row[0]]+=1
# now write to csv output
with open("youroutput.csv", "w") as _:
writer = csv.writer(_)
for k, v in ref_count.iteritems():
# as requested we only take duplicates
if v > 1:
# use the writer to write the list to the file
# the delimiters will be added by it.
writer.writerow([k, v])
You may need to tweek a little csv reader and writer options to fit your needs but this should do the trick. You'll find the documentation here https://docs.python.org/2/library/csv.html. I haven't tested it though. Correcting the little mistakes that may have occurred is left as a practicing exercise :).
That's rather easy to achieve. It would look something like:
import os
# Set to what kind of separator you have. '\t' for TAB
delimiter = ','
# Dictionary to keep count of ids
ids = {}
# Iterate over files in a dir
for in_file in os.listdir(os.curdir):
# Check whether it is csv file (dummy way but it shall work for you)
if in_file.endswith('.csv'):
with open(in_file, 'r') as ifile:
for line in ifile:
my_id = line.strip().split(delimiter)[0]
# If id does not exist in a dict = set count to 0
if my_id not in ids:
ids[my_id] = 0
# Increment the count
ids[my_id] += 1
# saves ids and counts to a file
with open('ids_counts.csv', 'w') as ofile:
for key, val in ids.iteritems():
# write down counts to a file using same column delimiter
ofile.write('{}{}{}\n'.format(key, delimiter, value))
Check out the pandas package. You can read an write csv files quite easily with it.
http://pandas.pydata.org/pandas-docs/stable/10min.html#csv
Then, when having the csv-content as a dataframe you convert it with the as_matrix function.
Use the answers to this question to get the duplicates as a list.
Find and list duplicates in a list?
I hope this helps
As you are a newbie, Ill try to give some directions instead of posting an answer. Mainly because this is not a "code this for me" platform.
Python has a library called csv, that allows to read data from CSV files (Boom!, surprised?). This library allows you to read the file. Start by reading the file (preferably an example file that you create with just 10 or so rows and then increase the amount of rows or use a for loop to iterate over different files). The examples in the bottom of the page that I linked will help you printing this info.
As you will see, the output you get from this library is a list with all the elements of each row. Your next step should be extracting just the ID that you are interested in.
Next logical step is counting the amount of appearances. There is also a class from the standard library called counter. They have a method called update that you can use as follows:
from collections import Counter
c = Counter()
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 1})
c['safddsfasdf'] # 1
c.update(['safddsfasdf'])
c # Counter({'safddsfasdf': 2})
c['safddsfasdf'] # 2
c.update(['fdf'])
c # Counter({'safddsfasdf': 2, 'fdf': 1})
c['fdf'] # 1
So basically you will have to pass it a list with the elements you want to count (you could have more than 1 id in the list, for exampling reading 10 IDs before inserting them, for improved efficiency, but remember not constructing a thousands of elements list if you are seeking good memory behaviour).
If you try this and get into some trouble come back and we will help further.
Edit
Spoiler alert: I decided to give a full answer to the problem, please avoid it if you want to find your own solution and learn Python in the progress.
# The csv module will help us read and write to the files
from csv import reader, writer
# The collections module has a useful type called Counter that fulfills our needs
from collections import Counter
# Getting the names/paths of the files is not this question goal,
# so I'll just have them in a list
files = [
"file_1.csv",
"file_2.csv",
]
# The output file name/path will also be stored in a variable
output = "output.csv"
# We create the item that is gonna count for us
appearances = Counter()
# Now we will loop each file
for file in files:
# We open the file in reading mode and get a handle
with open(file, "r") as file_h:
# We create a csv parser from the handle
file_reader = reader(file_h)
# Here you may need to do something if your first row is a header
# We loop over all the rows
for row in file_reader:
# We insert the id into the counter
appearances.update(row[:1])
# row[:1] will get explained afterwards, it is the first column of the row in list form
# Now we will open/create the output file and get a handle
with open(output, "w") as file_h:
# We create a csv parser for the handle, this time to write
file_writer = writer(file_h)
# If you want to insert a header to the output file this is the place
# We loop through our Counter object to write them:
# here we have different options, if you want them sorted
# by number of appearances Counter.most_common() is your friend,
# if you dont care about the order you can use the Counter object
# as if it was a normal dict
# Option 1: ordered
for id_and_times in apearances.most_common():
# id_and_times is a tuple with the id and the times it appears,
# so we check the second element (they start at 0)
if id_and_times[1] == 1:
# As they are ordered, we can stop the loop when we reach
# the first 1 to finish the earliest possible.
break
# As we have ended the loop if it appears once,
# only duplicate IDs will reach to this point
file_writer.writerow(id_and_times)
# Option 2: unordered
for id_and_times in apearances.iteritems():
# This time we can not stop the loop as they are unordered,
# so we must check them all
if id_and_times[1] > 1:
file_writer.writerow(id_and_times)
I offered 2 options, printing them ordered (based on Counter.most_common() doc) and unoredered (based on normal dict method dict.iteritems()). Choose one. From a speed point of view I'm not sure which one would be faster, as one first needs to order the Counter but also stops looping when finding the first element non-duplicated while the second doesn't need to order the elements but needs to loop every ID. The speed will probably be dependant on your data.
About the row[:1] thingy:
row is a list
You can get a subset of a list telling the initial and final positions
In this case the initial position is omited, so it defaults to the start
The final position is 1, so just the first element gets selected
So the output is another list with just the first element
row[:1] == [row[0]] They have the same output, getting a sublist of only the same element is the same that constructing a new list with only the first element
My issue is from a much larger program but I shrunk and dramatically simplified the specific problem for the purpose of this question.
I've used the dictreader method to create a dictionary from a csv file. I want to loop through the dictionary printing out its contents, which I can, but I want to do this multiple times.
The content of test.csv is simply one column with the numbers 1-3 and a header row called Number.
GetData is a class with a method called create_dict() that I wrote that creates and returns a dictionary from test.csv using csv.dictreader
My code is as follows:
dictionary = GetData('test.csv').create_dict()
for i in range(5):
print("outer loop")
for row in dictionary:
print(row['Number'])
class GetData:
def __init__(self, file):
self._name = file
def create_dict(self):
data = csv.DictReader(open(self._name, 'r'), delimiter=",")
return data
The output is as follows:
outer loop
1
2
3
outer loop
outer loop
outer loop
outer loop
My desired output is:
outer loop
1
2
3
outer loop
1
2
3
outer loop
1
2
3
outer loop
1
2
3
outer loop
1
2
3
Does anyone know why this happens in Python?
Since you're using a file object, it's reading from the cursor position. This isn't a problem the first time through because the position is at the beginning of the file. After that, it's reading from the end of the file to the, well, end of the file.
I'm not sure how GetData works, but see if it has a seek command in which case:
for i in range(5):
print('outer loop')
dictionary.seek(0)
for row in dictionary:
print(row['Number'])
As g.d.d.c points out in a comment, it may also be a generator instead of a file object, in which case this approach is flawed. The generator will only run once, so you may have to dict() it. It all depends on how GetData.create_dict works!
As per your comment that GetData.create_dict gives you a csv.DictReader, your options are somewhat limited. Remember that the DictReader is essentially just a list of dicts, so you may be able to get away with:
list_of_dicts = [row for row in dictionary]
then you can iterate through the list_of_dicts
for i in range(5):
print('outer loop')
for row in list_of_dicts:
print(row['Number'])
csv.DictReader is an iterator for the associated open file. After one loop over the file, you're at the end (EOF).
To loop over it again, simply seek to the beginning of the file: your_filehandle.seek(0)
Imagine I'm reading in a csv file of numbers that looks like this:
1,6.2,10
5.4,5,11
17,1.5,5
...
And it's really really long.
I'm going to iterate through this file with a csv reader like this:
import csv
reader = csv.reader('numbers.csv')
Now assume I have some function that can take an iterator like max:
max((float(rec[0]) for rec in reader))
This finds the max of the first column and doesn't need to read the whole file into memory.
But what if I want to run max on each column of the csv file, still without reading the whole file into memory?
If max were rewritten like this:
def max(iterator):
themax = float('-inf')
for i in iterator:
themax = i if i > themax else themax
yield
yield themax
I could then do some fancy work (and have) to make this happen.
But what if I constrain the problem and don't allow max to be rewritten? Is this possible?
Thanks!
If you're comfortable with a more functional approach you can use functools.reduce to iterate through the file, pulling only two rows into memory at once, and accumulating the column-maximums as it goes.
import csv
from functools import reduce
def column_max(row1, row2):
# zip contiguous rows and apply max to each of the column pairs
return [max(float(c1), float(c2)) for (c1, c2) in zip(row1, row2)]
reader = csv.reader('numbers.csv')
# calling `next` on reader advances its state by one row
first_row = next(reader)
column_maxes = reduce(column_max, reader, first_row)
#
#
# another way to write this code is to unpack the reduction into explicit iteration
column_maxes = next(reader) # advances `reader` to its second row
for row in reader:
column_maxes = [max(float(c1), float(c2)) for (c1, c2) in zip(column_maxes, row)]
I would just move away from using a function which you pass the iterator but instead iterate on your own over the reader:
maxes = []
for row in reader:
for i in range(len(row)):
if i > len(maxes):
maxes.append(row[i])
else:
maxes[i] = max(maxes[i], row[i])
At the end, you will have the list maxes which will contain each maximum value, without having the whole file in memory.
def col_max(x0,x1):
"""x0 is a list of the accumulated maxes so far,
x1 is a line from the file."""
return [max(a,b) for a,b in zip(x0,x1)]
Now functools.reduce(col_max,reader,initializer) will return just what you want. You will have to supply initializer as a list of -inf's of the correct length.
I'm writing a Python script that reads a CSV file and creates a list of deques. If I print out exactly what gets appended to the list before it gets added, it looks like what I want, but when I print out the list itself I can see that append is overwriting all of the elements in the list with the newest one.
# Window is a list containing many instances
def slideWindow(window, nextInstance, num_attributes):
attribute = nextInstance.pop(0)
window.popleft()
for i in range(num_attributes):
window.pop()
window.extendleft(reversed(nextInstance))
window.appendleft(attribute)
return window
def convertDataFormat(filename, window_size):
with open(filename, 'rU') as f:
reader = csv.reader(f)
window = deque()
alldata = deque()
i = 0
for row in reader:
if i < (window_size-1):
window.extendleft(reversed(row[1:]))
i+=1
else:
window.extendleft(reversed(row))
break
alldata.append(window)
for row in reader:
window = slideWindow(window, row, NUM_ATTRIBUTES)
alldata.append(window)
# print alldata
f.close()
return alldata
This is really difficult to track what you exactly want from this code. I suspect the problem lies in the following:
alldata.append(window)
for row in reader:
window = slideWindow(window, row, NUM_ATTRIBUTES)
alldata.append(window)
Notice that in your slideWindow function, you modify the input deque (window), and then return the modified deque. So, you're putting a deque into the first element of your list, then you modify that object (inside slideWindow) and append another reference to the same object onto your list.
Is that what you intend to do?
The simple fix is to copy the window input in slideWindow and modify/return the copy.
I don't know for sure, but I'm suspicious it might be similar to this problem http://forums.devshed.com/python-programming-11/appending-object-to-list-overwrites-previous-842713.html.