Imagine I'm reading in a csv file of numbers that looks like this:
1,6.2,10
5.4,5,11
17,1.5,5
...
And it's really really long.
I'm going to iterate through this file with a csv reader like this:
import csv
reader = csv.reader('numbers.csv')
Now assume I have some function that can take an iterator like max:
max((float(rec[0]) for rec in reader))
This finds the max of the first column and doesn't need to read the whole file into memory.
But what if I want to run max on each column of the csv file, still without reading the whole file into memory?
If max were rewritten like this:
def max(iterator):
themax = float('-inf')
for i in iterator:
themax = i if i > themax else themax
yield
yield themax
I could then do some fancy work (and have) to make this happen.
But what if I constrain the problem and don't allow max to be rewritten? Is this possible?
Thanks!
If you're comfortable with a more functional approach you can use functools.reduce to iterate through the file, pulling only two rows into memory at once, and accumulating the column-maximums as it goes.
import csv
from functools import reduce
def column_max(row1, row2):
# zip contiguous rows and apply max to each of the column pairs
return [max(float(c1), float(c2)) for (c1, c2) in zip(row1, row2)]
reader = csv.reader('numbers.csv')
# calling `next` on reader advances its state by one row
first_row = next(reader)
column_maxes = reduce(column_max, reader, first_row)
#
#
# another way to write this code is to unpack the reduction into explicit iteration
column_maxes = next(reader) # advances `reader` to its second row
for row in reader:
column_maxes = [max(float(c1), float(c2)) for (c1, c2) in zip(column_maxes, row)]
I would just move away from using a function which you pass the iterator but instead iterate on your own over the reader:
maxes = []
for row in reader:
for i in range(len(row)):
if i > len(maxes):
maxes.append(row[i])
else:
maxes[i] = max(maxes[i], row[i])
At the end, you will have the list maxes which will contain each maximum value, without having the whole file in memory.
def col_max(x0,x1):
"""x0 is a list of the accumulated maxes so far,
x1 is a line from the file."""
return [max(a,b) for a,b in zip(x0,x1)]
Now functools.reduce(col_max,reader,initializer) will return just what you want. You will have to supply initializer as a list of -inf's of the correct length.
Related
I have one csv with SKUs and URLs I break them in two lists with
def myskus():
myskus =[]
with open('websupplies2.csv', 'r') as csvf:
reader = csv.reader(csvf, delimiter=";")
for row in reader:
myskus.append(row[0]) # Add each skus to list contents
return myskus
def mycontents():
contents = []
with open('websupplies2.csv', 'r') as csvf:
reader = csv.reader(csvf, delimiter=";")
for row in reader:
contents.append(row[1]) # Add each url to list contents
return contents
Then I multiprocess my urls but I want to join the correspondin SKU
if __name__ == "__main__":
with Pool(4) as p:
records = p.map(parse, web_links)
if len(records) > 0:
with open('output_websupplies.csv', 'a') as f:
f.write('\n'.join(records))
Can I put
records = p.map(parse, skus, web_links)
because is not working
My desirable output format
would be
sku price availability
bkk11 10,00 available
how can I achieve this?
minor refactor
I recommend naming your pair of functions def get_skus() and def get_urls(), to match your problem definition.
data structure
Having a pair of lists, skus and urls, does not seem like a good fit for your high level problem.
Keep them together, as a list of (sku, url) tuples, or as a sku_to_url dict.
That is, delete one of your two functions, so you're reading the CSV once, and keeping the related details together.
Then your parse() routine would have more information available to it.
The list of tuples boils down to Monty's starmap() suggestion.
writing results
You're using this:
if len(records) > 0:
with open('output_websupplies.csv', 'a') as f:
f.write('\n'.join(records))
Firstly, testing for at least one record is probably superfluous, it's not the end of the world to open for append and then write zero records.
If you care about the timestamp on the file then perhaps it's a useful optimization.
More importantly, the write() seems Bad.
One day an unfortunate character may creep into one of your records.
Much better to feed your structured records to a csv.writer, to ensure appropriate quoting.
I was a little curious because when I add a single line in my code, that counts the number of rows in the csv file, the for loop is stop working and just skipping everything inside.
My code shown below, is working now, but if I uncomment the row_count it's not working, so my question is why?
with open(r"C:\Users\heltbork\Desktop\test\ICM.csv", newline='') as csvfile:
sensor = csv.reader(csvfile, delimiter=',', quotechar='|')
#row_count = sum(1 for row in sensor)
#print(row_count)
for row in sensor:
#alot of stuff here
The reader is an iterable (see the iterator protocol):
... One notable exception is code which attempts multiple iteration passes. A container object (such as a list) produces a fresh new iterator each time you pass it to the iter() function or use it in a for loop. Attempting this with an iterator will just return the same exhausted iterator object used in the previous iteration pass, making it appear like an empty container.
The iterable is consumed when you iterate it. It is not a concrete data structure:
sensor = csv.reader(...) # creates an iterator
row_count = sum(1 for row in sensor) # *consumes* the iterator
for row in sensor: # nothing in the iterator, consumed by `sum`
# a lot of stuff here
You should count while you iterate (inside for row in sensor:), because once you iterate and consume it - you can't iterate again.
Alternatives are using list for concreting the data, or if you need the iterable interface - itertools.tee (if you don't have a lot if data). You can also use enumerate and keep the last index.
Example:
sensor = csv.reader(...) # creates an iterator
count = 0
for idx, row in enumerate(sensor):
# a lot of stuff here
# ...
count = idx
print(count)
Or:
count = 0
for row in sensor:
# a lot of stuff here
# ...
count += 1
print(count)
I'm interested in finding the FASTEST way to iterate through a list of lists and replace a character in the innermost list. I am generating the list of lists from a CSV file in Python.
Bing Ads API sends me a giant report but any percentage is represented as "20.00%" as opposed to "20.00". This means I can't insert each row as is to my database because "20.00%" doesn't convert to a numeric on SQL Server.
My solution thus far has been to use a list comprehension inside a list comprehension. I wrote a small script to test how fast this runs compared to just getting the list and it's doing ok (about 2x the runtime) but I am curious to know if there is a faster way.
Note: Every record in the report has a rate and therefore a percent. So every
record has to be visited once, and every rate has to be visited once (is that the cause of the 2x slowdown?)
Anyway I would love a faster solution as the size of these reports continue to grow!
import time
import csv
def getRecords1():
with open('report.csv', 'rU',encoding='utf-8-sig') as records:
reader = csv.reader(records)
while next(reader)[0]!='GregorianDate': #Skip all lines in header (the last row in header is column headers so the row containing 'GregorianDate' is the last to skip)
next(reader)
recordList = list(reader)
return recordList
def getRecords2():
with open('report.csv', 'rU',encoding='utf-8-sig') as records:
reader = csv.reader(records)
while next(reader)[0]!='GregorianDate': #Skip all lines in header (the last row in header is column headers so the row containing 'GregorianDate' is the last to skip)
next(reader)
recordList = list(reader)
data = [[field.replace('%', '') for field in record] for record in recordList]
return recordList
def getRecords3():
data = []
with open('c:\\Users\\sflynn\\Documents\\Google API Project\\Bing\\uploadBing\\reports\\report.csv', 'rU',encoding='utf-8-sig') as records:
reader = csv.reader(records)
while next(reader)[0]!='GregorianDate': #Skip all lines in header (the last row in header is column headers so the row containing 'GregorianDate' is the last to skip)
next(reader)
for row in reader:
row[10] = row[10].replace('%','')
data+=[row]
return data
def main():
t0=time.time()
for i in range(2000):
getRecords1()
t1=time.time()
print("Get records normally takes " +str(t1-t0))
t0=time.time()
for i in range(2000):
getRecords2()
t1=time.time()
print("Using nested list comprehension takes " +str(t1-t0))
t0=time.time()
for i in range(2000):
getRecords3()
t1=time.time()
print("Modifying row as it's read takes " +str(t1-t0))
main()
Edit: I have added a third function getRecords3() which is the fastest implementation I have seen yet. The output of running the program is as follows:
Get records normally takes 30.61197066307068
Using nested list comprehension takes 60.81756520271301
Modifying row as it's read takes 43.761850357055664
This means we have taken it down from a 2x slower algorithm to approximately 1.5x slower. Thank you everyone!
You could potentially check if the in-place inner-list modification is faster than creating a new list of list using list comprehension.
So, something like
for field in record:
for index in range(len(field)):
range[index] = range[index].replace('%', '')
We can't really modify the string in-place since strings are immutable.
I created a simple application that loads up multiple csv's and stores them into lists.
import csv
import collections
list1=[]
list2=[]
list3=[]
l = open("file1.csv")
n = open("file2.csv")
m = open("file3.csv")
csv_l = csv.reader(l)
csv_n = csv.reader(n)
csv_p = csv.reader(m)
for row in csv_l:
list1.append(row)
for row in csv_n:
list2.append(row)
for row in csv_p:
list3.append(row)
l.close()
n.close()
m.close()
I wanted to create a function that would be responsible for this, so that I could avoid being repetitive and to clean up the code so I was thinking something like this.
def read(filename):
x = open(filename)
y = csv.reader(x)
for row in y:
list1.append(row)
x.close()
However it gets tough for me when I get to the for loop which appends to the list. This would work to append to 1 list, however if i pass another file name into the function it will append to the same list. Not sure the best way to go about this.
You just need to create a new list each time, and return it from your function:
def read(filename):
rows = []
x = open(filename)
y = csv.reader(x)
for row in y:
rows.append(row)
x.close()
return rows
Then call it as follows
list1 = read("file1.csv")
Another option is to pass the list in as an argument to your function - then you can choose whether to create a new list each time, or append multiple CSVs to the same list:
def read(filename, rows):
x = open(filename)
y = csv.reader(x)
for row in y:
rows.append(row)
x.close()
return rows
# One list per file:
list1 = []
read("file1.csv", list1)
# Multiple files combined into one list:
listCombined = []
read("file2.csv", listCombined)
read("file3.csv", listCombined)
I have used your original code in my answer, but see also Malik Brahimi's answer for a better way to write the function body itself using with and list(), and DogWeather's comments - there are lots of different choices here!
You can make a single function, but use a with statement to condense even further:
def parse_csv(path):
with open(path) as csv_file:
return list(csv.reader(csv_file))
I like #DNA's approach. But consider a purely functional style. This can be framed as a map operation which converts
["file1.csv", "file2.csv", "file3.csv"]
to...
[list_of_rows, list_of_rows, list_of_rows]
This function would be invoked like this:
l, n, m = map_to_csv(["file1.csv", "file2.csv", "file3.csv"])
And map_to_csv could be implemented something like this:
def map_to_csv(filenames):
return [list(csv.reader(open(filename))) for filename in filenames]
The functional solution is shorter and doesn't need temporary variables.
I'm writing a Python script that reads a CSV file and creates a list of deques. If I print out exactly what gets appended to the list before it gets added, it looks like what I want, but when I print out the list itself I can see that append is overwriting all of the elements in the list with the newest one.
# Window is a list containing many instances
def slideWindow(window, nextInstance, num_attributes):
attribute = nextInstance.pop(0)
window.popleft()
for i in range(num_attributes):
window.pop()
window.extendleft(reversed(nextInstance))
window.appendleft(attribute)
return window
def convertDataFormat(filename, window_size):
with open(filename, 'rU') as f:
reader = csv.reader(f)
window = deque()
alldata = deque()
i = 0
for row in reader:
if i < (window_size-1):
window.extendleft(reversed(row[1:]))
i+=1
else:
window.extendleft(reversed(row))
break
alldata.append(window)
for row in reader:
window = slideWindow(window, row, NUM_ATTRIBUTES)
alldata.append(window)
# print alldata
f.close()
return alldata
This is really difficult to track what you exactly want from this code. I suspect the problem lies in the following:
alldata.append(window)
for row in reader:
window = slideWindow(window, row, NUM_ATTRIBUTES)
alldata.append(window)
Notice that in your slideWindow function, you modify the input deque (window), and then return the modified deque. So, you're putting a deque into the first element of your list, then you modify that object (inside slideWindow) and append another reference to the same object onto your list.
Is that what you intend to do?
The simple fix is to copy the window input in slideWindow and modify/return the copy.
I don't know for sure, but I'm suspicious it might be similar to this problem http://forums.devshed.com/python-programming-11/appending-object-to-list-overwrites-previous-842713.html.