I'm writing a blog post about generators in the context of screenscaping, or making lots of requests to an API, based on the contents of a large-ish text file, and after reading this nifty comic by Julia Evans, I want to check something.
Assume I'm on linux or OS X.
Let's say I'm making a screenscraper with scrapy (it's not so important to know scrapy this qn, but it might be useful context)
If I have an open file like so, and I went to be able to return a scrapy.Request for every line I pull out of a largeish csv file.
with open('top-50.csv') as csvfile:
urls = gen_urls(csvfile)
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
gen_urls is a function that looks like this.
def gen_urls(file_object):
while True:
# Read a line from the file, by seeking til you hit something like '\n'
line = file_object.readline()
# Drop out if there are no lines left to iterate through
if not line:
break
# turn '1,google.com\n' to just 'google.com'
domain = line.split(',')[1]
trimmed_domain = domain.rstrip()
yield "http://domain/api/{}".format(trimmed_domain)
This works, but I want to understand what's happening under the hood.
When I pass the csvfile to the gen_urls() like so:
urls = gen_urls(csvfile)
In gen_urls my understanding is that it works by pulling out a line at a time in the while loop with file_object.readline(), then yielding with yield "http://domain/api/{}".format(trimmed_domain).
Under the hood, I think is a reference to some file descriptor, and readline() is essentially seeking forwards through the file, until it finds the next newline \n character, and the yield basically pauses this function until the next call to __.next__() or the builtin next(), at which point it resumes the loop. This next is called implicitly in the loop in the snippet that looks like:
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
Because we're only pulling a line at the time from the file descriptor then 'pausing' the function with yield , we don't end up with loads of stuff in memory. Because scrapy uses an evented model, you can make a bunch of scrapy.Request objects without them all immediately sending off a bajillion HTTP requests and saturating your network. This way, scrapy is also able to do useful things like throttle how quickly they're sent, how many are sent concurrently, and so on.
This about right?
I'm mainly looking for a mental model that helps me think about using generators in python and explain them to other people, rather than all the gory details, as I've been using them for ages, without thinking through what's happening, and I figured asking here might shed some light.
Related
I have a file in my python folder called data.txt and i have another file read.py trying to read text from data.txt but when i change something in data.txt my read doesn't show anything new i put
Something else i tried wasn't working and i found something that read, but when i changed it to something that was actually meaningful it didn't print the new text.
Can someone explain why it doesn't refresh, or what i need to do to fix it?
with open("data.txt") as f:
file_content = f.read().rstrip("\n")
print(file_content)
First and foremost, strings are immutable in Python - once you use file.read(), that returned object cannot change.
That being said, you must re-read the file at any given point the file contents may change.
For example
read.py
def get_contents(filepath):
with open(filepath) as f:
return f.read().rstrip("\n")
main.py
from read import get_contents
import time
print(get_contents("data.txt"))
time.sleep(30)
# .. change file somehow
print(get_contents("data.txt"))
Now, you could setup an infinite loop that watches the file's last modification timestamp from the OS, then always have the latest changes, but that seems like a waste of resources unless you have a specific need for that (e.g. tailing a log file), however there are arguably better tools for that
It was unclear from your question if you do the read once or multiple times. So here are steps to do:
Make sure you call the read function repeatedly with a certain interval
Check if you actually save file after modification
Make sure there are no file usage conflicts
So here is a description of each step:
When you read a file the way you shared it gets closed, meaning it is read only once, you need to read it multiple times if you want to see changes, so make it with some kind of interval in another thread or async or whatever suits your application best.
This step is obvious, remember to hit ctrl+c
It may happen that a single file is being accessed by multiple processes, for example your editor and the script, now to prevent errors try the following code:
def read_file(file_name: str):
while True:
try:
with open(file_name) as f:
return f.read().rstrip("\n")
except IOError:
pass
I'm building some spiders to do some web scraping and am trying to figure out if my code is ok as written before I start building them out. The spiders will run via crontab at the same time, though they each write to a separate file.
with open(item['store_name']+'price_list2.csv', mode='a', newline ='') as price_list2:
savepriceurl2 = csv.writer(price_list2, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
savepriceurl2.writerow([item['url']]+item['price'])
I'm not sure how the 'open as price_list2' or 'savepriceurl2 = csv.writer' parts of the code work, and will the spiders get mixed up if they all use the same names, even for a different csv file, if they are all running at the same time?
From the minimal code posted it is difficult to say if there will be an issue with two. Assuming that the code you posted will run in each instance of the object (I assume) they will be writing to whatever store they are scraping (defined by your item['store_name'].
Regarding your questions about the code, the open(...) as price_list2 returns an io.TextIOWrapper object (details here) which is stored as the variable price_list. You could achieve the same by writing: price_list2 = open(...)however then you must close the file in order to not leak memory/data. However by writing it as with open(...) as file: means you do not have to call file.close() and thus ensures the file is always closed after usage.
The other line you asked about, savepriceurl2 = csv.writer(...) creates an object that simplifies writing to the actual file. Thus, you can simply use the object function writerow() to easily write a row to the desired file. More information on that can be found here.
So basically what your code is doing is this:
Open an object that represents a file. In your case you have also specified that you will append to the file if it exists (due to the type being 'a')
Create a csv writer instance that will write to the file object price_list2 with the delimiter ',' (and some other options, check the link for details)
Tell the csv writer to write a row to the file which is the concatenation of the value of item['url'] and item['price']
For your last question, given there is no information on your actual design and setup, I am assuming that each spider is an instance of the class that holds this file. As long as each spider is going to different sites (thus meaning that one spider will not have item['store_name'] be the same as the other spider, you should be writing to different files. As long as this is the case it should be fine (I'm not aware of issues of writing two to files 'at the same time' in python). If this is not the case you will run into issues if your spiders try to write to the same file at the same time.
As a tip, googling the functions will often get you the description and clarification on functions quicker than a post here and will have a lot more information.
I hope this helps and clarifies things for you.
I have a scrapy python scraper. At this project, I always used with statement for file handling, just like this:
with open('file2.json', 'r', encoding="utf8") as file_data:
datas = json.load(file_data)
But when i want to close this file, I get this error:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'file2.json'
The code that supposed to delete this file is :
filename = 'file2.json'
if os.path.exists(filename):
os.remove(filename)
I tried some methods to solve this but it didn't help, the first was this code before deleting:
os.chmod(filename, 0o777)
The second was opening and closing file before deleting it:
fn = open(filename, 'r')
fn.close()
None of this ways work and I'm still getting permission error for deleting this file. Is there a way to close all open files in Python garbage collector? How can I solve this issue ?
I know this post is old, but there may be other people with this problem. This is how I managed to deal with it.
This problem of the scraper having the file handler opened after finishing, in my case, happens when my spider doesn't yield values or I try to close the spider through a CloseSpider exception.
So, what I did was, instead of interrupting the spider or avoiding it from yielding values, was to yield a single trash value which I could track later:
class Scraper(scrapy.Spider):
# your spider's attributes (name, domains, start urls, etc)
scrape = True
trashYielded = False
def parse(self, response):
for href in response.css('my selector'):
if href == 'http://foo.bar':
self.scrape = False
if self.scrape:
# Here you yield your values as you would normally
yield {'url': href}
else:
if not self.trashYielded:
yield {'trashKey': 'trashValue'}
self.trashYielded = True
I know this is a mess and there must be better ways for doing this, but no one has provided one (at least I wasn't able to find any after hours).
The scrape variable tells if your spider must keep scraping or not, and the trashYielded tells if you have thrown the trash value (this way, we only throw one trash value).
In my example, I want to stop my scraping when I find a link to certain page, and when I find it I set the scrape variable to False (meaning I don't want to continue scraping).
Next, I'll only yield values if scrape = True, otherwise check if the spider has thrown a trash value (and do it if it hasn't).
When you process your data, you should just check if there is a 'trashKey' between your data and just drop it.
Hope this helps anyone (or attracts someone who could bring a better way) ^^
i'm having a difficult time understanding what the second 'with open' function does here.
so, in the first 'with open' part, we've essentially said out = open(save_as_file, 'wb+') , right? (still new to using 'with open'). we later write to it and then 'with open' automatically closes the 'out' file.That part i get - we're writing this response object from Requests as a binary in a specified save_as_file location until we hit the 81920th character aka our buffer #.
what's going on in the second 'with open'? breaking it down the same way as above, it's pretty much fp = open(save_as_file, 'r') , right? What does that make fp, which was already assigned the request response object earlier? We're just opening the save_as_file to use it for reading but not reading or extracting anything from it, so I don't see the reason for it. If someone could explain in english just what's taking place and the purpose of the second 'with open' part, that would be much appreciated.
(don't worry about the load_from_file function at the end, that's just another function under the class)
def load_from_url(self, url, save_as_file=None):
fp = requests.get(url, stream=True,
headers={'Accept-Encoding': None}).raw
if save_as_file is None:
return self.load_from_file(fp)
else:
with open(save_as_file, 'wb+') as out:
while True:
buffer = fp.read(81920)
if not buffer:
break
out.write(buffer)
with open(save_as_file) as fp:
return self.load_from_file(fp)
I'm the original author of the code that you're referring to; I agree it's a bit unclear.
If we hit the particular code at the else statement, this means that we want to save the data that we originally get from calling the URL to a file. Here, fp is actually the response text from the URL call.
We'll hit that else statement if, when ran from the command line, we pass in --cpi-file=foobar.txt and that file it doesn't actually exist yet; it acts as a target file as mentioned here. If you don't pass in --cpi-file=foobar.txt, then the program will not write to a file, it will just go straight to reading the response data (from fp) via load_from_file.
So then, if that file does not exist but we did pass it in the command line, we will grab data from the URL (fp), and write that data to the target file (save_as_file). It now exists for our reference (it will be on your file system), if we want to use it again in this script.
Then, we will open that exact file again and call load_from_file to actually read and parse the data that we originally got from the response (fp).
Now - if we run this script two times, both with --cpi-file=foobar.txt and foobar.txt doesn't exist yet, the first time the script runs, it will create the file and save the CPI data. The second time the script runs, it will actually avoid calling the CPI URL to re-downloaded the data again, and just go straight to parsing the CPI data from the file.
load_from_file is a bit of a misleading name, it should probably be load_from_stream as it could be reading the response data from our api call or from a file.
Hopefully that makes sense. In the next release of newcoder.io, I'll be sure to clear this language & code up a bit.
You are correct that the second with statement opens the file for reading.
What happens is this:
Load the response from the URL
If save_as_file is None:
Call load_from_file on the response and return the result
Else:
Store the contents of the response to save_as_file
Call load_from_file on the contents of the file and return the result
So essentialy, if save_as_file is set it stores the response body in a file, processes it and then returns the processed result. Otherwise it just processes the response body and returns the result.
The way it is implemented here is likely because load_from_file expects a file-like object and the easiest way the programmer saw of obtaining that was to read the file back.
It could be done by keeping the response body in memory and using Python 3's io module or Python 2's StringIO to provide a file-like object that uses the response body from memory, thereby avoiding the need to read the file again.
fp is reassigned in the second with statement in the same way as any other variable would be if you assigned it another value.
I tried with below code to simulate your case:
fp = open("/Users/example1.py",'wb+')
print "first fp",fp
with open("/Users/example2.py") as fp:
print "second fp",fp
The output is:
first fp <open file '/Users/example1.py', mode 'wb+' at 0x10b200390>
second fp <open file '/Users/example2.py', mode 'r' at 0x10b200420>
So second fp is a local variable inside with block.
Your code seem want to first read data from the URL, and write it to save_as_file, and then read data from save_as_file again and do something with load_from_file, like validating the content.
Here is a piece of code that describe it:
__with__ provides a block that "cleans up" when existed
Can handle exceptions that occur within the block
Can also execute code when entered
class MyClass(object):
def __enter__(self):
print("entering the myclass %s")
return self
def __exit__(self, type, value, traceback):
print("Exitinstance %s" %(id(self)))
print("error type {0}".format(type))
print("error value {0}".format(value))
print("error traceback {0}".format(traceback))
print("exiting the myclass")
def sayhi(self):
print("Sayhi instance %s" %(id(self)))
with MyClass() as cc:
cc.sayhi()
print("after the block ends")
I want to do something like:
import csv
import tornado.web
class MainHandler(tornado.web.RequestHandler):
def post(self):
uploaded_csv_file = self.request.files['file'][0]
with uploaded_csv_file as csv_file:
for row in csv.reader(csv_file):
self.write(' , '.join(row))
But, uploaded_csv_file is not of type file.
What is the best practice here?
Sources:
http://docs.python.org/2/library/csv.html
http://docs.python.org/2/library/functions.html#open
https://stackoverflow.com/a/11911972/242933
As the documentation explains:
csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable.
So, if you have something which is not a file, but is an iterator over lines, that's fine. If it's not even an iterator over lines, just wrap it in one. For a trivial example, if it's something with a read_all() method, you can always do this:
uploaded_csv_file = self.request.files['file'][0]
contents = uploaded_csv_file.read_all()
lines = contents.splitlines()
for row in csv.reader(lines):
# ...
(Obviously you can merge steps together to make it briefer; I just wrote each step as a separate line to make it simpler to understand.)
Of course if the CSV files are large, and especially if they take a while to arrive and you've got a nice streaming interface, you probably don't want to read the whole thing at once this way. Most network server frameworks offer nice protocol adapters to, e.g., take a stream of bytes and give you a stream of lines. (For that matter, even socket.makefile() in the stdlib sort of does that…) But in your case, I don't think that's an issue.