winError 32 permission error on deleting file with scrapy - python

I have a scrapy python scraper. At this project, I always used with statement for file handling, just like this:
with open('file2.json', 'r', encoding="utf8") as file_data:
datas = json.load(file_data)
But when i want to close this file, I get this error:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'file2.json'
The code that supposed to delete this file is :
filename = 'file2.json'
if os.path.exists(filename):
os.remove(filename)
I tried some methods to solve this but it didn't help, the first was this code before deleting:
os.chmod(filename, 0o777)
The second was opening and closing file before deleting it:
fn = open(filename, 'r')
fn.close()
None of this ways work and I'm still getting permission error for deleting this file. Is there a way to close all open files in Python garbage collector? How can I solve this issue ?

I know this post is old, but there may be other people with this problem. This is how I managed to deal with it.
This problem of the scraper having the file handler opened after finishing, in my case, happens when my spider doesn't yield values or I try to close the spider through a CloseSpider exception.
So, what I did was, instead of interrupting the spider or avoiding it from yielding values, was to yield a single trash value which I could track later:
class Scraper(scrapy.Spider):
# your spider's attributes (name, domains, start urls, etc)
scrape = True
trashYielded = False
def parse(self, response):
for href in response.css('my selector'):
if href == 'http://foo.bar':
self.scrape = False
if self.scrape:
# Here you yield your values as you would normally
yield {'url': href}
else:
if not self.trashYielded:
yield {'trashKey': 'trashValue'}
self.trashYielded = True
I know this is a mess and there must be better ways for doing this, but no one has provided one (at least I wasn't able to find any after hours).
The scrape variable tells if your spider must keep scraping or not, and the trashYielded tells if you have thrown the trash value (this way, we only throw one trash value).
In my example, I want to stop my scraping when I find a link to certain page, and when I find it I set the scrape variable to False (meaning I don't want to continue scraping).
Next, I'll only yield values if scrape = True, otherwise check if the spider has thrown a trash value (and do it if it hasn't).
When you process your data, you should just check if there is a 'trashKey' between your data and just drop it.
Hope this helps anyone (or attracts someone who could bring a better way) ^^

Related

Refresh variable when reading from a txt file

I have a file in my python folder called data.txt and i have another file read.py trying to read text from data.txt but when i change something in data.txt my read doesn't show anything new i put
Something else i tried wasn't working and i found something that read, but when i changed it to something that was actually meaningful it didn't print the new text.
Can someone explain why it doesn't refresh, or what i need to do to fix it?
with open("data.txt") as f:
file_content = f.read().rstrip("\n")
print(file_content)
First and foremost, strings are immutable in Python - once you use file.read(), that returned object cannot change.
That being said, you must re-read the file at any given point the file contents may change.
For example
read.py
def get_contents(filepath):
with open(filepath) as f:
return f.read().rstrip("\n")
main.py
from read import get_contents
import time
print(get_contents("data.txt"))
time.sleep(30)
# .. change file somehow
print(get_contents("data.txt"))
Now, you could setup an infinite loop that watches the file's last modification timestamp from the OS, then always have the latest changes, but that seems like a waste of resources unless you have a specific need for that (e.g. tailing a log file), however there are arguably better tools for that
It was unclear from your question if you do the read once or multiple times. So here are steps to do:
Make sure you call the read function repeatedly with a certain interval
Check if you actually save file after modification
Make sure there are no file usage conflicts
So here is a description of each step:
When you read a file the way you shared it gets closed, meaning it is read only once, you need to read it multiple times if you want to see changes, so make it with some kind of interval in another thread or async or whatever suits your application best.
This step is obvious, remember to hit ctrl+c
It may happen that a single file is being accessed by multiple processes, for example your editor and the script, now to prevent errors try the following code:
def read_file(file_name: str):
while True:
try:
with open(file_name) as f:
return f.read().rstrip("\n")
except IOError:
pass

Python Win 32 error while trying to rename a file

I have a folder with several csv-files in it. I have to change the filename of every file with a string that I find in the file. So I tried the script below. It looks like it is working until I try to rename the file.
What did I try:
First I didn't have the file.close() line in the program, but did didn't fix the problem
I added a line print(file.closed) to see if the file was actually closed
I tried to get the os.rename out of the indented 'with' block. But I keep getting the same error
I tried to get the os.rename out of any block. But then I get a Winerror 123, where it sais that the filename , directoryname etc. is incorrect.
I also read the questions WindowsError 32 while trying to os.rename and Windows Error: 32 when trying to rename file in python.
I understood that maybe I had to close the file with f.close since this is the handler, but that didn't work as well.
The code that I tried:
for f in glob.glob("/path/*.csv"):
with open(f, "r") as file:
#read the lines in the csv-file
data = file.read()
#search the lines that have been read for a pattern and save that in "search"
search = re.findall("some_pattern", data)
#The result was a list. With this line I tried to change it into a string
file.close()
Listtostring = ''.join([str(elem) for elem in search])
#I only want to use a part of the match in the new file name
name = Listtostring.replace("part_of_string", "")
os.rename(f,f+name)
I hope somebody can give me some tips and explain what I am doing wrong. Pretty new to Python, so if you can give me some insight in my mistakes, than it's appreciated!
Thank you for your comments and time. It seemed that one of the files that was opened was still busy in some process and therefore the code didn’t work. I first closed all the applications that were running, but that didn’t work. After that I restarted the computer and the script worked fine!

Multiple Scripts/Spiders writing to different CSV files. Will this code cause any problems?

I'm building some spiders to do some web scraping and am trying to figure out if my code is ok as written before I start building them out. The spiders will run via crontab at the same time, though they each write to a separate file.
with open(item['store_name']+'price_list2.csv', mode='a', newline ='') as price_list2:
savepriceurl2 = csv.writer(price_list2, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
savepriceurl2.writerow([item['url']]+item['price'])
I'm not sure how the 'open as price_list2' or 'savepriceurl2 = csv.writer' parts of the code work, and will the spiders get mixed up if they all use the same names, even for a different csv file, if they are all running at the same time?
From the minimal code posted it is difficult to say if there will be an issue with two. Assuming that the code you posted will run in each instance of the object (I assume) they will be writing to whatever store they are scraping (defined by your item['store_name'].
Regarding your questions about the code, the open(...) as price_list2 returns an io.TextIOWrapper object (details here) which is stored as the variable price_list. You could achieve the same by writing: price_list2 = open(...)however then you must close the file in order to not leak memory/data. However by writing it as with open(...) as file: means you do not have to call file.close() and thus ensures the file is always closed after usage.
The other line you asked about, savepriceurl2 = csv.writer(...) creates an object that simplifies writing to the actual file. Thus, you can simply use the object function writerow() to easily write a row to the desired file. More information on that can be found here.
So basically what your code is doing is this:
Open an object that represents a file. In your case you have also specified that you will append to the file if it exists (due to the type being 'a')
Create a csv writer instance that will write to the file object price_list2 with the delimiter ',' (and some other options, check the link for details)
Tell the csv writer to write a row to the file which is the concatenation of the value of item['url'] and item['price']
For your last question, given there is no information on your actual design and setup, I am assuming that each spider is an instance of the class that holds this file. As long as each spider is going to different sites (thus meaning that one spider will not have item['store_name'] be the same as the other spider, you should be writing to different files. As long as this is the case it should be fine (I'm not aware of issues of writing two to files 'at the same time' in python). If this is not the case you will run into issues if your spiders try to write to the same file at the same time.
As a tip, googling the functions will often get you the description and clarification on functions quicker than a post here and will have a lot more information.
I hope this helps and clarifies things for you.

Checking my assumption of how generators work in python 3

I'm writing a blog post about generators in the context of screenscaping, or making lots of requests to an API, based on the contents of a large-ish text file, and after reading this nifty comic by Julia Evans, I want to check something.
Assume I'm on linux or OS X.
Let's say I'm making a screenscraper with scrapy (it's not so important to know scrapy this qn, but it might be useful context)
If I have an open file like so, and I went to be able to return a scrapy.Request for every line I pull out of a largeish csv file.
with open('top-50.csv') as csvfile:
urls = gen_urls(csvfile)
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
gen_urls is a function that looks like this.
def gen_urls(file_object):
while True:
# Read a line from the file, by seeking til you hit something like '\n'
line = file_object.readline()
# Drop out if there are no lines left to iterate through
if not line:
break
# turn '1,google.com\n' to just 'google.com'
domain = line.split(',')[1]
trimmed_domain = domain.rstrip()
yield "http://domain/api/{}".format(trimmed_domain)
This works, but I want to understand what's happening under the hood.
When I pass the csvfile to the gen_urls() like so:
urls = gen_urls(csvfile)
In gen_urls my understanding is that it works by pulling out a line at a time in the while loop with file_object.readline(), then yielding with yield "http://domain/api/{}".format(trimmed_domain).
Under the hood, I think is a reference to some file descriptor, and readline() is essentially seeking forwards through the file, until it finds the next newline \n character, and the yield basically pauses this function until the next call to __.next__() or the builtin next(), at which point it resumes the loop. This next is called implicitly in the loop in the snippet that looks like:
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
Because we're only pulling a line at the time from the file descriptor then 'pausing' the function with yield , we don't end up with loads of stuff in memory. Because scrapy uses an evented model, you can make a bunch of scrapy.Request objects without them all immediately sending off a bajillion HTTP requests and saturating your network. This way, scrapy is also able to do useful things like throttle how quickly they're sent, how many are sent concurrently, and so on.
This about right?
I'm mainly looking for a mental model that helps me think about using generators in python and explain them to other people, rather than all the gory details, as I've been using them for ages, without thinking through what's happening, and I figured asking here might shed some light.

Python "With open(file) as f" is not saving the content if intrupt in a loop

I mostly use with open('file.txt', 'w') as f: for writing (reading as well). today I notice something weird.
I was crawling a site and there was normal pagination.
While True:
# visit url, get/scrape data
# save data in text file
# find next link(pagination)
# loop till next url is available.
for saving data first I did use with
with open('data.txt','w') as f:
While True:
# visit url, get/scrape data
f.write(some_scraped_data)
# find next link(pagination)
# loop till next url is available.
but when I run this script and if some exception occurred, this loop gets terminated and no data save in data.txt file
but when I do f = open('data.txt','w') then whatever data is crawled is saved(till exception occurred) even I didn't put f.close()
f = open('data.txt','w')
While True:
# visit url, get/scrape data
f.write(some_scraped_data)
# find next link(pagination) til next url is available.
my question is how can we achieve same thing with with. and I'm just curious where everybody recommends with for file handling and its not supporting this feature.
PS: I'm not so experienced in python. so if you find this question silly, I'm sorry
According to the documentation, the use of 'with' will close the file correctly if an exception occurs, so that is a good approach.
However, you could try f.flush() to get the buffer written to disc. More on flush here: what exactly the python's file.flush() is doing?

Categories

Resources