What's the best way to schedule a Scrapy spider?

What's the best way to schedule a Scrapy spider? - python

I need my spider to run every minute. Usually for scheduling tasks I use celery-beat, however, from what I've read, Celery is not used with scrapy.
What's the best approach to schedule scrapy spiders?

Zyte's Scrapy Cloud works seamlessly with Scrapy and has a feature for scheduling periodic jobs.
Another option, if you're using any Unix-like, would be cron:
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday);
* * * * * scrapy crawl my_spider

Related

Python: Project Package / Module structure dependency Problem

I was hoping someone could help me figure out an odd "dependency" problem. I have a fairly large python project, with a slimmed down structure that looks like:
Sitka
│ DataTickers.py
│ example.csv
│ FinDates.py
│ SitkaMongo.py
│ tickers_csv.csv
│ __init__.py
│
├───Fin
│ │ main.py
│ │ md_provider_control.py
│ │ Tofino.py
│ │ __init__.py
│ │
│ │
│ ├───Instruments
│ │ │ market_standard_instruments.py
│ │ └ __init__.py
│ │
│ ├───Env
│ │ │ CurveClass.py
│ │
│ ├───Utils
│ │ charting.py
│ │ exchange_identifier_mapper.py
│ │ fin_mapper.py
│ │ md_provider_simulation.py
│ └ __init__.py
Tofino.py:
from .Env.CurveClass import CurveData as _CurveData
class Tofino():
def __init__(self, mdp, VAL_ENV = None):
mdp.tofino = self # link Tofino
# Public VE Refernce
self.val_env = VAL_ENV
self.ir_config = VAL_ENV.market
market_standard_instruments.py:
# Standard Imports
import Sitka.FinDates as fdate
import datetime as dt
import re
from itertools import product
# bunch of functions after this.
CurveClass.py:
import pandas as pd
import datetime as dt
from dateutil.relativedelta import relativedelta
class CurveData():
def __init__(self):
self.do_stuff= self._stuff()
main.py
from Sitka.FinDates import getMainDates
# Sitka- Custom Imports
from .md_provider_control import MD_ProviderV3
from .Tofino import Tofino
import Sitka.Fin.Instruments.market_standard_instruments as mkt_std
def main() -> Tofino:
# < ---- do a bunch of stuff ---- >
return Tofino(mdp = mdp, VAL_ENV=ve.GLOBAL_VALN_ENV)
And lastly, Sitka.Fin.__ init __.py:
import logging
import traceback
# Run Valuation Environment Startup
from .main import main
# Global Variables:
from .Tofino import Tofino as _Tofino
tofino : _Tofino
tofino = None
try:
tofino = main() # I was trying some stuff out here, hence the weird traceback in try
except:
print(traceback.format_exc())
My issue is, after all that, is when I run import Sitka.Fin as fin, this line in main.py
import Sitka.Fin.Instruments.market_standard_instruments as mkt_std
fires off the Sitka.Fin__init__ process again before we even get to the try block (so init basically runs 2x).
Any help is appreciated!
P.S. Basically I'm just including subfolder init's because its the only way I know how to get Intellsense/autocomplete in the IDE to work nicely... I would love to know how to make my code 'cleaner' from that sense.
Edit:
A simpler way to look at the problem. Lets say I open a new IPython console, and only do:
import Sitka.Fin.Instruments.market_standard_instruments as mkt_std
Simply doing this kicks off the entire Sitka.Fin.__init__ procedure [which I wouldn't have expected]

It seems you only want some code of the main.py to run when the file itself is running. Try using:
if __name__ in "__main__": # All sikta imports
from Sitka.FinDates import getMainDates
from .md_provider_control import MD_ProviderV3
from .Tofino import Tofino
import Sitka.Fin.Instruments.market_standard_instruments as mkt_std

Importing Python modules in large projects and "ModuleNotFoundError"

I have faced a rather famous issue while importing my python modules inside the project.
This code is written to replicate the existing situation:
multiply_func.py:
def multiplier(num_1, num_2):
return num_1 * num_2
power_func.py:
from math_tools import multiplier
def pow(num_1, num_2):
result = num_1
for _ in range(num_2 - 1):
result = multiplier(num_1, result)
return result
The project structure:
project/
│ main.py
│
└─── tools/
│ __init__.py
│ power_func.py
│
└─── math_tools/
│ __init__.py
│ multiply_func.py
I've added these lines to __init__ files to make the importing easier:
__init__.py (math_tools):
from .multiply_func import multiplier
__init__.py (tools):
from .power_func import pow
from .math_tools.multiply_func import multiplier
Here is my main file.
main.py:
from tools import pow
print(pow(2, 3))
Whenever I run it, there is this error:
>>> ModuleNotFoundError: No module named 'math_tools'
I tried manipulating the sys.path, but I had no luck eliminating this puzzling issue. I'd appreciate your kind help. Thank you in advance!

You messed it up in the "power_func.py" file.
You have to use . before math_tools to refer to the current directory module.
Update "power_func.py" as bellow, it works perfectly.
from .math_tools import multiplier

What package is this: from schemas.tokens import Token

In this tutorial one line of the code reads
from schemas.tokens import Token
Which package do I need to install? I cannot find it out by Google.

Further down the tutorial we read:
We need a schema to verify that we are returning an access_token and token_type as defined in our response_model. Let's put this code in schemas > tokens.py
So it's a package created in the tutorial itself, i.e. a custom package, not from some library.

yeah. thats the problem.
if you've read the entire tutorial, you would see this tree structure
backend/
├─.env
├─apis/
│ └─general_pages/
│ └─route_homepage.py
├─core/
│ └─config.py
├─db/
│ ├─base.py
│ ├─base_class.py
│ ├─models/
│ │ ├─jobs.py
│ │ └─users.py
│ └─session.py
├─main.py
├─requirements.txt
├─schemas/ # <---------------- HERE
│ ├─jobs.py
│ └─users.py
├─static/
│ └─images/
│ └─logo.png
└─templates/
├─components/
│ └─navbar.html
├─general_pages/
│ └─homepage.html
└─shared/
└─base.html
where schemas is package inside root project

Can't schedule python script with crontab

I have this python script I did in anaconda and downloaded to my local workspace as .py
#!/usr/bin/env python
# coding: utf-8
# In[33]:
#!/usr/bin/env python
#
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""This example downloads a criteria performance report as a string with AWQL.
To get report fields, run get_report_fields.py.
The LoadFromStorage method is pulling credentials and properties from a
"googleads.yaml" file. By default, it looks for this file in your home
directory. For more information, see the "Caching authentication information"
section of our README.
"""
from googleads import adwords
import io
import pandas as pd
adwords_client = adwords.AdWordsClient.LoadFromStorage()
# Initialize appropriate service.
report_downloader = adwords_client.GetReportDownloader(version='v201809')
# Create report query.
report_query = (adwords.ReportQueryBuilder()
.Select('CampaignId', 'AdGroupId', 'Id', 'Criteria',
'CriteriaType', 'FinalUrls', 'Impressions', 'Clicks',
'Cost')
.From('CRITERIA_PERFORMANCE_REPORT')
.Where('Status').In('ENABLED', 'PAUSED')
.During('LAST_7_DAYS')
.Build())
output = io.StringIO()
report_downloader.DownloadReportWithAwql(
report_query, 'CSV', output, skip_report_header=True,
skip_column_header=False, skip_report_summary=True,
include_zero_impressions=True)
output.seek(0)
df = pd.read_csv(output)
print(df.head())
# In[44]:
df.to_csv("/Users/ezerivarola/Desktop/Google_ADS_API/report1.csv",index=False)
# In[ ]:
and I am trying to schedule it with crontab with the following command:
* * * * * /usr/local/bin/python3 /Users/ezerivarola/Desktop/Google_ADS_API/Report1_DF.py
But, although I don't get any error and when looking at mail I see it is running, the csv file of the script is not being generated.
Does any one have an idea of what can it be wrong?

You need to change the 5 *'s at the beginning to match the time period in which you want it to run,
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │ 7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * * command to execute
The below will run it every hour on the hour,
0 * * * * /usr/local/bin/python3 /Users/ezerivarola/Desktop/Google_ADS_API/Report1_DF.py

I already solved my problem. Crontab was not executing python script because it did't have access to disk. I leave a link with more details of de solution: https://blog.bejarano.io/fixing-cron-jobs-in-mojave/
thank you all

Multiprocessing error : can't import module

I'm getting an error message that i'm unable to tackle. I don't get what's the issue with the multiprocessing library and i don't understand why it says that it is impossible to import the build_database module but in the same time it executes perfectly a function from that module.
Could somebody tell me is he sees something. Thank you.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\multiprocessing\forking.py", line 380, in main
Traceback (most recent call last):
File "<string>", line 1, in <module>
prepare(preparation_data)
File "C:\Python27\lib\multiprocessing\forking.py", line 380, in main
File "C:\Python27\lib\multiprocessing\forking.py", line 495, in prepare
prepare(preparation_data)
'__parents_main__', file, path_name, etc
File "C:\Python27\lib\multiprocessing\forking.py", line 495, in prepare
File "C:\Users\Comp3\Desktop\User\Data\main.py", line 4, in <module>
'__parents_main__', file, path_name, etc
import database.build_database
File "C:\Users\Comp3\Desktop\User\Data\main.py", line 4, in <module>
ImportError : import database.build_database
NImportErroro module named build_database:
No module named build_database
This is what i have in my load_bigquery.py file:
# Send CSV to Cloud Storage
def load_send_csv(table):
job = multiprocessing.current_process().name
print '[' + table + '] : job starting (' + job + ')'
bigquery.send_csv(table)
#timer.print_timing
def send_csv(tables):
jobs = []
build_csv(tables)
for t in tables:
if t not in csv_targets:
continue
print ">>>> Starting " + t
# Load CSV in BigQuery, as parallel jobs
j = multiprocessing.Process(target=load_send_csv, args=(t,))
jobs.append(j)
j.start()
# Wait for jobs to complete
for j in jobs:
j.join()
And i call it like this from my main.py :
bigquery.load_bigquery.send_csv(tables)
My folder is like this:
src
| main.py
|
├───bigquery
│ │ bigquery.py
│ │ bigquery2.dat
│ │ client_secrets.json
│ │ herokudb.py
│ │ herokudb.pyc
│ │ distimo.py
│ │ flurry.py
│ │ load_bigquery.py
│ │ load_bigquery.pyc
│ │ timer.py
│ │ __init__.py
│ │ __init__.pyc
│ │
│ │
├───database
│ │ build_database.py
│ │ build_database.pyc
│ │ build_database2.py
│ │ postgresql.py
│ │ timer.py
│ │ __init__.py
│ │ __init__.pyc
That function works perfectly if i execute load_bigquery.py alone but if i import it into main.py it fails with the errors given above.
UPDATE :
Here are my import, maybe it might help:
main.py
import database.build_database
import bigquery.load_bigquery
import views.build_analytics
import argparse
import getopt
import sys
import os
load_bigquery.py
import sys
import os
import subprocess
import time
import timer
import distimo
import flurry
import herokudb
import bigquery
import multiprocessing
import httplib2
bigquery.py
import sys
import os
import subprocess
import json
import time
import timer
import httplib2
from pprint import pprint
from apiclient.discovery import build
from oauth2client.file import Storage
from oauth2client.client import AccessTokenRefreshError
from oauth2client.client import OAuth2WebServerFlow
from oauth2client.client import flow_from_clientsecrets
from oauth2client.tools import run
from apiclient.errors import HttpError
Maybe the issue is with the fact that load_bigquery.py imports multiprocessing and then main.py imports load_bigquery.py ?

You are probably missing the __init__.py inside src/bigquery/. So your source folders should be:
> src/main.py
> src/bigquery/__init__.py
> src/bigquery/load_bigquery.py
> src/bigquery/bigquery.py
The __init__.py just needs to be empty and is only there so that Python knows that bigquery is a Python package.
UPDATED: Apparently the __init__.py file is present. The actual error message talks about a different error, which is it cannot import database.build_database.
My suggestion is to look into that. It is not mentioned as being in the src folder...
UPDATE 2: I think you have a clash with your imports. Python 2 has a slightly fuzzy relative import, which sometimes catches people out. You have both a package at the same level of main.py called database and one inside bigquery called database. I think somehow you are ending up with the one inside bigquery, which doesn't have build_database. Try renaming one of them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What's the best way to schedule a Scrapy spider? - python

I need my spider to run every minute. Usually for scheduling tasks I use celery-beat, however, from what I've read, Celery is not used with scrapy. What's the best approach to schedule scrapy spiders?

Related

Python: Project Package / Module structure dependency Problem

Importing Python modules in large projects and "ModuleNotFoundError"

What package is this: from schemas.tokens import Token

Can't schedule python script with crontab

Multiprocessing error : can't import module

Categories

Resources