I have multiple spiders that i run in a bash script like so:
pipenv run scrapy runspider -o output-a.json a.py
pipenv run scrapy runspider -o output-b.json b.py
Since they should run for a long time I'd like to have a simple way of monitoring their success rate; my plan was to ping https://healtchecks.io when both scrapers run successfully (i.e. they don't have any error messages). I've sprinkled some assert statements over the code to be reasonably confident about this.
pipenv run scrapy runspider -o output-a.json a.py
result_a=$?
pipenv run scrapy runspider -o output-b.json b.py
result_b=$?
if [ $result_a -eq 0 && $result_b -eq 0]; then
curl $url
fi
My problem is that each scrapy runspider command always returns 0 no matter what. That means I can't really check whether they have been succesful.
Is there a way to influence this behavior? Some command line flag I haven't found? If not, how would I run the two spiders from a python script and save their output to a defined location? I found this link but it doesn't mention how to handle the returned items.
The way I eventually solved this was assigning the log output to a variable and grepping it for ERROR: Spider error processing. Scrapy has the very nice behavior of not failing unnecessarily early; if I exited the python script myself I would have lost that. This way I could just run one scraper after another and handle the errors in the end, so I could still collect as much as possible while being notified in the case that something doesn't run 100% smoothly.
Related
(I'm fairly new to Docker here.) I am trying to Dockerize a Scrapy application. As a first step, I'm trying to start a project in the container - which creates and populates a directory structure - and attach a volume to the project directory for editing purposes.
First I need to call scrapy startproject myScraper; then I'd like to be calling custom commands like scrapy shell or scrapy crawl myCrawler on the container to run webcrawls.
Since all Scrapy commands begin by calling scrapy, I wrote this Dockerfile:
FROM python:3
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
ENTRYPOINT scrapy #or so I thought was right ...
where requirements.txt is just Scrapy. Now I have a couple of problems. First is that the ENTRYPOINT does not seem to work - specifically, when I run
docker build -t scraper .
docker run -it -v $PWD:/scraper --name Scraper scraper [SOME-COMMAND]
I just get back the scrapy usage help menu. (For example, if SOME-COMMAND is shell or startproject scraper.) I've tried a few variations with no success. Second, if the container stops, I'm not sure how to start it again (e.g., I can't pass a command to docker start -ai Scraper).
Part of the reason I'm trying to do these commands here, rather than as RUN and VOLUME in the Dockerfile, is that if the volume is created during the build process, it obscures the project directory rather than copying its contents from container to host volume. (That is, I get a copy of my empty host directory in the container instead of the populated directory set up by scrapy startproject volumeDirectory.)
I've looked my issue up and know I may be off-track with proper Docker, but it really feels like what I'm asking should be possible here.
My recommendation would be to delete the ENTRYPOINT line; you can make it the default CMD if you'd like. Then you can run
docker run -it -v $PWD:/scraper --name Scraper scraper scrapy ...
Your actual problem here is that if you use ENTRYPOINT (or CMD or RUN) with a bare string like you show, it gets automatically wrapped in sh -c. Then the command you pass on the command line is combined with the entrypoint and what you ultimately get as the main container command is
/bin/sh -c 'scrapy' '[some-command]'
and so the shell runs scrapy with no arguments, but if that string happened to contain $1 or similar positional parameters, they could get filled in from command parameters.
If you add explicit JSON-array syntax then Docker won't add the sh -c wrapper and your proposed syntax will work
ENTRYPOINT ["scrapy"]
but a number of other common tasks won't work. For example, you can't easily get a debugging shell
# Runs "scrapy /bin/bash"
docker run --rm -it scraper /bin/bash
and using --entrypoint to overwrite things results in awkward command lines
docker run --rm --entrypoint /bin/ls scraper -lrt /scraper
nohup scrapy crawl test -o fed.csv &
nohup scrapy crawl test -o feder.csv &
nohup scrapy crawl fullrun -o dez.csv &
Hello, How can I run the following commands in sequence (when the first one finishes,the next one gets executed) on Virtual machine (Ubuntu terminal)
I wasn't sure if to ask this question on askubuntu.com or here, I hope this is correct place to ask
If you want to run commands one by one and only when the previous one completed successfully you can use :
command1 && command2
if your commands are independent of each other then you can use:
command1; command2
I am writing a python script and i want to execute some code only if the python script is being run directly from terminal and not from any another script.
How to do this in Ubuntu without using any extra command line arguments .?
The answer in here DOESN't WORK:
Determine if the program is called from a script in Python
Here's my directory structure
home
|-testpython.py
|-script.sh
script.py contains
./testpython.py
when I run ./script.sh i want one thing to happen .
when I run ./testpython.py directly from terminal without calling the "script.sh" i want something else to happen .
how do i detect such a difference in the calling way . Getting the parent process name always returns "bash" itself .
I recommend using command-line arguments.
script.sh
./testpython.py --from-script
testpython.py
import sys
if "--from-script" in sys.argv:
# From script
else:
# Not from script
You should probably be using command-line arguments instead, but this is doable. Simply check if the current process is the process group leader:
$ sh -c 'echo shell $$; python3 -c "import os; print(os.getpid.__name__, os.getpid()); print(os.getpgid.__name__, os.getpgid(0)); print(os.getsid.__name__, os.getsid(0))"'
shell 17873
getpid 17874
getpgid 17873
getsid 17122
Here, sh is the process group leader, and python3 is a process in that group because it is forked from sh.
Note that all processes in a pipeline are in the same process group and the leftmost is the leader.
I have followed a few posts on here trying to run either a python or shell script on my ec2 instance after every boot not just the first boot.
I have tried the:
[scripts-user, always] to /etc/cloud/cloud.cfg file
Added script to ./scripts/per-boot folder
and
adding script to /etc/rc.local
Yes the permissions were changed to 755 for /etc/rc.local
I am attempting to pipe the output of the file into a file located in the /home/ubuntu/ directory and the file does not contain anything after boot.
If I run the scripts (.sh or .py) manually they work.
Any suggestions or request for additional info to help?
So the current solution appears to be a method I wrote off in my initial question post as I may have not performed the setup exactly as outline in the link below...
This link -->
How do I make cloud-init startup scripts run every time my EC2 instance boots?
The link shows how to modify the /etc/cloud/cloud.cfg file to update scripts-user to [scripts-user, always]
Also that link says to add your *.sh file to /var/lib/cloud/scripts/per-boot directory.
Once you reboot your system your script should have executed and you can verify this in: sudo cat /var/log/cloud-init.log
if your script still fails to execute try to erase the instance state of your server with the following command: sudo rm -rf /var/lib/cloud/instance/*
--NOTE:--
It appears print commands from a python script do not pipe (>>) as expected but echo commands pipe easily
Fails to pipe
sudo python test.py >> log.txt
Pipes successfully
echo "HI" >> log.txt
Is this something along the lines that you want?
It copies the script to the instance, connects to the instance, and runs the script right away.
ec2 scp ~/path_to_script.py : instance_name -y && ec2 ssh instance_name -yc "python script_name.py" 1>/dev/null
I read that the use of rc.local is getting deprecated. One thing to try is a line in /etc/crontab like this:
#reboot full-path-of-script
If there's a specific user you want to run the script as, you can list it after #reboot.
i have start.sh bash script that is running though CRON JOB on ubuntu server
start.sh contains bellow mentioned lines of code
path of start.sh is /home/ubuntu/folder1/folder2/start.sh
#!/bin/bash
crawlers(){
nohup scrapy crawl first &
nohup scrapy crawl 2nd &
wait $!
nohup scrapy crawl 3rd &
nohup scrapy crawl 4th &
wait
}
cd /home/ubuntu/folder1/folder2/
PATH=$PATH:/usr/local/bin
export PATH
python init.py &
wait $!
crawlers
python final.py
my issue is if i run start.sh my myself on command line it outputs in nohup.out file
but when it executes this bash file through cronjob (although scripts are running fine) its not producing nohup.out
how can i get output of this cronjob in nohup.out ?
Why are you using nohup? nohup is a command that tells the running terminal to ignore the hangup signal. cron, however, has no hangup signal, because it is not linked to a terminal session.
In this case, instead of:
nohup scrapy crawl first &
You probably want:
scrapy crawl first > first.txt &
The last example also works in a terminal, but when you close the terminal, the hangup signal (hup) is sent, which ends the program.