Exchanging data between Python and React in real time with Electron JS - python

I have to implement a web scraping tool and I choose to do with electron js with react and python.
I could able to integrate python and react using python shell in electron as follow,
React Code
import React from 'react';
var path = require("path")
const {PythonShell} = require("python-shell");
const city = 'XYZ';
var options = {
scriptPath : path.join(__dirname, '../../python/'),
args : [city]
}
class App extends React.Component {
constructor(props) {
super(props);
}
componentDidMount() {
var shell = new PythonShell('main.py', options); //executes python script on python3
shell.on('message', function(message) {
console.log('message', message)
})
}
render (){
return (
<div className="header">
<h1>Hello, World, {this.state.test}</h1>
</div>
)
}
}
export default App;
Python Code
import sys
import requests
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin, urlparse
city = sys.argv[1]
class MultiThreadScraper:
def __init__(self, base_url):
self.base_url = base_url
self.root_url = '{}://{}'.format(urlparse(self.base_url).scheme, urlparse(self.base_url).netloc)
self.pool = ThreadPoolExecutor(max_workers=5)
self.scraped_pages = set([])
self.to_crawl = Queue()
self.to_crawl.put(self.base_url)
def parse_links(self, html):
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a', href=True)
for link in links:
url = link['href']
if url.startswith('/') or url.startswith(self.root_url):
url = urljoin(self.root_url, url)
if url not in self.scraped_pages:
self.to_crawl.put(url)
def scrape_info(self, html):
return
def post_scrape_callback(self, res):
result = res.result()
if result and result.status_code == 200:
self.parse_links(result.text)
self.scrape_info(result.text)
def scrape_page(self, url):
try:
res = requests.get(url, timeout=(3, 30))
return res
except requests.RequestException:
return
def run_scraper(self):
while True:
try:
target_url = self.to_crawl.get(timeout=60)
if target_url not in self.scraped_pages:
print("Scraping URL: {}".format(target_url))
self.scraped_pages.add(target_url)
job = self.pool.submit(self.scrape_page, target_url)
job.add_done_callback(self.post_scrape_callback)
except Empty:
return
except Exception as e:
print(e)
continue
if __name__ == '__main__':
s = MultiThreadScraper("http://websosite.com")
s.run_scraper()
I could get all scraping URLs after executing the python shell in React but I want all URLs in real time in React front end.
Following React code execute the python code and give the final result
var shell = new PythonShell('main.py', options); //executes python script on python3
This React code is used to receives a message from the python script with a simple 'print' statement.
pyshell.on('message', function (message) {
console.log(message);
});
Is there any way to get the result in real time while executing the python code?

Use sys.stdout.flush() after print statement in python.
Ref:usage of sys.stdout.flush()

Related

Can I execute python code on a list simultaneous instead of sequential?

First of all thank you for taking your time to read through this post. I'd like to begin that I'm very new to programming in general and that I seek advice to solve a problem.
I'm trying to create a script that checks if the content of a html page has been changed. I'm doing this to monitor certain website pages for changes. I've managed to find a script and I have made some alterations that it will go through a list of URL's checking if the page has been changed. The problem here is that its checking the page sequential. This means that it will go through the list checking the URL's one by one while I want the script to run the URL's parallel. I'm also using a while loop to continue checking the pages because even if a change took place it will still have to monitor the page. I could write a thousand more words on explaining what i'm trying to do so therefor have a look at the code:
import requests
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
url = ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]
i = 0
response = urlopen(url[i]).read()
currentHash = hashlib.sha224(response).hexdigest()
while True:
try:
response = urlopen(url[i]).read()
currentHash = hashlib.sha224(response).hexdigest()
print('checking')
time.sleep(10)
response = urlopen(url[i]).read()
newHash = hashlib.sha224(response).hexdigest()
i +=1
if newHash == currentHash:
continue
else:
print('Change detected')
print (url[i])
time.sleep(10)
continue
except Exception as e:
i = 0
print('resetting increment')
continue
What you want to do is called multi-threading.
Conceptually this is how it works:
import hashlib
import time
from urllib.request import urlopen
import threading
# Define a function for the thread
def f(url):
initialHash = None
while True:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
if not initialHash:
initialHash = currentHash
if currentHash != initialHash:
print('Change detected')
print (url)
time.sleep(10)
continue
return
# Create two threads as follows
for url in ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]:
t = threading.Thread(target=f, args=(url,))
t.start()
Running example of OP code Using Thread Executor
Code
import concurrent.futures
import time
import requests
import hashlib
from urllib.request import urlopen
def check_change(url):
'''
Checks for a change in web page contents by comparing current to previous hash
'''
try:
response = urlopen(url).read()
currentHash = hashlib.sha224(response).hexdigest()
time.sleep(10)
response = urlopen(url).read()
newHash = hashlib.sha224(response).hexdigest()
if newHash != currentHash:
return "Change to:", url
else:
return None
except Exception as e:
return "Error", e, url
page_urls = ["https://www.youtube.be", "https://www.google.com", "https://www.google.be"]
while True:
# We can use a Thread Execution Manager to ensure threads are clean up properly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = []
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(check_change, url): url for url in page_urls}
for future in concurrent.futures.as_completed(future_to_url):
# Output result of each thread upon it's completion
url = future_to_url[future]
try:
status = future.result()
if status:
print(*status)
else:
print(f'No change to: {url}')
except Exception as exc:
print('Site %r generated an exception: %s' % (url, exc))
time.sleep(10) # Wait 10 seconds before rechecking sites
Output
Change to: https://www.google.com
Change to: https://www.google.be
Change to: https://www.youtube.be
Change to: https://www.google.be
Change to: https://www.google.com
Change to: https://www.youtube.be
Change to: https://www.google.be
Change to: https://www.google.com
...

Scrapy Splash cannot get the data of a React site

I need to scrape this site.
Is made in React so it looks. Then I tried to extract the data with scrapy-splash. I need for example the "a" element with class shelf-product-name. But the response is an empty array. I used the wait argument in about 5 seconds.
But I only get an empty array.
def start_requests(self):
yield SplashRequest(
url='https://www.jumbo.cl/lacteos-y-bebidas-vegetales/leches-blancas?page=6',
callback=self.parse,
args={'wait':5}
)
def parse(self,response):
print(response.css("a.shelf-product-name"))
Actually there is no need to use Scrapy Splash because all required data stored inside <script> tag of raw html response as json formatted data:
import scrapy
from scrapy.crawler import CrawlerProcess
import json
class JumboCLSpider(scrapy.Spider):
name = "JumboCl"
start_urls = ["https://www.jumbo.cl/lacteos-y-bebidas-vegetales/leches-blancas?page=6"]
def parse(self,response):
script = [script for script in response.css("script::text") if "window.__renderData" in script.extract()]
if script:
script = script[0]
data = script.extract().split("window.__renderData = ")[-1]
json_data = json.loads(data[:-1])
for plp in json_data["plp"]["plp_products"]:
for product in plp["data"]:
#yield {"productName":product["productName"]} # data from css: a.shelf-product-name
yield product
if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
c.crawl(JumboCLSpider)
c.start()

Using scrapy-splash clicking a button

I am trying to use Scrapy-splash to click a button on a page that I'm being redirected to.
I have tested manually clicking on the page, and I am redirected to the correct page after I have clicked the button that gives my consent. I have written a small script to click the button when I am redirected to the page, but this is not working.
I have included a snippet of my spider below - am I missing something in my code?:
from sys import path
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
path.append(dir_path)
import scrapy
from scrapy_splash import SplashRequest
script="""
function main(splash)
splash:wait(1)
splash:runjs('document.querySelector("form.consent-form").submit()')
splash:wait(1)
return {
html = splash:html(),
}
end
"""
class FoobarSpider(scrapy.Spider):
name = "foobar"
def start_requests(self):
urls = ['https://uk.finance.yahoo.com/quote/ANTO.L?p=ANTO.L']
for url in urls:
yield SplashRequest(url=url, callback=self.parse,
endpoint='render.html',
args={'wait': 3},
meta = {'yahoo_url': url }
)
def parse(self, response):
url = response.url
with open('temp.html', 'wb') as f:
f.write(response.body)
if 'https://guce.' in url:
print('About to attempt to authenticate ...')
yield SplashRequest(
url,
callback = self.get_price,
endpoint = 'execute',
args = {'lua_source': script, 'timeout': 5},
meta = response.meta
)
else:
self.get_price(response)
def get_price(self, response):
print("Get price called!")
yahoo_price = None
try:
# Get Price ...
temp1 = response.css('div.D\(ib\).Mend\(20px\)')
if temp1 and len(temp1) > 1:
temp2 = temp1[1].css('span')
if len(temp2) > 0:
yahoo_price = temp2[0].xpath('.//text()').extract_first().replace(',','')
if not yahoo_price:
val = response.css('span.Trsdu\(0\.3s\).Trsdu\(0\.3s\).Fw\(b\).Fz\(36px\).Mb\(-4px\).D\(b\)').xpath('.//text()').extract_first().replace(',','')
yahoo_price = val
except Exception as err:
pass
print("Price is: {0}".format(yahoo_price))
def handle_error(self, failure):
pass
How do I fix this so that I can correctly give consent, so I'm directed to the page I want?
Rather than clicking the button, try submitting the form:
document.querySelector("form.consent-form").submit()
I tried running the JavaScript command input.btn.btn-primary.agree").click() in my console and would get an error message "Oops, Something went Wrong" but the page loads when using the above code to submit the form.
Because I'm not in Europe I can't fully recreate your setup but I believe that should get you past the issue. My guess is that this script is interfering with the .click() method.

python program call with tornado post request block session till end of the python program

this is a program input multiple urls calling url localhost:8888/api/v1/crawler
this program taking 1+hour to run its ok but it block other apis.
when it running other any api will not work till the existing api end so i want to run this program asynchronously so how can i achieve with the same program
#tornado.web.asynchronous
#gen.coroutine
#use_args(OrgTypeSchema)
def post(self, args):
print "Enter In Crawler Match Script POST"
print "Argsssss........"
print args
data = tornado.escape.json_decode(self.request.body)
print "Data................"
import json
print json.dumps(data.get('urls'))
from urllib import urlopen
from bs4 import BeautifulSoup
try:
urls = json.dumps(data.get('urls'));
urls = urls.split()
import sys
list = [];
# orig_stdout = sys.stdout
# f = open('out.txt', 'w')
# sys.stdout = f
for url in urls:
# print "FOFOFOFOFFOFO"
# print url
url = url.replace('"'," ")
url = url.replace('[', " ")
url = url.replace(']', " ")
url = url.replace(',', " ")
print "Final Url "
print url
try:
site = urlopen(url) ..............
Your post method is 100% synchronous. You should make the site = urlopen(url) async. There is an async HTTP client in Tornado for that. Also good example here.
You are using urllib which is the reason for blocking.
Tornado provides a non-blocking client called AsyncHTTPClient, which is what you should be using.
Use it like this:
from tornado.httpclient import AsyncHTTPClient
#gen.coroutine
#use_args(OrgTypeSchema)
def post(self, args):
...
http_client = AsyncHTTPClient()
site = yield http_client.fetch(url)
...
Another thing that I'd like to point out is don't import modules from inside a function. Although, it's not the reason for blocking but it is still slower than if you put all your imports at the top of file. Read this question.

Is there way to ignore 302 Moved Temporarily redirection or find what it is caused by?

I am Writing some parsing script and need to access to many web pages like this one.
Whenever I try to get this page with urlopen and then read(), I get redirected to this page.
When I launch the same links from google chrome redirect happens but really rarely, most of times when I try to launch url not by clicking it from site menus.
Is there way to dodge that redirect or simulate jump to url from web-site menus with python3?
Example code:
def getItemsFromPage(url):
with urlopen(url) as page:
html_doc = str(page.read())
return re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&orgid=[\d]+)', html_doc)
url = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
items_urls = getItemsFromPage(url)
with urlopen(item_urls[0]) as item_page:
print(item_page.read().decode('utf-8')) # Here i get search.advanced instead of item page
In fact, it's a weird problem with ampersands from raw html data. When you visit the webpage and click on link ampersands (&amp) are read by web navigator as "&" and it work. However, python reads data as it is, that is raw data. So:
import urllib.request as net
from html.parser import HTMLParser
import re
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0",
}
def unescape(items):
html = HTMLParser()
unescaped = []
for i in items:
unescaped.append(html.unescape(i))
return unescaped
def getItemsFromPage(url):
request = net.Request(url, headers=headers)
response = str(net.urlopen(request).read())
# --------------------------
# FIX AMPERSANDS - unescape
# --------------------------
links = re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&orgid=[\d]+)', response)
unescaped_links = unescape(links)
return unescaped_links
url = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
item_urls = getItemsFromPage(url)
request = net.Request(item_urls[0], headers=headers)
print(item_urls)
response = net.urlopen(request)
# DEBUG RESPONSE
print(response.url)
print(80 * '-')
print("<title>Charity Navigator Rating - 10,000 Degrees</title>" in (response.read().decode('utf-8')))
Your problem is not replacing & with & in the url string. I rewrite your code using urllib3 as below and got the expected webpages.
import re
import urllib3
def getItemsFromPage(url):
# create connection pool object (urllib3-specific)
localpool = urllib3.PoolManager()
with localpool.request('get', url) as page:
html_doc = page.data.decode('utf-8')
return re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&orgid=[\d]+)', html_doc)
# the master webpage
url_master = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
# name and store the downloaded contents for testing purpose.
file_folder = "R:"
file_mainname = "test"
# parse the master webpage
items_urls = getItemsFromPage(url_master)
# create pool
mypool = urllib3.PoolManager()
i=0;
for url in items_urls:
# file name to be saved
file_name = file_folder + "\\" + file_mainname + str(i) + ".htm"
# replace '&' with r'&'
url_OK = re.sub(r'&', r'&', url)
# print revised url
print(url_OK)
### the urllib3-pythonic way of web page retrieval ###
with mypool.request('get', url_OK) as page, open(file_name, 'w') as f:
f.write(page.data.decode('utf-8'))
i+=1
(verified on python 3.4 eclipse PyDev win7 x64)

Categories

Resources