Running Python script with scrapy import from Node child process - python

I'm attempting to get a simple scraper up and running to gather data and would like to use Python Scrapy. The rest of the app will be through Nodejs/Express, so I would like to call this script on demand when I need fresh/new data.
The python code runs fine locally through piecharm, but I am seeing issues when it is run as a script.
Through node when I run the server locally and hit /name, it fails with "no module named 'scrapy'
When I run the server through the Anaconda prompt this works fine and scrapy is imported with no error.
I have installed scrapy via conda at the location the express server is being run for both 1 and 2.
From what I've read this may have to do with scrapys need of the Twisted reactor, but as I'm new to Python it's not clear to me what the anaconda terminal is doing differently, and what I would need to do properly from the node side in order to use scrapy.
Nodejs:
app.get('/name', callName);
function callName(req, res) {
console.log("test");
var spawn = require('child_process').spawn;
const pyProg = spawn('python', ['pythonscript.py']);
pyProg.stdout.on('data', function(data) {
console.log(data.toString());
res.write(data);
res.end('end');
});
}
//Print URL for accessing server
console.log('Server running at http://127.0.0.1:8000/')
app.listen(process.env.PORT || 8000, () => console.log("Listening on " + (process.env.PORT || 8000)));
Python script:
try:
import sys
import scrapy
data = "python starting"
print(data)
sys.stdout.flush()
except Exception as exception:
print(exception, False)
print(exception.__class__.__name__ + ": " + exception.message)
Update:
When running import scrapy from the Anaconda interpreter (the other from the comments resulted in "no module found")
Traceback (most recent call last):
File "", line 1, in
File "\Anaconda3\lib\site-packages\scrapy__init__.py", line 34, in
from scrapy.spiders import Spider
File "\Anaconda3\lib\site-packages\scrapy\spiders__init__.py", line 10, in
from scrapy.http import Request
File "\Anaconda3\lib\site-packages\scrapy\http__init__.py", line 11, in
from scrapy.http.request.form import FormRequest
File "\Anaconda3\lib\site-packages\scrapy\http\request\form.py", line 11, in
import lxml.html
File "\Anaconda3\lib\site-packages\lxml\html__init__.py", line 54, in
from .. import etree
ImportError: DLL load failed: The specified module could not be found.
So this looks to be not just interpreter related, but perhaps something additional with Anacondas variables it uses for the terminal?

Related

Flink Python Datastream API Kafka Consumer

Im new to pyflink. Im tryig to write a python program to read data from kafka topic and prints data to stdout. I followed the link Flink Python Datastream API Kafka Producer Sink Serializaion. But i keep seeing NoSuchMethodError due to version mismatch. I have added the flink-sql-kafka-connector available at https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka_2.11/1.13.0/flink-sql-connector-kafka_2.11-1.13.0.jar. Can someone help me in with a proper example to do this? Following is my code
import json
import os
from pyflink.common import SimpleStringSchema
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer
from pyflink.common.typeinfo import Types
def my_map(obj):
json_obj = json.loads(json.loads(obj))
return json.dumps(json_obj["name"])
def kafkaread():
env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///automation/flink/flink-sql-connector-kafka_2.11-1.10.1.jar")
deserialization_schema = SimpleStringSchema()
kafkaSource = FlinkKafkaConsumer(
topics='test',
deserialization_schema=deserialization_schema,
properties={'bootstrap.servers': '10.234.175.22:9092', 'group.id': 'test'}
)
ds = env.add_source(kafkaSource).print()
env.execute('kafkaread')
if __name__ == '__main__':
kafkaread()
But python doesnt recognise the jar file and throws the following error.
Traceback (most recent call last):
File "flinkKafka.py", line 31, in <module>
kafkaread()
File "flinkKafka.py", line 20, in kafkaread
kafkaSource = FlinkKafkaConsumer(
File "/automation/flink/venv/lib/python3.8/site-packages/pyflink/datastream/connectors.py", line 186, in __init__
j_flink_kafka_consumer = _get_kafka_consumer(topics, properties, deserialization_schema,
File "/automation/flink/venv/lib/python3.8/site-packages/pyflink/datastream/connectors.py", line 336, in _get_kafka_consumer
j_flink_kafka_consumer = j_consumer_clz(topics,
File "/automation/flink/venv/lib/python3.8/site-packages/pyflink/util/exceptions.py", line 185, in wrapped_call
raise TypeError(
TypeError: Could not found the Java class 'org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer'. The Java dependencies could be specified via command line argument '--jarfile' or the config option 'pipeline.jars'
What is the correct location to add the jar file?
I see that you downloaded flink-sql-connector-kafka_2.11-1.13.0.jar, but the code loades flink-sql-connector-kafka_2.11-1.10.1.jar.
May be you can have a check
just need to check the path to flink-sql-connector jar
You should add jar file of flink-sql-connector-kafka, it depends on your pyflink and scala version. If versions are true, check your path in add_jars function if the jar package is here.

Python 3.7: Elasticsearch module not found

I'm new to Python and trying to try manipulate data using 'elasticsearch'. Initially I am just trying to connect using their standard example.
I succesfully installed the package using
pip install elasticsearch
When running the code:
from datetime import datetime
from elasticsearch import Elasticsearch
es = Elasticsearch()
doc = {
'author': 'kimchy',
'text': 'Elasticsearch: cool. bonsai cool.',
'timestamp': datetime.now(),
}
res = es.index(index="test-index", doc_type='tweet', id=1, body=doc)
print(res['result'])
res = es.get(index="test-index", doc_type='tweet', id=1)
print(res['_source'])
es.indices.refresh(index="test-index")
res = es.search(index="test-index", body={"query": {"match_all": {}}})
print("Got %d Hits:" % res['hits']['total']['value'])
for hit in res['hits']['hits']:
print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])
I get the following error:
[Running] python -u "c:\Users\USERNAME\Dropbox\Stuff\Python\Elasticsearch.py"
Traceback (most recent call last):
File "c:\Users\USERNAME\Dropbox\Stuff\Python\Elasticsearch.py", line 10, in <module>
from elasticsearch import Elasticsearch
ModuleNotFoundError: No module named 'elasticsearch
I looked around and heard something about a .bash_profile but I do not understand what that means?
My environment variable for PYTHONPATH contains
C:\Users\USERNAME\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.7
Your Python path probably only contains an alias to the Python executable, but it's trying to import stuff from there. You should open C:\Users\USERNAME\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.7
and get the Target from right click > properties on the Python executable.
Set the PYTHONPATH to whatever that path is. It will probably look like C:\Users\USERNAME\AppData\Local\Programs\Python\Python37.

python scrapy conversion to exe file using pyinstaller

I am trying to convert a scrapy script to a exe file.
The main.py file looks like this:
from scrapy.crawler import CrawlerProcess
from amazon.spiders.amazon_scraper import Spider
spider = Spider()
process = CrawlerProcess({
'FEED_FORMAT': 'csv',
'FEED_URI': 'data.csv',
'DOWNLOAD_DELAY': 3,
'RANDOMIZE_DOWNLOAD_DELAY': True,
'ROTATING_PROXY_LIST_PATH': 'proxies.txt',
'USER_AGENT_LIST': 'useragents.txt',
'DOWNLOADER_MIDDLEWARES' :
{
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 400
}
})
process.crawl(spider)
process.start() # the script will block here until the crawling is finished
The scrapy script looks like any other. I am using pyinstaller.exe --onefile main.py to convert it to an exe file. When I try to open the main.exe file inside dist folder it starts outputing errors:
FileNotFoundError: [Errno 2] No such file or directory: '...\\scrapy\\VERSION'
I can fix it by creating a scrapy folder inside the dist folder and uploading a VERSION file from lib/site-packages/scrapy.
After that, many other errors occur but I can fix them by uploading some scrapy libraries.
In the end it starts outputing error:
ModuleNotFoundError: No module named 'email.mime'
I don`t even know what does it mean. I have never seen it.
I am using:
Python 3.6.5
Scrapy 1.5.0
pyinstaller 3.3.1
I had the same situation.
Instead of trying to make pyinstaller count this file (I failed all my attempts to do it) I decided to check and change some part of scrapy code in order to avoid this error.
I noticed that there is only one place where \scrapy\VERSION file used-- \scrapy\__init__.py
I decided to hardcode that value from scrapy\version by changing scrapy__init__.py
:
#import pkgutil
__version__ = "1.5.0" #pkgutil.get_data(__package__, 'VERSION').decode('ascii').strip()
version_info = tuple(int(v) if v.isdigit() else v
for v in __version__.split('.'))
#del pkgutil
After this change there is no need to store version in external file.
As there is no reference to \scrapy\version file - that error will not occure.
After that I had the same FileNotFoundError: [Errno 2] with \scrapy\mime.types file.
There is the same situation with \scrapy\mime.types - it used only in \scrapy\responsetypes.py
...
#from pkgutil import get_data
...
def __init__(self):
self.classes = {}
self.mimetypes = MimeTypes()
#mimedata = get_data('scrapy', 'mime.types').decode('utf8')
mimedata = """
Copypaste all 750 lines of \scrapy\mime.types here
"""
self.mimetypes.readfp(StringIO(mimedata))
for mimetype, cls in six.iteritems(self.CLASSES):
self.classes[mimetype] = load_object(cls)
This change resolved FileNotFoundError: [Errno 2] with \scrapy\mime.types file.
I agree that hardcode 750 lines of text into python code is not the best decision.
After that I started to recieve ModuleNotFoundError: No module named scrapy.spiderloader . I added "scrapy.spiderloader" into hidden imports parameter of pyinstaller.
Next Issue ModuleNotFoundError: No module named scrapy.statscollectors.
Final version of pyinstaller command for my scrapy script consist of 46 hidden imports - after that I received working .exe file.

Python: Access data from Solr using Pysolr

I am using simple Python script to fetch example data from Solr using Pysolr. First I created my core using the following
[user#user solr-7.1.0]$ ./bin/solr create -c json_db
WARNING: Using _default configset. Data driven schema functionality is enabled by default, which is
NOT RECOMMENDED for production use.
To turn it off:
curl http://localhost:8983/solr/json_db/config -d '{"set-user-property": {"update.autoCreateFields":"false"}}'
Created new core 'json_db'
[user#user solr-7.1.0]$ ./bin/post -c json_db example/exampledocs/*.json
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/json_db/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/json_db/update...
Time spent: 0:00:00.398
After creating the core I ran simple python script to fetch data
from pysolr import Solr
conn = Solr('http://localhost:8983/solr/json_db/')
results = conn.search('*:*')
I am getting this error
Traceback (most recent call last):
File "/home/user/PycharmProjects/APP/application/solr_test.py", line 4, in <module>
results = conn.search({'*:*'})
File "/home/user/PycharmProjects/APP/venv/lib/python3.5/site-packages/pysolr.py", line 723, in search
response = self._select(params, handler=search_handler)
File "/home/user/PycharmProjects/APP/venv/lib/python3.5/site-packages/pysolr.py", line 421, in _select
return self._send_request('get', path)
File "/home/user/PycharmProjects/APP/venv/lib/python3.5/site-packages/pysolr.py", line 396, in _send_request
raise SolrError(error_message % (resp.status_code, solr_message))
pysolr.SolrError: Solr responded with an error (HTTP 404): [Reason: Error 404 Not Found]
But when I try to run the query directly from solr I got results like the following
Can somebody guide me what I am doing wrong here ? Thanks
You can just run the script below to fetch the result without using pysolr library.
#! /usr/bin/python
import urllib
import json as simplejson
import pprint
import sys
url = 'give the url here'
wt = "wt=json"
connection = urllib.urlopen(url)
if wt == "wt=json":
response = simplejson.load(connection)
else:
response = eval(connection.read())
print "Number of hits: " + str(response['response']['numFound'])
pprint.pprint(response['response']['docs'])

How to interact with pynessus

I am using http://code.google.com/p/pynessus/ so that I can interact with nessus using python but I run into problems trying to connect to the server. I am not sure what I need to set pynessus too?
I try connecting to the server using the following syntax as directed by the documentation on the site but I receive the following error:
n = pynessus.NessusServer(localhost, 8834, root, password123)
Error:
root#bt:~/Desktop# ./nessus.py
Traceback (most recent call last):
File "./nessus.py", line 634, in
n = pynessus.NessusServer(localhost, 8834, root, password123)
NameError: name 'pynessus' is not defined
The problem is that you didn't import the pynessus module. To solve this problem, simply place the downloaded pynessus.py in the same folder as your Python script and add the line
import pynessus
at the top of that script. You can reference the pynessus library in your script only after that line.

Categories

Resources