I'm invoking a python script from node js. The python script retrieves data from a REST API and stores it in a dataframe and then there's a search function based on user input. I'm confused as to what variable type does python send the data to node js in? I've tried to convert into a string but in node js it says it is an unresolved variable type. Here's the code:
r = requests.get(url)
data = r.json()
nested = json.loads(r.text)
nested_full = json_normalize(nested)
req_data= json_normalize(nested,record_path ='items')
search = req_data.get(["name", "id"," ])
#search.head(10)
filter = sys.argv[1:]
print(filter)
input = filter[0]
print(input)
result = search[search["requestor_name"].str.contains(input)]
result = result.to_String(index=false)
response = '```' + str(result) + '```'
print(response)
sys.stdout.flush()
Here's the node js program that invokes the above python script. How do i store the output in a format which i can pass to another function in node?
var input = 'robert';
var childProcess = require("child_process").spawn('python', ['./search.py', input], {stdio: 'inherit'})
const stream = require('stream');
const format = require('string-format')
childProcess.on('data', function(data){
process.stdout.write("python script output",data)
result += String(data);
console.log("Here it is", data);
});
childProcess.on('close', function(code) {
if ( code === 1 ){
process.stderr.write("error occured",code);
process.exit(1);
}
else{
process.stdout.write('done');
}
});
According to the docs:
childProcess.stdout.on('data', (data) => {
console.log(`stdout: ${data}`);
});
Related
When I print the response, everything seems to be correct, and the type is also correct.
Assertion: True
Response type: <class 'scrape_pb2.ScrapeResponse'>
But on postman I get "13 INTERNAL" With no additional information:
I can't figure out what the issue is, and I can't find out how to log or print the error from the server side.
Relevant proto parts:
syntax = "proto3";
service ScrapeService {
rpc ScrapeSearch(ScrapeRequest) returns (stream ScrapeResponse) {};
}
message ScrapeRequest {
string url = 1;
string keyword = 2;
}
message ScrapeResponse {
oneof result {
ScrapeSearchProgress search_progress = 1;
ScrapeProductsProgress products_progress = 2;
FoundProducts found_products = 3;
}
}
message ScrapeSearchProgress {
int32 page = 1;
int32 total_products = 2;
repeated string product_links = 3;
}
scraper.py
def get_all_search_products(search_url: str, class_keyword: str):
search_driver = webdriver.Firefox(options=options, service=service)
search_driver.maximize_window()
search_driver.get(search_url)
# scrape first page
product_links = scrape_search(driver=search_driver, class_keyword=class_keyword)
page = 1
search_progress = ScrapeSearchProgress(page=page, total_products=len(product_links), product_links=[])
search_progress.product_links[:] = product_links
# scrape next pages
while go_to_next_page(search_driver):
page += 1
print(f'Scraping page=>{page}')
product_links.extend(scrape_search(driver=search_driver, class_keyword=class_keyword))
print(f'Number of products scraped=>{len(product_links)}')
search_progress.product_links.extend(product_links)
# TODO: remove this line
if page == 6:
break
search_progress_response = ScrapeResponse(search_progress=search_progress)
yield search_progress_response
Server:
class ScrapeService(ScrapeService):
def ScrapeSearch(self, request, context):
print(f"Request received: {request}")
scrape_responses = get_all_search_products(search_url=request.url, class_keyword=request.keyword)
for response in scrape_responses:
print(f"Assertion: {response.HasField('search_progress')}")
print(f"Response type: {type(response)}")
yield response
Turns out it's just an issue with postman. I set up a python client and it worked.
I have html document embeded with pdf document in base64 encoded format. I like to extract the string and save it as pdf file. using below code to save it as pdf file.
but its on opening in adobe reader, saying invalid format. looking to fix this issue.
I think pdf file encoded using Javascript encodeURIComponent function. need to convert using Python.
sample embed tag
<embed type="application/pdf" src="data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC9D">
Code
import base64
def decode_b64():
b64 = "JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC9D"
buffer = BytesIO.BytesIO()
content = base64.b64decode(b64)
buffer.write(content)
with open(Path(Path.home(), 'Downloads', 'mytest.pdf'), "wb") as f:
f.write(buffer.getvalue())
if __name__ == "__main__":
decode_b64()
=== Update 1:
found the way to convert using JavaScript: It will be nice if we can port this code to Python.
const {readFileSync, writeFile, promises: fsPromises} = require('fs');
var data=readFileSync("pdf-file.html", 'utf-8')
var DOMParser = require('xmldom').DOMParser;
var parser = new DOMParser();
const virtualDoc = parser.parseFromString(data, 'text/html');
var elem = virtualDoc.getElementsByTagName('embed')[0];
for (var i = 0; i < elem.attributes.length; i++) {
var attrib = elem.attributes[i];
if (attrib.specified) {
if( attrib.name == "src") {
var result =attrib.value
result=result.replace('data:application/pdf;base64,','');
let buff = Buffer.from(decodeURIComponent(result), 'base64');
writeFile('pdf-file.pdf', buff, err => {
if (err) {
console.error(err);
}
});
}
}
}
This is a situation that you should have been able to chase down yourself. I wasn't 100% sure how Javascript encoded those two characters, so I wrote up a simple HTML page:
<script>
var s = "abcde++defgh//";
alert(encodeURIComponent(s));
</script>
When I ran that page, the result was "abcde%2B%2Bdefgh%2F%2F", and that is all the information you need to fix up those strings.
def decode_b64():
b64 = "JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC9D......"
b64 = b64.replace('%2B','+').replace('%2F','/')
content = base64.b64decode(b64)
with open(Path(Path.home(), 'Downloads', 'mytest.pdf'), "wb") as f:
f.write(content)
I'm trying to retrieve data programmatically through websockets and am failing due to my limited knowledge around this. On visiting the site at https://www.tradingview.com/chart/?symbol=ASX:RIO I notice one of the websocket messages being sent out is ~m~60~m~{"m":"quote_fast_symbols","p":["qs_p089dyse9tcu","ASX:RIO"]}
My code is as follows:
from websocket import create_connection
import json
ws = create_connection("wss://data.tradingview.com/socket.io/websocket?from=chart%2Fg0l68xay%2F&date=2019_05_27-12_19")
ws.send(json.dumps({"m":"quote_fast_symbols","p"["qs_p089dyse9tcu","ASX:RIO"]}))
result = ws.recv()
print(result)
ws.close()
Result of the print:
~m~302~m~{"session_id":"<0.25981.2547>_nyc2-charts-3-webchart-5#nyc2-compute-3_x","timestamp":1558976872,"release":"registry:5000/tvbs_release/webchart:release_201-106","studies_metadata_hash":"888cd442d24cef23a176f3b4584ebf48285fc1cd","protocol":"json","javastudies":"javastudies-3.44_955","auth_scheme_vsn":2}
I get this result no matter what message I send out, out of the almost multitude of messages that seem to be sent out. I was hoping one of the messages sent back will be the prices info for the low and highs for RIO. Is there other steps I should include to get this data? I understand there might be some form of authorisation needed but I dont know the workflow.
Yes, there is much more to setup and it needs to be done in order. The following example written in Node.js will subscribe to the BINANCE:BTCUSDT real time data and fetch historical 5000 bars on the daily chart.
Ensure you have proper value of the origin field set in header section before connecting. Otherwise your connection request will be rejected by the proxy. I most common ws there is no way to do this. Use faye-websocket instead
const WebSocket = require('faye-websocket')
const ws = new WebSocket.Client('wss://data.tradingview.com/socket.io/websocket', [], {
headers: { 'Origin': 'https://data.tradingview.com' }
});
After connecting you need to setup your data stream. I don't know if all of this commands needs to be performed. This probably can be shrink even more but it works. Basically what you need to do is to create new quote and chart sessions and within these sessions request stream of the data of the prior resolved symbol.
ws.on('open', () => {
const quote_session = 'qs_' + getRandomToken()
const chart_session = 'cs_' + getRandomToken()
const symbol = 'BINANCE:BTCUSDT'
const timeframe = '1D'
const bars = 5000
sendMsg(ws, "set_auth_token", ["unauthorized_user_token"])
sendMsg(ws, "chart_create_session", [chart_session, ""])
sendMsg(ws, "quote_create_session", [quote_session])
sendMsg(ws, "quote_set_fields", [quote_session,"ch","chp","current_session","description","local_description","language","exchange","fractional","is_tradable","lp","lp_time","minmov","minmove2","original_name","pricescale","pro_name","short_name","type","update_mode","volume","currency_code","rchp","rtc"])
sendMsg(ws, "quote_add_symbols",[quote_session, symbol, {"flags":['force_permission']}])
sendMsg(ws, "quote_fast_symbols", [quote_session, symbol])
sendMsg(ws, "resolve_symbol", [chart_session,"symbol_1","={\"symbol\":\""+symbol+"\",\"adjustment\":\"splits\",\"session\":\"extended\"}"])
sendMsg(ws, "create_series", [chart_session, "s1", "s1", "symbol_1", timeframe, bars])
});
ws.on('message', (msg) => { console.log(`RX: ${msg.data}`) })
And finally implementation of the helper methods
const getRandomToken = (stringLength=12) => {
characters = 'abcdefghijklmnopqrstuvwxyz0123456789'
const charactersLength = characters.length;
let result = ''
for ( var i = 0; i < stringLength; i++ ) {
result += characters.charAt(Math.floor(Math.random() * charactersLength))
}
return result
}
const createMsg = (msg_name, paramsList) => {
const msg_str = JSON.stringify({ m: msg_name, p: paramsList })
return `~m~${msg_str.length}~m~${msg_str}`
}
const sendMsg = (ws, msg_name, paramsList) => {
const msg = createMsg(msg_name, paramsList)
console.log(`TX: ${msg}`)
ws.send(createMsg(msg_name, paramsList))
}
I'm trying to program web crawler.
I have server.js / crawling.js / dataCrawler.py
When I call crawlData that is defined in crawling.js at server.js, the method I defined in crawling.js using spawn for executing the dataCrawler.py gets called.
I need data in server.js, but executing dataCrawler.py takes a while So I cannot get proper data but null or undefined.
Do you have any solution ? or Anyone who has same issue?
My codes are below. (I don't put these perfectly. Just reference for structure)
//server.js
var crawler = require("./crawling")
var resultArr = crawler.crawlData();
console.log('nodeserver:', resultArr)
//crawling.js
exports.crawlData = ()=>{
var dataArr = [];
var temp;
var py = spawn('python', ['dataCrawler.py']);
var data = [totalUrl, gubun];
var dataFromPy = null;
py.stdout.on('data', function(result){
var dataArr = encoding.convert(result, 'utf-8')
dataArr = JSON.parse(encoding.convert(result, 'utf-8'));
py.stdout.on('end', function(){
temp = dataArr
});
});
py.stdin.write(JSON.stringify(data));
py.stdin.end();
return temp;
}
//dataCrawler.py
def crawling(url, gubun, page_count):
idx = 0
result = []
jsonData = {}
for i in range(1, page_count + 1):
....
crawling code
....
return result
def main():
lines = sys.stdin.readlines()
paraFromServer = json.loads(lines[0])
url = paraFromServer[0]
gubun = paraFromServer[1]
result = crawling(url, gubun, page_count)
print(result)
main()
You didn't account for the asynchronous nature of javascript. What you have to do is, pass in a callback method to crawlData method, which will be invoked once scraping is done.
exports.crawlData = (cb)=>{
....
py.stdout.on('data', function(result){
var dataArr = encoding.convert(result, 'utf-8')
dataArr = JSON.parse(encoding.convert(result, 'utf-8'));
py.stdout.on('end', function(){
cb(dataArr); // ideally the pattern is cb(error, data)
});
});
...
So server.js becomes:
var crawler = require("./crawling")
crawler.crawlData((data) => {
console.log(data);
// Do whatever you want to do with the data.
});
Callbacks can cause Callback hell. Try exploring promises or async / await.
Alternatively you can use spawnSync if running in parallel isn't a concern
exports.crawlData = () => {
const result = spawnSync('python', ['dataCrawler.py'], {
input: JSON.stringify([totalUrl, gubun])
});
return JSON.parse(encoding.convert(result, 'utf-8'));
}
Trying to collect data on book price fluctuations for a school project. I'm using Python to scrape from a book buyback aggregator (in this case, bookscouter), but I find that since the site has to load in the data, grabbing the source code through the urllib2 package gives me the source code from before the data is loaded. How do I pull from after the data is loaded?
Example: http://bookscouter.com/prices.php?isbn=9788498383621&searchbutton=Sell
You cannot this with Python only. You need a JavaScript engine API like PhantomJS
With Phantom, will be very easy to setup the web scraping of all the page contents, static and dynamic JavaScript contents (like Ajax calls results in your case). Infact you can register page event handlers to your page parser like (this is a node.js + phantom.js example)
/*
* Register Page Handlers as functions
{
onLoadStarted : onLoadStarted,
onLoadFinished: onLoadFinished,
onError : onError,
onResourceRequested : onResourceRequested,
onResourceReceived : onResourceReceived,
onNavigationRequested : onNavigationRequested,
onResourceError : onResourceError
}
*/
registerHandlers : function(page, handlers) {
if(handlers.onLoadStarted) page.set('onLoadStarted',handlers.onLoadStarted)
if(handlers.onLoadFinished) page.set('onLoadFinished',handlers.onLoadFinished)
if(handlers.resourceError) page.set('onResourceError', handlers.resourceError)
if(handlers.onResourceRequested) page.set('onResourceRequested',handlers.onResourceRequested)
if(handlers.onResourceReceived) page.set('onResourceReceived',handlers.onResourceReceived)
if(handlers.onNavigationRequested) page.set('onNavigationRequested',handlers.onNavigationRequested)
if(handlers.onError) page.set('onError',handlers.onError)
}
At this point you have full control of what is going on and when in the page you have to download like:
var onResourceError = function(resourceError) {
var errorReason = resourceError.errorString;
var errorPageUrl = resourceError.url;
}
var onResourceRequested = function (request) {
var msg = ' request: ' + JSON.stringify(request, undefined, 4);
};
var onResourceReceived = function(response) {
var msg = ' id: ' + response.id + ', stage: "' + response.stage + '", response: ' + JSON.stringify(response);
};
var onNavigationRequested = function(url, type, willNavigate, main) {
var msg = ' destination_url: ' + url;
msg += ' type (cause): ' + type;
msg += ' will navigate: ' + willNavigate;
msg += ' from page\'s main frame: ' + main;
};
page.onResourceRequested(
function(requestData, request) {
//request.abort()
//request.changeUrl(url)
//request.setHeader(key,value)
var msg = ' request: ' + JSON.stringify(request, undefined, 4);
//console.log( msg )
},
function(requestData) {
//console.log(requestData.url)
})
PageHelper.registerHandlers(page,
{
onLoadStarted : onLoadStarted,
onLoadFinished: onLoadFinished,
onError : null, // onError THIS HANDLER CRASHES PHANTOM-NODE
onResourceRequested : null, // MUST BE ON PAGE OBJECT
onResourceReceived : onResourceReceived,
onNavigationRequested : onNavigationRequested,
onResourceError : onResourceError
});
As you can see you can define you page handlers and take control of the flow and so of the resources loaded on that page. So you can be sure that all data are ready and set, before you take the whole page source like:
var Parser = {
parse : function(page) {
var onSuccess = function (page) { // page loaded
var pageContents=page.evaluate(function() {
return document.body.innerText;
});
}
var onError = function (page,elapsed) { // error
}
page.evaluate(function(func) {
return func(document);
}, function(dom) {
return true;
});
}
} // Parser
Here you can see the whole page contents loaded in the onSuccess callback:
var pageContents=page.evaluate(function() {
return document.body.innerText;
});
The page comes from Phantomjs directly like in the following snippet:
phantom.create(function (ph) {
ph.createPage(function (page) {
Parser.parse(page)
})
},options)
Of course this to give you and idea of what you can do with node.js + Phantomjs, that are super powerful when combined together.
You can run phantomjs in a Python env, calling it like
try:
output = ''
for result in runProcess([self.runProcess,
self.runScript,
self.jobId,
self.protocol,
self.hostname,
self.queryString]):
output += '' + result
print output
except Exception as e:
print e
print(traceback.format_exc())
where you use subprocess Popen to execute the binary:
def runProcess(exe):
p = subprocess.Popen(exe, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
while(True):
retcode = p.poll() #returns None while subprocess is running
line = p.stdout.readline()
yield line
if(retcode is not None):
break
Of course the process to run is node.js in this case
self.runProcess='node'
with the args you need as params.
The challenge is reading the data once its been rendered by a web browser, which will require some extra tricks to do. If you can see if the site has a pre-rendered version* or an API.
This article (linked from the Web archive) has a pretty good breakdown of what you'll need to do. It can be summed up however as:
Pick a good python-webkit renderer (in the case of the article PyQT)
Use a windowing widget to fetch and render the page
Fetch the rendered HTML from the widget
Parse this HTML as normal using a library like lXML or BeautifulSoup.
* Minor rant - the idea of having to hope for a pre-rendered version ofwhat should be a static webpage angers me.