Data exchange from Nodejs to Python is not working properly

Data exchange from Nodejs to Python is not working properly - python

I have created a webhook for Dialogflow using Nodejs.
Now I need to call a Python file to summarize the text. I use python-shell for this action. As an argument, I add text that I want to shorten.
The problem is that the text about defenestration, for example, is sent correctly except for "v roce". Python will receive it as "v\xa0roce". The rest of the text is fine. Other text may have more than one this problem.
I tried to call Python with the same text as argument from the command line and this problem did not occur.
Code sample in Nodejs
var options = {
mode: 'text',
args: result
};
await PythonShell.run('summary.py', options, function (err, results) {
if(err){
console.log(err);
callback(result);
}
// results is an array consisting of messages collected during execution
else{
result = results.toString();
let output = {
fulfillmentText: result,
outputContexts: []
};
result = JSON.stringify(output);
callback(result);
}
});
If I call print (sys.argv [1]) at the beginning of the Python code, I get this
Původ slova je odvozován od pražské defenestrace v roce 1618, kdy
nespokojení protestanští stavové vyhodili z okna dva královské
místodržící a k tomuto činu sepsali obsáhlou apologii.
This looks good, but after doing this
article = filedata.split(". ")
sentences = []
for sentence in article:
sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
sentences.pop()
print(sentences)
return sentences
The following will return
"[['Defenestrace', 'označuje', 'násilné', 'vyhození', 'z', 'okna'],
['Původ', 'slova', 'je', 'odvozován', 'od', 'pražské', 'defenestrace',
'v\xa0roce', '1618,', 'kdy', 'nespokojení', 'protestanští',
'stavové', 'vyhodili', 'z', 'okna', 'dva', 'královské', 'místodržící',
'a', 'k', 'tomuto', 'činu', 'sepsali', 'obsáhlou', 'apologii'],
['Ovšem', 've', 'středověku', 'a', 'v', 'raném', 'novověku', 'se',
'defenestrace', 'konaly', 'poměrně', 'často', 'a', 'toto', 'konání',
'neslo', 'prvky', 'lynče,', 'ordálu', 'a', 'společně', 'spáchané',
'vraždy'], ['Ve', 'středověké', 'a', 'raně', 'novověké',
'společnosti,', 'která', 'je', 'výrazně', 'horizontálně', 'členěna,',
'má', 'defenestrace', 'charakter', 'symbolického', 'trestu']]"
I call this Python code using generate_summary(sys.argv[1], 2).
If I enter "TEXT ABOUT DEFENESTRATION" instead of sys.argv[1], this problem does not occur and is displayed "v roce".
Could someone help me?
Thank you in advance

Regular expressions are not supported in replace, you need to use re module instead like this:
re.sub("[^a-zA-Z]", " ", your_string)
And then split the result into words.

Related

An effective way to crawl specific sites based on just 1 keyword

I'm trying to create a domain name generator with this code.
import random
import requests
Website= open("dictionaries/Website.txt", 'w', encoding="utf-8")
with open("dictionaries/WebAll.txt") as i:
Web = [line.rstrip() for line in i]
with open("dictionaries/domain.txt") as i:
WebD = [line.rstrip() for line in i]
delimiters = ['', '-', '.']
suffix = ['info', 'com', 'odoo', 'xyz', 'de', 'fans', 'blog', 'io', 'site', 'online']
output = []
for i in range(100):
for subdomain_count in [1, 2, 3, 4]:
webd = random.choice(WebD)
data = [webd] + random.sample(Web, k=subdomain_count)
random.shuffle(data)
delims = (random.choices(delimiters, k=subdomain_count) +
['.' + random.choice(suffix)])
address = ''.join([a+b for a, b in zip(data, delims)])
weburl = 'http://' + address
output.append(weburl)
I made this code to find a phishing site with a bunch of words that i input inside WebAll and Domain, the words inside the domain are words for the main domain for example 'minecraft', 'm1n3craft', 'minecr4ft' and so on, which the output is going to look like this for example, since i have made the range output between 1 - 4, let's pick range 2. the site would look like this m1n3craft.play.com, the word play comes from the WebAll.txt which then that website is going through a request.
try:
request = request.get('m1necr4ft.play.com')
if request.status_code == 200:
print('web exist)
except:
print('does not exist')
If i were to do this, the chance of finding a legitimate site would be lower than a parked or a non exiting site. Is there a better version, where i just need to put just 1 keyword for example minecraft and it would just search every possible URL that includes the name Minecraft and outputs it?
Thank you!

Unable to parse data from UrlFetchApp in Google Apps Script [duplicate]

This question already has answers here:
What exactly do "u" and "r" string prefixes do, and what are raw string literals?
(7 answers)
Closed 2 years ago.
My current Cloud Run URL returns a long string, matching the exact format as described here.
When I run the following code in Google Apps Script, I get a Log output of '1'. What happens, is the entire string is put in the [0][0] position of the data array instead of actually being parsed.
function myFunction() {
const token = ScriptApp.getIdentityToken();
const options = {
headers: {'Authorization': 'Bearer ' + token}
}
var responseString = UrlFetchApp.fetch("https://*myproject*.a.run.app", options).getContentText();
var data = Utilities.parseCsv(responseString, '\t');
Logger.log(data.length);
}
My expected output is a 2D array as described in the aforementioned link, with a logged output length of 18.
I have confirmed the output of my response by:
Logging the responseString
Copying the output log into a separate var -> var temp = "copied-output"
Changing the parseCsv line to -> var data = Utilities.parseCsv(temp, '\t')
Saving and running the new code. This then outputs a successful 2D array with a length of 18.
So why is it, that my current code doesn't work?
Happy to try anything because I am out of ideas.
Edit: More information below.
Python script code
#app.route("/")
def hello_world():
# Navigate to webpage and get page source
driver.get("https://www.asxlistedcompanies.com/")
soup = BeautifulSoup(driver.page_source, 'html.parser')
# ##############################################################################
# Used by Google Apps Script to create Arrays
# This creates a two-dimensional array of the format [[a, b, c], [d, e, f]]
# var csvString = "a\tb\tc\nd\te\tf";
# var data = Utilities.parseCsv(csvString, '\t');
# ##############################################################################
long_string = ""
limit = 1
for row in soup.select('tr'):
if limit == 20:
break
else:
tds = [td.a.get_text(strip=True) if td.a else td.get_text(strip=True) for td in row.select('td')]
count = 0
for column in tds:
if count == 4:
linetext = column + r"\n"
long_string = long_string+linetext
else:
text = column + r"\t"
long_string = long_string+text
count = count+1
limit = limit+1
return long_string
GAS Code edited:
function myFunction() {
const token = ScriptApp.getIdentityToken();
const options = {
headers: {'Authorization': 'Bearer ' + token}
}
var responseString = UrlFetchApp.fetch("https://*myfunction*.a.run.app", options).getContentText();
Logger.log("The responseString: " + responseString);
Logger.log("responseString length: " + responseString.length)
Logger.log("responseString type: " + typeof(responseString))
var data = Utilities.parseCsv(responseString, '\t');
Logger.log(data.length);
}
GAS logs/output as requested:
6:17:11 AM Notice Execution started
6:17:22 AM Info The responseString: 14D\t1414 Degrees Ltd\tIndustrials\t21,133,400\t0.001\n1ST\t1ST Group Ltd\tHealth Care\t12,738,500\t0.001\n3PL\t3P Learning Ltd\tConsumer Discretionary\t104,613,000\t0.005\n4DS\t4DS Memory Ltd\tInformation Technology\t58,091,300\t0.003\n5GN\t5G Networks Ltd\t\t82,746,600\t0.004\n88E\t88 Energy Ltd\tEnergy\t42,657,800\t0.002\n8CO\t8COMMON Ltd\tInformation Technology\t11,157,900\t0.001\n8IH\t8I Holdings Ltd\tFinancials\t35,814,200\t0.002\n8EC\t8IP Emerging Companies Ltd\t\t3,199,410\t0\n8VI\t8VIC Holdings Ltd\tConsumer Discretionary\t13,073,200\t0.001\n9SP\t9 Spokes International Ltd\tInformation Technology\t21,880,100\t0.001\nACB\tA-Cap Energy Ltd\tEnergy\t7,846,960\t0\nA2B\tA2B Australia Ltd\tIndustrials\t95,140,200\t0.005\nABP\tAbacus Property Group\tReal Estate\t1,679,500,000\t0.082\nABL\tAbilene Oil and Gas Ltd\tEnergy\t397,614\t0\nAEG\tAbsolute Equity Performance Fund Ltd\t\t107,297,000\t0.005\nABT\tAbundant Produce Ltd\tConsumer Staples\t1,355,970\t0\nACS\tAccent Resources NL\tMaterials\t905,001\t0\n
6:17:22 AM Info responseString length: 1020
6:17:22 AM Info responseString type: string
6:17:22 AM Info 1.0
6:17:22 AM Notice Execution completed

Issue:
Using a r'' raw string flag makes \n and \t, a literal \ and n/t respectively and not a new line or a tab character. This explains why you were able to copy the "displayed" logs to a variable and execute it successfully.
Solution:
Don't use r flag.
Snippet:
linetext = column + "\n" #no flag
long_string = long_string+linetext
else:
text = column + "\t" #no flag

How to split a string into a list but not brackets?

Sorry about the title - I wasn't sure how to word it. Anyway, I'm writing a markup language compiler in python that compiles into HTML. Example:
-(a){href:"http://www.google.com"}["Click me!"]
Compiles into:
Click me!
It's all working great until nested tags are introduced:
-(head)[
-(title)[Nested tags]
]
-(body)[
-(div)[
-(h1)[Nested tags]
-(p)[Paragraph]
]
]
Which produces:
<title>Nested tags</title>
<h1>Nested tags</h1>
<p>Paragraph</p>
Because the program splits commands using code.split("-"), It is splitting the commands insides of the bodies of other commands. When I added print statement into the compiler code, I got:
(head)[
(title)[Nested tags]
]
(body)[
(div)[
(h1)[Nested tags]
(p)[Paragraph]
]
]
Each line is interpreted as a different command so my regexps (\((.+)\)\[(.+)\] and \((.+)\)\{(.+)\}\[(.+)\]) do not match something like head[. I thought the best solution was to have it split '-' unless it was in the body of a command, making the above produce:
(head)[-(title)[Nested tags]
(body)[-(div)[-(h1)[Nested tags]-(p)[Paragraph]]]
Then run the same code for each command within each block.
TL;DR: Make input:
"-abc-def[-ignore-me]-ghi"
Produce:
["abc", "def", "[-ignore-me]", "ghi"]
Any help would be greatly appreciated, thanks.

I believe this dodgy code works similar to what you want:
import re
re_name = re.compile(r'-\(([^)]+)\)')
re_args = re.compile(r'{([^}]+)}')
def parse(chars, history=[]):
body = ''
while chars:
char = chars.pop(0)
if char == '[':
name = re_name.search(body).group(1)
args = re_args.search(body)
start = '<'+name
if args:
for arg in args.group(1).split(','):
start += ' '+arg.strip().replace(':', '=')
start += '>'
end = '</'+name+'>'
history.append(start)
history.append(parse(chars))
history.append(end)
body = ''
continue
if char == ']':
return body.strip()
body += char
return history
code = '''
-(head)[
-(title)[Nested tags]
-(random)[Stuff]
]
-(body){class:"nav", other:stuff, random:tests}[
-(div)[
-(h1)[Nested tags]
-(p)[Paragraph]
]
]
'''
parsed = parse(list(code))
print(''.join(parsed))

Pig: Python UDF to search text for a list of keywords/strings

I have two files, one with a list of keywords/strings:
blue fox
the
lazy dog
orange
of
file
Another, with text:
The blue fox jumped
over the lazy dog
this file has nothing important
lines repeat
this line does not match
I want to take the list of strings in the first file and find lines from second file that match any of the strings from the first. So I wrote a Pig script with a Python UDF:
register match.py using jython as match;
A = LOAD 'words.txt' AS (word:chararray);
B = LOAD 'text.txt' AS (line:chararray);
C = GROUP A ALL;
D = FOREACH B generate match.match(C.$1,line);
dump D;
#match.py
#outputSchema("str:chararray")
def match(wordlist,line):
linestr = str(line)
for word in wordlist:
wordstr = str(word)
if re.search(wordstr,linestr):
return line
Ends in error:
"2014-04-01 06:22:34,775 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function"
Detailed Error log:
Backend error message
---------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at o
Pig Stack Trace
---------------
ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias D. Backend error : Error executing function
at org.apache.pig.PigServer.openIterator(PigServer.java:828)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:538)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
================================================================================

I suspect the "re" module isn't available to jython in my CDH4.x cluster. I did not spend much time on the python UDF. I solved it by writing a Java UDF. Pardon my Java since I am a n00b, may not be the most efficient or most pretty Java code (and some bugs in there, I am sure):
package pigext;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.io.IOException;
import java.util.*;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
public class matchList extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
try {
String line = (String)input.get(0);
DataBag bag = (DataBag)input.get(1);
Iterator it = bag.iterator();
String output = "";
while (it.hasNext()){
Tuple t = (Tuple)it.next();
if (t != null && t.size() > 0 && t.get(0) != null && line != null )
{
String cmd = t.get(0).toString();
if ( line.toLowerCase().matches(cmd.toLowerCase()) ) {
return (line + "," + cmd);
}
}
}
return output;
} catch (Exception e) {
throw new IOException("Failed to process row", e);
}
} }
The way to use it is have a file filled with regex, one per line, that you want to search for and obviously your target text file. So a regex file "wordstext.txt" as:
.*?this +blah.*?
And, your text file,text.txt, is:
this blah starts with blah
this blah has way too many spaces
that won't match
thisblahshouldnotmatch
thisblah should not match either
the line here is this blah
line here has this blah in the middle
line here has this blah with extra spaces
only has blah
only has this
The pig script would be:
REGISTER pigext.jar;
A = LOAD 'wordstest.txt' AS (cmd:chararray);
B = LOAD 'text.txt' AS (line:chararray);
C = GROUP A ALL;
D = FOREACH B generate pigext.matchList(line,C.$1);
dump D;

Parsing chat messages as config

I'm trying write a function that would be able to parse out a file with defined messages for a set of replies but am at loss on how to do so.
For example the config file would look:
[Message 1]
1: Hey
How are you?
2: Good, today is a good day.
3: What do you have planned?
Anything special?
4: I am busy working, so nothing in particular.
My calendar is full.
Each new line without a number preceding it is considered part of the reply, just another message in the conversation without waiting for a response.
Thanks
Edit: The config file will contain multiple messages and I would like to have the ability to randomly select from them all. Maybe store each reply from a conversation as a list, then the replies with extra messages can carry the newline then just split them by the newline. I'm not really sure what would be the best operation.
Update:
I've got for the most part this coded up so far:
def parseMessages(filename):
messages = {}
begin_message = lambda x: re.match(r'^(\d)\: (.+)', x)
with open(filename) as f:
for line in f:
m = re.match(r'^\[(.+)\]$', line)
if m:
index = m.group(1)
elif begin_message(line):
begin = begin_message(line).group(2)
else:
cont = line.strip()
else:
# ??
return messages
But now I am stuck on being able to store them into the dict the way I'd like..
How would I get this to store a dict like:
{'Message 1':
{'1': 'How are you?\nHow are you?',
'2': 'Good, today is a good day.',
'3': 'What do you have planned?\nAnything special?',
'4': 'I am busy working, so nothing in particular.\nMy calendar is full'
}
}
Or if anyone has a better idea, I'm open for suggestions.
Once again, thanks.
Update Two
Here is my final code:
import re
def parseMessages(filename):
all_messages = {}
num = None
begin_message = lambda x: re.match(r'^(\d)\: (.+)', x)
with open(filename) as f:
messages = {}
message = []
for line in f:
m = re.match(r'^\[(.+)\]$', line)
if m:
index = m.group(1)
elif begin_message(line):
if num:
messages.update({num: '\n'.join(message)})
all_messages.update({index: messages})
del message[:]
num = int(begin_message(line).group(1))
begin = begin_message(line).group(2)
message.append(begin)
else:
cont = line.strip()
if cont:
message.append(cont)
return all_messages

Doesn't sound too difficult. Almost-Python pseudocode:
for line in configFile:
strip comments from line
if line looks like a section separator:
section = matched section
elsif line looks like the beginning of a reply:
append line to replies[section]
else:
append line to last reply in replies[section][-1]
You may want to use the re module for the "looks like" operation. :)

If you have a relatively small number of strings, why not just supply them as string literals in a dict?
{'How are you?' : 'Good, today is a good day.'}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data exchange from Nodejs to Python is not working properly - python

Regular expressions are not supported in replace, you need to use re module instead like this: re.sub("[^a-zA-Z]", " ", your_string) And then split the result into words.

Related

An effective way to crawl specific sites based on just 1 keyword

Unable to parse data from UrlFetchApp in Google Apps Script [duplicate]

How to split a string into a list but not brackets?

Pig: Python UDF to search text for a list of keywords/strings

Parsing chat messages as config

Categories

Resources