indexing content of weburl into elasticsearch/kibana

indexing content of weburl into elasticsearch/kibana - python

I have scrapped 500+ links/sublinks of a website using beautiful soup+python,now I am looking forward to index all the contents/text of this url in elasticsearch,is there any tool that can help me indexing directly with elastic search/kibana stack.
please help me with pointers,i tried searching on google and found logstash,but seems it works for single url.

For reference on Logstash please see: https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html
Otherwise, an example of putting your crawler output into a file, with a line per url, you could have the logstash config below, in this example, logstash will read one line as being a message and send it to the elastic servers on host1 and host2.
input {
file {
path => "/an/absolute/path" #The path has to be absolute
start_position => beginning
}
}
output {
elasticsearch{
hosts => ["host1:port1", "host2:port2"] #most of the time the host being the DNS name (localhost as the most basic one), the port is 9200
index => "my_crawler_urls"
workers => 4 #to define depending on your available resources/expected performance
}
}
Now of course, you might want to do some filter, post-treatment of the output of your crawler, and for that Logstash gives you the possibility with codecs and/or filters

Related

Nginx: How to Send to Client Result of a Python Script?

I have a simple python script: the generates an x-icon from a hex colour given to it, then it returns a valid byte-stream (BytesIO).
I want to get something like this (please, do not laugh, I'm using Nginx for about two days):
location ~^/icons/(?<colour>[a-fA-F0-9]{6})\.ico$ {
send 200 (./favicon.py colour); # System call to `favicon.py` with `colour` argument.
}
Is it possible at all?

The following config should do the work:
location ~^/icons/(?<colour>[a-fA-F0-9]{6})\.ico$ {
content_by_lua '
local command = "./favicon.py colour"
local handle = io.popen(command)
local content = handle:read("*a")
handle:close()
ngx.print(content)
';
}
Basically it uses Lua for executing and providing the content
NOTE: your nginx should be compiled with the lua module for this solution to work

Monaco editor: Highlight syntax errors in Python

I want to highlight Python syntax errors in browser.
I discovered there is LSP implementation for Python and LSP client for Monaco editor.
Is there any way to connect them together?

There is a way to connect them together!
It is all about the Language Server Protocol.
First step: the Language Server
The first thing you need is a running server that will provide language-specific logic (such as autocompletion, validation, etc.).
As mentioned in your question, you can use palantir's python-language-server.
You can also find a list of existing language server implementations by language on langserver.org.
In LSP, client and server are meant to communicate via a JSON-RPC websocket.
You can use python-jsonrpc-server and execute this python script to launch a python language server on your device:
python langserver_ext.py
This will host the language server on ws://localhost:3000/python.
Second step: the Language Client
Monaco is initially part of VSCode. Most existing LSP client parts for Monaco are thus initially meant for VSCode, so you will need to use a bit of nodejs and npm.
There exist a lot of modules to link monaco and an LSP client, some with vscode, some not - and it becomes very time-consuming to get this sorted out.
Here is a list of modules I used and finally got to work:
#codingame/monaco-languageclient
#codingame/monaco-jsonrpc
normalize-url
reconnecting-websocket
Using server-side javascript on a browser
Now, the neat part: node modules are server-side javascript. Which means, you can't use them within a browser directly (see It is not possible to use RequireJS on the client (browser) to access files from node_modules.).
You need to use a build tool, like browserify to transpile your node modules to be usable by a browser:
.../node_modules/#codingame/monaco-languageclient/lib$ browserify monaco-language-client.js monaco-services.js connection.js ../../monaco-jsonrpc/lib/connection.js -r ./vscode-compatibility.js:vscode > monaco-jsonrpc-languageclient.js
This will create a file named monaco-jsonrpc-languageclient.js, which we will use as a bundle for both monaco-languageclient and monaco-jsonrpc.
Notes:
The -r ./vscode-compatibility.js:vscode tells browserify to use the vscode-compatibility.js file for every dependency to the vscode module (see Unresolved dependencies to 'vscode' when wrapping monaco-languageclient).
We browserify these modules as a single bundle, to avoid multiple inclusions of some dependencies.
Now that you have a browser-compatible javascript file, you need to make needed components visible (ie. export them as window properties).
In monaco-jsonrpc-languageclient.js, search for places where MonacoLanguageClient, createConnection, MonacoServices, listen, ErrorAction, and CloseAction are exported. There, add a line to glabally export them:
(...)
exports.MonacoLanguageClient = MonacoLanguageClient;
window.MonacoLanguageClient = MonacoLanguageClient; // Add this line
(...)
exports.createConnection = createConnection;
window.createConnection = createConnection; // Add this line
(...)
(MonacoServices = exports.MonacoServices || (exports.MonacoServices = {}));
window.MonacoServices = MonacoServices; // Add this line
(...)
etc.
Do the same operation for normalize-url:
.../node_modules/normalize-url/lib$ browserify index.js > normalizeurl.js
In normalizeurl.js, search for the place where normalizeUrl is exported. There (or, as default, at the end of the file), add a line to globally export it:
window.normalizeUrl = normalizeUrl;
And you can do the same operation for reconnecting-websocket, or use the amd version that is shipped with the module.
Include monaco-jsonrpc-languageclient.js, normalizeurl.js and the browserified or AMD reconnecting-websocket module on your page.
For faster loading time, you can also minify them with a minifying tool (like uglify-js).
Finally, we can create and connect the client:
// From https://github.com/TypeFox/monaco-languageclient/blob/master/example/src/client.ts
/* --------------------------------------------------------------------------------------------
* Copyright (c) 2018 TypeFox GmbH (http://www.typefox.io). All rights reserved.
* Licensed under the MIT License. See License.txt in the project root for license information.
* ------------------------------------------------------------------------------------------ */
MonacoServices.install(monaco); // This is what links everything with your monaco editors.
var url = 'ws://localhost:3000/python';
// Create the web socket.
var webSocket = new ReconnectingWebSocket(normalizeUrl(url), [], {
maxReconnectionDelay: 10000,
minReconnectionDelay: 1000,
reconnectionDelayGrowFactor: 1.3,
connectionTimeout: 10000,
maxRetries: Infinity,
debug: false
});
// Listen when the web socket is opened.
listen({
webSocket,
onConnection: function(connection) {
// create and start the language client
var languageClient = new MonacoLanguageClient({
name: 'Python Language Client',
clientOptions: {
// use a language id as a document selector
documentSelector: ['python'],
// disable the default error handler
errorHandler: {
error: () => ErrorAction.Continue,
closed: () => CloseAction.DoNotRestart
}
},
// create a language client connection from the JSON RPC connection on demand
connectionProvider: {
get: (errorHandler, closeHandler) => {
return Promise.resolve(createConnection(connection, errorHandler, closeHandler));
}
}
});
var disposable = languageClient.start();
connection.onClose(() => disposable.dispose());
}
});

Flask: How to avoid generate any kind of answer for a specific URL

I am programming a home Web server for home automation. I've seen several times 'bots' scanning the ports of my server. To avoid give any kind of activity signs to undesired scans, I'm trying to avoid generate any kind of answer for specific URLs, like '/', ie. configure a silent mode for the typical scanned URL's.
I've tried with void .route decorators, error addressing and void pages, but all of them generated some kind of response.
It's that possible in Flask with Python?
Any workaround?
Thanks

What I would suggest is to return a custom error code for urls you are getting scanned, like HTTP_410_GONE.
From: http://www.flaskapi.org/api-guide/status-codes/
#app.route('/')
def empty_view(self):
content = {'please move along': 'nothing to see here'}
return content, status.HTTP_410_GONE
Put nginx in front of your flask app and use a fail2ban config to watch for this error code and start banning ips that are constantly hitting these urls.
From: https://github.com/mikechau/fail2ban-configs/blob/master/filter.d/nginx-404.conf
# Fail2Ban configuration file
[Definition]
failregex = <HOST> - - \[.*\] "(GET|POST).*HTTP.* 410
ignoreregex =

I had exactly the same need as you and decided to just return from 404 and 500 handlers:
#staticmethod
#bottle.error(404)
def error404(error):
log.warning("404 Not Found from {0} # {1}".format(bottle.request.remote_addr, bottle.request.url))
#staticmethod
#bottle.error(500)
def error500(error):
log.warning("500 Internal Error from {0} # {1}".format(bottle.request.remote_addr, bottle.request.url))
The example is for bottle but you can adapt it with flask.
Beside catching the obvious 404 I decided to do the same with 500 in case an unexpected call from the scanners would crash my script so that no traceback information is provided.

Restricting access to private file downloads in Django

I have multiple FileFields in my django app, which can belong to different users.
I am looking for a good way to restrict access to files for user who aren't the owner of the file.
What is the best way to achieve this? Any ideas?

Unfortuanately #Mikko's solution cannot actually work on a production environment since django is not designed to serve files. In a production environment files need to be served by your HTTP server (e.g apache, nginx etc) and not by your application/django server (e.g uwsgi, gunicorn, mod_wsgi etc).
That's why restricting file acccess is not very easy: You need a way for your HTTP server to ask the application server if it is ok to serve a file to a specific user requesting it. As you can understand thiss requires modification to both your application and your http server.
The best solution to the above problem is django-sendfile (https://github.com/johnsensible/django-sendfile) which uses the X-SendFile mechanism to implement the above. I'm copying from the project's description:
This is a wrapper around web-server specific methods for sending files to web clients. This is useful when Django needs to check permissions associated files, but does not want to serve the actual bytes of the file itself. i.e. as serving large files is not what Django is made for.
To understand more about the senfile mechanism, please read this answer: Django - Understanding X-Sendfile
2018 Update: Please notice that django-sendfile does not seem to be maintained anymore; probably it should still be working however if you want a more modern package with similar functionality take a look at https://github.com/edoburu/django-private-storage as commenter #surfer190 proposes. Especially make sure that you implement the "Optimizing large file transfers" section; you actuallyu need this for all transfers not only for large files.
2021 Update: I'm returning to this answer to point out that although it hasn't been updated for like 4 years, the django-sendfile project still works great with the current Django version (3.2) and I'm actually using it for all my projects that require that particular functionality! There is also an actively-maintained fork now, django-sendfile2, which has improved Python 3 support and more extensive documentation.

If you need just moderate security, my approach would be the following:
1) When the user uploads the file, generate a hard to guess path for it. For example you can create a folder with a randomly generated name for each uploaded file in your /static folder. You can do this pretty simply using this sample code:
file_path = "/static/" + os.urandom(32).encode('hex') + "/" + file_name
In this way it will be very hard to guess where other users' files are stored.
2) In the database link the owner to the file. An example schema can be:
uploads(id, user_id, file_path)
3) Use a property for your FileFields in the model to restrict access to the file, in this way:
class YourModel(models.Model)
_secret_file = models.FileField()
def get_secret_file(self):
# check in db if the user owns the file
if True:
return self._secret_file
elif:
return None # or something meaningful depanding on your app
secret_file = property(get_secret_file)

This is best handled by the server, e.g. nginx secure link module (nginx must be compiled with --with-http_secure_link_module)
Example from the documentation:
location /some-url/ {
secure_link $arg_md5,$arg_expires;
secure_link_md5 "$secure_link_expires$uri$remote_addr some-secret";
if ($secure_link = "") {
return 403;
}
if ($secure_link = "0") {
return 410;
}
if ($secure_link = "1") {
// authorised...
}
}
The file would be accessed like:
/some-url/some-file?md5=_e4Nc3iduzkWRm01TBBNYw&expires=2147483647
(This would be both time-limited and bound to the user at that IP address).
Generating the token to pass to the user would use something like:
echo -n 'timestamp/some-url/some-file127.0.0.1 some-secret' | \
openssl md5 -binary | openssl base64 | tr +/ -_ | tr -d =

Generally, you do not route private files through normal static file serving directly through Apache, Nginx or whatever web server you are using. Instead write a custom Django view which handles the permission checking and then returns the file as streaming download.
Make sure files are in a special private folder folder and not exposed through Django's MEDIA_URL or STATIC_URL
Write a view which will
Check that the user has access to the file in your view logic
Open the file with Python's open()
Return HTTP response which gets the file's handle as the parameter http.HttpResponse(_file, content_type="text/plain")
For example see download() here.

For those who use Nginx as a webserver to serve the file, the 'X-Accel-Redirect' is a good choice.
At the first, request for access to the file comes to Django and after authentication and authorization, it redirects internally to Nginx with 'X-Accel-Redirect'. more about this header: X-Accel-Redirect
The request comes to Django and will be checked like below:
if user_has_right_permission
response = HttpResponse()
# Let nginx guess to correct file mime-type by setting
# below header empty. otherwise all the files served as
# plain text
response['Content-Type'] = ''
response['X-Accel-Redirect'] = path_to_file
return response
else:
raise PermissionDenied()
If the user has the right permission, it redirects to Nginx to serve the file.
The Nginx config is like this:
server {
listen 81;
listen [::]:81;
...
location /media/ {
internal; can be accessed only internally
alias /app/media/;
}
...
}
Note: The thing about the path_to_file is that it should be started with "/media/" to serve by Nginx (is clear though)

Google Glass callbackUrl POST from Mirror API is empty?

Apologies because the only web development I know is of the django/python kind and am probably guilty of mixing my code idioms ( REST vs django URL dispatch workflow)
I have a URL handler which serves as a callbackUrl to a subscription for my Glassware. I am getting a POST to the handler , but the request object seems empty.
I am sure I am understanding this wrong but can someone point me in the direction of getting the "REPLY" information from a POST notification to a callbackURL.
My URL Handler is
class A600Handler(webapp2.RequestHandler):
def post(self):
"""Process the value of A600 received and return a plot"""
# I am seeing this in my logs proving that I am getting a POST when glass replies
logging.info("Received POST to logA600")
# This is returning None
my_collection = self.request.get("collection")
logging.info(my_collection)
# I also tried this but self.sequest.POST is empty '[]' and of type UnicodeMultiDict
# json_request_data = json.loads(self.request.POST)
#util.auth_required
def get(self):
"""Process the value of A600 received and return a plot"""
logging.info("Received GET to this logA600")
I have the following URL Handler defined and can verify that the post function is getting a "ping" when the user hits reply by looking at the app-engine logs.
MAIN_ROUTES = [
('/', MainHandler),('/logA600',A600Handler),
]
How do I extract the payload in the form of the voice transcribed text sent by the user?. I am not understanding The "parse_notification" example given in the docs

Did you try request.body? The docs for request.POST state
"If you need to access raw or non-form data posted in the request, access this through the HttpRequest.body attribute instead."
If the API isn't using form data in its post, you'll likely find the contents in request.body. The docs to which you linked indicate that the content will be placed as JSON in the body instead of form data ("containing a JSON request body"). I would try json.loads(request.body).

I am also having this issue of Mirror API calling my application for notifications, and those notifications are empty. My app runs on tomcat so its a java stack. All the samples process the notification like this:
BufferedReader notificationReader = new BufferedReader(
new InputStreamReader(request.getInputStream()));
String notificationString = "";
// Count the lines as a very basic way to prevent Denial of Service
// attacks
int lines = 0;
while (notificationReader.ready()) {
notificationString += notificationReader.readLine();
lines++;
// No notification would ever be this long. Something is very wrong.
if (lines > 1000) {
throw new IOException(
"Attempted to parse notification payload that was unexpectedly long.");
}
}
log.info("got raw notification " + notificationString);
For me this is always logging as empty. Since a notification url must be https, and for testing I could not use an IP address, I have setup dyndns service to point to my localhost:8080 running service. This all seems to work but I suspect how dyndns works is some type of forward or redirect here post data is removed.
How can I work around this for local development?
Updated:
Solved for me.
I found closing the response before reading request caused issue that request.inputStream was already closed. MOving this
response.setContentType("text/html");
Writer writer = response.getWriter();
writer.append("OK");
writer.close();
To after I fully read in request notification into a String solved the issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.