Elasticsearch Data Insertion with Python

Elasticsearch Data Insertion with Python - python

I'm brand new to using the Elastic Stack so excuse my lack of knowledge on the subject. I'm running the Elastic Stack on a Windows 10, corporate work computer. I have Git Bash installed for a bash cli, and I can successfully launch the entire Elastic Stack. My task is to take log data that is stored in one of our databases and display it on a Kibana dashboard.
From what my team and I have reasoned, I don't need to use Logstash because the database that the logs are sent to is effectively our 'log stash', so to use the Logstash service would be redundant. I found this nifty diagram
on freecodecamp, and from what I gather, Logstash is just the intermediary for log retrieval different services. So instead of using Logstash, since the log data is already in a database, I could just do something like this
USER ---> KIBANA <---> ELASTICSEARCH <--- My Python Script <--- [DATABASE]
My python script successfully calls our database and retrieves the data, and a function that molds the data into a dict object (as I understand, Elasticsearch takes data in a JSON format).
Now I want to insert all of that data into Elasticsearch - I've been reading the Elastic docs, and there's a lot of talk about indexing that isn't really indexing, and I haven't found any API calls I can use to plug the data right into Elasticsearch. All of the documentation I've found so far concerns the use of Logstash, but since I'm not using Logstash, I'm kind of at a loss here.
If there's anyone who can help me out and point me in the right direction I'd appreciate it. Thanks
-Dan

You ingest data on elasticsearch using the Index API, it is basically a request using the PUT method.
To do that with Python you can use elasticsearch-py, the official python client for elasticsearch.
But sometimes what you need is easier to be done using Logstash, since it can extract the data from your database, format it using many filters and send to elasticsearch.

Related

I want to get a stream object from Azure Inheritance Iterator ItemPaged - ItemPaged[TableEntity] to Stream (Python). Is it possible?

I want to get a stream object from Azure Inheritance Iterator ItemPaged - ItemPaged[TableEntity] to stream (Python). Is it possible?
https://learn.microsoft.com/en-us/python/api/azure-core/azure.core.paging.itempaged?view=azure-python
https://learn.microsoft.com/en-us/python/api/azure-core/azure.core.paging.itempaged?view=azure-python
#Updated 11.08.2021
I have a realization to backup Azure Tables to Azure Blob - Current process to backup Azure Tables. But I want to improve this process and I am considering different options. I try to get the stream from Azure Tables to use create_blob_from_stream

I assume you want to stream bytes from the HTTP response, and not the use the iterator of objects you receive.
Each API in the SDK supports a keyword argument call raw_response_hook that gives you access to the HTTP response object, and then let you use a stream download API if you want to. Note that since the payload is considered to represent objects, it will be pre-loaded in memory no matter what, but you can still use a stream syntax nonetheless.
The callback is simply one parameter:
def response_callback(response):
# Do something with the response
requests_response = response.internal_response
# Use "requests" API now
for chunk in requests_response.iter_content():
work_with_chunk(chunk)
Note that this is pretty advanced, you may encounter difficulties and this might not fit what you want precisely. We are working on a new pattern on SDK to simplify complex scenario like that, but it's not shipped yet. You would be able to send and receive raw requests using a send_request method, which gives you absolute control on all aspect of the query, like explaining you just want to stream (no pre-load in memory) or disabling the deserialization by default.
Feel free to open an issue on the Azure SDK for Python repo if you have additional questions or clarification: https://github.com/Azure/azure-sdk-for-python/issues
Edit with new suggestions: TableEntity is a dict like class, so you can json.dumps as string, or json.dump as a stream while using the ItemPaged<TableEntity>. If JSON dumps raise an exception, you can try our JSON encoder in azure.core.serialization.AzureJSONEncoder: https://github.com/Azure/azure-sdk-for-python/blob/1ffb583d57347257159638ae5f71fa85d14c2366/sdk/core/azure-core/tests/test_serialization.py#L83
(I work at MS in the Azure SDK for Python team.)
Ref:
https://docs.python-requests.org/en/master/api/#requests.Response.iter_content
https://azuresdkdocs.blob.core.windows.net/$web/python/azure-core/1.17.0/azure.core.pipeline.policies.html#azure.core.pipeline.policies.CustomHookPolicy

Python vs. Node.js Event Payloads in Firebase Cloud Functions

I am in the process of writing a Cloud Function for Firebase via the Python option. I am interested in Firebase Realtime Database Triggers; in other words I am willing to listen to events that happen in my Realtime Database.
The Python environment provides the following signature for handling Realtime Database triggers:
def handleEvent(data, context):
# Triggered by a change to a Firebase RTDB reference.
# Args:
# data (dict): The event payload.
# context (google.cloud.functions.Context): Metadata for the event.
This is looking good. The data parameter provides 2 dictionaries; 'data' for notifying the data before the change and 'delta' for the changed bits.
The confusion kicks in when comparing this signature with the Node.js environment. Here is a similar signature from theNode.js world:
exports.handleEvent = functions.database.ref('/path/{objectId}/').onWrite((change, context) => {}
In this signature, the change parameter is pretty powerful and it seems to be of type firebase.database.DataSnapshot. It has nice helper methods such as hasChild() or numChildren() that provide information about the changed object.
The question is: Does Python environment have a similar DataSnapshot object? With Python, do I have to query the database to get the number of children for example? It really isn't clear what Python environment can and can't do.
Related API/Reference/Documentation:
Firebase Realtime DB Triggers: https://cloud.google.com/functions/docs/calling/realtime-database
DataSnapshot Reference: https://firebase.google.com/docs/reference/js/firebase.database.DataSnapshot

The python runtime currently doesn't have a similar object structure. The firebase-functions SDK is actually doing a lot of work for you in creating objects that are easy to consume. Nothing similar is happening in the python environment. You are essentially getting a pretty raw view at the payload of data contained by the event that triggered your function.
If you write Realtime Database triggers for node, not using the Firebase SDK, it will be a similar situation. You'll get a really basic object with properties similar to the python dictionary.
This is the reason why use of firebase-functions along with the Firebase SDK is the preferred environment for writing triggers from Firebase products. The developer experience is superior: it does a bunch of convenient work for you. The downside is that you have to pay for the cost of the Firebase Admin SDK to load and initialize on cold start.
Note that might be possible for you to parse the event and create your own convenience objects using the Firebase Admin SDK for python.

How to retrieve all previous builds for a Jenkins job through the API?

I'm building a python script to pull build history data for Jenkins jobs. I've been successful with this using the Requests library to retrieve the json output, feed into a dataframe, and report on.
I'm noticing it's only pulling the last 100 builds, which looks like the default. I'm testing with a basic curl call, which works fine retrieving the last 100, to see how I can retrieve all builds. I've been searching Google and found one that said to add fetch_all_builds=True, but that still only pulls 100.
Does anyone know how I can request all the builds from a job through an API call?
Thanks

Adding tree=allBuilds will give you what you want.
<JENKINS URL>/job/<Job Name>/api/json?tree=allBuilds[*]&depth=2
This is the API Call URL.

How to expose Python api for elasticsearch

I have inserted large amount of data(1 million) in EllasticSearch. Now i want to create a REST API to fetch the data from EllasticSearch.
I want to use CURL commands
(eg: curl -i http://localhost:5000/todo/api/v1.0/tasks/2)
for being able to get the json fields having _id=2
I found the following blog https://blog.miguelgrinberg.com/post/designing-a-restful-api-with-python-and-flask
that helped me on how to create REST API, but i am not able to understand how do i extend this for ElasticSearch.

The elasticsearch python API is very convenient to create any kind of operation (inserting or fetching). You can find the doc's here:
https://elasticsearch-py.readthedocs.io/en/master/
Just one hint, in my experience the python api tended to be slower then creating direct curl requests from the command line. Anyhow, it is very convenient to work with. A query is as easy as the following snippet.
from elasticsearch import Elasticsearch
es = Elasticsearch()
res = es.index(index="index-logstash")

A Realtime Dashboard For Logs

We have a number of Python services, many of which use Nginx as a reverse proxy. Right now, we examine requests in real time by tailing the logs found in /var/log/nginx/access.log. I want to make these logs publicly readable in aggregate on a webserver so people don't have to SSH into individual machines.
Our current infrastructure has fluentd (a tool similar to logstash I'm told) sending logs up to a centralized stats server, which has Elasticsearch and kibana installed, with the idea being that kibana would serve as the frontend for viewing our logs.
I know nothing about these services. If I wanted to view our logs in realtime, would this stack even be feasible? Can Elasticsearch provide realtime data with a mere second's delay? Does kibana have out-of-the-box functionality for automatically updating the page as new log data comes in (i.e., does it have a socket connection with elasticsearch? Or am I falling into the wrong toolset?

Kibana is just an interface on top of elastic search. It talks directly to elasticsearch, so the data on it is as realtime as the data you are feeding into elasticsearch. In other words, its only as good as your collectors (fluentd in your case).
It works by having you define time series which it uses to query data from elastic search, and then you can have it always search for keywords and then visualize that data.
If by "realtime" you mean that you want the graphs to move/animate - this is also possible (its called "streaming dashboards"); but that's not the real power of kibana - the real power is a very rich query engine, drill down into time series, do calculations (top x over period y).
If all you want is a nice visual/moving thing to place on the wall tv - this is possible with kibana, but keep in mind you'll have to store everything in elasticsearch so unless you plan on doing some other analysis, you'll have to adjust your configuration. For example, have a really short TTL for the messages so once they are visualized, they are no longer available; or filter fluentd to only send across those events that you want to plot. Otherwise you'll have a disk space problem.
If that is all that you want, it would be a easier to grab some javascript charting library and use that in your portal.

I have the "access.log (or other logs) - logstash (or other ES indexer) - Kibana" pipeline setup for a number of services and logs and it works well. In our case it has more than a second of delay but that's because of buffering in logs or the ES indexer, not because of Kibana/ES itself.
You can setup Kibana to show only the last X minutes of data and refresh every Y seconds, which gives a decent real-time feel - and looks good on TVs ;)
Keep in mind that Kibana can sometimes issue pretty bad queries which can crash your ES cluster (although this seems to have vastly improved in more recent ES and Kibana versions), so do not rely on this as a permanent data store for your logs, and do not share the ES cluster you use for Kibana with apps that have stronger HA requirements.
As Burhan Khalid pointed out, this setup also gives us the ability to drill down and study specific patterns in details, which is super useful ("What's this spike on this graph?" - zoom in, add a couple filters, look at a few example log lines, filter again - mystery solved). I think saving us from having to dig somewhere else to get more details when we see something weird is actually the best part of this solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.