Combine/merge data from separate elastic search indexes based on #timestamp

Combine/merge data from separate elastic search indexes based on #timestamp - python

I have three separate indexes in elasticsearch, each storing data every second. I would like to combine data from all three indexes at #timestamp T, and return a single document.
How do I achieve this? Is there a query I can write?
I have been reading about denormalization. Do I have to write a script, in something like Python, and create a new document in a new index with the combined data for every unique #timestamp? If so, how does this script get executed to ensure the index is always up to date? manually? cron? triggered by elasticsearch when a conditions is met?
I am new to elasticsearch so any help and sample code is greatly appreciated.

Related

get result in dynamo db based on list of elements

I am new to dynamo db and want to compare values of a list(python) with attribute value of dynamo db table.
I am able to compare single value by using query with index key:
response = dynamotable.query(
IndexName='Classicmovies',
KeyConditionExpression = Key('DDT').eq('BBB-rrr-jjj-mq'))
but want to compare entire list which should be in .eq as follow:
movies =['ddd-dddss-gdgdg','kkdf-dfdfd-www','dfw-gddf-gssg']
I have searched alot and not able to figure out right way.

Hard to say what you are trying to do. A query will only retrieve a bunch of records belonging to a single item collection. Maybe what you need is a scan but please avoid heavily using scans unless of its for maintenance purposes.

Is there a way to return multiple values to python after a MySql query?

I am new to python and of course MySql. I recently created a Python function that generates a list of values that i want to insert to a table (2 columns) in MySql based on their specification.
Is it possible to create a procedure that can take a list of values that i'm sending through python, check if these values are already in one of my 2 two columns,
if they are already in the second one don't return,
if they are in the first one return all that are contained there
if they are in none of them return them with some kind of a flag so i can handle them through python and insert them to correct table
EXTRA EXPLANATION
Let me try to explain what i want to achieve so maybe you can give me a push and help me out. So, first i get a list of CPE items like this ("cpe:/a:apache:iotdb:0.9.0") in python and my goal is to save them into a database where the CPE's related to the IOT will be differentiated from the generic ones and saved in different tables or columns. My goal is that this distinction will be done by user input for each and every item but only once per item, so after parsing all items in python i want to first check in database if they exist in one of the tables or columns.
So for each and every list item that i pass i want to query mysql and:
if it exists in non iot column already don't return anything
if it exist in iot column already return item
if not exists anywhere return also item so i can get user input in python to verify if this is iot item or not and insert it to database after that

I think you could use library called pandas.
Idk if it is the best solution but it could work.
Export the thing you have in SQL into pandas or just query the SQL file using pandas.
Check out this library, it's really helpful for exploring data sets.
https://pandas.pydata.org/

Reading Elastic cluster data into python data frame

I am pretty new with elasticsearch. so, please forgive if i am asking a very simple question.
In my workplace we have a proper setup of ELK.
Due to the very large volume of data we are just storing 14 days of data and my question is how can i read the data in Python and later store my analysis in some NOSQL.
As of now my primary goal is to read the raw data into python in the form of data frame or any format from the elastic cluster.
I want to get it for different time intervals like 1 day, 1 week, 1 month etc..
I am struggling for the last 1 week.

you can use the below code to achieve that
# Create a DataFrame object
from pandasticsearch import DataFrame
df = DataFrame.from_es(url='http://localhost:9200', index='indexname')
To get the schema of your index:-
df.print_schema()
After that you can perform general dataframe operation on the df.
If you want to parse the result then do the below :-
from elasticsearch import Elasticsearch
es = Elasticsearch('http://localhost:9200')
result_dict = es.search(index="indexname", body={"query": {"match_all": {}}})
and then finally everything into your final dataframe:-
from pandasticsearch import Select
pandas_df = Select.from_dict(result_dict).to_pandas()
I hope it helps..

It depends on how you want to read the data from the Elasticsearch. Is it incremental reading i.e. reading new data that comes to you every day or is it like a bulk reading. For the latter, you need to use the bulk API of Elasticsearch in python and for the former, you can restrict yourself to a simple range query.
Schematic code for reading bulk data: https://gist.github.com/dpkshrma/04be6092eda6ae108bfc1ed820621130
How to use bulk API of ES:
How to use Bulk API to store the keywords in ES by using Python
https://elasticsearch-py.readthedocs.io/en/master/helpers.html#elasticsearch.helpers.bulk
How to use the range query for incremental inserts:
https://martinapugliese.github.io/python-for-(some)-elasticsearch-queries/
How to have Range and Match query in one elastic search query using python?
Since you want your data to be inserted in different intervals, you will require to perform date aggregations as well.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
How to perform multiple aggregation on an object in Elasticsearch using Python?
Once you issue your Elasticsearch query, your data will be collected in a temporary variable, you can use the python library over NOSQL database such as PyMongo to insert into Elasticsearch data into it.

Update a MySQL table from an HTML table with thousands of rows

I have an html file on network which updates almost every minute with new rows in a table. At any point, the file contains close to 15000 rows I want to create a MySQL table with all data in the table, and then some more that I compute from the available data.
The said HTML table contains, say rows from the last 3 days. I want to store all of them in my mysql table, and update the table every hour or so (can this be done via a cron?)
For connecting to the DB, I'm using MySQLdb which works fine. However, I'm not sure what are the best practices to do so. I can scrape the data using bs4, connect to table using MySQLdb. But how should I update the table? What logic should I use to scrape the page that uses the least resources?
I am not fetching any results, just scraping and writing.
Any pointers, please?

My Suggestion is instead of updating values row by row try to use Bulk Insert in temporary table and then move the data into an actual table based on some timing key. If you have key column that will be good for reading the recent rows as you added.

You can adopt the following approach:
For the purpose of the discussion, let master be the final destination the scraped data.
Then we can adopt the following steps:
Scrape data from the web page.
Store this scraped data within the temporary table within MySQL say temp.
Perform an EXCEPT operation to pull out only those rows which exist within the master but not in temp.
persist the rows obtained in step 3 within the master table.
Please refer to this link for understanding how to perform SET operations in MySQL. Also, it would be advisable to place all this logic within a store procedure and pass it the set of the data to be processed ( not sure if this part is possible in MySQL)
Adding one more step to the approach - Based on the discussion below, we can use a timestamp based column to determine the newest rows that need to be placed into the table. The above approach for SET based operations works well, in case there are no timestamp based columns.

Quicker way of updating subdocuments

My JSON documents (called "i"), have sub documents (called "elements").
I am looping trhough these subdocuments and updating them one at a time. However, to do so (once the value i need is computed), I have mongo scan through all the documents in the database, then through all the subdocuments, and then find the subdocument it needs to update.
I am having major time issues, as I have ~3000 documents and this is taking about 4minutes.
I would like to know if there is a quicker way to do this, without mongo having to scan all the documents but by doing it within the loop.
Here is the code:
for i in db.stuff.find():
for element in i['counts']:
computed_value = element[a] + element[b]
db.stuff.update({'id':i['id'], 'counts.timestamp':element['timestamp']},
{'$set': {'counts.$.total':computed_value}})
I am identifying the overall document by "id" and then the subdocument by its timestamp (which is unique to each subdocument). I need to find a quicker way than this. Thank you for your help.

What indexes do you have on your collection ? This could probably be sped up by creating an index on your embedded documents. You can do this using dot notation -- there's a good explanation and example here.
In your case, you'd do something like
db.stuff.ensureIndex( { "i.elements.timestamp" : 1 });
This will make your searches through embedded documents run much faster.

Your update is based on id (and i assume it is diff from default _id of mongo)
Put index on your id field
You want to set new field for all documents within collection or want to do it only for some matching collection to given criteria? if only for matching collections, use query operator (with index if possible)
dont fetch full document, fetch only those fields which are being used.
What is your avg document size? Use explain and mongostat to understand what is actual bottleneck.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combine/merge data from separate elastic search indexes based on #timestamp - python

Related

get result in dynamo db based on list of elements

Is there a way to return multiple values to python after a MySql query?

Reading Elastic cluster data into python data frame

Update a MySQL table from an HTML table with thousands of rows

Quicker way of updating subdocuments

Categories

Resources