Reading Elastic cluster data into python data frame - python

I am pretty new with elasticsearch. so, please forgive if i am asking a very simple question.
In my workplace we have a proper setup of ELK.
Due to the very large volume of data we are just storing 14 days of data and my question is how can i read the data in Python and later store my analysis in some NOSQL.
As of now my primary goal is to read the raw data into python in the form of data frame or any format from the elastic cluster.
I want to get it for different time intervals like 1 day, 1 week, 1 month etc..
I am struggling for the last 1 week.

you can use the below code to achieve that
# Create a DataFrame object
from pandasticsearch import DataFrame
df = DataFrame.from_es(url='http://localhost:9200', index='indexname')
To get the schema of your index:-
df.print_schema()
After that you can perform general dataframe operation on the df.
If you want to parse the result then do the below :-
from elasticsearch import Elasticsearch
es = Elasticsearch('http://localhost:9200')
result_dict = es.search(index="indexname", body={"query": {"match_all": {}}})
and then finally everything into your final dataframe:-
from pandasticsearch import Select
pandas_df = Select.from_dict(result_dict).to_pandas()
I hope it helps..

It depends on how you want to read the data from the Elasticsearch. Is it incremental reading i.e. reading new data that comes to you every day or is it like a bulk reading. For the latter, you need to use the bulk API of Elasticsearch in python and for the former, you can restrict yourself to a simple range query.
Schematic code for reading bulk data: https://gist.github.com/dpkshrma/04be6092eda6ae108bfc1ed820621130
How to use bulk API of ES:
How to use Bulk API to store the keywords in ES by using Python
https://elasticsearch-py.readthedocs.io/en/master/helpers.html#elasticsearch.helpers.bulk
How to use the range query for incremental inserts:
https://martinapugliese.github.io/python-for-(some)-elasticsearch-queries/
How to have Range and Match query in one elastic search query using python?
Since you want your data to be inserted in different intervals, you will require to perform date aggregations as well.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
How to perform multiple aggregation on an object in Elasticsearch using Python?
Once you issue your Elasticsearch query, your data will be collected in a temporary variable, you can use the python library over NOSQL database such as PyMongo to insert into Elasticsearch data into it.

Related

Best Practice when Writing PySpark DataFrames to MySQL

I am trying to develop a few data pipelines using Apache Airflow with scheduled Spark jobs.
For one of these pipelines, I am trying to write data from a PySpark DataFrame to MySQL and I keep running into a few problems. This is simply what my code looks like for now, but I do want to add more transformation to this in the future,
df_tsv = spark.read.csv(tsv_file, sep=r'\t', header=True)
df_tsv.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })
This is the exception that keeps getting raised,
java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
The first thing that I want to know is how I can solve the above issue.
Secondly, I would like to know what the best practice is when writing data from Spark to databases like MySQL. For instance, is there an option to make it so that data from a given column in the DataFrame is stored in a specified column in the table? Or should the column names of the table be the same as those of the DataFrame?
The other option that I can think of here is to convert the DataFrame to say, a list of tuples and then use something like the mysql-python-connector to load the data into the database,
rdd = df.rdd
b = rdd.map(tuple)
data = b.collect()
# write data to database using mysql-python-connector
What is the more efficient option here? Are there any other options that I do not know of?
java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
The first thing that I want to know is how I can solve the above issue.
You need to pass the JDBC connector when starting your Spark session https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html.
Secondly, I would like to know what the best practice is when writing data from Spark to databases like MySQL. For instance, is there an option to make it so that data from a given column in the DataFrame is stored in a specified column in the table? Or should the column names of the table be the same as those of the DataFrame?
Yes, dataframe column names would match with table column names.
The other option that I can think of here is to convert the DataFrame to say, a list of tuples and then use something like the mysql-python-connector to load the data into the database,
rdd = df.rdd
b = rdd.map(tuple)
data = b.collect()
# write data to database using mysql-python-connector
No, never ever do this, this will defeat all purposes of using Spark (which is distributed computation). Check out the link above, you will find some good advises on where to start and how to read/write from/to a JDBC data source.

Selecting data from a CSV and entering that data into a table in SQLite

I am trying to figure out how to iterate through rows in a .CSV files and enter that data into a table in sqlite but only if the data in that row meets certain criteria.
I am trying to build a database of my personal spending. I have used python to categorise my spending data I now want to enter that data into a database with each category as a different table. This means I need to sort the data and enter it into different tables based on the category of spend.
I looked for quite a long time. Can anyone help?
You need to read the CSV file using pandas and store it in a pandas DataFrame. Then (If you did not create already a database) use SQLAlchemy library (Here is the documentation) to create an engine engine = sqlalchemy.create_engine('sqlite:///file.db').
Afterwards, you need to convert the DataFrame to the SQL database using pandas to_sql function (Documentation). df.to_sql('file_name', engine, index=False). I used the index=False to avoid creating a column for the index of the DataFrame.

What is the best way of updating BigQuery table from a pandas Dataframe with many rows

I have a dataset at BigQuery with 100 thousand+ rows and 10 columns. I'm also continuously adding new data to the dataset. I want to fetch data that not processed, process them and write back to my table. Currently, I'm fetching them to a pandas dataframe using bigquery python library and processing using pandas.
Now, I want to update table with new pre-processed data. One way of doing it using SQL statement and calling query function of the bigquery.Client() class. Or use a job like here.
bqclient = bigquery.Client(
credentials=credentials,
project=project_id,
)
query = """UPDATE `dataset.table` SET field_1 = '3' WHERE field_2 = '1'"""
bqclient.query(query_string)
But it doesn't make sense to create update statement for each row.
Another way I found is using to_gbq function of pandas-gbq package. Disadvantage of this , it updates all table.
Question: What is the best way of updating Bigquery table from pandas dataframe?
Google BigQuery is mainly used for Data Analysis when your data is static and you don't have to update a value, since the arquitecture is basically to do that kind of thinking. Therefore, if you want to update the data, there are some options but are very heavy:
The one you mentioned, with a query and update one by one row.
Recreate the table using only the new values.
Appending the new data with different timestamp.
Using partitioned tables [1] and if possible clustered tables [2], this way when you want to update the table you can use the partitioned and clustered columns to update it and the query will be less heavy. Also, you can append the new data in a new partitioned table, let's say on the current day.
If you are using the data for analytical reasons, maybe the best option is 2 and 3, but I always recommend having [1] and [2].
[1] https://cloud.google.com/bigquery/docs/querying-partitioned-tables
[2] https://cloud.google.com/bigquery/docs/clustered-tables

Python: How to update (overwrite) Google BigQuery table using pandas dataframe

I have a table in Google BigQuery(GBQ) with almost 3 million records(rows) so-far that were created based on data coming from MySQL db every day. This data inserted in GBQ table using Python pandas data frame(.to_gbq()).
What is the optimal way to sync changes from MySQL to GBQ, in this direction, with python.
Several different ways to import data from MySQL to BigQuery that might suit your needs are described in this article. For example Binlog replication:
This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.
Seems to be exactly what you are searching for.

Export from Oracle to MongoDB using python

I know there are various ETL tools available to export data from oracle to MongoDB but i wish to use python as intermediate to perform this. Please can anyone guide me how to proceed with this?
Requirement:
Initially i want to add all the records from oracle to mongoDB and after that I want to insert only newly inserted records from Oracle into MongoDB.
Appreciate any kind of help.
To answer your question directly:
1. Connect to Oracle
2. Fetch all the delta data by timestamp or id (first time is all records)
3. Transform the data to json
4. Write the json to mongo with pymongo
5. Save the maximum timestamp / id for next iteration
Keep in mind that you should think about the data model considerations and usually relational DB (like Oracle) and document DB (like mongo) will have different data model.

Categories

Resources