I know there are various ETL tools available to export data from oracle to MongoDB but i wish to use python as intermediate to perform this. Please can anyone guide me how to proceed with this?
Requirement:
Initially i want to add all the records from oracle to mongoDB and after that I want to insert only newly inserted records from Oracle into MongoDB.
Appreciate any kind of help.
To answer your question directly:
1. Connect to Oracle
2. Fetch all the delta data by timestamp or id (first time is all records)
3. Transform the data to json
4. Write the json to mongo with pymongo
5. Save the maximum timestamp / id for next iteration
Keep in mind that you should think about the data model considerations and usually relational DB (like Oracle) and document DB (like mongo) will have different data model.
Related
I have an application which is using Cassandra as a database.I need to create some kind of reports from the Cassanbdra DB data, but data is not modelled as per report queries. So one report may have data scattered in multiple tables. As Cassandra doesn't allow joins like RDBMS, this is not simple to do.So I am thinking of a solution to get the required tables data in some other DB (RDBMS or Mongo) in real time and then genereate the report from there. So do we have any standard way to get the data from Cassandra to other DBs (Mongo or RDBMS) in realtime i.e. whenever an insert/update/delete happens in Cassandra same has to eb updated in destination DB. Any example programe or code would be very helpful.
You would be better off using spark + spark cassandra connector combination to do this task. With Spark you can do joins in memory and write the data back to Cassandra or any text file.
I'm trying to read from a CosmosDB collection (MachineCollection) with a large amount of data (58 GB data; index-size 9 GB). Throughput is set to 1000 RU/s. The collection is partitioned with a Serial number, Read Location (WestEurope, NorthEurope), Write Location (WestEurope). Simultaneously to my reading attempts, the MachineCollection is fed with data every 20 seconds.
The problem is that I can not query any data via Python. If I execute the query on CosmosDB Data Explorer I get results in no time. (e.g. querying for a certain serial number).
For troubleshooting purposes, I have created a new Database (TestDB) and a TestCollection. In this TestCollection, there are 10 datasets of MachineCollection. If I try to read from this MachineCollection via Python it succeeds and I am able to save the data to CSV.
This makes me wonder why I am not able to query data from MachineCollection when configuring TestDB and TestCollection with the exact same properties.
What I have already tried for the querying via Python:
options['enableCrossPartitionQuery'] = True
Querying using PartitionKey: options['partitionKey'] = 'certainSerialnumber'
Same as always. Works with TestCollection, but not with MachineCollection.
Any ideas on how to resolve this issue are highly appreciated!
Firstly, what you need to know is that Document DB imposes limits on Response page size. This link summarizes some of those limits: Azure DocumentDb Storage Limits - what exactly do they mean?
Secondly, if you want to query large data from Document DB, you have to consider the query performance issue, please refer to this article:Tuning query performance with Azure Cosmos DB.
By looking at the Document DB REST API, you can observe several important parameters which has a significant impact on query operations : x-ms-max-item-count, x-ms-continuation.
As I know,Azure portal doesn't automatically help you optimize your SQL so you need to handle this in the sdk or rest api.
You could set value of Max Item Count and paginate your data using continuation token. The Document Db sdk supports reading paginated data seamlessly. You could refer to the snippet of python code as below:
q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})
Another case you could refer to:How do I set continuation tokens for Cosmos DB queries sent by document_client objects in Python?
I am pretty new with elasticsearch. so, please forgive if i am asking a very simple question.
In my workplace we have a proper setup of ELK.
Due to the very large volume of data we are just storing 14 days of data and my question is how can i read the data in Python and later store my analysis in some NOSQL.
As of now my primary goal is to read the raw data into python in the form of data frame or any format from the elastic cluster.
I want to get it for different time intervals like 1 day, 1 week, 1 month etc..
I am struggling for the last 1 week.
you can use the below code to achieve that
# Create a DataFrame object
from pandasticsearch import DataFrame
df = DataFrame.from_es(url='http://localhost:9200', index='indexname')
To get the schema of your index:-
df.print_schema()
After that you can perform general dataframe operation on the df.
If you want to parse the result then do the below :-
from elasticsearch import Elasticsearch
es = Elasticsearch('http://localhost:9200')
result_dict = es.search(index="indexname", body={"query": {"match_all": {}}})
and then finally everything into your final dataframe:-
from pandasticsearch import Select
pandas_df = Select.from_dict(result_dict).to_pandas()
I hope it helps..
It depends on how you want to read the data from the Elasticsearch. Is it incremental reading i.e. reading new data that comes to you every day or is it like a bulk reading. For the latter, you need to use the bulk API of Elasticsearch in python and for the former, you can restrict yourself to a simple range query.
Schematic code for reading bulk data: https://gist.github.com/dpkshrma/04be6092eda6ae108bfc1ed820621130
How to use bulk API of ES:
How to use Bulk API to store the keywords in ES by using Python
https://elasticsearch-py.readthedocs.io/en/master/helpers.html#elasticsearch.helpers.bulk
How to use the range query for incremental inserts:
https://martinapugliese.github.io/python-for-(some)-elasticsearch-queries/
How to have Range and Match query in one elastic search query using python?
Since you want your data to be inserted in different intervals, you will require to perform date aggregations as well.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
How to perform multiple aggregation on an object in Elasticsearch using Python?
Once you issue your Elasticsearch query, your data will be collected in a temporary variable, you can use the python library over NOSQL database such as PyMongo to insert into Elasticsearch data into it.
I have a table in Google BigQuery(GBQ) with almost 3 million records(rows) so-far that were created based on data coming from MySQL db every day. This data inserted in GBQ table using Python pandas data frame(.to_gbq()).
What is the optimal way to sync changes from MySQL to GBQ, in this direction, with python.
Several different ways to import data from MySQL to BigQuery that might suit your needs are described in this article. For example Binlog replication:
This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.
Seems to be exactly what you are searching for.
Is there a way to perform an SQL query that joins a MySQL table with a dict-like structure that is not in the database but instead provided in the query?
In particular, I regularly need to post-process data I extract from a database with the respective exchange rates. Exchange rates are not stored in the database but retrieved on the fly and stored temporarily in a Python dict.
So, I have a dict: exchange_rates = {'EUR': 1.10, 'GBP': 1.31, ...}.
Let's say some query returns something like: id, amount, currency_code.
Would it be possible to add the dict to the query so I can return: id, amount, currency_code, usd_amount? This would remove the need to post-process in Python.
This solution doesn't use a 'join', but does combine the data from Python into SQL via a case statement. You could generate the sql you want in python (as a string) that includes these values in a giant case statement.
You give no details, and don't say which version of Python, so it's hard to provide useful code. But This works with Python 2.7 and assumes you have some connection to the MySQL db in python:
exchange_rates = {'EUR': 1.10, 'GBP': 1.31, ...}
# create a long set of case conditions as a string
er_case_statement = "\n".join("mytable.currency = \"{0}\" then {1}".format(k,v) for (k,v) in exchange_rates.iteritems())
# build the sql with these case statements
sql = """select <some stuff>,
case {0}
end as exchange_rate,
other columns
from tables etc
where etc
""".format(er_case_statement)
Then send this SQL to MySQL
I don't like this solution; you end up with a very large SQL statement which can hit the maximum ( What is maximum query size for mysql? ).
Another idea is to use temporary tables in mysql. Again assuming you are connecting to the db in python, with python create the sql that creates a temporary table and insert the exchange rates, send that to MySQL, then build a query that joins your data to that temporary table.
Finally you say you don't want to post-process in python but you have a dict from somewhere do I don't know which environment you are using BUT if you can get these exchange rates from the web, say with CURL, then you could use shell to also insert these values into a MySQL temp table, and join there.
sorry this is general and not specific, but the question could use more specificity. Hope it helps someone else give a more targeted answer.