Dask DataFrame.map_partition() to write to db table - python

I have a dask dataframe that contains some data after some transformations. I want to write those data back to a mysql table. I have implemented a function that takes a dataframe a db url and writes the dataframe back to database. Because I need some to make some final edits on the data of the dataframe, I use pandas df.to_dict('record') to handle the write.
the function looks like that
def store_partition_to_db(df, db_url):
from sqlalchemy import create_engine
from mymodels import DBTableBaseModel
records_dict = df.to_dict(records)
records_to_db = []
for record in records_dict:
transformed_record = transform_record_some_how # transformed_record is a dictionary
records_to_db.append(transformed_record)
engine = create_engine(db_uri)
engine.execute(DBTableBaseModel.__table__.insert(), records_to_db)
return records_to_db
In my dask code:
from functools import partial
partial_store_partition_to_db(store_partition_to_db db_url=url)
dask_dataframe = dask_dataframe_data.map_partitions(partial_store_partition_to_db)
all_records = dask_dataframe.compute()
print len([record_dict for record_list in all_records for record_dict in record_list]] # Gives me 7700
But when I go to the respected table in MySQL I get 7702 with the same value on all columns that is 1. When I try to filter all_records with that value, no dictionary is returned. Has anyone met this situation before? How do you handle db writes from paritions with dask?
PS: I use LocalCluster and dask distributed

The problem was that I didn't provide meta information in the map_partition method and because of that it created a ataframe with foo values which I turns were written in the db

Related

Does fetching data from Azure Table Storage from python takes too long? Data has around 1000 rows per hour and I am fetching it per hour

import os, uuid
from azure.data.tables import TableClient
import json
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity, EntityProperty
import pandas as pd
def queryAzureTable(azureTableName,filterQuery):
table_service = TableService(account_name='accountname', account_key='accountkey')
tasks=Entity()
tasks = table_service.query_entities(azureTableName, filter=filterQuery)
return tasks
filterQuery = f"PartitionKey eq '{key}' and Timestamp ge datetime'2022-06-15T09:00:00' and Timestamp lt datetime'2022-06-15T10:00:00')"
entities = queryAzureTable("TableName",filterQuery)
for i in entities:
print(i)
OR
df = pd.DataFrame(entities)
Above is the code that I am using, in the azure table there are only around 1000 entries which should not take too long but extracting it takes more than an hour using this.
Both, using either a 'for' loop or changing entities directly to DataFrame takes too long.
Could anyone let me know the reason why it is taking too long or generally it takes that much of time.
If that's the case, is there any alternate way of it that does not take more than 10-15 mins for processing it without increasing the number of clusters already in use.
I read multithreading might resolve it, I tried that too but doesn't seems to be of any help, maybe I am writing it wrong, could anyone help me with the code using multithreading or any alternate way.
I tried to list all the rows with my table storage, By default Azure table storage can only have 1000 rows or entities per table.
Also, there are few limitations on the partition key and rows, which should not exceed the size of 1KIB. Unfortunately the type of table storage account also matters to decrease the latency of your output. As, you’re trying to query 1000 rows at once:-
Make sure you have your table storage near to your region.
Check the scalability targets and limitations for your rows of Azure table storage here :- https://learn.microsoft.com/en-us/azure/storage/tables/scalability-targets#scale-targets-for-table-storage
Also, AFAIK, In your code, you can directly make use of
list_entities
method to list all the entities in the table instead of writing such complex query :-
I tried the below code and was able to retrieve all the table entities successfully within few seconds with standard general purpose V2 Storage account :-
Code :-
from azure.data.tables import TableClient
table_client = TableClient.from_connection_string(conn_str="DefaultEndpointsProtocol=<connection-strin g>windows.net", table_name="myTable")
**# Query the entities in the table**
entities = list(table_client.list_entities())
for i, entity in enumerate(entities):
print("Entity #{}: {}".format(entity, i))
Result :-
With Pandas :-
import pandas as pd
from azure.cosmosdb.table.tableservice import TableService
CONNECTION_STRING = "DefaultEndpointsProtocol=https;AccountName=siliconstrg45;AccountKey=<connection-string>==;EndpointSuffix=core.windows.net"
SOURCE_TABLE = "myTable"
def set_table_service():
""" Set the Azure Table Storage service """
return TableService(connection_string=CONNECTION_STRING)
def get_dataframe_from_table_storage_table(table_service):
""" Create a dataframe from table storage data """
return pd.DataFrame(get_data_from_table_storage_table(table_service,
))
def get_data_from_table_storage_table(table_service):
""" Retrieve data from Table Storage """
get_dataframe_from_table_storage_table
for record in table_service.query_entities(
SOURCE_TABLE
):
yield record
ts = set_table_service()
df = get_dataframe_from_table_storage_table(table_service=ts,
)
print(df)
Result :-
If you have your table storage scalability targets in place, You can consider few points from this document to increase the I/Ops of your table storage :-
Refer here :-
https://learn.microsoft.com/en-us/azure/storage/tables/storage-performance-checklist
Also, storage quota and limits vary for Azure subscriptions type too!

SQLAlchemy getting column data types of view results in MySQL

I've seen a few posts on this but nothing that works for me unfortunately.
Basically trying to get the SQLAlchemy (or pandas) column data types from a list of views in a MySQL database.
import sqlalchemy as sa
view = "myView"
engine = "..."
meta = sa.MetaData(engine, True)
This errors:
tb_data = meta.tables["tb_data"]
# KeyError: 'tb_data'
And I don't know what I'm supposed to do with this:
sa.Table(view, meta).columns
# <sqlalchemy.sql.base.ImmutableColumnCollection at 0x7f9cb264d4a0>
Saw this somewhere but not sure how I'm supposed to use it:
str_columns = filter(lambda column: isinstance(column.type, sa.TEXT), columns)
# <filter at 0x7f9caafab640>
Eventually what I'm trying to achieve is a list or dict of data types for a view that I can then use to load to a PostgreSQL database. Happy to consider alternatives outside of sqlalchemy and/or pandas if they exist (and are relatively trivial to implement).

Retrieve data from python script and save to sqllite database

I have a python query which retrieves data through an API. The data returned is a dictionary. I want to save the data in sqlite3 database. There are two main columns('scan','tests'). I'm only interested in the data inside these two columns e.g. 'grade': 'D+', 'likelihood_indicator': 'MEDIUM'.
Any help is appreciated.
import pandas as pd
from httpobs.scanner.local import scan
import sqlite3
website_to_scan = 'digitalnz.org'
scan_site = scan(website_to_scan)
df = pd.DataFrame(scan_site)
print(scan_site)
print(df)`
Results of print(scan_site):
Results of print(df) attached:
This depends on how you have set up your table in sqlite but essentially you would write an INSERT INTO SQL clause and use the connection.execute() function in Python and pass your SQL string as an argument.
Its difficult to give a more precise answer for your question (i.e. code) because you haven't declared the connection variable. Lets imagine you already have your sqlite DB set up with the connection:
connection_variable.execute("""INSERT INTO table_name
(column_name1, column_name2) VALUES (value1, value2);""")

Example of using the 'callable' method in pandas.to_sql()?

I'm trying to make a specific insert statement that has an ON CONFLICT argument (I'm uploading to a Postgres database); will the df.to_sql(method='callable') allow that? Or is it intended for another purpose? I've read through the documentation, but I wasn't able to grasp the concept. I looked around on this website and others for similar questions, but I haven't found one yet. If possible I would love to see an example of how to use the 'callable' method in practice. Any other ideas on how to effectively load large numbers of rows from pandas using ON CONFLICT logic would be much appreciated as well. Thanks in advance for the help!
Here's an example on how to use postgres's ON CONFLICT DO NOTHING with to_sql
# import postgres specific insert
from sqlalchemy.dialects.postgresql import insert
def to_sql_on_conflict_do_nothing(pd_table, conn, keys, data_iter):
# This is very similar to the default to_sql function in pandas
# Only the conn.execute line is changed
data = [dict(zip(keys, row)) for row in data_iter]
conn.execute(insert(pd_table.table).on_conflict_do_nothing(), data)
conn = engine.connect()
df.to_sql("some_table", conn, if_exists="append", index=False, method=to_sql_on_conflict_do_nothing)
I have just had similar problem, and followed by to this answer I came up with solution on how to send df to potgresSQL ON CONFLICT:
1. Send some initial data to the database to create the table
from sqlalchemy import create_engine
engine = create_engine(connection_string)
df.to_sql(table_name,engine)
2. add primary key
ALTER TABLE table_name ADD COLUMN id SERIAL PRIMARY KEY;
3. prepare index on the column (or columns) you want to check the uniqueness
CREATE UNIQUE INDEX review_id ON test(review_id);
4. map the sql table with sqlalchemy
from sqlalchemy.ext.automap import automap_base
ABase = automap_base()
Table = ABase.classes.table_name
Table.__tablename__ = 'table_name'
6. do your insert on conflict with:
from sqlalchemy.dialects.postgresql import insert
insrt_vals = df.to_dict(orient='records')
insrt_stmnt = insert(Table).values(insrt_vals)
do_nothing_stmt = insrt_stmnt.on_conflict_do_nothing(index_elements=['review_id'])
results = engine.execute(do_nothing_stmt)

Is there any way to automatically load column data types into SQLite using SQLAlchemy?

I have a large csv file with nearly 100 columns with varying data types that I would like to load into a sqlite database using sqlalchemy. This will be an ongoing thing where I will periodicly load new data as a new table in the database. This seems like it should be trivial, but I cannot get anything to work.
All the solutions I've looked at so far have defined the columns explicitly when creating the tables.
Here is a minimal example (with far fewer columns) of what I have at the moment.
from sqlalchemy import *
import pandas as pd
values_list = []
url = r"https://raw.githubusercontent.com/amanthedorkknight/fifa18-all-player-statistics/master/2019/data.csv"
df = pd.read_csv(url,sep=",")
df = df.to_dict()
metadata = MetaData()
engine = create_engine("sqlite:///" + r"C:\Users\...\example.db")
connection = engine.connect()
# I would like define just the primary key column and the others be automatically loaded...
t1 = Table('t1', metadata, Column('ID',Integer,primary_key=True))
metadata.create_all(engine)
stmt = insert(t1).values()
values_list.append(df)
results = connection.execute(stmt, values_list)
values_list = []
connection.close()
Thanks for the suggestions. After some time searching, a decent solution is using the sqlathanor package. There is a function called generate-model-from-csv which allows you to read in a csv (also available for dictionary, json, etc) and build a sqlalchemy model directly. It is imperfect on datatype recognition, but certainly will save you some time if you have a lot of columns.
https://sqlathanor.readthedocs.io/en/latest/api.html#generate-model-from-csv

Categories

Resources