Importing data from postgresql with Dask

Importing data from postgresql with Dask - python

So I have a large (7GB) dataset stored in postgres that I'm trying to import into Dask. I'm trying the read_sql_table function, but keep getting ArgumentErrors.
My info in postgres is the following:
database is "my_database"
schema is "public"
data table is "table"
username is "fred"
password is "my_pass"
index in postgres is 'idx'
I am trying to get this piece of code to work:
df = dd.read_sql_table('public.table', 'jdbc:postgresql://localhost/my_database?user=fred&password=my_pass', index_col='idx')
Am I formatting something incorrectly?

I was finally able to figure it out by using psycopg2. The answer is below:
df = dd.read_sql_table('table', 'postgresql+psycopg2://postgres:fred#localhost/my_database', index_col = 'idx')
Additionally, I had to create a different index in the postgres table. The original index needed to be a whole separate column. I did this with the following line in Postgres:
alter table table add idx serial;

Related

Problems while inserting df values with python into Oracle db

I am having troubles when trying to insert data from a df into an Oracle database table, this is the error: DatabaseError: ORA-01036: illegal variable name/number
These are the steps I did:
This is the dataframe I have imported from yfinance package and elaborated in order to respect the integrity of the data types of my df
I transformed my df into a list, these are my data in the list:
this is the table where I want to insert my data:
This is the code:
sql_insert_temp = "INSERT INTO TEMPO('GIORNO','MESE','ANNO') VALUES(:2,:3,:4)"
index = 0
for i in df.iterrows():
cursor.execute(sql_insert_temp,df_list[index])
index += 1
connection.commit()
I have tried a single insert in the sqldeveloper worksheet, using the data you can see in the list, and it worked, so I guess I have made some mistake in the code. I have seen other discussions, but I couldn't find any solution to my problem.. Do you have any idea of how I can solve this or maybe is it possible to do this in another way?
I have tried to print the iterated queries and that's the result, that's why it's not inserting my data:

If you already have a pandas DataFrame, then you should be able to use the to_sql() method provided by the pandas library.
import cx_Oracle
import sqlalchemy
import pandas as pd
DATABASE = 'DB'
SCHEMA = 'DEV'
PASSWORD = 'password'
connection_string = f'oracle://{SCHEMA}:{PASSWORD}#{DATABASE}'
db_conn = sqlalchemy.create_engine(connection_string)
df_to_insert = df[['GIORNO', 'MESE', 'ANNO']] #creates a dataframe with only the columns you want to insert
df_to_insert.to_sql(name='TEMPO', con=db_connection, if_exists='append')
name is the name of the table
con is the connection object
if_exists='append' will add the rows to end of the table. There are other options to add fail or drop and re-create the table
other parameters can be found on the pandas website. pandas.to_sql()

How to map data types from python pandas to postgres table?

I have a pandas.DataFrame with columns having different data types like object, int64 , etc.
I have a postgresql table created with appropriate data types. I want to insert all the dataframe data into postgresql table. How should manage to do this?
Note : The data in pandas is coming from another source so the
data types are not specified manually by me.

The easiest way is to use sqlalchemy:
from sqlalchemy import create_engine
engine = create_engine('postgresql://abc:def#localhost:5432/database')
df.to_sql('table_name', engine, if_exists='replace')
If the table exists, you can choose what you want to do with if_exists option
if_exists {‘fail’, ‘replace’, ‘append’}, default ‘fail’
If the table does not exist, it will create a new table with the corresponding datatypes.

Maybe you have the problem I had that you want to create new columns on the existing table, and then the solution to replace or append the table does not work for me. Shortly for me it looks like this ( I guess for the converting of datatypes is no general solution and you should adapt for your need):
lg.debug('table gets extended with the columns: '+",".join(dataframe.dtypes))
#check whether we have to add a field
df_postgres={'object':'text','int64':'bigint','float64':'numeric','bool':'boolean','datetime64':'timestamp','timedelta':'interval'}
for col in dataframe.columns:
#convert the columns to postgres:
if str(dataframe.dtypes[col]) in df_postgres:
dbo.table_column_if_not_exists(self.table_name,col,df_postgres[str(dataframe.dtypes[col])],original_endpoint)
else:
lg.error('Fieldtype '+str(dataframe.dtypes[col])+' is not configured')
and the function to create the columns:
def table_column_if_not_exists(self,table,name,dtype,original_endpoint=''):
self.query(query='ALTER TABLE '+table+' ADD COLUMN IF NOT EXISTS '+name+' '+dtype)
#make a comment when we know which source create this column
if original_endpoint!='':
self.query(query='comment on column '+table+'.'+name+" IS '"+original_endpoint+"'")

cannot get sqlalchemy and pandas (to_sql) to write an dataframe index date into a MySQL DB

I read from an API the following data into a pandas dataframe:
Now, I want to write this data into a MySQL-DB-table, using pandas to_sql:
In MySQL, the column is set up correctly, but has not written the values:
Then I looked in the debugger to show me the dataframe:
I thought it would maybe a formatting issue, and added the following lines:
In the debugger, it looks now fine:
But now, in the database, it wants to write the index column as text
... and interrupts the execution with an error:
Is there a way to get this going, aka to write df index data as date into a MySQL DB using pandas to_SQL in connection with a sqlalchemy engine?
Edit:
Table schema:
DataFrame Header:

It seems you are using Date column as primary key. I would suggest not to use that as primary key instead you should use Date + Ticker as primary key.

Python PyTd teradata Query Into Pandas DataFrame

I'm using the PyTd teradata module to query data from Teradata and want to read it into a Pandas DataFrame
import teradata
import pandas as pd
# teradata connection
udaExec = teradata.UdaExec(appName="Example", version="1.0",
logConsole=False)
session = udaExec.connect(method="odbc", system="", username="", password="")
# Create empty dataframe with column names
query = session.execute("SELECT TOP 1 * FROM table")
cols = [str(d[0]) for d in query.description]
df = pd.DataFrame(columns=cols)
# Read data into dataframe
for row in session.execute("SELECT * FROM table"):
print type(row)
df.append(row)
row is of teradata.util.Row class and can't be appended to the dataframe. I tried converting it to a list but the format gets messed up.
How can I read my data into a dataframe from Teradata using the teradata module? I'm not able to use the pyodbc module for this.
Is there a better way to create the empty dataframe with column names matching those in the database?

You can use pandas.read_sql :)
import teradata
import pandas as pd
# teradata connection
udaExec = teradata.UdaExec(appName="Example", version="1.0",
logConsole=False)
with udaExec.connect(method="odbc", system="", username="", password="") as session:
query ="SELECT * FROM table"
df = pd.read_sql(query,session)
Using ‘with’ will ensure close of session after the query. I hope that helped :)

I know its a little late. But putting a note nevertheless.
There are a few questions here.
How can I read my data into a dataframe from Teradata using the
teradata module?
At the end of the day, a teradata.util.Row is simply a list. So a simple list operation should help you get things out of Row.
','.join(str(item) for item in row)
kinda thing.
Pushing that into a pandas dataframe should be a list to df conversion exercise.
I'm not able to use the pyodbc module for this.
I used teradata's python module to do a LDAP auth. All worked fine. Didn't have this requirement. Sorry.
Is there a better way to create the empty dataframe with column names matching those in the database?
I assume, given a table name, you can query to figure it schema (table names) >> convert that to a list and create your pandas df?

I know this is very late.
You can use read_sql() from pandas module. It returns pandas dataframe.
Here is the reference:
http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_sql.html

Converting JSON into Python Dict with Postgresql data imported with SQLAlchemy

I've got a little bit of a tricky question here regarding converting JSON strings into Python data dictionaries for analysis in Pandas. I've read a bunch of other questions on this but none seem to work for my case.
Previously, I was simply using CSVs (and Pandas' read_csv function) to perform my analysis, but now I've moved to pulling data directly from PostgreSQL.
I have no problem using SQLAlchemy to connect to my engine and run my queries. My whole script runs the same as it did when I was pulling the data from CSVs. That is, until it gets to the part where I'm trying to convert one of the columns (namely, the 'config' column in the sample text below) from JSON into a Python dictionary. The ultimate goal of converting it into a dict is to be able to count the number of responses under the "options" field within the "config" column.
df = pd.read_sql_query('SELECT questions.id, config from questions ', engine)
df = df['config'].apply(json.loads)
df = pd.DataFrame(df.tolist())
df['num_options'] = np.array([len(row) for row in df.options])
When I run this, I get the error "TypeError: expected string or buffer". I tried converting the data in the 'config' column to string from object, but that didn't do the trick (I get another error, something like "ValueError: Expecting property name...").
If it helps, here's a snipped of data from one cell in the 'config' column (the code should return the result '6' for this snipped since there are 6 options):
{"graph_by":"series","options":["Strongbow Case Card/Price Card","Strongbow Case Stacker","Strongbow Pole Topper","Strongbow Base wrap","Other Strongbow POS","None"]}
My guess is that SQLAlchemy does something weird to JSON strings when it pulls them from the database? Something that doesn't happen when I'm just pulling CSVs from the database?

In recent Psycopg versions the Postgresql json(b) adaption to Python is transparent. Psycopg is the default SQLAlchemy driver for Postgresql
df = df['config']['options']
From the Psycopg manual:
Psycopg can adapt Python objects to and from the PostgreSQL json and jsonb types. With PostgreSQL 9.2 and following versions adaptation is available out-of-the-box. To use JSON data with previous database versions (either with the 9.1 json extension, but even if you want to convert text fields to JSON) you can use the register_json() function.

Just sqlalchemy query:
q = session.query(
Question.id,
func.jsonb_array_length(Question.config["options"]).label("len")
)
Pure sql and pandas' read_sql_query:
sql = """\
SELECT questions.id,
jsonb_array_length(questions.config -> 'options') as len
FROM questions
"""
df = pd.read_sql_query(sql, engine)
Combine both (my favourite):
# take `q` from the above
df = pd.read_sql(q.statement, q.session.bind)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Importing data from postgresql with Dask - python

Related

Problems while inserting df values with python into Oracle db

How to map data types from python pandas to postgres table?

cannot get sqlalchemy and pandas (to_sql) to write an dataframe index date into a MySQL DB

Python PyTd teradata Query Into Pandas DataFrame

Converting JSON into Python Dict with Postgresql data imported with SQLAlchemy

Categories

Resources