I've got a little bit of a tricky question here regarding converting JSON strings into Python data dictionaries for analysis in Pandas. I've read a bunch of other questions on this but none seem to work for my case.
Previously, I was simply using CSVs (and Pandas' read_csv function) to perform my analysis, but now I've moved to pulling data directly from PostgreSQL.
I have no problem using SQLAlchemy to connect to my engine and run my queries. My whole script runs the same as it did when I was pulling the data from CSVs. That is, until it gets to the part where I'm trying to convert one of the columns (namely, the 'config' column in the sample text below) from JSON into a Python dictionary. The ultimate goal of converting it into a dict is to be able to count the number of responses under the "options" field within the "config" column.
df = pd.read_sql_query('SELECT questions.id, config from questions ', engine)
df = df['config'].apply(json.loads)
df = pd.DataFrame(df.tolist())
df['num_options'] = np.array([len(row) for row in df.options])
When I run this, I get the error "TypeError: expected string or buffer". I tried converting the data in the 'config' column to string from object, but that didn't do the trick (I get another error, something like "ValueError: Expecting property name...").
If it helps, here's a snipped of data from one cell in the 'config' column (the code should return the result '6' for this snipped since there are 6 options):
{"graph_by":"series","options":["Strongbow Case Card/Price Card","Strongbow Case Stacker","Strongbow Pole Topper","Strongbow Base wrap","Other Strongbow POS","None"]}
My guess is that SQLAlchemy does something weird to JSON strings when it pulls them from the database? Something that doesn't happen when I'm just pulling CSVs from the database?
In recent Psycopg versions the Postgresql json(b) adaption to Python is transparent. Psycopg is the default SQLAlchemy driver for Postgresql
df = df['config']['options']
From the Psycopg manual:
Psycopg can adapt Python objects to and from the PostgreSQL json and jsonb types. With PostgreSQL 9.2 and following versions adaptation is available out-of-the-box. To use JSON data with previous database versions (either with the 9.1 json extension, but even if you want to convert text fields to JSON) you can use the register_json() function.
Just sqlalchemy query:
q = session.query(
Question.id,
func.jsonb_array_length(Question.config["options"]).label("len")
)
Pure sql and pandas' read_sql_query:
sql = """\
SELECT questions.id,
jsonb_array_length(questions.config -> 'options') as len
FROM questions
"""
df = pd.read_sql_query(sql, engine)
Combine both (my favourite):
# take `q` from the above
df = pd.read_sql(q.statement, q.session.bind)
Related
I've seen a few posts on this but nothing that works for me unfortunately.
Basically trying to get the SQLAlchemy (or pandas) column data types from a list of views in a MySQL database.
import sqlalchemy as sa
view = "myView"
engine = "..."
meta = sa.MetaData(engine, True)
This errors:
tb_data = meta.tables["tb_data"]
# KeyError: 'tb_data'
And I don't know what I'm supposed to do with this:
sa.Table(view, meta).columns
# <sqlalchemy.sql.base.ImmutableColumnCollection at 0x7f9cb264d4a0>
Saw this somewhere but not sure how I'm supposed to use it:
str_columns = filter(lambda column: isinstance(column.type, sa.TEXT), columns)
# <filter at 0x7f9caafab640>
Eventually what I'm trying to achieve is a list or dict of data types for a view that I can then use to load to a PostgreSQL database. Happy to consider alternatives outside of sqlalchemy and/or pandas if they exist (and are relatively trivial to implement).
Problem
I have a pandas dataframe and I'm trying to use the pd.df.to_sql() function to an Oracle database. My Oracle database is 19.3c. Seems easy enough right? Why won't it work??
I saw in a few other another stackoverflow posts that I should be using sqlalchemy datatypes. Okay. Links:
Pandas and SQL Alchemy: Specify Column Data Types
Pandas to_sql changing datatype in database table
https://docs.sqlalchemy.org/en/14/dialects/oracle.html#oracle-data-types
from sqlalchemy.types import Integer, String
from sqlalchemy.dialects.oracle import NUMBER, VARCHAR2, DATE
oracle_dtypes = {
'id' : NUMBER(38,0),
'counts' : Integer,
'name' : VARCHAR2(50),
'swear_words' : String(9999)
'date' : DATE()
}
df_upload.to_sql(
"oracle_table",
db.engine,
schema="SA_COVID",
if_exists="replace",
index=False
dtype=oracle_dtypes
)
It never fails to convert random groups to CLOB or some other random datatypes. What should I do?
Things i've tried
I've tried and didn't work...
trucating (sending a SQL statement to the db from python) the table first then if_exist="append"
using the if_exist="replace"
using Oracle specific dialect of sqlalchemy datatypes only
using the generic sqlalchmey datatypes only
using a mix of both just bc I'm frustrated
Maybe it's an Oracle specific issue?
Things I haven't tried:
Things I haven't tried:
Dropping table and just recreating it before insert
to_sql adhoc and the send a series of some ALTER TABLE tbl_name MODIFY col_name
Related Links:
Changing the data type of a column in Oracle
Turns out I needed to double check the incoming datatypes from the API into my pandas dataframe (made a dumb assumption the data was clean)... The API was yielding all strings, and using df.info really helped.
Needed to convert all the the integer, numeric, and dates to the appropriate datatypes in python (that was the main issue), and from there could re-map the the database datatypes. In short...
API (all strings) --> Python (set datatypes) --> Database (map datatypes using sqlalchemy)
I used the pd.Int64Dtype() for integer columns with null values, and 'datetime64[ns]' for datetimes.
I faced a similar issue when I was using df.to_sql
import sqlalchemy as sa
df_upload.to_sql(
"oracle_table",
db.engine,
schema="SA_COVID",
if_exists="replace",
index=False
dtype=oracle_dtypes
)
Change your dtypes like this:
oracle_dtypes = {
'id' : sa.types.NUMBER(38,0),
'counts' : sa.types.Integer,
'name' : sa.types.VARCHAR2(50),
'swear_words' : sa.types.String(9999)
'date' : sa.types.DATE()
}
So I have several tables with each product for each year and tables go like:
2020product5, 2019product5, 2018product6 and so on. I have added two custom parameters in google data studio as well named year and product_id but could not use them in table names themselves. I have used parameterized queries before but in conditions like where product_id = #product_id but this setup only works if all of the data is in same table which is not the current case with me. In python I use string formatters like f"{year}product{product_id}" but that obviously does not work in this case...
Using Bigquery Default CONCAT & FORMAT functions does not help as both throw following validation error: Table-valued function not found: CONCAT at [1:15]
So how do I get around with querying bigquery tables in google data studio with python-like string formatting in table names based on custom parameters?
After much research I (kinda) sorted it out. Turns out it is a database level feature to query schema-level entities e.g. table names dynamically. BigQuery does not support formatting within table name like tables as per in question (e.g. 2020product5, 2019product5, 2018product6) cannot be queried directly. However, it does have a TABLE_SUFFIX function which allow you to access tables dynamically given that changes in table names are located at the end of the table. (This feature also allowed for dateweise partitioning and many tools which use BQ as data sink, utilize this. So If you are using BQ as data sink, there is good chance that your original data source is already doing so). Thus, table names like (product52020, product52019, product62018) as well can be accessed dynamically and of course from data studio too using following:
SELECT * FROM `project_salsa_101.dashboards.product*` WHERE _table_Suffix = CONCAT(#product_id,#year)
P.S.: Used python to create a dirty script which looped through products and tables and copied and created new ones which goes as follows: (Adding script with formatted string so it might be useful for anyone with such case wtih nominal effort)
import itertools
credentials = service_account.Credentials.from_service_account_file(
'project_salsa_101-bq-admin.json')
project_id = 'project_salsa_101'
schema = 'dashboards'
client = bigquery.Client(credentials= credentials,project=project_id)
for product_id, year in in itertools.product(product_ids, years):
df = client.query(f"""
SELECT * FROM `{project_id}.{schema}.{year}product{product_id}`
""").result().to_dataframe()
df.to_gbq(project_id = project_id,
destination_table = f'{schema}.product{product_id}{year}',
credentials = service_account.Credentials.from_service_account_file(
'credentials.json'),
if_exists = 'replace')
client.query(f"""
DROP TABLE `{project_id}.{schema}.{year}product{product_id}`""").result()
I have a Pandas dataframe that I'm writing out to Snowflake using SQLAlchemy engine and the to_sql function. It works fine, but I have to use the chunksize option because of some Snowflake limit. This is also fine for smaller dataframes. However, some dataframes are 500k+ rows, and at a 15k records per chunk, it takes forever to complete writing to Snowflake.
I did some research and came across the pd_writer method provided by Snowflake, which apparently loads the dataframe much faster. My Python script does complete faster and I see it creates a table with all the right columns and the right row count, but every single column's value in every single row is NULL.
I thought it was a NaN to NULL issue and tried everything possible to replace the NaNs with None, and while it does the replacement within the dataframe, by the time it gets to the table, everything becomes NULL.
How can I use pd_writer to get these huge dataframes written properly into Snowflake? Are there any viable alternatives?
EDIT: Following Chris' answer, I decided to try with the official example. Here's my code and the result set:
import os
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer
def create_db_engine(db_name, schema_name):
return create_engine(
URL(
account=os.environ.get("DB_ACCOUNT"),
user=os.environ.get("DB_USERNAME"),
password=os.environ.get("DB_PASSWORD"),
database=db_name,
schema=schema_name,
warehouse=os.environ.get("DB_WAREHOUSE"),
role=os.environ.get("DB_ROLE"),
)
)
def create_table(out_df, table_name, idx=False):
engine = create_db_engine("dummy_db", "dummy_schema")
connection = engine.connect()
try:
out_df.to_sql(
table_name, connection, if_exists="append", index=idx, method=pd_writer
)
except ConnectionError:
print("Unable to connect to database!")
finally:
connection.close()
engine.dispose()
return True
df = pd.DataFrame([("Mark", 10), ("Luke", 20)], columns=["name", "balance"])
print(df.head)
create_table(df, "dummy_demo_table")
The code works fine with no hitches, but when I look at the table, which gets created, it's all NULLs. Again.
Turns out, the documentation (arguably, Snowflake's weakest point) is out of sync with reality. This is the real issue: https://github.com/snowflakedb/snowflake-connector-python/issues/329. All it needs is a single character in the column name to be upper case and it works perfectly.
My workaround is to simply do: df.columns = map(str.upper, df.columns) before invoking to_sql.
I have had this exact same issue, don't despair there is a solution in sight. When you create a table in snowflake, from the snowflake worksheet or snowflake environment, it names the object and all columns and constraints in uppercase. However when you create the table from Python using the data frame, the object gets created in the exact case that you specified in your data frame. In your case it is columns=['name', 'balance']). So when the insert happens, it looks for all uppercase column names in snowflake and cannot find it, it does the insert but sets your 2 columns to null as the columns are created as nullable.
Best way to get pass this issue is to create your columns in uppercase in the dataframe, columns=['NAME', 'BALANCE']).
I do think this is something that snowflake should address and fix as it is not an expected behavior.
Even if you tried to do a select from your table that has nulls you would get an error eg:
select name, balance from dummy_demo_table
You would probably get an error like the following,
SQL compilation error: error line 1 at position 7 invalid identifier 'name'
BUT the following will work
SELECT * from dummy_demo_table
I'm writing a UDF in Python for a Hive query on Hadoop. My table has several bigint fields, and several string fields.
My UDF modifies the bigint fields, subtracts the modified versions into a new column (should also be numeric), and leaves the string fields as is.
When I run my UDF in a query, the results are all string columns.
How can I preserve or specify types in my UDF output?
More details:
My Python UDF:
import sys
for line in sys.stdin:
# pre-process row
line = line.strip()
inputs = line.split('\t')
# modify numeric fields, calculate new field
inputs[0], inputs[1], new_field = process(int(inputs[0]), int(inputs[1]))
# leave rest of inputs as is; they are string fields.
# output row
outputs = [new_field]
outputs.extend(inputs)
print '\t'.join([str(i) for i in outputs]) # doesn't preserve types!
I saved this UDF as myudf.py and added it to Hive.
My Hive query:
CREATE TABLE calculated_tbl AS
SELECT TRANSFORM(bigintfield1, bigintfield2, stringfield1, stringfield2)
USING 'python myudf.py'
AS (calculated_int, modified_bif1, modified_bif2, stringfield1, stringfield2)
FROM original_tbl;
Streaming sends everything through stdout. It is really just a wrapper on top of hadoop streaming under the hood. All types get converted to strings, which you handled accordingly in your python udf, and come back into hive as a strings. A python transform in hive will never return anything but strings. You could try to do a the transform in a sub query, and then cast the results to a type:
SELECT cast(calculated_int as bigint)
,cast( modified_bif1 as bigint)
,cast( modified_bif2 as bigint)
,stringfield1
,stringfield2
FROM (
SELECT TRANSFORM(bigintfield1, bigintfield2, stringfield1, stringfield2)
USING 'python myudf.py'
AS (calculated_int, modified_bif1, modified_bif2, stringfield1, stringfield2)
FROM original_tbl) A ;
Hive might let you get away with this, if it does not, You will need to save the results to a table, and then you can convert (cast) to a different type in another query.
The final option it to just use a Java UDF. Map only UDFs are not too bad, and they allow you to specify return types.
Update (from asker):
The above answer works really well. A more elegant solution I found reading the "Programming Hive" O'Reilly book a few weeks later is this:
CREATE TABLE calculated_tbl AS
SELECT TRANSFORM(bigintfield1, bigintfield2, stringfield1, stringfield2)
USING 'python myudf.py'
AS (calculated_int BIGINT, modified_bif1 BIGINT, modified_bif2 BIGINT, stringfield1 STRING, stringfield2 STRING)
FROM original_tbl;
Rather than casting, you can specify types right in the AS(...) line.