problem with transformation to_timestamp python sql on databricks - python

I am trying to implement a transformation in python sql on databricks, I have tried several ways but without success, I request a validation please:
%sql
SELECT aa.AccountID__c as AccountID__c_2,
aa.LastModifiedDate,
to_timestamp(aa.LastModifiedDate, "yyyy-MM-dd HH:mm:ss.SSS") as test
FROM EVENTS aa
The output is as follows:
It can be seen that the validation is not correct, but even so it is executed on the engine and returns null.
I have also tried performing a substring on the LastModifiedDate field from 1 to 19, but without success ...

The date format you provided does not agree with the date format of that column, so you got null. Having said that, for standard date formats like the format you have, there is no need to provide any date format at all. Simply using to_timestamp will give the correct results.
%sql
SELECT aa.AccountID__c as AccountID__c_2,
aa.LastModifiedDate,
to_timestamp(aa.LastModifiedDate) as test
FROM EVENTS aa

Related

SAS/Python: How to make table name dynamic using date parameter?

I have connected SAS to python and trying to extract data. The date is attached to table name and I am not allowed to change the table format (total.gross_data_20211201). The table should be dynamic to the salary date. I tried below method but its not working. I expect that 'user date' should be applied to make the table name dynamic. Please suggest.
import saspy
salary_date = (pd.to_datetime('01-Dec-2021').strftime('%d-%b-%Y')).upper()
user_date = dt.datetime.strptime(salary_date, '%d-%b-%Y').strftime('%Y%m%d')
sas.symput('user_date', user_date)
xyz = sas.submit("""proc sql;
create table data_extract as
select ID
FROM total.gross_data_20211201
;quit; """)
You can change this in either SAS or Python, but given you're submitting this in python, it seems like you should do it there.
In that case, you're just submitting a string to sas.submit - so modify it using any of the normal means you have of modifying things in python. A formatted string would be the most typical way of doing that.
But you could also use &user_date in the SAS code, so
FROM total.gross_data_&user_date.

Databricks not updating in SQL query

I am trying to replace special characters from a table column using SQL a SQL query. However, I get the following error. Can anyone tell me what I did wrong or how I should approach this?
SQL QUERY
UPDATE wine SET description = REPLACE(description, '%', '')
ERROR
error in sql statement: analysisexception: update destination only supports delta sources.
Databricks only supports updates for delta (delta lake) tables. The error message indicates that you try the update on a non-delta-table. So you would have to convert your data source to delta. For parquet it is very simple:
CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS]
[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)]
See the Documentation for more details.
CONVERT TO DELTA parquet.s3://path/to/table
PARTITIONED BY (column_name INT) ;
--try this for partioned table

PySpark sql compare records on each day and report the differences

so the problem I have is I have this dataset:
and it shows the businesses are doing business in the specific days. what i want to achieve is to report which businesses are added on what day. Perhaps Im lookign for some answer as:
I managed to tide up all the records using this sql:
select [Date]
,Mnemonic
,securityDesc
,sum(cast(TradedVolume as money)) as TradedVolumSum
FROM SomeTable
group by [Date],Mnemonic,securityDesc
but I dont know how to compare each days record with the other day and export the non existence record on the following day to another table. I tired sql over partition cluase but it makes it complex. I can either use sql or Pyspark sql python combination.
could you let me how I can resolve this problem?
Below is the dataframe operation for your question you might need to tweak a little bit as I dont have the sample data for it, written the code by seeing your data, please let me know if that solves your problem:
import pyspark.sql.functions as F
from pyspark.sql import Window
some_win = Window.partitionBy("securityDesc").orderBy(F.col("[date]").asc())
some_table.withColumn(
"buisness_added_day",
F.first(F.col("id")).over(some_win)
).select(
"buisness_added_day",
"securityDesc",
"TradedVolumSum",
"Mnemonic"
).distinct().orderBy("buisness_added_day").show()

Converting JSON into Python Dict with Postgresql data imported with SQLAlchemy

I've got a little bit of a tricky question here regarding converting JSON strings into Python data dictionaries for analysis in Pandas. I've read a bunch of other questions on this but none seem to work for my case.
Previously, I was simply using CSVs (and Pandas' read_csv function) to perform my analysis, but now I've moved to pulling data directly from PostgreSQL.
I have no problem using SQLAlchemy to connect to my engine and run my queries. My whole script runs the same as it did when I was pulling the data from CSVs. That is, until it gets to the part where I'm trying to convert one of the columns (namely, the 'config' column in the sample text below) from JSON into a Python dictionary. The ultimate goal of converting it into a dict is to be able to count the number of responses under the "options" field within the "config" column.
df = pd.read_sql_query('SELECT questions.id, config from questions ', engine)
df = df['config'].apply(json.loads)
df = pd.DataFrame(df.tolist())
df['num_options'] = np.array([len(row) for row in df.options])
When I run this, I get the error "TypeError: expected string or buffer". I tried converting the data in the 'config' column to string from object, but that didn't do the trick (I get another error, something like "ValueError: Expecting property name...").
If it helps, here's a snipped of data from one cell in the 'config' column (the code should return the result '6' for this snipped since there are 6 options):
{"graph_by":"series","options":["Strongbow Case Card/Price Card","Strongbow Case Stacker","Strongbow Pole Topper","Strongbow Base wrap","Other Strongbow POS","None"]}
My guess is that SQLAlchemy does something weird to JSON strings when it pulls them from the database? Something that doesn't happen when I'm just pulling CSVs from the database?
In recent Psycopg versions the Postgresql json(b) adaption to Python is transparent. Psycopg is the default SQLAlchemy driver for Postgresql
df = df['config']['options']
From the Psycopg manual:
Psycopg can adapt Python objects to and from the PostgreSQL json and jsonb types. With PostgreSQL 9.2 and following versions adaptation is available out-of-the-box. To use JSON data with previous database versions (either with the 9.1 json extension, but even if you want to convert text fields to JSON) you can use the register_json() function.
Just sqlalchemy query:
q = session.query(
Question.id,
func.jsonb_array_length(Question.config["options"]).label("len")
)
Pure sql and pandas' read_sql_query:
sql = """\
SELECT questions.id,
jsonb_array_length(questions.config -> 'options') as len
FROM questions
"""
df = pd.read_sql_query(sql, engine)
Combine both (my favourite):
# take `q` from the above
df = pd.read_sql(q.statement, q.session.bind)

Preserving column data types in Hadoop UDF output (Streaming)

I'm writing a UDF in Python for a Hive query on Hadoop. My table has several bigint fields, and several string fields.
My UDF modifies the bigint fields, subtracts the modified versions into a new column (should also be numeric), and leaves the string fields as is.
When I run my UDF in a query, the results are all string columns.
How can I preserve or specify types in my UDF output?
More details:
My Python UDF:
import sys
for line in sys.stdin:
# pre-process row
line = line.strip()
inputs = line.split('\t')
# modify numeric fields, calculate new field
inputs[0], inputs[1], new_field = process(int(inputs[0]), int(inputs[1]))
# leave rest of inputs as is; they are string fields.
# output row
outputs = [new_field]
outputs.extend(inputs)
print '\t'.join([str(i) for i in outputs]) # doesn't preserve types!
I saved this UDF as myudf.py and added it to Hive.
My Hive query:
CREATE TABLE calculated_tbl AS
SELECT TRANSFORM(bigintfield1, bigintfield2, stringfield1, stringfield2)
USING 'python myudf.py'
AS (calculated_int, modified_bif1, modified_bif2, stringfield1, stringfield2)
FROM original_tbl;
Streaming sends everything through stdout. It is really just a wrapper on top of hadoop streaming under the hood. All types get converted to strings, which you handled accordingly in your python udf, and come back into hive as a strings. A python transform in hive will never return anything but strings. You could try to do a the transform in a sub query, and then cast the results to a type:
SELECT cast(calculated_int as bigint)
,cast( modified_bif1 as bigint)
,cast( modified_bif2 as bigint)
,stringfield1
,stringfield2
FROM (
SELECT TRANSFORM(bigintfield1, bigintfield2, stringfield1, stringfield2)
USING 'python myudf.py'
AS (calculated_int, modified_bif1, modified_bif2, stringfield1, stringfield2)
FROM original_tbl) A ;
Hive might let you get away with this, if it does not, You will need to save the results to a table, and then you can convert (cast) to a different type in another query.
The final option it to just use a Java UDF. Map only UDFs are not too bad, and they allow you to specify return types.
Update (from asker):
The above answer works really well. A more elegant solution I found reading the "Programming Hive" O'Reilly book a few weeks later is this:
CREATE TABLE calculated_tbl AS
SELECT TRANSFORM(bigintfield1, bigintfield2, stringfield1, stringfield2)
USING 'python myudf.py'
AS (calculated_int BIGINT, modified_bif1 BIGINT, modified_bif2 BIGINT, stringfield1 STRING, stringfield2 STRING)
FROM original_tbl;
Rather than casting, you can specify types right in the AS(...) line.

Categories

Resources