How do I insert data in selective columns using PySpark?

How do I insert data in selective columns using PySpark? - python

I have a table on Redshift, to which I want to insert some data using pyspark dataframe. The redshift table has schema:
CREATE TABLE admin.audit_of_all_tables
(
wh_table_name varchar,
wh_schema_name varchar,
wh_population_method integer,
wh_audit_date timestamp without time,
wh_percent_change numeric(15,5),
wh_s3_path varchar
)
DISTSTYLE AUTO;
In my dataframe, I want to keep values for only the first 4 columns and write that dataframe's data to this table.
My dataframe is something like this:
Now, I want to do df.write.format to my table on Redshift, but I need to somehow specify that I want to insert data to only the first four columns and pass no value for the last 2 columns (keeping them null by default).
Any idea how to specify this using dataframe.write.format (or any method).
Thanks for reading.

You can use selectExpr to select the first four columns plus two additional columns with null that have been cast to the required type:
df2 = df.selectExpr("table_name as wh_table_name",
"schema_name as wh_schema_name",
"population_method as wh_population_method",
"audit_date as wh_audit_date",
"cast(null as double) as wh_percent_change",
"cast(null as string) as wh_s3_path")
df2.write....

Related

Removing duplicated rows by subset of columns (SQLite3)

Hi I'm recording live stocks data to DB(sqlite3) and, by mistake, unwanted data got into my DB.
For example,
date
name
price
20220107
A_company
10000
20220107
A_company
9000
20220107
B_company
500
20220107
B_company
400
20220107
B_company
200
in this table, row 1,2 and row 3,4,5 are same in [date, name] but different in [price].
I want to save only the 'first' row of such rows.
date
name
price
20220107
A_company
10000
20220107
B_company
500
What I have done before is read this whole DB into python and use pandas.drop_duplicate function.
import pandas as pd
import sqlite3
conn = sqlite3.connect("TRrecord.db")
query = pd.read_sql_query(f"SELECT * FROM TR_INFO, conn)
df = pd.DataFrame(query)
df.drop_duplicates(inplace=True, subset=['date', 'name'], ignore_index=True, keep='first')
However, as DB grows larger, I think this method won't be efficient in the long run.
How can I do this efficiently by using SQL?

There is no implicit 'first' concept in SQL, the database manager can store the records in any order, it has to be specified in SQL. If not specified (by ORDER BY), the order is determined by the database manager (SqlLite in your case), and it is not guaranteed (same data, same query, can return your rows in different order, at different times or different installations).
Having said that, if you are ok to delete any duplicates, and retain just one, you can use the rowid in sqlite for ordering:
delete from MyTbl
where exists (select 1
from MyTbl b
where MyTbl.date=b.date
and MyTbl.name=b.name
and MyTbl.rowid>b.rowid);
This would delete, from your table, any row for which there is another with a smaller rowid (but the same date and name).
If, by 'first', you meant to keep the record that was inserted first, then you need a column to indicate when the record was inserted (an insert_date_time, or an autoincrementing number column, etc.), and use that, instead of rowid.

Pandas convert data from two tables into third table. Cross Referencing and converting unique rows to columns

I have the following tables:
Table A
listData = {'id':[1,2,3],'date':['06-05-2021','07-05-2021','17-05-2021']}
pd.DataFrame(listData,columns=['id','date'])
Table B
detailData = {'code':['D123','F268','A291','D123','F268','A291'],'id':['1','1','1','2','2','2'],'stock':[5,5,2,10,11,8]}
pd.DataFrame(detailData,columns=['code','id','stock'])
OUTPUT TABLE
output = {'code':['D123','F268','A291'],'06-05-2021':[5,5,2],'07-05-2021':[10,11,8]}
pd.DataFrame(output,columns=['code','06-05-2021','07-05-2021'])
Note: The code provided is hard coded code for the output. I need to generate the output table from Table A and Table B
Here is brief explanation of how the output table is generated if it is not self explanatory.
The id column needs to be cross reference from Table A to Table B and the dates should be put instead in Table B
Then all the unique dates in Table B should be made into columns and the corresponding stock values need to be shifted to then newly created date columns.
I am not sure where to start to do this. I am new to pandas and have only ever used it for simple data manipulation. If anyone can suggest me where to get started, it will be of great help.

Try:
tableA['id'] = tableA['id'].astype(str)
tableB.merge(tableA, on='id').pivot('code', 'date', 'stock')
Output:
date 06-05-2021 07-05-2021
code
A291 2 8
D123 5 10
F268 5 11
Details:
First, merge on id, this is like doing a SQL join. First, the
dtypes much match, hence using astype to str.
Next, reshape the dataframe using pivot to get code by date.

Is there an Excel function to look up & not stop at first match but look for all the values (dates) and return "Yes/No" if it matches certain date

I have two data sets in excel. Like below (Table 1 and Table 2) I am trying to get result in Table 1 as Yes/No if the date matches for the corresponding ID in Table 2. See result Column in Table 1. Can you please let me know how this can be achieved using excel formulas? Thanks
Table 1
Table1
Table 2
Table2

You could try this:
The formula I've used is:
=IF(COUNTIFS($G$3:$G$6;A3;$H$3:$H$6;B3)=0;"No";"Yes")

How to speed up the update query on MySql for millions of records

I have a table which has 27 million id columns.
I plan to update the average and count from another table which is taking very long to complete.
Below is the update query (Database - MySQL, I am using Python to connect to the Database)
UPDATE dna_statistics
SET chorus_count =
(SELECT count(*)
FROM dna B
WHERE B.music_id = <music_id>
AND B.label = 'Chorus')
WHERE music_id = 916094

As scaisEdge already said, you need to check if there are indices on the two tables.
I would like to add to scaisEdge's answer that the order of the columns in the composite index should match the order in which you compare them.
You used
WHERE B.music_id = <music_id>
AND B.label = 'Chorus')
So your index should consist of the columns in order (music_id, label) and not (label, music_id).
I would have added this as comment, but I'm still 1 reputation point away from commenting.

UPDATE clause isn't good solution for 27 millions id's
use EXCHANGE PARTITION instead
https://dev.mysql.com/doc/refman/5.7/en/partitioning-management-exchange.html

be sure you have a cmposite index on
table dna column ( label, music_id)
and a index on
table dna_statistics column (music_id)

Finding a primary key

I have a big dataset with 4.5 million rows and 150 columns. I want to create a table in my database, and I want to create an index for it.
There isn't a column with ids and I would like to know if there is an easy way to find a column or combination of columns that could be unique to base my index on those.
I am using python and pandas

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I insert data in selective columns using PySpark? - python

Related

Removing duplicated rows by subset of columns (SQLite3)

Pandas convert data from two tables into third table. Cross Referencing and converting unique rows to columns

Is there an Excel function to look up & not stop at first match but look for all the values (dates) and return "Yes/No" if it matches certain date

How to speed up the update query on MySql for millions of records

Finding a primary key

Categories

Resources