I am fetching data from mysql using pyspark which for only one table.I want to fetch all tables from mysql db. Don't want call jdbc connection again and again. see code below
Is it possible to simplify my code? Thank you in advance
url = "jdbc:mysql://localhost:3306/dbname"
table_df=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name").option("user","root").option("password", "root").load()
sqlContext.registerDataFrameAsTable(table_df, "table1")
table_df_1=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name_1").option("user","root").option("password", "root").load()
sqlContext.registerDataFrameAsTable(table_df_1, "table2")
you need somehow to acquire the list of the tables you have in mysql.
Either you find some sql commands to do that, or you manually create a file containing everything.
Then, assuming you can create a list of tablenames in python tablename_list, you can simply loop over it like this :
url = "jdbc:mysql://localhost:3306/dbname"
reader = (
sqlContext.read.format("jdbc")
.option("url", url)
.option("user", "root")
.option("password", "root")
)
for tablename in tablename_list:
reader.option("dbtable", tablename).load().createTempView(tablename)
This will create a temporary view with the same tablename. If you want another name, you can probably change the initial tablename_list with a list of tuples (tablename_in_mysql, tablename_in_spark).
#Steven already gave a perfect answer. As he said, in order to find a Python list of tablenames, you can use:
#list of the tables in the server
table_names_list = spark.read.format('jdbc'). \
options(
url='jdbc:postgresql://localhost:5432/', # database url (local, remote)
dbtable='information_schema.tables',
user='YOUR_USERNAME',
password='YOUR_PASSWORD',
driver='org.postgresql.Driver'). \
load().\
filter("table_schema = 'public'").select("table_name")
#DataFrame[table_name: string]
# table_names_list.collect()
# [Row(table_name='employee'), Row(table_name='bonus')]
table_names_list = [row.table_name for row in table_names_list.collect()]
print(table_names_list)
# ['employee', 'bonus']
Note that this is in PostgreSQL. You can easily change url and driver arguments.
Related
I have around 70 tables in one S3 bucket and I would like to move them to the redshift using glue. I could move only few tables. Rest of them are having data type issue. Redshift is not accepting some of the data types. I resolved the issue in a set of code which moves tables one by one:
table1 = glueContext.create_dynamic_frame.from_catalog(
database="db1_g", table_name="table1"
)
table1 = table1.resolveChoice(
specs=[
("column1", "cast:char"),
("column2", "cast:varchar"),
("column3", "cast:varchar"),
]
)
table1 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=table1,
catalog_connection="redshift",
connection_options={"dbtable": "schema1.table1", "database": "db1"},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="table1",
)
The same script is used for all other tables having data type change issue.
But, As I would like to automate the script, I used looping tables script which iterate through all the tables and write them to redshift. I have 2 issues related to this script.
Unable to move the tables to respective schemas in redshift.
Unable to add if condition in the loop script for those tables which needs data type change.
client = boto3.client("glue", region_name="us-east-1")
databaseName = "db1_g"
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables["TableList"]
for table in tableList:
tableName = table["Name"]
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="db1_g", table_name=tableName, transformation_ctx="datasource0"
)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=datasource0,
catalog_connection="redshift",
connection_options={
"dbtable": tableName,
"database": "schema1.db1",
},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="datasink4",
)
job.commit()
Mentioning redshift schema name along with tableName like this: schema1.tableName is throwing error which says schema1 is not defined.
Can anybody help in changing data type for all tables which requires the same, inside the looping script itself?
So the first problem is fixed rather easily. The schema belongs into the dbtable attribute and not the database, like this:
connection_options={
"dbtable": f"schema1.{tableName},
"database": "db1",
}
Your second problem is that you want to call resolveChoice inside of the for Loop, correct? What kind of error occurs there? Why doesn't it work?
I am trying to insert a GTiff-file into a specific PostGIS table using the raster2pgsql-command. So far I managed inserting the GTiff-file into the PostGIS database I am connected to. But this creates a new table with the file-name of the GTiff-file. I could also move the raster-data to the target table afterwards, but I suppose there is a more efficient way.
Here is an example:
import psycopg2
import os
tif_path = 'test.tif'
conn = psycopg2.connect(
host = 'localhost',
port = 5432,
user = 'postgres',
dbname = 'gisdb'
)
curs = conn.cursor()
curs.execute("SET postgis.gdal_enabled_drivers = 'ENABLE_ALL';")
os.system('raster2pgsql "%s" > temp.sql'%tif_path)
curs.execute(open('temp.sql','r').read())
Is there a way to insert the raster-data directly into an existing table?
I know I can use -a to append the raster to an existing table and specify the column name by using -f. But there doesn't seem to be a way to specify the name of the table.
if you want to specify the table by yourself, your query must be like that:
raster2pgsql -s 4326 -I -C -M C:\temp\test_1.tif -t 100x100 myschema.mytable > out.sql
if you want to add the raster to existing table, you are right you must use "-a" value
I can create a BQ view by calling client.create_table but I could not find a way to update the SQL of the view.
To create:
table = bigquery.Table(table_ref)
table.view_query = view_query
client.create_table(table)
To update? (does not work)
table = client.get_table(table_ref)
table.view_query = view_query
client.update_table(table, [])
Thoughts?
The second argument to update_table is a list of fields to update in the API. By passing an empty list you are saying: don't update anything. Instead, pass in ['view_query'] as the update properties list.
table = client.get_table(table_ref)
table.view_query = view_query
client.update_table(table, ['view_query'])
Or as Elliot suggested in the comments, you can use DDL to do this operation.
I used CREATE OR REPLACE VIEW statement.
job = client.query('CREATE OR REPLACE VIEW `{}.{}.{}` AS {}'.format(client.project, dataset, view_name, view_query))
job.result()
I have the following query
INSERT INTO `min01_aggregated_data_800` (`datenum`,`Timestamp`,`QFlag_R6_WYaw`) VALUES ('734970.002777778','2012-04-11 00:04:00.000','989898') ON DUPLICATE KEY UPDATE `datenum`=VALUES(`datenum`);
INSERT INTO `min01_aggregated_data_100` (`datenum`,`Timestamp`,`QFlag_R6_WYaw`) VALUES ('734970.002777778','2012-04-11 00:04:00.000','989898') ON DUPLICATE KEY UPDATE `datenum`=VALUES(`datenum`);
INSERT INTO `min01_aggregated_data_300` (`datenum`,`Timestamp`,`QFlag_R6_WYaw`) VALUES ('734970.002777778','2012-04-11 00:04:00.000','989898') ON DUPLICATE KEY UPDATE `datenum`=VALUES(`datenum`);
I'm using the mysql.connector package to insert the data to the MySQL
self.db = mysql.connector.Connect( host = self.m_host, user = self.m_user, password = self.m_passwd, \
database = self.m_db, port = int( self.m_port ) )
self.con = self.db.cursor( cursor )
self.con.execute( query )
self.db.commit()
self.db.close()
self.con.close()
But I'm getting the following error Use multi=True when executing multiple statements
I tried to use the multi=True in this case I'm not getting any exception, but the data won't be inserted to the MySQL. How can I insert multiple rows?
I see three options:
Send every query to the DB separately:
[...]
self.con.execute(query1)
self.con.execute(query2)
self.con.execute(query3)
[...]
[removed as it didn't apply here]
I am not very familiar with this multi=True, however; it might be possible that there is a solution which calls the self.con.nextset() repeatedly. According to the doc, this is only for multiple result sets, but perhaps it is needed on a multi-query request as well.
You have three separate queries, so each one should be run separately, i.e:
self.con.execute(query1)
self.con.execute(query2)
self.con.execute(query3)
I am migrating some data from other databases , so i am using raw sql queries for inserting data into database . But i don't know how to get last inserted id from raw sql queries in django. I have tried this
affected_count1=cursor2.execute("table')")
and
SELECT IDENT_CURRENT(‘MyTable’)
but it gives me the error of "(1305, 'FUNCTION pydev.SCOPE_IDENTITY does not exist')"
So please tell me how can i get the last inserted id in raw sq l queries in django
You can get latest create obj like this:
obj = Foo.objects.latest('id')
more info here
Try this
LastInsertId = (TableName.objects.last()).id
In Django 1.6
obj = Foo.objects.latest('id')
obj = Foo.objects.earliest('id')