I find a SQLITE database but I'm totaly new with SQL and not sure where to start...
As you see in the screenshot, I manage to open the data inside pandas df and the daily flow are split in columns named "FLOW1, FLOW2... FLOW31".
I want to extract all daily flow history and stack it in a different columns for each STATION_NUMBER. (columns=station, row/index=datetime) ex:
date STATION#1 STATION#2 ...
1-1-1969 value value
2-1-1969 value value
here my small code used to get there:
conn = sqlite3.connect(r"E:\Python\Data\hydrology_forcasting\Caravan\timeseries\csv\hydat" + "\Hydat.sqlite3")
for row in conn.execute("SELECT name FROM sqlite_master WHERE type = 'table'"):
# get dayly flow df of the screenshot
flow = pd.read_sql_query("SELECT * from DLY_FLOWS", conn)
My way is make an for loop to pick values one at the time and copy it to a new df... but I'm sure this is not the more efficient way and sql as build in method or something to do that...?
Related
I have a table in a database in sqlite3 in which I am trying to add a list with values from python as a new column. I can only find how to add a new column without values or change specific rows, could somebody help me with this?
This is probably the sort of thing you can google.
I cant find any way to add data to a column on creation, but you can add a default value (ALTER TABLE table_name ADD COLUMN column_name NOT NULL DEFAULT default_value) if that helps at all. Then afterwards you are going to have to add the data separately. There are a few places to find how to do that. These questions might be relevant:
Populate Sqlite3 column with data from Python list using for loop
Adding column in SQLite3, then filling it
You can read the database into a pandas dataframe, add a list as a column to that dataframe, then replace the original file from the dataframe:
import sqlite3
import pandas as pd
conn = sqlite3.connect("my_data.db")
df = pd.read_sql_query("SELECT * FROM my_table", conn)
conn.close()
df['new_column'] = my_list
conn = sqlite3.connect("my_data.db")
df.to_sql(name='my_table', if_exists='replace', con=conn)
conn.close()
I have a ms access db I've connected to with (ignore the ... in the drive name, it's working):
driver = 'DRIVER={...'
con = pyodbc.connect(driver)
cursor = con.cursor()
I have a pandas dataframe which is exactly the same as a table in the db except there's an additional column. Basically I pulled the table with pyodbc, merged it with external excel data to add this additional column, and now want to push the data back to the ms access table with the new column. The pandas df containing the new information is merged_df['Item']
Trying things like below does not work, I've had a variety of errors.
cursor.execute("insert into ToolingData(Item) values (?)", merged_df['Item'])
con.commit()
How can I push the new column to the original table? Can I just write over the entire table instead? Would that be easier? Since merged_df is literally the same thing with the addition of one new column.
If the target MS Access table does not already contain a field to house the data held within the additional column, you'll first need to execute an alter table statement to add the new field.
For example, the following will add a 255-character text field called item to the table ToolingData:
alter table ToolingData add column item text(255)
I am having monthly Revenue data for the last 5 years and I am storing the DataFrames for respective months in parquet formats in append mode, but partitioned by month column. Here is the pseudo-code below -
def Revenue(filename):
df = spark.read.load(filename)
.
.
df.write.format('parquet').mode('append').partitionBy('month').save('/path/Revenue')
Revenue('Revenue_201501.csv')
Revenue('Revenue_201502.csv')
Revenue('Revenue_201503.csv')
Revenue('Revenue_201504.csv')
Revenue('Revenue_201505.csv')
The df gets stored in parquet format on monthly basis, as can be seen below -
Question: How can I delete the parquet folder corresponding to a particular month?
One way would be to load all these parquet files in a big df and then use .where() clause to filter out that particular month and then save it back into parquet format partitionBy month in overwrite mode, like this -
# If we want to remove data from Feb, 2015
df = spark.read.format('parquet').load('Revenue.parquet')
df = df.where(col('month') != lit('2015-02-01'))
df.write.format('parquet').mode('overwrite').partitionBy('month').save('/path/Revenue')
But, this approach is quite cumbersome.
Other way is to directly delete the folder of that particular month, but I am not sure if that's a right way to approach things, lest we alter the metadata in an unforseeable way.
What would be the right way to delete the parquet data for a particular month?
Spark supports deleting partition, both data and metadata.
Quoting the scala code comment
/**
* Drop Partition in ALTER TABLE: to drop a particular partition for a table.
*
* This removes the data and metadata for this partition.
* The data is actually moved to the .Trash/Current directory if Trash is configured,
* unless 'purge' is true, but the metadata is completely lost.
* An error message will be issued if the partition does not exist, unless 'ifExists' is true.
* Note: purge is always false when the target is a view.
*
* The syntax of this command is:
* {{{
* ALTER TABLE table DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...] [PURGE];
* }}}
*/
In your case, there is no backing table.
We could register the dataframe as a temp table and use the above syntax(temp table documentation)
From pyspark, we could run the SQL using the syntax in this link
Sample:
df = spark.read.format('parquet').load('Revenue.parquet'). registerTempTable("tmp")
spark.sql("ALTER TABLE tmp DROP IF EXISTS PARTITION (month='2015-02-01') PURGE")
Below statement will only delete the metadata related to partition information.
ALTER TABLE db.yourtable DROP IF EXISTS PARTITION(loaded_date="2019-08-22");
you need to set the tblproperties for your hive external table as False, if you want to delete the data as well. It will set your hive table as managed table.
alter table db.yourtable set TBLPROPERTIES('EXTERNAL'='FALSE');
you can set it back to external table.
alter table db.yourtable set TBLPROPERTIES('EXTERNAL'='TRUE');
I tried setting given properties using spark session but was facing some issues.
spark.sql("""alter table db.test_external set tblproperties ("EXTERNAL"="TRUE")""")
pyspark.sql.utils.AnalysisException: u"Cannot set or change the preserved property key: 'EXTERNAL';"
I am sure there must be someway to do this. I ended up using python. I defined below function in pyspark and it did the job.
query=""" hive -e 'alter table db.yourtable set tblproperties ("EXTERNAL"="FALSE");ALTER TABLE db.yourtable DROP IF EXISTS PARTITION(loaded_date="2019-08-22");' """
def delete_partition():
print("I am here")
import subprocess
import sys
p=subprocess.Popen(query,shell=True,stderr=subprocess.PIPE)
stdout,stderr = p.communicate()
if p.returncode != 0:
print stderr
sys.exit(1)
>>> delete_partition()
This will delete the metadata and data both.
Note. I have tested this with Hive ORC external partition table, which is partitioned on loaded_date
# Partition Information
# col_name data_type comment
loaded_date string
Update:
Basically your data is lying at hdfs location in subdirectory named as
/Revenue/month=2015-02-01
/Revenue/month=2015-03-01
/Revenue/month=2015-03-01
and so on
def delete_partition(month_delete):
print("I am here")
hdfs_path="/some_hdfs_location/Revenue/month="
final_path=hdfs_path+month_delete
import subprocess
subprocess.call(["hadoop", "fs", "-rm", "-r", final_path])
print("got deleted")
delete_partition("2015-02-01")
I have created a lookup table (in Excel) which has the table and column name for the various tables and the the column names under these table along with all the SQL queries to be run on these fields. Below is an example table.
Results from all SQL Queries are in the format Total_Count and Fail_Count. I want to output these results along with all the information in the current version of the lookup table and date of execution into a separate table.
Sample result Table:
Below is the code I used to get the results together in the same lookup table but have trouble storing the same results in a separate result_set table with separate columns for total and fail counts.
df['Results'] = ''
from pandas import DataFrame
for index, row in df.iterrows():
cur.execute(row["SQL_Query"])
df.loc[index,'Results'] = (cur.fetchall())
It might be easier to load the queries into a DataFrame directly using the read_sql method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html.
The one caveat is that you need to use a sqlAlchemy engine for the connection. I also find itertuples easier to work with.
One you have those your code is merely:
df['Results'] = ''
for row in df.itertuples():
df_result = pd.read_sql(row.Sql_Query)
df.loc[row.Table_Name, 'Total_count'] = df_result.total_count
df.loc[row.Table_Name, 'Fail_count'] = df_result.fail_count
Your main problem above is that you're passing two columns from the result query to one column in df. You need to pass each column separately.
I'm trying to create a pandas data frame using the Snowflake Packages in python.
I run some query
sf_cur = get_sf_connector()
sf_cur.execute("USE WAREHOUSE Warehouse;")
sf_cur.execute("""select Query"""
)
print('done')
The output is roughly 21k rows. Then using
df = pd.DataFrame(sf_cur.fetchall())
takes forever, even on a limit sample of only 100 rows. Is there a way to optimize this, ideally the bigger query would be run in a loop so handling even bigger data sets would be ideal.
as fetchall() copies all the result in memory, you should try to iterate over the cursor object directly and map it to a data frame inside the for block
cursor.execute(query)
for row in cursor:
#build the data frame
Other example, just to show:
query = "Select ID from Users"
cursor.execute(query)
for row in cursor:
list_ids.append(row["ID"])
Use df = cur.fetch_pandas_all() to build pandas dataframe on top of results.