I created my own DB and added my csv data into it successfully via Python. Queries therefore work, but I use python to do the queries. Package sqlite3. In short, the table is: ID of user, Type of action performed, ID of object that action was performed on.
df1 and df2 are the DataFrames.
import pandas as pd
import sqlite3
connection = sqlite3.connect("test.db")
c = connection.cursor()
df1.to_sql('tab1', con = connection, if_exists = 'replace', chunksize = 1000)
df2.to_sql('tab2', con = connection, if_exists = 'replace', chunksize = 1000)
In this SQL command, I need to select the columns, join one table and select only data from certain time. I hope it's okay I include it there for completeness but won't include all columns in example table as they're only for the filtering. Note that the subquery without the pivot works.
sql_command = """
SELECT * FROM
(SELECT tab1.action,
tab1.iduser,
tab2.idobject
FROM tab1,
tab2
LEFT JOIN tab2 ON tab1.objectid = tab2.objectid
WHERE tab2.Datetime BETWEEN '2020-01-29' AND '2020-02-04'
) AS source
PIVOT(
COUNT(tab1.iduser)
FOR tab1.action IN (
[a],
[b],
[c])
) AS pivot_table;"""
# execute query
c.execute(sql_command)
This is the table I get from the subquery basically:
iduser action idobject
--------------------------------------
1 a Obj1
1 b Obj2
1 c Obj3
1 a Obj2
2 a Obj1
2 c Obj2
2 a Obj2
2 b Obj1
3 a Obj1
3 c Obj3
Now I want to create such table, i.e. count the number of types of actions performed on each object.
action idobject Count
----------------------------------------------------------------------
a Obj1 3
Obj2 2
Obj3 0
b Obj1 1
Obj2 1
Obj3 0
c Obj1 0
Obj2 1
Obj3 2
If i run the code above with the PIVOT table, I get the error "Error: Incorrect syntax near "PIVOT"".
Related
I am using pandasql to update or replace values in columns using an ID from another table.
I have two tables one that I am trying to update or replace the values
Table A
id start_destination end_destination
=======================================
1 3431 3010
2 3521 3431
3 3010 3521
Table B
destination_id destination_name
=======================================
3010 NameA
3431 NameB
3521 NameC
I am trying to write an SQL query to create the following output
id start_destination end_destination
=======================================
1 NameB NameA
2 NameC NameB
3 NameA NameC
I tried
update TableA
set start_destination = TableB.destination_name
from TableB
where TableB.destination_id = TableA.start_destination
But I was getting an error
(sqlite3.OperationalError) near "from": syntax error
In the real dataset, there are more than 10 columns in Table A, which I need to keep all.
Also, if there is a destination_id that cannot be matched with either start or end_destination, it should be null.
You may try joining table B to table A, twice:
SELECT
a.id,
b1.destination_name AS start_destination,
b2.destination_name AS end_destination
FROM TableA a
LEFF JOIN TableB b1
ON b1.destination_id = a.start_destination
LEFT JOIN TableB b2
ON b2.destination_id = a.end_destination
ORDER BY
a.id;
If I have two tables. TableA:
primary key
join_id
distance
value
1
1
50
A
2
1
100
B
3
1
150
C
4
2
50
AA
5
2
100
BB
6
2
150
CC
Table B:
join_id
other value
1
X
2
Y
And I want to form Table C:
join_id
other_value
dist_50
dist_100
dist_150
1
X
A
B
C
2
Y
AA
BB
CC
What is the simplest strategy for this? I'm working with pandas dataframes which I convert into a sqlite3 db so i can use pd.read_sql() for queries.
So far, I've tried writing a loop to create a new table for each distance, using a WHERE filter.
distances = [50,100,150]
for dist in distances:
int_tableA_df = pd.read_sql('SELECT * FROM tableA '
'WHERE distance == {}'.format(dist), mydb)
int_tableA_df.to_sql('int_tableA{}'.format(dist), mydb)
This does work, but I can't imagine it's the optimal solution, particularly when scaled up (it yields a lot of tables). Then, I've tried to loop in a similar way to join with each of those previous intermediary tables. (assume tableB_df is previously defined).
for dist in distances:
tableB_df.to_sql('tableB{}'.format(dist), mydb)
tableB_df = pd.read_sql('SELECT tableB{0}.*, tableA{0}.distance AS dist{0} '
'FROM tableB{0} '
'LEFT JOIN tableA{0} '
'ON tableB{0}.result_id = tableA{0}.result_id'.format(dist), mydb)
This gives me a duplicate column level 0 error. I'm not sure I understand why I would have duplicate columns when I'm creating a new table each loop, just adding one column (with a distinct name) to it. Any insights to where I'm going wrong would be appreciated!
Also, there has to be a cleaner way to solve this, right? I know I could just write out a series of left joins for this small example, but that won't scale well with a lot more distances.
Thank you!
Consider conditional aggregation for a single SQL query pull:
sql = '''SELECT a.join_id,
b.other_value,
SUM(CASE WHEN a.distance = 50 THEN a.value END) AS dist_50,
SUM(CASE WHEN a.distance = 100 THEN a.value END) AS dist_100,
SUM(CASE WHEN a.distance = 150 THEN a.value END) AS dist_150
FROM tableA a
LEFT JOIN tableB b AS a.join_id = b.join_id
GROUP BY a.join_id
'''
dist_df = pd.read_sql(sql, mydb)
I have a SQL Server stored procedure that returns 3 separate tables.
How can I store each of this table in different data-frame using pandas?
Something like:
df1 - first table
df2 - second table
df3 - third table
Where should I start looking at?
Thank you
import pandas as pd
import pyodbc
from datetime import datetime
param = datetime(year=2019,month=7,day=31)
query = """EXECUTE [dbo].PythonTest_USIC_TreatyYear_ReportingPackage #AsOFDate = '{0}'""".format(param)
conn = pyodbc.connect('DRIVER={SQL Server};server=myserver;DATABASE=mydatabase;Trusted_Connection=yes;')
df = pd.read_sql_query(query, conn)
print(df.head())
You should be able to just iterate through the result sets, convert them to DataFrames, and append those DataFrames to a list. For example, given the stored procedure
CREATE PROCEDURE dbo.MultiResultSP
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
SELECT 1 AS [UserID], N'Gord' AS [UserName]
UNION ALL
SELECT 2 AS [UserID], N'Elaine' AS [UserName];
SELECT N'pi' AS [Constant], 3.14 AS [Value]
UNION ALL
SELECT N'sqrt_2' AS [Constant], 1.41 AS [Value]
END
the Python code would look something like this:
data_frames = []
crsr = cnxn.cursor()
crsr.execute("EXEC dbo.MultiResultSP")
result = crsr.fetchall()
while result:
col_names = [x[0] for x in crsr.description]
data = [tuple(x) for x in result] # convert pyodbc.Row objects to tuples
data_frames.append(pd.DataFrame(data, columns=col_names))
if crsr.nextset():
result = crsr.fetchall()
else:
result = None
# check results
for df in data_frames:
print(df)
print()
""" console output:
UserID UserName
0 1 Gord
1 2 Elaine
Constant Value
0 pi 3.14
1 sqrt_2 1.41
"""
I'm fairly new to using sqlalchemy and having some issues generating the sql code that I am looking for.
Ultimately, I'm trying to join two different subsets of table2 to table1 by using the following SQL query:
SELECT table1.date, a1.id AS name1_id, a2.id AS name2_id
FROM table1
LEFT JOIN table2 as a1
ON table1.name1 = table2.label AND table2.lookup_id = 1000
LEFT JOIN table2 as a2
ON table1.name2 = table2.label AND table2.lookup_id = 2000
Here's what I have so far using sqlalchemy:
q_generate = (
select([table1.c.date,
a1.id.label('name1_id'),
a2.id.label('name2_id')])
.select_from(table1
.outerjoin(table2.alias(name='a1'),
and_(
table2.c.lookup_id == 1000,
table1.c.name1 == table2.c.label
))
.outerjoin(table2.alias(name='a2'),
and_(
table2.c.lookup_id == 2000,
table1.c.name2== table2.c.label
))
)
)
which produces the following errors:
*NameError: name 'a1' is not defined*
Is there a special way that aliased table names must be referenced? What am I missing here? I think the error has something to do with these lines but I can't figure out how exactly to get this to work:
...
a1.id.label('name1_id'),
a2.id.label('name2_id')])
...
Thank you!
Yes, do this:
a1 = table2.alias(name='a1')
a2 = table2.alias(name='a2')
q_generate = (
select([table1.c.date,
a1.c.id.label('name1_id'),
a2.c.id.label('name2_id')])
.select_from(table1.outerjoin(a1, ...).outerjoin(a2, ...)))
I want to execute a SQL query with a JOIN where I can propagate the table aliases into the resulting dictionary keys. For example, I may have a query
query = """
SELECT t1.col1,t2.col1,t2.col2
FROM table1 t1 JOIN table2 t2
ON t1.col0=t2.col0
"""
and I want the output to maintain the t1, t2 aliases, since I have duplicate column names (col1). I would run
con = sqlite3.connect(dbpath, isolation_level=None, detect_types=sqlite3.PARSE_DECLTYPES)
def dict_factory(cursor, row):
d = {}
for idx,col in enumerate(cursor.description): d[col[0]] = row[idx]
return d
db.dict = con.cursor()
db.dict.row_factory = dict_factory
result = db.dict.execute(query).fetchone()
But this overwrites the col1 value. How can I have it return, say,
{'t1.col1':123, 't2.col1':234, 't2.col2':345}
Thanks!
The documentation says:
The name of a result column is the value of the "AS" clause for that column, if there is an AS clause. If there is no AS clause then the name of the column is unspecified and may change from one release of SQLite to the next.
So you have to do:
SELECT t1.col1 AS "t1.col1", t2.col1 AS "t2.col1", ...