I am trying to use Pandas.DataFrame as the intermediate result dataset between two consequent SQL queries.
I imagine it looks like:
import pandas.io.sql as pisql
import pyodbc
SQL_command1 = """
select * from tab_A
"""
result = pisql.read_frame(SQL_command1)
SQL_command2 = """
select *
from ? A
inner join B
on A.id = B.id
"""
pyodbc.cursor.execute(SQL_command2, result)
The SQL_command2 in above code is simply a pseudo code, where ? takes in the result as the input and given a alias name as A.
This is my first time using Pandas, so I'm not confident if my idea is feasible or efficient. Can anyone enlight me please?
Many thanks.
The pseudo code would look like this
import pandas as pd
df_a = pd.read_csv('tab_a.csv') #or read_sql or other read engine
df_b = pd.read_csv('tab_b.csv')
result = pd.merge(left=df_a,
right=df_b,
how='inner',
on='id') #assuming 'id' is in both table
And to select columns of pandas dataframe, it would be something like df_a[['col1','col2','col3']]
Related
I want to perform a minus operation like the code below on two tables.
SELECT
column_list_1
FROM
T1
MINUS
SELECT
column_list_2
FROM
T2;
This is after a migration has happened. I have these two databases that I have connected like this:
import cx_Oracle
import pandas as pd
import pypyodbc
source = cx_Oracle.connect(user, password, name)
df = pd.read_sql(""" SELECT * from some_table """, source)
target = pypyodbc.connect(blah, blah, db)
df_2 = pd.read_sql(""" SELECT * from some_table """, target)
How can I run a minus operation on the source and target databases in python using a query?
Choose either one:
Use Python in order to perform a "manual" MINUS operation between the two result sets.
Use Oracle by means of a dblink. In this case, you won't need to open two connections from Python.
if you have a DB link then you can do a MINUS or you can use merge from Pandas.
df = pd.read_sql(""" SELECT * from some_table """, source)
df_2 = pd.read_sql(""" SELECT * from some_table """, target)
df_combine = df.merge(df2.drop_duplicates(),how='right', indicator=True)
print(df3)
There will be a new column _merge created in df_combine which will contain values both (row present in both the data frame) and right_only (row in data frame df).
In the same way you can join a left merge.
I would like to do the following in pandas which I would do in SQL:
SELECT * FROM table WHERE field = value
I was thinking I could use something similar to an apply or map with a similar interface. Something like:
def filter_func(row):
if row['name'] == 'Bob':
return True
else:
return False
df.filter(filter_func, axis=1)
Similar to how I can do:
df['new_col'] = df.apply(apply_func, axis=1)
Is there a way to do something similar so that it only returns the rows where name='Bob' ?
The strangest thing is the pandas filter function says:
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
That seems to me like quite a useless way to make use of a filter ?
Check with
df_filter = df[df['name'] == 'Bob']
For sql in operation we have isin
#SELECT * FROM table WHERE field IN ('A','B')
df_filter = df[df['name'].isin('A','B)]
filter is named badly , which is the filter for columns, or when we do groupby filter
I have the following situation:
I have multiple tables that look like this:
table1 = pd.DataFrame([[0,1],[0,1],[0,1],[0,1],[0,1]], columns=['v1','v2'])
I have one dataframe that each element refers to these tables, something like this:
df = pd.DataFrame([table1, table2, table3, table4], columns=['tablename'])
I need to create a new column in df that contains, for each table, the values that I get from np.polyfit(table1['v1'],table1['v2'],1)
I have tried to do the following
for x in df['tablename']:
df.loc[:,'fit_result'] = np.polyfit(x['v1'],x['v2'],1)
but it returns me
TypeError: string indices must be integers
Is there a way to do it? or am I writing something that makes no sense?
obs: in fact, these tables are HUGE and contains more than two columns.
You can try something like this
import numpy as np
import pandas as pd
table1 = pd.DataFrame([[0.0,0.0],[1.0,0.8],[2.0,0.9],[3.0,0.1],[4.0,-0.8],[5.0,-1.0]], columns=['table1_v1','table1_v2'])
df = pd.DataFrame([['some','random'],['values','here']], columns=['example_1','example_2'])
def fit_result(v1,v2):
return np.polyfit(v1, v2, 1)
df['fit_result'] = df.apply(lambda row: fit_result(table1['table1_v1'].values,table1['table1_v2'].values), axis=1)
df.head()
Output
example_1 example_2 fit_result
0 some random [-0.3028571428571428, 0.7571428571428572]
1 values here [-0.3028571428571428, 0.7571428571428572]
You only need do this over all your dataframes and concat all off them at the end
df_col = pd.concat([df1,df2], axis=1) (https://www.datacamp.com/community/tutorials/joining-dataframes-pandas)
I was trying to merge two dataframes using a less-than operator. But I ended up using pandasql.
Is it possible to do the same query below using pandas functions?
(Records may be duplicated, but that is fine as I'm looking for something similar to cumulative total later)
sql = '''select A.Name,A.Code,B.edate from df1 A
inner join df2 B on A.Name = B.Name
and A.Code=B.Code
where A.edate < B.edate '''
df4 = sqldf(sql)
The suggested answer seems similar but couldn't get the result expected. Also the answer below looks very crisp.
Use:
df = df1.merge(df2, on=['Name','Code']).query('edate_x < edate_y')[['Name','Code','edate_y']]
I have a pandas dataframe as follows...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([['abc', 11], ['xyz', 21],['pqr',31]]),columns=['member','value'])
What I need is to collapse the column 'member' inside a single string with the output as an SQL query as follows...
"select * from table1 where member in ('abc','xyz','pqr')"
In my original data I have a large number of values. I couldn't figure out a way to collapse it from previous question search. Is there a way to do this without using a loop? Thanks.
You can use tolist method of interesting column then convert it to tuple and then to string:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([['abc', 11], ['xyz', 21],['pqr',31]]),columns=['member','value'])
sql_string = "select * from table1 where member in "
members = tuple(df.member.tolist())
query = sql_string + str(members)