Here is my code:
import pandas as pd
from sqlalchemy import create_engine
db_username = "my_username"
db_pw = "my_password"
db_to_use = "my_database"
#####
engine = create_engine(
"postgresql://" +
db_username + ":" +
db_pw +
"#localhost:5432/" +
db_to_use
)
#####
connection = engine.connect()
fac_id_list = connection.execute (
"""
select distinct
a.name,
replace(regexp_replace(regexp_replace(a.logo_url,'.*/logo','','i'),'production.*','','i'),'/','') as new_logo
from
sync_locations as a
inner join
public.vw_locations as b
on
a.name = b.location_name
order by
new_logo
"""
)
I want to put the results of fac_id_list into two separate lists. One list will contain all of the values from a.name and the other new_logo.
How can I do this?
sql_results = []
for row in fac_id_list:
sql_results.append(row)
This puts every column in my SQL query into a list, but I want them separated.
When you loop over the results, you can spread them into separate variables and append them to the corresponding lists
names = []
logos = []
for name, logo in fac_id_list:
names.append(name)
logos.append(log)
Related
I need to query rows where a column matches my list of ~60K IDs out of a table that contains millions of IDs. I think normally you would insert a temporary table into the database and merge on that but I can't edit this database. I am doing it like this using a loop w/ a python wrapper, but is there a better way? I mean it works, but still:
import pyodbc
import pandas as pd
# connect to the database using windows authentication
conn = pyodbc.connect('DRIVER={SQL Server Native Client 11.0};SERVER=my_fav_server;DATABASE=my_fav_db;Trusted_Connection=yes;')
cursor = conn.cursor()
# read in all the ids
ids_list = [...60K ids in here..]
# query in 10K chunks to prevent memory error
def chunks(l,n):
# split list into n lists of evenish size
n = max(1,n)
return [l[i:i+n] for i in range(0,len(l), n)]
chunked_ids_lists = chunks(ids_list, 10000)
# looping through to retrieve all cols
for chunk_num, chunked_ids_list in enumerate(chunked_ids_lists):
temp_ids_string = "('" + "','".join(chunked_ids_list) + "')"
temp_sql = f"SELECT * FROM dbo.my_fav_table WHERE ID IN {temp_ids_string};"
temp_data = pd.read_sql_query(temp_sql, conn)
temp_path = f"temp_chunk_{chunk_num}.txt"
temp_data.to_csv(temp_path, sep='\t', index=None)
# read the query chunks
all_data_list = []
for chunk_num in range(len(chunked_ids_lists)):
temp_path = f"temp_chunk_{chunk_num}.txt"
temp_data = pd.read_csv(temp_path, sep='\t')
all_data_list.append(temp_data)
all_data = pd.concat(all_data_list)
Another way use Psycopg's cursor.
import psycopg2
# Connect to an existing database
conn = psycopg2.connect("dbname=test user=postgres")
# Open a cursor to perform database operations
cur = conn.cursor()
# get data from query
# no need construct 'SQL-correct syntax' filter
cur.execute("SELECT * FROM dbo.my_fav_table WHERE ID IN %(filter)s;", {"filter": chunked_ids_lists})
# loop over getted rows
for record in cur:
# we got one record
print(record) # or make other data treatment
Use parameters rather than concatenating strings.
I don't see the need for the CSV files, if you're just going to read them all into Python in the next loop. Just put everything into all_data_list during the query loop.
all_data_list = []
for chunk in chunked_ids_lists:
params = ','.join(['?'] * len(chunk))
sql = f"SELECT * FROM dbo.my_fav_table WHERE ID IN ({params});"
cursor.execute(sql, chunk)
rows = cursor.fetchall()
all_data_list.extend(rows)
all_data = pd.dataFrame(all_data_list)
I'm using pd.read_sql to generate dataframes. I'm using parameters in a list to generate separate dataframes based on filters. I did same query from my script in python on database and it worked.
In SQL database I always had data returned for months 1,2,3 and/or project x,y. But python SOMETIMES doesn't bring nonthing and I don't know. This generate empty dataframe
if I put months like [1,'2',3] sometimes the where condition works, but in my database the field is varchar, I don't why if I put int in a list the data comes or depend on type the filter doesn't works
server = 'xxxxx'
database = 'mydatabase'
username = 'user'
password = '*************'
driver = '{SQL Server Native Client 11.0}'
turbodbc_options = 'options'
timeout = '1'
project = ['x','y']
months = ['1','2','3']
def generatefile():
for pr in project:
for index in months:
print(type(index))
print(index)
print(pr)
db = pd.read_sql('''select * from table WHERE monthnumber =(?)and pro=(?)''',conn,params={str(index),(pr)})
print(db)
print("generate...")
db.to_excel('C:\\localfile\\' + pr + '\\' + pr + tp + str(index) + 'file.xlsx',index=False,engine='xlsxwriter' ),index=False,engine='xlsxwriter' )
generatefile()
consider the difference between:
1.
sql = '''
select * from table WHERE monthnumber = '1' and pro= 'x'
'''
sql = '''
select * from table WHERE monthnumber = 1 and pro= 'x'
'''
so, the sql should be something like below, if the field is varchar.
db = pd.read_sql(f'''
select * from table WHERE monthnumber ='{index}'and pro='{pr}'
''',conn)
I have a folder called 'testfolder' that includes two files -- 'Sigurdlogfile' and '2004ADlogfile'. Each file has a list of strings called entries. I need to run my code on both of them and am using glob to do this. My code creates a dictionary for each file and stores data extracted using regex where the dictionary keys are stored in commonterms below. Then it inserts each dictionary into a mysql table. It does all of this successfully, but my second sql statement is not inserting how it should (per file).
import glob
import re
files = glob.glob('/home/user/testfolder/*logfile*')
commonterms = (["freq", "\s?(\d+e?\d*)\s?"],
["tx", "#txpattern"],
["rx", "#rxpattern"], ...)
terms = [commonterms[i][0] for i in range(len(commonterms))]
patterns = [commonterms[i][1] for i in range(len(commonterms))]
def getTerms(entry):
for i in range(len(terms)):
term = re.search(patterns[i], entry)
if term:
term = term.groups()[0] if term.groups()[0] is not None else term.groups()[1]
else:
term = 'NULL'
d[terms[i]] += [term]
return d
for filename in files:
#code to create 'entries'
objkey = re.match(r'/home/user/testfolder/(.+?)logfile', filename).group(1)
d = {t: [] for t in terms}
for entry in entries:
d = getTerms(entry)
import MySQLdb
db = MySQLdb.connect(host='', user='', passwd='', db='')
cursor = db.cursor()
cols = d.keys()
vals = d.values()
for i in range(len(entries)):
lst = [item[i] for item in vals]
csv = "'{}'".format("','".join(lst))
sql1 = "INSERT INTO table (%s) VALUES (%s);" % (','.join(cols), csv.replace("'NULL'", "NULL"))
cursor.execute(sql1)
#now in my 2nd sql statement I need to update the table with data from an old table, which is where I have the problem...
sql2 = "UPDATE table, oldtable SET table.key1 = oldtable.key1,
table.key2 = oldtable.key2 WHERE oldtable.obj = %s;" % repr(objkey)
cursor.execute(sql2)
db.commit()
db.close()
The problem is that in the second sql statement, it ends up inserting that data into all columns of the table from only one of the objkeys, but I need it to insert different data depending on which file the code is currently running on. I can't figure out why this is, since I've defined objkey inside my for filename in files loop. How can I fix this?
Instead of doing separate INSERT and UPDATE, do them together to incorporate the fields from the old table.
for i in range(len(entries)):
lst = [item[i] for item in vals]
csv = "'{}'".format("','".join(lst))
sql1 = """INSERT INTO table (key1, key2, %s)
SELECT o.key1, o.key2, a.*
FROM (SELECT %s) AS a
LEFT JOIN oldtable AS o ON o.obj = %s""" % (','.join(cols), csv.replace("'NULL'", "NULL"), repr(objkey))
cursor.execute(sql1)
My second data frame is not loading values when i create it. Any help with why it is not working? When i make my cursor a list, it has a bunch of values in it, but for whatever reason when i try to do a normal data frame load with pandas a second time, it does not work.
My code:
conn = pyodbc.connect(constr, autocommit=True)
cursor = conn.cursor()
secondCheckList = []
checkCount = 0
maxValue = 0
strsql = "SELECT * FROM CRMCSVFILE"
cursor = cursor.execute(strsql)
cols = []
SQLupdateNewIdField = "UPDATE CRMCSVFILE SET NEW_ID = ? WHERE Email_Address_Txt = ? OR TELEPHONE_NUM = ? OR DRIVER_LICENSE_NUM = ?"
for row in cursor.description:
cols.append(row[0])
df = pd.DataFrame.from_records(cursor)
df.columns = cols
newIdInt = 1
for row in range(len(df['Email_Address_Txt'])):
#run initial search to figure out the max number of records. Look for email, phone, and drivers license, names have a chance not to be unique
SQLrecordCheck = "SELECT * FROM CRMCSVFILE WHERE Email_Address_Txt = '" + str(df['Email_Address_Txt'][row]) + "' OR TELEPHONE_NUM = '" + str(df['Telephone_Num'][row]) + "' OR DRIVER_LICENSE_NUM = '" + str(df['Driver_License_Num'][row]) + "'"
## print(SQLrecordCheck)
cursor = cursor.execute(SQLrecordCheck)
## maxValue is indeed a list filled with records
maxValue =(list(cursor))
## THIS IS WHERE PROBLEM OCCURS
tempdf = pd.DataFrame.from_records(cursor)
Why not just use pd.read_sql_query("your_query", conn) this will return the result of the query as a dataframe and requires less code. Also you set cursor to cursor.execute(strsql) at the top and then you are trying to call execute on cursor again in your for loop but you can no longer call execute on cursor you will have to set cursor = conn.cursor() again.
Writing a script to clean up some data. Super unoptimized but this cursor is
returning the number of results in the like query rather than the rows what am I doing wrong.
#!/usr/bin/python
import re
import MySQLdb
import collections
db = MySQLdb.connect(host="localhost", # your host, usually localhost
user="admin", # your username
passwd="", # your password
db="test") # name of the data base
# you must create a Cursor object. It will let
# you execute all the query you need
cur = db.cursor()
# Use all the SQL you like
cur.execute("SELECT * FROM vendor")
seen = []
# print all the first cell of all the rows
for row in cur.fetchall() :
for word in row[1].split(' '):
seen.append(word)
_digits = re.compile('\d')
def contains_digits(d):
return bool(_digits.search(d))
count_word = collections.Counter(seen)
found_multi = [i for i in count_word if count_word[i] > 1 and not contains_digits(i) and len(i) > 1]
unique_multiples = list(found_multi)
groups = dict()
for word in unique_multiples:
like_str = '%' + word + '%'
res = cur.execute("""SELECT * FROM vendor where name like %s""", like_str)
You are storing the result of cur.execute(), which is the number of rows. You are never actually fetching any of the results.
Use .fetchall() to get all result rows or iterate over the cursor after executing:
for word in unique_multiples:
like_str = '%' + word + '%'
cur.execute("""SELECT * FROM vendor where name like %s""", like_str)
for row in cur:
print row