Dynamically build dictionary from columns specified at runtime

Dynamically build dictionary from columns specified at runtime - python

I have been told I need to use a mapping table instead of a hardcoded dictionary I made. There has to be a better method than the code below to get a 3 column table into a dictionary?
Mapping table
AgentGrp, Property, TimeZone #Headers
HollyWoodAgent Sunset PST
UnionSquareAgent UnionSquare PST
Turns into the following dictionary:
{'HollyWoodAgent': ['Sunset', 'PST'], 'UnionSquareAgent': ['UnionSquare', 'PST']}
Code:
import pandas as pd
import pyodbc
import datetime
import sys
import csv
VipAgent = "{"
finalSql = "SELECT agentgrp, property, timezone FROM sandbox_dev.agentgrp_map;"
colcnt = 0
try:
conn = pyodbc.connect("DSN=Dev")
cursor = conn.cursor()
cursor.execute(finalSql)
for row in cursor.fetchall():
VipAgent += "'" + row.prop + "VipAgent':['" + row.prop + "','" + row.tz + "'],"
colcnt = colcnt + 1
if(colcnt==3):
VipAgent = VipAgent + "\n"
colcnt = 0
except my.Error as e:
print(e)
VipAgent = VipAgent[:-1] + "}"
Dict = eval(VipAgent)
print(Dict)
I do get the values as expected. There has to be a better python way out there.

We'll take it as given that you read the "mapping table" from a file into a Python list similar to this one
item_map = ['AgentGrp', 'Property', 'TimeZone']
Once you've executed your SELECT query
cursor.execute(finalSql)
then you can build your dict like so:
result_dict = {}
while (True):
row = cursor.fetchone()
if (row):
result_dict[row.__getattribute__(item_map[0])] = \
[row.__getattribute__(x) for x in item_map[1:]]
else:
break
print(result_dict)
# {'HollyWoodAgent': ['Sunset', 'PST'], 'UnionSquareAgent': ['UnionSquare', 'PST']}
The trick is to use row.__getattribute__, e.g., row.__getattribute__('column_name') instead of hard-coding row.column_name.

Related

SQLAlchemy Retrieving Different Columns

Here is my code:
import pandas as pd
from sqlalchemy import create_engine
db_username = "my_username"
db_pw = "my_password"
db_to_use = "my_database"
#####
engine = create_engine(
"postgresql://" +
db_username + ":" +
db_pw +
"#localhost:5432/" +
db_to_use
)
#####
connection = engine.connect()
fac_id_list = connection.execute (
"""
select distinct
a.name,
replace(regexp_replace(regexp_replace(a.logo_url,'.*/logo','','i'),'production.*','','i'),'/','') as new_logo
from
sync_locations as a
inner join
public.vw_locations as b
on
a.name = b.location_name
order by
new_logo
"""
)
I want to put the results of fac_id_list into two separate lists. One list will contain all of the values from a.name and the other new_logo.
How can I do this?
sql_results = []
for row in fac_id_list:
sql_results.append(row)
This puts every column in my SQL query into a list, but I want them separated.

When you loop over the results, you can spread them into separate variables and append them to the corresponding lists
names = []
logos = []
for name, logo in fac_id_list:
names.append(name)
logos.append(log)

Python - Normalize data with Regex

I am trying to use Regex cleaning steps in Python to test to see if a pattern matches and if so, clean it to the specified carrier.
For instance, if re.match("\bA\.?X\.?A\.?\b", Carrier): Carrier = CarrierMatch
I've tried this by running a for loop on the number of raw carrier fields followed by another for loop on all of the match descriptions (just printing for now) and it takes FOREVER to run. Hoping someone out there has a better method.
Ideally I would like to see if it's possible to compile all match descriptions for Carrier I have in SQL (~2,000) and pull out the regex match pattern(s) to then use to append the carrier field.
For reference the SQL data fields are [raw_pattern], [Carrier]
import sys
import re
import pyodbc
import sys
import os
import pandas as pd
from datetime import datetime
import time
regexlist = list()
carrierlist = list()
rpt_id = 1234
#rpt_id = sys.argv[1]
plan_typs = list()
try:
conn = pyodbc.connect('Driver={SQL Server};'
'Server=xxxxxxxxx;'
'Database=xxxxxxxxx;'
'Trusted_Connection=xxxxx;')
except:
print('Connection Failed')
sys.exit()
cursor = conn.cursor()
sql = "delete from [dbo].[python_test1] where rpt_id = '""" + str(rpt_id) + """'"""
cursor.execute(sql)
conn.commit()
cursor = conn.cursor()
sql = "insert into [dbo].[python_test1](rpt_id, raw_carr_nm) select distinct rpt_id, raw_carr_nm from [dbo].[wrk_data] where rpt_id = '""" + str(rpt_id) + """'"""
cursor.execute(sql)
conn.commit()
sql = "SELECT [raw_pattern], [Carrier] FROM [dbo].[ref_regex_t]"
regex1 = pd.read_sql(sql, conn)
sql = "select * from [dbo].[python_test1] where rpt_id = '""" + str(rpt_id) + """'"""
carriers = pd.read_sql(sql, conn)
for index, row in regex1.iterrows():
regexlist.append(row['raw_pattern'])
for index, row in carriers.iterrows():
carrierlist.append(row['Carrier'])
for i in carrierlist:
print('"' + i + '"')
for i in regexlist:
print('"' + i + '"')

Date-time as an identification and insert to the sql server

I would like to use the date as an identification that will link two different tables together. I've searched a lot and found a few alternative solutions but I get error messages like this:
pyodbc.Error: ('21S01', '[21S01] [Microsoft][ODBC SQL Server Driver][SQL Server]There are more columns in the INSERT statement than values specified in the VALUES clause. The number of values in the VALUES clause must match the number of columns specified in the INSERT statement. (109) (SQLExecDirectW)')
This is the code I work with:
from src.server.connectToDB import get_sql_conn
import pandas as pd
from datetime import datetime
if __name__ == '__main__':
cursor = get_sql_conn().cursor()
localFile = 'C:\\Users\\dersimw\\Source\Repos\\nordpoolAnalyse\\data\\2011-3.xlsx'
excelFile = pd.ExcelFile(localFile)
rowsID = []
a = ["01"]
for sheets in a:
df = excelFile.parse(sheets).head(5)
df.dropna(axis=1, how='all', inplace=True)
df.fillna(0, inplace=True)
print(df)
now = datetime.now()
DateDateTime = now.strftime('?Y-?m-?d ?H:?M:?S')
for key, rows in df.items():
print("## Column: ", key, "\n")
columnInsertSql = "INSERT INTO Table11 (DateDateTime, AcceptedBlockBuy, AcceptedBlockSell, RejectedBlockBuy, RejectedBlockSell, NetImports) VALUES("
columnCounter = 1
columnHasData = False
for key, column in rows.items():
if isinstance(column, int) or isinstance(column, float):
columnHasData = True
columnInsertSql += str(column)
if columnCounter != len(list(rows.items())):
columnInsertSql += ", "
columnCounter += 1
columnInsertSql += ")"
if columnHasData == True:
cursor.execute(columnInsertSql)
cursor.commit()
This what I have:
Id A.BlockBuy A.BlockSell R.BlockBuy R.BlockSell NetImports
1 112 1 14 655 65
2 123 1 54 3 654
3 122 1 65 43 43
.
.
122 21 12 54 54 54
This is what I want:
Id DateDate A.BlockBuy A.BlockSell R.BlockBuy R.BlockSell NetImports
1 2018-08-1 112 1 14 655 65
2 2018-08-1 123 1 54 3 654
3 2018-08-1 122 1 65 43 43
.
.
122 2018-08-01 21 12 54 54 54

The way you are trying is not a good way to do etl. I have build my own package for one of my project using postgres and python. The procedure should be exactly same for SQL Server. You should add a datetime column to your data (etl_run_time ). I always add it to the dataframe/data before I upload into database. Then I can do bulk insert to the database.
The main thing is, The data loading to python and insertion to database should be separate task. Then there should be some update task if needed. I could not manage time to replicate exactly your task. But You can read in detail into the blog: https://datapsycho.github.io/PsychoBlog/dataparrot-18-01
# import datetime
import time
# from dateutil.relativedelta import relativedelta
import json
# that function has username and password for db connection
# you can create your own which will be used as cursor
from auths.auth import db_connection
import os
import pandas as pd
class DataLoader():
# function to process survey sent data
#classmethod
def process_survey_sent_data(cls):
# my file path is very big so I divide it to 3 different part
input_path_1 = r'path\to\your\file'
input_path_2 = r'\path\to\your\file'
input_path_3 = r'\path\to\your\file'
file_path = input_path_1 + input_path_2 + input_path_3
file_list = os.listdir(os.path.join(file_path))
file_list = [file_name for file_name in file_list if '.txt' in file_name]
field_names = ['country', 'ticket_id']
pd_list = []
for file_name in file_list:
# collecting file name to put them as column
date_ = file_name.replace(" ", "-")[:-4]
file_path_ = file_path + '\\' + file_name
df_ = pd.read_csv(os.path.join(file_path_), sep='\t', usecols=field_names).assign(sent_date=date_)
df_['sent_date'] = pd.to_datetime(df_['sent_date'])
df_['sent_date'] = df_['sent_date'].values.astype('datetime64[M]')
df_['sent_date'] = df_['sent_date'].astype(str)
pd_list.append(df_)
df_ = pd.concat(pd_list)
# doing few more cleaning
# creating a unique ID
df_ = df_[['country','sent_date', 'ticket_id']].groupby(['country','sent_date']).agg('count').reset_index()
df_['sent_id'] = df_['country'] + '_' + df_['sent_date']
df_.drop_duplicates(keep='first', subset='sent_id')
print(df_.head())
output_path_1 = r'\\long\output\path1'
output_path_2 = r'\lont\output\path2'
output_path = output_path_1 + output_path_2
# put the file name
survey_sent_file = 'survey_sent.json'
# add etl run time
df_['etl_run_time'] = pd.to_datetime('today').strftime('%Y-%m-%d')
# write file to json
df_.to_json(os.path.join(output_path, survey_sent_file), orient='records')
return print('Survey Sent data stored as json dump')
# function to crate a database insert query
#classmethod
def insert_string(cls, column_list,table_name):
# Uncomment the first part in the console
first_part = 'INSERT INTO {} VALUES ('.format(table_name)
second_part = ', '.join(['%({})s'.format(col) for col in column_list])
return first_part + second_part + ') ;'
# function to execute database query
#classmethod
def empty_table(cls, table_name):
conn = db_connection()
cursor = conn.cursor()
cursor.execute("delete from {} ;".format(table_name))
conn.commit()
conn.close()
# #function to run post post_sql code after the data load
# #classmethod
# def run_post_sql(cls):
# # create a database query which can run after the insertation of data
# post_sql = """
# INSERT INTO schema.target_table -- target
# select * from schema.extract_table -- extract
# WHERE
# schema.extract_table.sent_id -- primary key of extract
# NOT IN (SELECT DISTINCT sent_id FROM schema.target_table) -- not in target
# """
# conn = db_connection()
# cursor = conn.cursor()
# cursor.execute(post_sql)
# conn.commit()
# conn.close()
# return print("Post SQL for servey sent has run for Survey Sent.")
# function to insert data to server
#classmethod
def insert_survey_sent_data(cls):
output_path_1 = r'new\json\file\path1'
output_path_2 = r'\new\json\file\path2'
output_path = output_path_1 + output_path_2
## create file
output_survey_file = 'survey_sent.json'
full_path = os.path.join(output_path, output_survey_file)
# column name from the json file
table_def = ['sent_id','country', 'ticket_id', 'sent_date', 'etl_run_time']
# load the data as json and partitioning
with open(full_path, 'r') as file:
chunk = 60
json_data = json.loads(file.read())
json_data = [json_data[i * chunk:(i + 1) * chunk] for i in range((len(json_data) + chunk - 1) // chunk )]
# create connection delete existing data and insert data
table_name = 'schema.extract_table'
cls.empty_table(table_name)
print('{} total chunk will be inserted, each chunk have {} rows.'.format(len(json_data), chunk))
for iteration, chunk_ in enumerate(json_data, 1):
conn = db_connection()
cursor = conn.cursor()
insert_statement = cls.insert_string(table_def, table_name)
start_time = time.time()
cursor.executemany(insert_statement, chunk_)
conn.commit()
conn.close()
print(iteration, " %s seconds" % round((time.time() - start_time), 2))
return print('Insert happened for survey sent.')
if __name__ == "__main__":
DataLoader.process_survey_sent_data()
DataLoader.insert_survey_sent_data()
# DataLoader.run_post_sql()

Unable to create a second dataframe python pandas

My second data frame is not loading values when i create it. Any help with why it is not working? When i make my cursor a list, it has a bunch of values in it, but for whatever reason when i try to do a normal data frame load with pandas a second time, it does not work.
My code:
conn = pyodbc.connect(constr, autocommit=True)
cursor = conn.cursor()
secondCheckList = []
checkCount = 0
maxValue = 0
strsql = "SELECT * FROM CRMCSVFILE"
cursor = cursor.execute(strsql)
cols = []
SQLupdateNewIdField = "UPDATE CRMCSVFILE SET NEW_ID = ? WHERE Email_Address_Txt = ? OR TELEPHONE_NUM = ? OR DRIVER_LICENSE_NUM = ?"
for row in cursor.description:
cols.append(row[0])
df = pd.DataFrame.from_records(cursor)
df.columns = cols
newIdInt = 1
for row in range(len(df['Email_Address_Txt'])):
#run initial search to figure out the max number of records. Look for email, phone, and drivers license, names have a chance not to be unique
SQLrecordCheck = "SELECT * FROM CRMCSVFILE WHERE Email_Address_Txt = '" + str(df['Email_Address_Txt'][row]) + "' OR TELEPHONE_NUM = '" + str(df['Telephone_Num'][row]) + "' OR DRIVER_LICENSE_NUM = '" + str(df['Driver_License_Num'][row]) + "'"
## print(SQLrecordCheck)
cursor = cursor.execute(SQLrecordCheck)
## maxValue is indeed a list filled with records
maxValue =(list(cursor))
## THIS IS WHERE PROBLEM OCCURS
tempdf = pd.DataFrame.from_records(cursor)

Why not just use pd.read_sql_query("your_query", conn) this will return the result of the query as a dataframe and requires less code. Also you set cursor to cursor.execute(strsql) at the top and then you are trying to call execute on cursor again in your for loop but you can no longer call execute on cursor you will have to set cursor = conn.cursor() again.

is it possible to change ms access table name with python

I have several ms access databases that each have a table named PlotStatus-name-3/13/12.
I need to import each of these tables into a .csv table. If I manually change the name of the tables to PlotStatus_name_3_13_12, this code works. Does anyone know how to change the table namees using python?
#connect to access database
for filename in os.listdir(prog_rep_local):
if filename[-6:] == ".accdb":
DBtable = os.path.join(prog_rep_local, filename)
conn = pyodbc.connect(r'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=' + DBtable)
cursor = conn.cursor()
ct = cursor.tables
for row in ct():
rtn = row.table_name
if rtn[:10] == "PlotStatus":
#this does not work:
#Oldpath = os.path.join(prog_rep_local, filename, rtn)
#print Oldpath
#fpr = Oldpath.replace('-', '_')#.replace("/","_")
#print fpr
#newname = os.rename(Oldpath, fpr) this does not work
#print newname
#spqaccdb = "SELECT * FROM " + newname
#this workds if I manually change the table names in advance
sqlaccdb = "SELECT * FROM " + rtn
print sqlaccdb
cursor.execute(sqlaccdb)
rows = cursor.fetchall()

An easier solution would be to just add brackets around the table name so that the /s don't throw off the SQL command interpreter.
sqlaccdb = "SELECT * FROM [" + rtn + "]"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dynamically build dictionary from columns specified at runtime - python

Related

SQLAlchemy Retrieving Different Columns

Python - Normalize data with Regex

Date-time as an identification and insert to the sql server

Unable to create a second dataframe python pandas

is it possible to change ms access table name with python

Categories

Resources