Python - Normalize data with Regex

Python - Normalize data with Regex - python

I am trying to use Regex cleaning steps in Python to test to see if a pattern matches and if so, clean it to the specified carrier.
For instance, if re.match("\bA\.?X\.?A\.?\b", Carrier): Carrier = CarrierMatch
I've tried this by running a for loop on the number of raw carrier fields followed by another for loop on all of the match descriptions (just printing for now) and it takes FOREVER to run. Hoping someone out there has a better method.
Ideally I would like to see if it's possible to compile all match descriptions for Carrier I have in SQL (~2,000) and pull out the regex match pattern(s) to then use to append the carrier field.
For reference the SQL data fields are [raw_pattern], [Carrier]
import sys
import re
import pyodbc
import sys
import os
import pandas as pd
from datetime import datetime
import time
regexlist = list()
carrierlist = list()
rpt_id = 1234
#rpt_id = sys.argv[1]
plan_typs = list()
try:
conn = pyodbc.connect('Driver={SQL Server};'
'Server=xxxxxxxxx;'
'Database=xxxxxxxxx;'
'Trusted_Connection=xxxxx;')
except:
print('Connection Failed')
sys.exit()
cursor = conn.cursor()
sql = "delete from [dbo].[python_test1] where rpt_id = '""" + str(rpt_id) + """'"""
cursor.execute(sql)
conn.commit()
cursor = conn.cursor()
sql = "insert into [dbo].[python_test1](rpt_id, raw_carr_nm) select distinct rpt_id, raw_carr_nm from [dbo].[wrk_data] where rpt_id = '""" + str(rpt_id) + """'"""
cursor.execute(sql)
conn.commit()
sql = "SELECT [raw_pattern], [Carrier] FROM [dbo].[ref_regex_t]"
regex1 = pd.read_sql(sql, conn)
sql = "select * from [dbo].[python_test1] where rpt_id = '""" + str(rpt_id) + """'"""
carriers = pd.read_sql(sql, conn)
for index, row in regex1.iterrows():
regexlist.append(row['raw_pattern'])
for index, row in carriers.iterrows():
carrierlist.append(row['Carrier'])
for i in carrierlist:
print('"' + i + '"')
for i in regexlist:
print('"' + i + '"')

Related

Mysql Python eof then return after fetch more data

How can I show more data after select finish (eof) but insert more data:
import MySQLdb
import MySQLdb.cursors
cnx = MySQLdb.connect(user="user",
passwd="password",
db="mydb",
cursorclass = MySQLdb.cursors.SSCursor
)
cursor = cnx.cursor()
cursor.execute("SELECT * FROM individual_data")
while FOREVER:
row = cursor.fetchone()
if row is not None:
print row

Just:
cursor.execute("SELECT * FROM individual_data where " +
"datetimestamp >= TIMESTAMP '" + str(TOP_DT1) + "' and datetimestamp < TIMESTAMP '" + str(TOP_DT11) + "' order by datetimestamp, segment_id, user_id, id")
execute again the query to fetch more data. Like this;
row1 = fetch_one(cur1)
while row1 is not None:
b = read_individual_data(row1)
row = emit_individual_data(a, b)
TOP_DT1 = TOP_DT11
row1 = fetch_one(cur1)

Sqlite3 naming db file with a variable in python

How can I use the current date to name my db file so when it runs it creates a db file which is named after the current date. This is what I have so far:
import sqlite3
import time
timedbname = time.strftime("%d/%m/%Y")
# Connecting to the database file
conn = sqlite3.connect(???)
with this error its the same with '/' or '-' or '.' in "%d/%m/%Y":
conn = sqlite3.connect(timedbname, '.db')
TypeError: a float is required
27.01.2016

Try using:
time.strftime("%d-%m-%Y")
I guess it doesn't work because of the slashes in the generated date.

You can't have dashes in your table name. Also it can't start with a digit.
import sqlite3
from datetime import date
timedbname = '_' + str(date.today()).replace('-','_')
# Connecting to the database file
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
cursor.execute('''CREATE TABLE %s (col1 int, col2 int)''' % (timedbname))
cursor.execute('''INSERT INTO %s VALUES (1, 2)''' % (timedbname))
cursor.execute('''SELECT * FROM %s'''%timedbname).fetchall()

This worked:
import sqlite3
import time
timedbname = time.strftime("_" + "%d.%m.%Y")
conn = sqlite3.connect(timedbname + '.db')

Python cursor is returning number of rows instead of rows

Writing a script to clean up some data. Super unoptimized but this cursor is
returning the number of results in the like query rather than the rows what am I doing wrong.
#!/usr/bin/python
import re
import MySQLdb
import collections
db = MySQLdb.connect(host="localhost", # your host, usually localhost
user="admin", # your username
passwd="", # your password
db="test") # name of the data base
# you must create a Cursor object. It will let
# you execute all the query you need
cur = db.cursor()
# Use all the SQL you like
cur.execute("SELECT * FROM vendor")
seen = []
# print all the first cell of all the rows
for row in cur.fetchall() :
for word in row[1].split(' '):
seen.append(word)
_digits = re.compile('\d')
def contains_digits(d):
return bool(_digits.search(d))
count_word = collections.Counter(seen)
found_multi = [i for i in count_word if count_word[i] > 1 and not contains_digits(i) and len(i) > 1]
unique_multiples = list(found_multi)
groups = dict()
for word in unique_multiples:
like_str = '%' + word + '%'
res = cur.execute("""SELECT * FROM vendor where name like %s""", like_str)

You are storing the result of cur.execute(), which is the number of rows. You are never actually fetching any of the results.
Use .fetchall() to get all result rows or iterate over the cursor after executing:
for word in unique_multiples:
like_str = '%' + word + '%'
cur.execute("""SELECT * FROM vendor where name like %s""", like_str)
for row in cur:
print row

searching for occurrences in access database

I am trying to search in access database for some occurrence, but I found that my code miss somes when it made a search.
I found that he miss the second occurence when it found the first one.
Example: if I have the following and I am looking for T300 and I have this structure:
T200
T300
T300
it will catch first T300 and pass the second T300
enter code here
import csv
import pyodbc
from xml.dom import minidom
# *************************************
def DBAccess (Term):
MDB = 'c:/test/mydb.mdb'
DRV = '{Microsoft Access Driver (*.mdb)}'
PWD = ''
conn = pyodbc.connect('DRIVER=%s;DBQ=%s;PWD=%s' % (DRV,MDB,PWD))
curs = conn.cursor()
curs.execute("select * from gdo_segment")
rows = curs.fetchall()
for row in rows:
T = 'T' + str(row.troncon) + '_' + row.noeud1 + '-' + row.noeud2
if (T == Term ):
print T
curs.close()
conn.close()
#*************************************
def findTerminal():
xmldoc = minidom.parse('c:\\test\mydoc.xml')
#printing the number of blocs in my xml file
itemlist = xmldoc.getElementsByTagName('ACLineSegment')
for item in itemlist:
found = False
for child in item.childNodes:
if child.nodeName == 'Terminal':
found = True
if not found:
Term = item.getAttribute('Name')
DBAccess (Term)
#***********************************
findTerminal()

I assume it is finding the last item, and this would be because of your code indenting. Correct indenting is essential in Python. the docs
Currently, your if statement only applies after all the looping has completed, so will only check the last value of T.
def DBAccess (Term):
MDB = 'c:/test/gdomt.mdb'
DRV = '{Microsoft Access Driver (*.mdb)}'
PWD = ''
conn = pyodbc.connect('DRIVER=%s;DBQ=%s;PWD=%s' % (DRV,MDB,PWD))
curs = conn.cursor()
curs.execute("select * from gdo_segment")
rows = curs.fetchall()
for row in rows:
T = 'T' + str(row.troncon) + '_' + row.noeud1 + '-' + row.noeud2
if (T == Term ):
print T
curs.close()
conn.close()

creating a pandas dataframe from a database query that uses bind variables

I'm working with an Oracle database. I can do this much:
import pandas as pd
import pandas.io.sql as psql
import cx_Oracle as odb
conn = odb.connect(_user +'/'+ _pass +'#'+ _dbenv)
sqlStr = "SELECT * FROM customers"
df = psql.frame_query(sqlStr, conn)
But I don't know how to handle bind variables, like so:
sqlStr = """SELECT * FROM customers
WHERE id BETWEEN :v1 AND :v2
"""
I've tried these variations:
params = (1234, 5678)
params2 = {"v1":1234, "v2":5678}
df = psql.frame_query((sqlStr,params), conn)
df = psql.frame_query((sqlStr,params2), conn)
df = psql.frame_query(sqlStr,params, conn)
df = psql.frame_query(sqlStr,params2, conn)
The following works:
curs = conn.cursor()
curs.execute(sqlStr, params)
df = pd.DataFrame(curs.fetchall())
df.columns = [rec[0] for rec in curs.description]
but this solution is just...inellegant. If I can, I'd like to do this without creating the cursor object. Is there a way to do the whole thing using just pandas?

Try using pandas.io.sql.read_sql_query. I used pandas version 0.20.1, I used it, it worked out:
import pandas as pd
import pandas.io.sql as psql
import cx_Oracle as odb
conn = odb.connect(_user +'/'+ _pass +'#'+ _dbenv)
sqlStr = """SELECT * FROM customers
WHERE id BETWEEN :v1 AND :v2
"""
pars = {"v1":1234, "v2":5678}
df = psql.frame_query(sqlStr, conn, params=pars)

As far as I can tell, pandas expects that the SQL string be completely formed prior to passing it along. With that in mind, I would (and always do) use string interpolation:
params = (1234, 5678)
sqlStr = """
SELECT * FROM customers
WHERE id BETWEEN %d AND %d
""" % params
print(sqlStr)
which gives
SELECT * FROM customers
WHERE id BETWEEN 1234 AND 5678
So that should feed into psql.frame_query just fine. (it does in my experience with postgres, mysql, and sql server).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Normalize data with Regex - python

Related

Mysql Python eof then return after fetch more data

Sqlite3 naming db file with a variable in python

Python cursor is returning number of rows instead of rows

searching for occurrences in access database

creating a pandas dataframe from a database query that uses bind variables

Categories

Resources