[Edit 2: More information and debugging in answer below...]
I'm writing a python script to export MS Access databases into a series of text files to allow for more meaningful version control (I know - why Access? Why aren't I using existing solutions? Let's just say the restrictions aren't of a technical nature).
I've successfully exported the full contents and structure of the database using ADO and ADOX via the comtypes library, but I'm getting a problem re-importing the data.
I'm exporting the contents of each table into a text file with a list on each line, like so:
[-9, u'No reply']
[1, u'My home is as clean and comfortable as I want']
[2, u'My home could be more clean or comfortable than it is']
[3, u'My home is not at all clean or comfortable']
And the following function to import the said file:
import os
import sys
import datetime
import comtypes.client as client
from ADOconsts import *
from access_consts import *
class Db:
def create_table_contents(self, verbosity = 0):
conn = client.CreateObject("ADODB.Connection")
rs = client.CreateObject("ADODB.Recordset")
conn.ConnectionString = self.new_con_string
conn.Open()
for fname in os.listdir(self.file_path):
if fname.startswith("Table_"):
tname = fname[6:-4]
if verbosity > 0:
print "Filling table %s." % tname
conn.Execute("DELETE * FROM [%s];" % tname)
rs.Open("SELECT * FROM [%s];" % tname, conn,
adOpenDynamic, adLockOptimistic)
f = open(self.file_path + os.path.sep + fname, "r")
data = f.readline()
print repr(data)
while data != '':
data = eval(data.strip())
print data[0]
print rs.Fields.Count
rs.AddNew()
for i in range(rs.Fields.Count):
if verbosity > 1:
print "Into field %s (type %s) insert value %s." % (
rs.Fields[i].Name, str(rs.Fields[i].Type),
data[i])
rs.Fields[i].Value = data[i]
data = f.readline()
print repr(data)
rs.Update()
rs.Close()
conn.Close()
Everything works fine except that numerical values (double and int) are being inserted as zeros. Any ideas on whether the problem is with my code, eval, comtypes, or ADO?
Edit: I've fixed the problem with inserting numbers - casting them as strings(!) seems to solve the problem for both double and integer fields.
However, I now have a different issue that had previously been obscured by the above: the first field in every row is being set to 0 regardless of data type... Any ideas?
And found an answer.
rs = client.CreateObject("ADODB.Recordset")
Needs to be:
rs = client.CreateObject("ADODB.Recordset", dynamic=True)
Now I just need to look into why. Just hope this question saves someone else a few hours...
Is data[i] being treated as a string? What happens if you specifically cast it as a int/double when you set rs.Fields[i].Value?
Also, what happens when you print out the contents of rs.Fields[i].Value after it is set?
Not a complete answer yet, but it appears to be a problem during the update. I've added some further debugging code in the insertion process which generates the following (example of a single row being updated):
Inserted into field ID (type 3) insert value 1, field value now 1.
Inserted into field TextField (type 202) insert value u'Blah', field value now Blah.
Inserted into field Numbers (type 5) insert value 55.0, field value now 55.0.
After update: [0, u'Blah', 55.0]
The last value in each "Inserted..." line is the result of calling rs.Fields[i].Value before calling rs.Update(). The "After..." line shows the results of calling rs.Fields[i].Value after calling rs.Update().
What's even more annoying is that it's not reliably failing. Rerunning the exact same code on the same records a few minutes later generated:
Inserted into field ID (type 3) insert value 1, field value now 1.
Inserted into field TextField (type 202) insert value u'Blah', field value now Blah.
Inserted into field Numbers (type 5) insert value 55.0, field value now 55.0.
After update: [1, u'Blah', 2.0]
As you can see, results are reliable until you commit them, then... not.
Related
I wrote a code in python to manipulate a table I have in my database. I am doing so using SQL Alchemy. Basically I have table 1 that has 2 500 000 entries. I have another table 2 with 200 000 entries. Basically what I am trying to do, is compare my source ip and dest ip in table 1 with source ip and dest ip in table 2. if there is a match, I replace the ip source and ip dest in table 1 with a data that matches ip source and ip dest in table 2 and I add the entry in table 3. My code also checks if the entry isn't already in the new table. If so, it skips it and then goes on with the next row.
My problem is its extremely slow. I launched my script yesterday and in 24 hours it only went through 47 000 entries out of 2 500 000. I am wondering if there are anyways I can speed up the process. It's a postgres db and I can't tell if the script taking this much time is reasonable or if something is up. If anyone had a similar experience with something like this, how much time did it take before completion ?
Many thanks.
session = Session()
i = 0
start_id = 1
flows = session.query(Table1).filter(Table1.id >= start_id).all()
result_number = len(flows)
vlan_list = {"['0050']", "['0130']", "['0120']", "['0011']", "['0110']"}
while i < result_number:
for flow in flows:
if flow.vlan_destination in vlan_list:
usage = session.query(Table2).filter(Table2.ip ==
str(flow.ip_destination)).all()
if len(usage) > 0:
usage = usage[0].usage
else:
usage = str(flow.ip_destination)
usage_ip_src = session.query(Table2).filter(Table2.ip ==
str(flow.ip_source)).all()
if len(usage_ip_src) > 0:
usage_ip_src = usage_ip_src[0].usage
else:
usage_ip_src = str(flow.ip_source)
if flow.protocol == "17":
protocol = func.REPLACE(flow.protocol, "17", 'UDP')
elif flow.protocol == "1":
protocol = func.REPLACE(flow.protocol, "1", 'ICMP')
elif flow.protocol == "6":
protocol = func.REPLACE(flow.protocol, "6", 'TCP')
else:
protocol = flow.protocol
is_in_db = session.query(Table3).filter(Table3.protocol ==
protocol)\
.filter(Table3.application == flow.application)\
.filter(Table3.destination_port == flow.destination_port)\
.filter(Table3.vlan_destination == flow.vlan_destination)\
.filter(Table3.usage_source == usage_ip_src)\
.filter(Table3.state == flow.state)\
.filter(Table3.usage_destination == usage).count()
if is_in_db == 0:
to_add = Table3(usage_ip_src, usage, protocol, flow.application, flow.destination_port,
flow.vlan_destination, flow.state)
session.add(to_add)
session.flush()
session.commit()
print("added " + str(i))
else:
print("usage already in DB")
i = i + 1
session.close()
EDIT As requested, here are more details : Table 1 has 11 columns, the two we are interested in are source ip and dest ip.
Table 1
Here, I have Table 2 :Table 2. It has an IP and a Usage. What my script is doing is that it takes source ip and dest ip from table one and looks up if there is a match in Table 2. If so, it replaces the ip address by usage, and adds this along with some of the columns of Table 1 in Table 3 :[Table3][3]
Along doing this, when adding the protocol column into Table 3, it writes the protocol name instead of the number, just to make it more readable.
EDIT 2 I am trying to think about this differently, so I did a diagram of my problem Diagram (X problem)
What I am trying to figure out is if my code (Y solution) is working as intended. I've been coding in python for a month only and I feel like I am messing something up. My code is supposed to take every row from my Table 1, compare it to Table 2 and add data to table 3. My Table one has over 2 million entries and it's understandable that it should take a while but its too slow. For example, when I had to load the data from the API to the db, it went faster than the comparisons im trying to do with everything that is already in the db. I am running my code on a virtual machine that has sufficient memory so I am sure it's my code that is lacking and I need direction to as what can be improved. Screenshots of my tables:
Table 2
Table 3
Table 1
EDIT 3 : Postgresql QUERY
SELECT
coalesce(table2_1.usage, table1.ip_source) AS coalesce_1,
coalesce(table2_2.usage, table1.ip_destination) AS coalesce_2,
CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END AS anon_1,
table1.application AS table1_application,
table1.destination_port AS table1_destination_port,
table1.vlan_destination AS table1_vlan_destination,
table1.state AS table1_state
FROM
table1
LEFT OUTER JOIN table2 AS table2_2 ON table2_2.ip = table1.ip_destination
LEFT OUTER JOIN table2 AS table2_1 ON table2_1.ip = table1.ip_source
WHERE
table1.vlan_destination IN (
%(vlan_destination_1) s,
%(vlan_destination_2) s,
%(vlan_destination_3) s,
%(vlan_destination_4) s,
%(vlan_destination_5) s
)
AND NOT (
EXISTS (
SELECT
1
FROM
table3
WHERE
table3.usage_source = coalesce(table2_1.usage, table1.ip_source)
AND table3.usage_destination = coalesce(table2_2.usage, table1.ip_destination)
AND table3.protocol = CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END
AND table3.application = table1.application
AND table3.destination_port = table1.destination_port
AND table3.vlan_destination = table1.vlan_destination
AND table3.state = table1.state
)
)
Given the current question, I think this at least comes close to what you might be after. The idea is to perform the entire operation in the database, instead of fetching everything – the whole 2,500,000 rows – and filtering in Python etc.:
from sqlalchemy import func, case
from sqlalchemy.orm import aliased
def newhotness(session, vlan_list):
# The query needs to join Table2 twice, so it has to be aliased
dst = aliased(Table2)
src = aliased(Table2)
# Prepare required SQL expressions
usage = func.coalesce(dst.usage, Table1.ip_destination)
usage_ip_src = func.coalesce(src.usage, Table1.ip_source)
protocol = case({"17": "UDP",
"1": "ICMP",
"6": "TCP"},
value=Table1.protocol,
else_=Table1.protocol)
# Form a query producing the data to insert to Table3
flows = session.query(
usage_ip_src,
usage,
protocol,
Table1.application,
Table1.destination_port,
Table1.vlan_destination,
Table1.state).\
outerjoin(dst, dst.ip == Table1.ip_destination).\
outerjoin(src, src.ip == Table1.ip_source).\
filter(Table1.vlan_destination.in_(vlan_list),
~session.query(Table3).
filter_by(usage_source=usage_ip_src,
usage_destination=usage,
protocol=protocol,
application=Table1.application,
destination_port=Table1.destination_port,
vlan_destination=Table1.vlan_destination,
state=Table1.state).
exists())
stmt = insert(Table3).from_select(
["usage_source", "usage_destination", "protocol", "application",
"destination_port", "vlan_destination", "state"],
flows)
return session.execute(stmt)
If the vlan_list is selective, or in other words filters out most rows, this will perform a lot less operations in the database. Depending on the size of Table2 you may benefit from indexing Table2.ip, but do test first. If it is relatively small, I would guess that PostgreSQL will perform a hash or nested loop join there. If some column of the ones used to filter out duplicates in Table3 is unique, you could perform an INSERT ... ON CONFLICT ... DO NOTHING instead of removing duplicates in the SELECT using the NOT EXISTS subquery expression (which PostgreSQL will perform as an antijoin). If there is a possibility that the flows query may produce duplicates, add a call to Query.distinct() to it.
I'm trying to port to Python a SAP table download script, that already works on Excel VBA, but I want a command line version and I would prefer to avoid VBScript for a number of reasons that go beyond the goal of this post.
I'm stuck at the moment in which I need to fill the values in a table
from win32com.client import Dispatch
Functions = Dispatch("SAP.Functions")
Functions.Connection.Client = "400"
Functions.Connection.ApplicationServer = "myserver"
Functions.Connection.Language = "EN"
Functions.Connection.User = "myuser"
Functions.Connection.Password = "mypwd"
Functions.Connection.SystemNumber = "00"
Functions.Connection.UseSAPLogonIni = False
if (Functions.Connection.Logon (0,True) == True):
print("Logon OK")
RFC = Functions.Add("RFC_READ_TABLE")
RFC.exports("QUERY_TABLE").Value = "USR02"
RFC.exports("DELIMITER").Value = "~"
#RFC.exports("ROWSKIPS").Value = 2000
#RFC.exports("ROWCOUNT").Value = 10
tblOptions = RFC.Tables("OPTIONS")
#RETURNED DATA
tblData = RFC.Tables("DATA")
tblFields = RFC.Tables("FIELDS")
tblFields.AppendRow ()
print(tblFields.RowCount)
print(tblFields(1,"FIELDNAME"))
# the 2 lines above print 1 and an empty string, so the row in the table exists
Until here it is basically copied from VBA adapting the syntax.
In VBA at this point I'm able to do
tblFields(1,"FIELDNAME") = "BNAME"
if I do the same I get an error because the left part is a function and written that way it returns a string. In VBA it is probably a bi-dimensional array.
I unsuccessfully tried various approaches like
tblFields.setValue([{"FIELDNAME":"BNAME"}])
tblFields(1,"FIELDNAME").Value = "BNAME"
tblFields(1,"FIELDNAME").setValue("BNAME")
tblFields.FieldName = "BNAME" ##kinda desperate
The script works, without setting the FIELDS table, for outputs that produce rows shorter than 500 chars. This is a SAP limit in the function.
I know that this is not the best way, but I can't use the SAPNWRFC library and I can't use librfc32.dll.
I must be able to solve this way, or revert to the VB version.
Thanks to anyone who will provide a hint
After a lot of trial and error, i found a solution.
Instead of adding row by row to the "OPTIONS" or "FIELDS" tables, you can just submit a prefilled table.
This should work:
tblFields.Data = (('VBELN', '000000', '000000', '', ''),
('POSNR', '000000', '000000', '', ''))
same here:
tblOptions.Data = (("VBELN EQ '2557788'",),)
I have created a python module to generate weather data(Latitude, Longitude, Elevation and other details) by taking particular location as input.
Updated it as per standards and "pycodestyle" package for checking PEP8 standards does not throw any error or warnings.
My Code is given below :
def fetch_location_info(input_list, err_file):
# URL which gives us Latitude, Longitude values
LatLong_URL = (
'http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address='
)
# URL which gives us Elevation values
Elevation_URL = (
'https://maps.googleapis.com/maps/api/elevation/json?locations='
)
# Initializing Error Logs with relevant title for writing error records
err_line_header = "Logging Location Data Errors"
print(err_line_header, file=err_file)
# Insert a new line in the error file after the Error Header
print("\n", file=err_file)
# Fetch and Extract Location details from google maps
input_info = []
for location in input_list:
temp_info = {'Location': location}
latlong_response = requests.get(LatLong_URL + location).json()
if latlong_response.get('results'):
for latlong_results in latlong_response.get('results'):
latlong = (
latlong_results
.get('geometry', '0')
.get('location', '0')
)
temp_info['Latitude'] = latlong.get('lat', '0')
temp_info['Longitude'] = latlong.get('lng', '0')
elevation_response = requests.get(
Elevation_URL
+ str(temp_info['Latitude'])
+ ','
+ str(temp_info['Longitude'])
).json()
if elevation_response.get('results'):
for elevation_results in elevation_response.get('results'):
temp_info['Elevation'] = (
elevation_results.get('elevation', '0'))
input_info.append(temp_info)
break
else:
print("Elevation_URL is not fetching values for {}"
.format(location),
file=err_file
)
break
else:
print("LatLong_URL is not fetching values for {}"
.format(location),
file=err_file
)
print("\n", file=err_file)
return input_info
Now as a next step, I am trying to do Unit Testing using doctest. I chose to keep the test cases in a separate file. So I created the following .txt file and kept in the same directory as the code.
This is a doctest based regression suite for Test_Weather.py
Each '>>' line is run as if in a python shell, and counts as a test.
The next line, if not '>>' is the expected output of the previous line.
If anything doesn't match exactly (including trailing spaces), the test fails.
>>> from Test_Weather import fetch_location_info
>>> fetch_location_info(["Sydney,Australia"], open('data/error_log.txt', 'w'))
print(input_info)
As seen above, the expected condition should return the contents of the list / dataframe / variable that is created within the function being tested. For a try I just tried to print the contents of the list but my unit test output throws error like below since the expected value and got value is not matching :
PS C:\Users\JKC> python -m doctest testcases.txt
********************************************************************** File "testcases.txt", line 7, in testcases.txt Failed example:
fetch_location_info(["Sydney,Australia"], open('data/error_log.txt', 'w'))
Expected:
print(input_info)
Got:
[{'Location': 'Sydney,Australia', 'Latitude': -33.8688197, 'Longitude': 151. 2092955, 'Elevation': 24.5399284362793}]
So here as you can see that the Test Case worked fine but since I am not able to print the contents of the list, it is failing the test case.
My question is How can I display the contents of list in the expected section of the unit test case ?
If I am not wrong, do I need to literally mention the output value in the expected section of the unit test case ?
Any inputs will be helpful
You need to have the doctest look exactly as if you ran it at the Python REPL:
>>> from Test_Weather import fetch_location_info
>>> fetch_location_info(["Sydney,Australia"], open('data/error_log.txt', 'w'))
[{'Location': 'Sydney,Australia', 'Latitude': -33.8688197, 'Longitude': 151.2092955, 'Elevation': 24.5399284362793}]
I am reading a pandas dataframe and trying to write it to a file with json.dump.
This throws an error -
TypeError(repr(o) + " is not json serializable"
TypeError: 5 is not JSON serializable
(link to error screenshot http://postimg.org/image/lyr539w5p/
The code block at which the error occurs is shown below. Line 3 in this block raises the error
def write_df_to_json_file(user_array, output_filename):
try:
fh = open(output_filename, "w")
json.dump(user_array, fh)
except IOError:
util.logit(3, "Error: can\'t find file or read data")
else:
util.logit(2, str(output_filename) + " file written successful.")
fh.close()
taking an output of "user_array" I get the below data.
user_array = {'344': {'216': 4, '215': 3, '213': 4, '297': 4, '684': 4}, '346': {'216': 3, '215': 3, '213': 3, '669': 1, '211': 4, '218': 3, '219': 2, '133': 5, '132': 4, '496': 5, '693': 4, '210': 4, '22': 5, '29': 4, '161': 3, '358': 4}, '347': {'216': 3, '378': 5, '417': 5, '435': 4}}
The code ran fine on OS-X with Anaconda. But on Windows, it throws the error. Three other team members in my team have tried this code on their Windows PC and they all got the same error. They also have Anaconda with Python 2.7 (same as me).
In order to do some troubleshooting on a windows pc, we copied the data (user_array) to a new file (hard coded it in the main file rather than read from the source file), and tried a json.dump. the code runs without any error!. The file was successfully created with the json dump.
I have searched online (in stackoverflow as well as other sites) and though people have faced this error, they normally have a dictionary object that couldn't be json dumped. In my case, even if this is a dictionary object, it runs on OS-X. so I assume that there is no problem with the object as such on OS-X.
Is there a particular difference between OS-X and Windows that causes this error? How do I fix this on Windows? Since, my team cannot help with the development if they cannot run the code on Windows.
--------------- Additional info ------------
Adding the blocks that end up calling this function block.
read_csv_to_json is the first function that is called. and that in turns calls the remaining (in order shown) ending with the write_df_to_json where i get the error.
There was one point, early in the phase, where we were doing a simple json.dump on the input file after reading from csv and it ran fine on both mac and windows. after a week, when we had all these parsing functions, this error showed up. Not sure if that's helpful info though
Read csv file
def read_csv_to_json(filename, separator, colnames):
df_ratings = read_csv(filename, separator, colnames)
holdout_split(df_ratings)
def read_csv(filename, separator, colnames):
# read the full u data set containing 100000 ratings by 943 users on 1682 items
data_ratings = pd.read_csv(filename, sep =separator, header = None, names = colnames)
return data_ratings
def holdout_split(df_ratings):
train, test = train_test_split(df_ratings, test_size = 0.2)
parse_df_to_usercf_json(train, file_holdout_utrain)
parse_df_to_usercf_json(test, file_holdout_utest)
parse_df_to_itemcf_json(train, file_holdout_itrain)
parse_df_to_itemcf_json(test, file_holdout_itest)
def parse_df_to_usercf_json(data_ratings, output_filename):
data_ratings_sorted = data_ratings.sort_values(user_sort_colname) # sort the data_ratings using the sort_colname
# get distinct_userId sorted by User_id
distinct_userId = sorted(pd.unique(data_ratings_sorted.user_id.ravel()))
# Setting up counters to slice the dataframe for subgroups based on same userid
user_marker = 0
rowCounter = 0;
rowcount = len(data_ratings_sorted.index)
user_array = {}
# Slicing the dataframe based on rows with the same userid and storing as dict objects
for user in distinct_userId: # for each distinct user
movie_details = {}
for index_j, row_j in data_ratings_sorted[user_marker: rowcount].iterrows():
# dataframe is sliced using a user_marker that is set to beginning of the current distinct user
rowCounter += 1 # point rowcounter to next row
if user == row_j['user_id']: # till the userj in current row matches the current distinct user in the top loop
movie_details[str(row_j['movie_id'])]=(row_j['rating']) # create a key value pair {movie id : rating}
else: # if userid in current row doesnt match the distinct user in the top loop
rowCounter = rowCounter - 1 # retreat the rowcounter one step back
user_marker = rowCounter # set user_marker to the next distinct user id
user_array[str(user)] = movie_details # store the userid and movie details as {user_id1: {movie_id,movie_rating}}
break; # skip to next distinct user (back to top loop)
# write the final dictionary object/array to json file
write_df_to_json_file(user_array, output_filename)
I have the following code to query a database using an ADO COMObject in python. This is connecting to a Time series database (OSIPI) and this is the only way we've been able to get Python connected to the database.
from win32com.client import Dispatch
oConn = Dispatch('ADODB.Connection')
oRS = Dispatch('ADODB.RecordSet')
oConn.ConnectionString = <my connection string>
oConn.Open()
oRS.ActiveConnection = oConn
if oConn.State == adStateOpen:
print "Connected to DB"
else:
raise SystemError('Database Connection Failed')
cmd = """SELECT tag, dataowner FROM pipoint WHERE tag LIKE 'TEST_TAG1%'"""
self.oRS.Open(cmd)
result = oRS.GetRows(1)
print result
result2 = oRS.GetRows(2)
print result2
if oConn.State == adStateOpen:
oConn.Close()
oConn = None
This code returns the following two lines as results to the query:
result ((u'TEST_TAG1.QTY.BLACK',), (u'piadmin',))
result2 = ((u'TEST_TAG1.QTY.BLACK', u'TEST_TAG1.QTY.PINK'), (u'piadmin', u'piuser'))
This is not the expected format. In this case, I was expecting something like this:
result = ((u'TEST_TAG1.QTY.BLACK',u'piadmin'))
result2 = ((u'TEST_TAG1.QTY.BLACK',u'piadmin'),
(u'TEST_TAG1.QTY.PINK',u'piuser'))
Is there a way to adjust the results of an ADO query so everything related to row 1 is in the same tuple and everything in row 2 is in the same tuple?
What you're seeing is not really a Python thing but the output of GetRows(), which returns a two-dimensional array, which is organized by by field and then row.
Fortunately, Python has the zip() function that will make the relevant change for you. Try changing your code from:
result = oRS.GetRows(1)
to:
result = zip(*oRS.GetRows(1))
etc.