I'm new at this blog although I've used several tips on publications done here.
I have a code that reads the GPS position, and other data from another device. After this, I do some calculations and store the data in a database and a text file. My problem is that after a while it stops storing the data, but the code keeps running. Any idea why?
My code is a little extensive, but basically it creates two serial variables so I can read two different devices.
Then in a infinite loop I do the following:
open database
manipulate and analyze data
store data in txt file
store data in mysql
close connection
All of this with a try catch so i can save any exception in another text file.
The code I use to store ina text file is:
with open("datos.txt", "a") as mydata:
mydata.write("(" + str(hora)+","+str(latitud)+","+str(longitud)+",#"+str(color)+","+str(latitudReal)+","+str(longitudReal)+","+str(libre)+","+espectro+")" + ", " + cc1 + "\n")
mydata.flush()
mydata.close()
The code to store in the mysql database is:
try:
sql2 = "INSERT INTO pichincha101(hora, latitud, longitud, color, latitudReal, longitudReal, porcentaje, espectro) VALUES ('"+str(hora)+"','"+str(latitud)+"','"+str(longitud)+"','#"+str(color)+"','"+str(latitudReal)+"','"+str(longitudReal)+"','"+str(libre)+"','"+espectro+"')"
cur.execute(sql2)
db.commit()
except Exception as e:
print "error INSERT mysql"
print e
I'm also deleting some arrays and cleaning up the ram before doing it again with this code:
del res1
del res2
del ampp
del amp1
os.system("echo 3 > /proc/sys/vm/drop_caches")
It's the same problem even if I dont include the last code section I just published.
In advanced, thank you for your help.
Perhaps you have a timeout in your mysql (SQL_TIMEOUT) or your server settings? For example, when I work in Apache, I have to adjust the max_execution_time variable. Cheers!
The solution was empting the buffers from my serial devices. Hope it helps someone else.
Related
I think I've lost my mind. I have created a python script to read a temp table in SQL SSMS. While testing, we found out that python is able to query and read the table even when it's not there/queryable in SSMS. I believe the DF is storing in cache or something but let me break down the problem into steps:
Starting point, the temp table is present in SSMS, MAIN_DF = python.read_sql('SELECT Statement') and stored in DF and saved to excel file (using ExcelWriter)
We delete the temp table in SQL, then run the python script again. To make sure, we use THE SAME 'SELECT' statement found in the python script in SSMS and it displays 'Invalid object name' which is correct because the table has been dropped. BUT when I run the python script again, it is able to query the table and get the same results it had before! It should be throwing the same error as SSMS because the table isn't there! Why isn't python starting from scratch when I run it? It seems to be holding information over from the initial run. How do I ensure I am starting from scratch every time I run it?
I have tried many things including starting the script with blank DF's so they should not have anything held over. 'MAIN_DF = pd.DataFrame()'. I have tried deleting the DF's at the end as well. 'del MAIN_DF'
I don't understand what is happening..
try:
conn = pyodbc.connect(r'Driver={SQL Server};Server=GenericServername;Database=testdb;Trusted_Connection=yes;')
print('Connected to SQL: ' + str(datetime.now()))
MAIN_DF = pd.read_sql('SELECT statement',conn)
print('Queried Main DF: ' + str(datetime.now()))
It's because I didn't close the connection conn.close() so it was cached in the memory and SQL didn't perform / close it
Question 1 of 2
I'm trying to import data from CSV file to Vertica using Python, using Uber's vertica-python package. The problem is that whitespace-only data elements are being loaded into Vertica as NULLs; I want only empty data elements to be loaded in as NULLs, and non-empty whitespace data elements to be loaded in as whitespace instead.
For example, the following two rows of a CSV file are both loaded into the database as ('1','abc',NULL,NULL), whereas I want the second one to be loaded as ('1','abc',' ',NULL).
1,abc,,^M
1,abc, ,^M
Here is the code:
# import vertica-python package by Uber
# source: https://github.com/uber/vertica-python
import vertica_python
# write CSV file
filename = 'temp.csv'
data = <list of lists, e.g. [[1,'abc',None,'def'],[2,'b','c','d']]>
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f, escapechar='\\', doublequote=False)
writer.writerows(data)
# define query
q = "copy <table_name> (<column_names>) from stdin "\
"delimiter ',' "\
"enclosed by '\"' "\
"record terminator E'\\r' "
# copy data
conn = vertica_python.connect( host=<host>,
port=<port>,
user=<user>,
password=<password>,
database=<database>,
charset='utf8' )
cur = conn.cursor()
with open(filename, 'rb') as f:
cur.copy(q, f)
conn.close()
Question 2 of 2
Are there any other issues (e.g. character encoding) I have to watch out for using this method of loading data into Vertica? Are there any other mistakes in the code? I'm not 100% convinced it will work on all platforms (currently running on Linux; there may be record terminator issues on other platforms, for example). Any recommendations to make this code more robust would be greatly appreciated.
In addition, are there alternative methods of bulk inserting data into Vertica from Python, such as loading objects directly from Python instead of having to write them to CSV files first, without sacrificing speed? The data volume is large and the insert job as is takes a couple of hours to run.
Thank you in advance for any help you can provide!
The copy statement you have should perform the way you want with regards to the spaces. I tested it using a very similar COPY.
Edit: I missed what you were really asking with the copy, I'll leave this part in because it might still be useful for some people:
To fix the whitespace, you can change your copy statement:
copy <table_name> (FIELD1, FIELD2, MYFIELD3 AS FILLER VARCHAR(50), FIELD4, FIELD3 AS NVL(MYFIELD3,'') ) from stdin
By using filler, it will parse that into something like a variable which you can then assign to your actual table field using AS later in the copy.
As for any gotchas... I do what you have on Solaris often. The only one thing I noticed is you are setting the record terminator, not sure if this is really something you need to do depending on environment or not. I've never had to do it switching between linux, windows and solaris.
Also, one hint, this will return a resultset that will tell you how many rows were loaded. Do a fetchone() and print it out and you'll see it.
The only other thing I can recommend might be to use reject tables in case any rows reject.
You mentioned that it is a large job. You may need to increase your read timeout by adding 'read_timeout': 7200, to your connection or more. I'm not sure if None would disable the read timeout or not.
As for a faster way... if the file is accessible directly on the vertica node itself, you could just reference the file directly in the copy instead of doing a copy from stdin and have the daemon load it directly. It's much faster and has a number of optimizations that you can do. You could then use apportioned load, and if you have multiple files to load you can just reference them all together in a list of files.
It's kind of a long topic, though. If you have any specific questions let me know.
I have (what I would consider) a massive set of plain text files, around 400GB, that are being imported into a MySQL database (InnoDB engine). The .txt files range from 2GB to 26GB in size, and each file represents a table in the database. I was given a Python script which parses the .txt files and builds SQL statements. I have a machine specifically dedicated to this task with the following specs:
OS - Windows 10
32GB RAM
4TB hard drive
i7 3.40 GHz processor
I want to optimize this import to be as quick and dirty as possible. I've changed the following config settings in the MySQL my.ini file based on stack O questions, the MySQL docs, and other sources:
max_allowed_packet=1073741824;
autocommit=0;
net_buffer_length=0;
foreign_key_check=0;
unique_checks=0;
innodb_buffer_pool_size=8G; (this made a big difference in speed when I increased from the default of 128M)
Are there other settings in the config file that I missed (maybe around logging or caching) that would direct MySQL to use a significant portion of the machine's resources? Could there be another bottleneck I'm missing?
(Side note: not sure if this is related - when I start the import, the mysqld process spins up to use about 13-15% of the system's memory, but then never seems to purge it when I stop the Python script from continuing the import. I'm wondering if this is a result of messing with the logging and flush settings. Thanks in advance for any help.)
(EDIT)
Here is the relevant part of the Python script that populates the tables. It appears the script is connecting, committing and closing the connection for every 50,000 records. Could I remove the conn.commit() at the end of the function and let MySQL handle the committing? The comments below the while (true) are from the authors of the script, and I've adjusted that number so that it won't exceed max_allowed_packet size.
conn = self.connect()
while (True):
#By default, we concatenate 200 inserts into a single INSERT statement.
#a large batch size per insert improves performance, until you start hitting max_packet_size issues.
#If you increase MySQL server's max_packet_size, you may get increased performance by increasing maxNum
records = self.parser.nextRecords(maxNum=50000)
if (not records):
break
escapedRecords = self._escapeRecords(records) #This will sanitize the records
stringList = ["(%s)" % (", ".join(aRecord)) for aRecord in escapedRecords]
cur = conn.cursor()
colVals = unicode(", ".join(stringList), 'utf-8')
exStr = exStrTemplate % (commandString, ignoreString, tableName, colNamesStr, colVals)
#unquote NULLs
exStr = exStr.replace("'NULL'", "NULL")
exStr = exStr.replace("'null'", "NULL")
try:
cur.execute(exStr)
except MySQLdb.Warning, e:
LOGGER.warning(str(e))
except MySQLdb.IntegrityError, e:
#This is likely a primary key constraint violation; should only be hit if skipKeyViolators is False
LOGGER.error("Error %d: %s", e.args[0], e.args[1])
self.lastRecordIngested = self.parser.latestRecordNum
recCheck = self._checkProgress()
if recCheck:
LOGGER.info("...at record %i...", recCheck)
conn.commit()
conn.close()
I'll try to give a brief background here. I recently received a large amount of data that was all digitized from paper maps. Each map was saved as an individual file that contains a number of records (polygons mostly). My goal is to merge all of these files into one shapefile or geodatabase, which is an easy enough task. However, other than spatial information, the records in the file do not have any distinguishing information so I would like to add a field and populate it with the original file name to track its provenance. For example, in the file "505_dmg.shp" I would like each record to have a "505_dmg" id in a column in the attribute table labeled "map_name". I am trying to automate this using Python and feel like I am very close. Here is the code I'm using:
# Import system module
import arcpy
from arcpy import env
from arcpy.sa import *
# Set overwrite on/off
arcpy.env.overwriteOutput = "TRUE"
# Define workspace
mywspace = "K:/Research/DATA/ADS_data/Historic/R2_ADS_Historical_Maps/Digitized Data/Arapahoe/test"
print mywspace
# Set the workspace for the ListFeatureClass function
arcpy.env.workspace = mywspace
try:
for shp in arcpy.ListFeatureClasses("","POLYGON",""):
print shp
map_name = shp[0:-4]
print map_name
arcpy.AddField_management(shp, "map_name", "TEXT","","","20")
arcpy.CalculateField_management(shp, "map_name","map_name", "PYTHON")
except:
print "Fubar, It's not working"
print arcpy.GetMessages()
else:
print "You're a genius Aaron"
The output I receive from running this script:
>>>
K:/Research/DATA/ADS_data/Historic/R2_ADS_Historical_Maps/Digitized Data/Arapahoe/test
505_dmg.shp
505_dmg
506_dmg.shp
506_dmg
You're a genius Aaron
Appears successful, right? Well, it has been...almost: a field was added and populated for both files, and it is perfect for 505_dmg.shp file. Problem is, 506_dmg.shp has also been labeled "505_dmg" in the "map_name" column. Though the loop appears to be working partially, the map_name variable does not seem to be updating. Any thoughts or suggestions much appreciated.
Thanks,
Aaron
I received a solution from the ESRI discussion board:
https://geonet.esri.com/thread/114520
Basically, a small edit in the Calculate field function did the trick. Here is the new code that worked:
arcpy.CalculateField_management(shp, "map_name","\"" + map_name + "\"", "PYTHON")
I tried researching the answer but was not able to find a good solution. I have files with strange extensions .res. I was told that they are MS Access files. Not sure if they are the same as .mdb but I was able to open them in MS Access. How can I open those files, extract necessary data, sort that data and produce .csv file? I tried using this script: http://mazamascience.com/WorkingWithData/?p=168 and mdb tools on Linux. I got some output with errors in terminal but all the files produced were blank. It could be due to encoding. I am not sure. The file is in ASCII encoding I think.
Error: Table fo_Table
Smart_Battery_Data_Table
MCell_Aci_Data_Table
Aux_Global_Data_Table
Smart_Battery_Clock_Stretch_Table
does not exist in this database.
On Windows I have no idea how to do it. My first step for now is just to dump the necessary table from that database file into .csv. But ideally I need the script to take the file, sort it, extract necessary data, do some calculations (like data in one column divided by data in another column) and save all that stuff into nice .csv.
Thanks a lot. I am not an experienced programmer so please have mercy.
Using the generic pyodbc library should do it. Looks like it has already an embedded MS access driver. This question can probably help you out.
I dont have any MS Access database files with me (It has been ages that I dont have to work with them), but following the examples your code should be something like this:
import pyodbc
db_file = r'''/path/to/the/file.res'''
user = 'admin'
password = 'password'
odbc_conn_str = 'DRIVER={Microsoft Access Driver (*.mdb)};DBQ=%s;UID=%s;PWD=%s' % (db_file, user, password)
conn = pyodbc.connect(odbc_conn_str)
cursor = conn.cursor()
cursor.execute("select * from table order by some_column")
for row in cursor.fetchall():
print ", ".join((row.column1, row.column2, row.columnN))