I am trying to change the format of the date and time values I am receiving from the sensor. I initially receive it as string and i convert it into datetime and then try to apply strftime. When I do this in Jupyter notebook on set of values it works fine but when I implement it in my code it breaks. Here is my code:
import json
import socket
from pandas.io.json import json_normalize
from sqlalchemy import create_engine
import pandas as pd
import datetime
# Establish connection with Database
engine = create_engine('sqlite:///Production.db', echo=False)
# Establish connecton with Spider
server_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
server_socket.bind(('192.168.130.35', 8089))
# Receive data while sensor is live
while True:
message, address = server_socket.recvfrom(1024)
# Create empty list to hold data of interest
objs_json = []
# Record only data where tracked_objects exist within json stream
if b'tracked_objects' in message:
stream = json.loads(message)
if not stream:
break
# Append all data into list and process through parser
objs_json += stream
print("Recording Tracked Object")
# Parsing json file with json_normalize object
objs_df = json_normalize(objs_json, record_path='tracked_objects',
meta=[['metadata', 'serial_number'], 'timestamp'])
# Renaming columns
objs_df = objs_df.rename(
columns={"id": "object_id", "position.x": "x_pos", "position.y": "y_pos",
"person_data.height": "height",
"metadata.serial_number": "serial_number", "timestamp": "timestamp"})
# Selecting columns of interest
objs_df = objs_df.loc[:, ["timestamp", "serial_number", "object_id", "x_pos", "y_pos", "height"]]
# Converting datatime into requested format
objs_df["timestamp"] = pd.to_datetime(objs_df["timestamp"])
objs_df["timestamp"].apply(lambda x: x.strftime("%d-%m-%Y %Hh:%Mm:%Ss.%f")[:-3])
# Writting the data into SQlite db
objs_df.to_sql('data_object', con=engine, if_exists='append', index=False)
# In case there is no tracks, print No Tracks in console.
else:
print("No Object Tracked")
# Empty the list and prepare it for next capture
objs_json = []
Here is the error message i am getting:
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Program Files (x86)\Python37-32\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "C:\Users\slavi\PycharmProjects\ProRail_FInal_POC\pythreads\runner.py", line 15, in run
self.function(*self.args, **self.kwargs)
File "C:\Users\slavi\PycharmProjects\ProRail_FInal_POC\ObjectStream.py", line 46, in objectstream
objs_df["timestamp"].apply(lambda x: x.strftime("%d-%m-%Y %Hh:%Mm:%Ss.%f")[:-3])
File "C:\Users\slavi\PycharmProjects\ProRail_FInal_POC\venv\lib\site-packages\pandas\core\series.py", line 4049, in apply
return self._constructor(mapped, index=self.index).__finalize__(self)
File "C:\Users\slavi\PycharmProjects\ProRail_FInal_POC\venv\lib\site-packages\pandas\core\series.py", line 299, in __init__
"index implies {ind}".format(val=len(data), ind=len(index))
ValueError: Length of passed values is 0, index implies 1
Any idea how do I resolve this error?
Related
I'm working on automating some query extraction using python and pyodbc, and then converting to parquet format, and send to AWS S3.
My script solution is working fine so far, but I have faced a problem. I have a Schema, let us call it SCHEMA_A, and inside of it several tables, TABLE_1, TABLE_2 .... TABLE_N.
All those tables inside that schema are accessible by using the same credentials.
So I'm using a script like this one to automate the task.
def get_stream(cursor, batch_size=100000):
while True:
row = cursor.fetchmany(batch_size)
if row is None or not row:
break
yield row
cnxn = pyodbc.connect(driver='pyodbc driver here',
host='host name',
database='schema name',
user='user name,
password='password')
print('Connection stabilished ...')
cursor = cnxn.cursor()
print('Initializing cursos ...')
if len(sys.argv) > 1:
table_name = sys.argv[1]
cursor.execute('SELECT * FROM {}'.format(table_name))
else:
exit()
print('Query fetched ...')
row_batch = get_stream(cursor)
print('Getting Iterator ...')
cols = cursor.description
cols = [col[0] for col in cols]
print('Initalizin batch data frame ..')
df = pd.DataFrame(columns=cols)
start_time = time.time()
for rows in row_batch:
tmp = pd.DataFrame.from_records(rows, columns=cols)
df = df.append(tmp, ignore_index=True)
tmp = None
print("--- Batch inserted inn%s seconds ---" % (time.time() - start_time))
start_time = time.time()
I run a code similar to that inside Airflow tasks, and works just fine for all other tables. But then I have two tables, lets call TABLE_I and TABLE_II that yields the following error when I execute cursor.fetchmany(batch_size):
ERROR - ('ODBC SQL type -151 is not yet supported. column-index=16 type=-151', 'HY106')
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1285, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1310, in _execute_task
result = task_copy.execute(context=context)
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 117, in execute
return_value = self.execute_callable()
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 128, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/ubuntu/prea-ninja-airflow/jobs/plugins/extract/fetch.py", line 58, in fetch_data
for rows in row_batch:
File "/home/ubuntu/prea-ninja-airflow/jobs/plugins/extract/fetch.py", line 27, in stream
row = cursor.fetchmany(batch_size)
Inspecting those tables with SQLElectron, and Querying the first few lines, I have realized that both TABLE_I and TABLE_II have a Column called 'Geolocalizacao', when I use SQL server language to find the DATA TYPE of that column with:
SELECT DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE
TABLE_NAME = 'TABLE_I' AND
COLUMN_NAME = 'Geolocalizacao';
It yields:
DATA_TYPE
geography
Seraching here on stack overflow I found this solution: python pyodbc SQL Server Native Client 11.0 cannot return geometry column
By the description of the user, it seem work fine by adding:
def unpack_geometry(raw_bytes):
# adapted from SSCLRT information at
# https://learn.microsoft.com/en-us/openspecs/sql_server_protocols/ms-ssclrt/dc988cb6-4812-4ec6-91cd-cce329f6ecda
tup = struct.unpack('<i2b3d', raw_bytes)
# tup contains: (unknown, Version, Serialization_Properties, X, Y, SRID)
return tup[3], tup[4], tup[5]
and then:
cnxn.add_output_converter(-151, unpack_geometry)
After creating the connection. But It's not working for the GEOGRAPHY DATA TYPE, when I use this code (add import struct on python script), it gives me the following error:
Traceback (most recent call last):
File "benchmark.py", line 79, in <module>
for rows in row_batch:
File "benchmark.py", line 39, in get_stream
row = cursor.fetchmany(batch_size)
File "benchmark.py", line 47, in unpack_geometry
tup = struct.unpack('<i2b3d', raw_bytes)
struct.error: unpack requires a buffer of 30 bytes
An example of values that this column have, follows the given template:
{"srid":4326,"version":1,"points":[{}],"figures":[{"attribute":1,"pointOffset":0}],"shapes":[{"parentOffset":-1,"figureOffset":0,"type":1}],"segments":[]}
I honestly don't know how to adapt the code for this given structure, can someone help me? It's been working fine for all other tables, but I have those two tables with this column that are giving me a lot o headeach.
Hi this is what I have done:
from binascii import hexlify
def _handle_geometry(geometry_value):
return f"0x{hexlify(geometry_value).decode().upper()}"
and then on connection:
cnxn.add_output_converter(-151, _handle_geometry)
this will return value as SSMS.
I have been working with the alpha vantage python API for a while now, but I have only needed to pull daily and intraday timeseries data. I am trying to pull extended intraday data, but am not having any luck getting it to work. Trying to run the following code:
from alpha_vantage.timeseries import TimeSeries
apiKey = 'MY API KEY'
ts = TimeSeries(key = apiKey, output_format = 'pandas')
totalData, _ = ts.get_intraday_extended(symbol = 'NIO', interval = '15min', slice = 'year1month1')
print(totalData)
gives me the following error:
Traceback (most recent call last):
File "/home/pi/Desktop/test.py", line 9, in <module>
totalData, _ = ts.get_intraday_extended(symbol = 'NIO', interval = '15min', slice = 'year1month1')
File "/home/pi/.local/lib/python3.7/site-packages/alpha_vantage/alphavantage.py", line 219, in _format_wrapper
self, *args, **kwargs)
File "/home/pi/.local/lib/python3.7/site-packages/alpha_vantage/alphavantage.py", line 160, in _call_wrapper
return self._handle_api_call(url), data_key, meta_data_key
File "/home/pi/.local/lib/python3.7/site-packages/alpha_vantage/alphavantage.py", line 354, in _handle_api_call
json_response = response.json()
File "/usr/lib/python3/dist-packages/requests/models.py", line 889, in json
self.content.decode(encoding), **kwargs
File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
What is interesting is that if you look at the TimeSeries class, it states that extended intraday is returned as a "time series in one csv_reader object" whereas everything else, which works for me, is returned as "two json objects". I am 99% sure this has something to do with the issue, but I'm not entirely sure because I would think that calling intraday extended function would at least return SOMETHING (despite it being in a different format), but instead just gives me an error.
Another interesting little note is that the function refuses to take "adjusted = True" (or False) as an input despite it being in the documentation... likely unrelated, but maybe it might help diagnose.
Seems like TIME_SERIES_INTRADAY_EXTENDED can return only CSV format, but the alpha_vantage wrapper applies JSON methods, which results in the error.
My workaround:
from alpha_vantage.timeseries import TimeSeries
import pandas as pd
apiKey = 'MY API KEY'
ts = TimeSeries(key = apiKey, output_format = 'csv')
#download the csv
totalData = ts.get_intraday_extended(symbol = 'NIO', interval = '15min', slice = 'year1month1')
#csv --> dataframe
df = pd.DataFrame(list(totalData[0]))
#setup of column and index
header_row=0
df.columns = df.iloc[header_row]
df = df.drop(header_row)
df.set_index('time', inplace=True)
#show output
print(df)
This is an easy way to do it.
ticker = 'IBM'
date= 'year1month2'
apiKey = 'MY API KEY'
df = pd.read_csv('https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol='+ticker+'&interval=15min&slice='+date+'&apikey='+apiKey+'&datatype=csv&outputsize=full')
#Show output
print(df)
import pandas as pd
symbol = 'AAPL'
interval = '15min'
slice = 'year1month1'
api_key = ''
adjusted = '&adjusted=true&'
csv_url = 'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol='+symbol+'&interval='+interval+'&slice='+slice+adjusted+'&apikey='+api_key
data = pd.read_csv(csv_url)
print(data.head)
I am trying to count number of duplicate rows from a file "train.db" comprising of 7GB. My laptop has 8GB RAM. Below is the code I have used to obtain the results. When I run the code, I get the Error as below:
Traceback (most recent call last):
File "C:/Users/tahir/PycharmProjects/stopwordsremovefile/stopwordsrem.py", line 13, in <module>
df_no_dup = pd.read_sql_query('SELECT Title, Body, Tags, COUNT(*) as cnt_dup FROM trainingdata GROUP by Title, Body, Tags', con)
File "C:\Users\tahir\PycharmProjects\stopwordsremovefile\venv\lib\site-packages\pandas\io\sql.py", line 332, in read_sql_query
chunksize=chunksize,
File "C:\Users\tahir\PycharmProjects\stopwordsremovefile\venv\lib\site-packages\pandas\io\sql.py", line 1658, in read_query
data = self._fetchall_as_list(cursor)
File "C:\Users\tahir\PycharmProjects\stopwordsremovefile\venv\lib\site-packages\pandas\io\sql.py", line 1671, in _fetchall_as_list
result = cur.fetchall()
MemoryError
Process finished with exit code 1
Following is the code I am using:-
import os
import sqlite3
import pandas as pd
from datetime import datetime
from pandas import DataFrame
if os.path.isfile('train.db'):
start = datetime.now()
con = sqlite3.connect('train.db')
con.text_factory = lambda x: str(x, 'iso-8859-1')
df_no_dup = pd.read_sql_query('SELECT Title, Body, Tags, COUNT(*) as cnt_dup FROM trainingdata GROUP by Title, Body, Tags', con)
con.close()
print("Time taken to run this cell:", datetime.now() - start)
else:
print("Please download train.db file")
If I create a schema with only the first field, spark sql proceeds without issue, but if I uncomment any additional fields (i.e. I've uncommented s_store_id below) in both the mapper and the schema, I get a type error saying that the integer designation was not expecting a string.
from pyspark import SparkContext
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as types
from datetime import datetime
from decimal import *
sc = SparkContext()
spark = SparkSession(sc)
#sample raw data for this table
data = [
'1|AAAAAAAABAAAAAAA|1997-03-13||2451189|ought|245|5250760|8AM-4PM|William Ward|2|Unknown|Enough high areas stop expectations. Elaborate, local is|Charles Bartley|1|Unknown|1|Unknown|767|Spring |Wy|Suite 250|Midway|Williamson County|TN|31904|United States|-5|0.03|',
'2|AAAAAAAACAAAAAAA|1997-03-13|2000-03-12||able|236|5285950|8AM-4PM|Scott Smith|8|Unknown|Parliamentary candidates wait then heavy, keen mil|David Lamontagne|1|Unknown|1|Unknown|255|Sycamore |Dr.|Suite 410|Midway|Williamson County|TN|31904|UnitedStates|-5|0.03|',
'3|AAAAAAAACAAAAAAA|2000-03-13|||able|236|7557959|8AM-4PM|Scott Smith|7|Unknown|Impossible, true arms can treat constant, complete w|David Lamontagne|1|Unknown|1|Unknown|877|Park Laurel|Road|Suite T|Midway|Williamson County|TN|31904|United States|-5|0.03|',
'4|AAAAAAAAEAAAAAAA|1997-03-13|1999-03-13|2451044|ese|218|9341467|8AM-4PM|Edwin Adams|4|Unknown|Events would achieve other, eastern hours. Mechanisms must not eat other, new org|Thomas Pollack|1|Unknown|1|Unknown|27|Lake |Ln|Suite 260|Midway|Williamson County|TN|31904|United States|-5|0.03|',
'5|AAAAAAAAEAAAAAAA|1999-03-14|2001-03-12|2450910|anti|288|9078805|8AM-4PM|Edwin Adams|8|Unknown|Events would achieve other, eastern hours. Mechanisms must not eat other, new org|Thomas Pollack|1|Unknown|1|Unknown|27|Lee 6th|Court|Suite 80|Fairview|Williamson County|TN|35709|United States|-5|0.03|'
]
# Import the above sample data into RDD
lines = sc.parallelize(data)
# map the data, return data as a Row, and cast data types of some fields
def mapper(lines):
r = lines.split("|")
return Row(
s_store_sk=int(r[0]),
s_store_id=r[1],
# s_rec_start_date=None if r[2]=='' else datetime.strptime(r[2],'%Y-%m-%d').date(),
# s_rec_end_date=None if r[3]=='' else datetime.strptime(r[3],'%Y-%m-%d').date(),
# s_closed_date_sk=None if r[4]=='' else int(r[4]),
# s_store_name=r[5],
# s_number_employees=None if r[6]=='' else int(r[6]),
# s_floor_space=None if r[7]=='' else int(r[7]),
# s_hours=r[8],
# s_manager=r[9],
# s_market_id=None if r[10]=='' else int(r[10]),
# s_geography_class=r[11],
# s_market_desc=r[12],
# s_market_manager=r[13],
# s_division_id=None if r[14]=='' else int(r[14]),
# s_division_name=r[15],
# s_company_id=None if r[16]=='' else int(r[16]),
# s_company_name=r[17],
# s_street_number=r[18],
# s_street_name=r[19],
# s_street_type=r[20],
# s_suite_number=r[21],
# s_city=r[22],
# s_county=r[23],
# s_state=r[24],
# s_zip=r[25],
# s_country=r[26],
# s_gmt_offset=None if r[27]=='' else Decimal(r[27]),
# s_tax_precentage=None if r[28]=='' else Decimal(r[28])
)
#build strict schema for the table closely based on the original sql schema
schema = types.StructType([
types.StructField('s_store_sk',types.IntegerType())
,types.StructField('s_store_id',types.StringType())
# ,types.StructField('s_rec_start_date',types.DateType())
# ,types.StructField('s_rec_end_date',types.DateType())
# ,types.StructField('s_closed_date_sk',types.IntegerType())
# ,types.StructField('s_store_name',types.StringType())
# ,types.StructField('s_number_employees',types.IntegerType())
# ,types.StructField('s_floor_space',types.IntegerType())
# ,types.StructField('s_hours',types.StringType())
# ,types.StructField('s_manager',types.StringType())
# ,types.StructField('s_market_id',types.IntegerType())
# ,types.StructField('s_geography_class',types.StringType())
# ,types.StructField('s_market_desc',types.StringType())
# ,types.StructField('s_market_manager',types.StringType())
# ,types.StructField('s_division_id',types.IntegerType())
# ,types.StructField('s_division_name',types.StringType())
# ,types.StructField('s_company_id',types.IntegerType())
# ,types.StructField('s_company_name',types.StringType())
# ,types.StructField('s_street_number',types.StringType())
# ,types.StructField('s_street_name',types.StringType())
# ,types.StructField('s_street_type',types.StringType())
# ,types.StructField('s_suite_number',types.StringType())
# ,types.StructField('s_city',types.StringType())
# ,types.StructField('s_county',types.StringType())
# ,types.StructField('s_state',types.StringType())
# ,types.StructField('s_zip',types.StringType())
# ,types.StructField('s_country',types.StringType())
# ,types.StructField('s_gmt_offset',types.DecimalType())
# ,types.StructField('s_tax_precentage',types.DecimalType())
])
rows = lines.map(mapper)
# create data frame by passing in the mapped data AND its strict schema
store = spark.createDataFrame(rows,schema)
# create temp table name of the new table
store.createOrReplaceTempView("store")
# run basic SQL query against the table
results = spark.sql("SELECT * FROM store")
# show 20 results from query
results.show()
# end the spark application
spark.stop()
And this throws this error:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/tools/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/tools/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/tools/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/tools/spark2/python/lib/pyspark.zip/pyspark/sql/session.py", line 505, in prepare
File "/tools/spark2/python/lib/pyspark.zip/pyspark/sql/types.py", line 1349, in _verify_type
_verify_type(v, f.dataType, f.nullable)
File "/tools/spark2/python/lib/pyspark.zip/pyspark/sql/types.py", line 1321, in _verify_type
raise TypeError("%s can not accept object %r in type %s" % (dataType, obj, type(obj)))
TypeError: IntegerType can not accept object 'AAAAAAAABAAAAAAA' in type <type 'str'>
Is there something wrong with the way I'm handling these types?
I am developing an application on App Engine and Python. This app is meant to create routes to several points in town. To create this routes, I invoke a request to an Arcgis service. Once that is done, I need to check the status of the request and get a JSON with the results. I check these results with the following method:
def store_route(job_id, token):
import requests, json
#Process stops result and store it as json in stops_response
stops_url = "https://logistics.arcgis.com/arcgis/rest/services/World/VehicleRoutingProblem/GPServer/SolveVehicleRoutingProblem/jobs/"
stops_url = stops_url+str(job_id)+"/results/out_stops?token="+str(token)+"&f=json"
stops_r = requests.get(stops_url)
stops_response = json.loads(stops_r.text)
#Process routes result and store it as json in routes_response
routes_url = "https://logistics.arcgis.com/arcgis/rest/services/World/VehicleRoutingProblem/GPServer/SolveVehicleRoutingProblem/jobs/"
routes_url = routes_url+str(job_id)+"/results/out_routes?token="+str(token)+"&f=json"
routes_r = requests.get(routes_url)
routes_response = json.loads(routes_r.text)
from routing.models import ArcGisJob, DeliveryRoute
#Process each route from response
processed_routes = []
for route_info in routes_response['value']['features']:
print route_info
route_name = route_info['attributes']['Name']
coordinates = route_info['geometry']['paths']
coordinates_json = {"coordinates": coordinates}
#Process stops from each route
stops = []
for route_stops in stops_response['value']['features']:
if route_name == route_stops['attributes']['RouteName']:
stops.append({"Name": route_stops['attributes']['Name'],
"Sequence": route_stops['attributes']['Sequence']})
stops_json = {"content": stops}
#Create new Delivery Route object
processed_routes.append(DeliveryRoute(name=route_name,route_coordinates=coordinates_json, stops=stops_json))
#insert a new Job table entry with all processed routes
new_job = ArcGisJob(job_id=str(job_id), routes=processed_routes)
new_job.put()
As you can see, what my code does is practically visit the JSON returned by the service and parse it for the content that interest me. The problem is I get the following output:
{u'attributes': {
u'Name': u'ruta_4855443348258816',
...
u'StartTime': 1427356800000},
u'geometry': {u'paths': [[[-100.37766063699996, 25.67669987000005],
...
[-100.37716999999998, 25.67715000000004],
[-100.37766063699996, 25.67669987000005]]]}}
ERROR 2015-03-26 19:02:58,405 handlers.py:73] 'geometry'
Traceback (most recent call last):
File "/Users/Vercetti/Dropbox/Logyt/Quaker Routing/logytrouting/routing/handlers.py", line 68, in get
arc_gis.store_route(job_id, token)
File "/Users/Vercetti/Dropbox/Logyt/Quaker Routing/logytrouting/libs/arc_gis.py", line 150, in store_route
coordinates = route_info['geometry']['paths']
KeyError: 'geometry'
ERROR 2015-03-26 19:02:58,412 BaseRequestHandler.py:51] Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/Users/Vercetti/Dropbox/Logyt/Quaker Routing/logytrouting/routing/handlers.py", line 68, in get
arc_gis.store_route(job_id, token)
File "/Users/Vercetti/Dropbox/Logyt/Quaker Routing/logytrouting/libs/arc_gis.py", line 150, in store_route
coordinates = route_info['geometry']['paths']
KeyError: 'geometry'
The actual JSON returned has a lot more of info, but i just wrote a little portion of it so you can see that there IS a 'geometry' key. Any idea why I get this error??