.apply(lambda..., strftime produces None - python

I am trying to change the format of the date and time values I am receiving from the sensor. I initially receive it as string and i convert it into datetime and then try to apply strftime. When I do this in Jupyter notebook on set of values it works fine but when I implement it in my code it breaks. Here is my code:
import json
import socket
from pandas.io.json import json_normalize
from sqlalchemy import create_engine
import pandas as pd
import datetime
# Establish connection with Database
engine = create_engine('sqlite:///Production.db', echo=False)
# Establish connecton with Spider
server_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
server_socket.bind(('192.168.130.35', 8089))
# Receive data while sensor is live
while True:
message, address = server_socket.recvfrom(1024)
# Create empty list to hold data of interest
objs_json = []
# Record only data where tracked_objects exist within json stream
if b'tracked_objects' in message:
stream = json.loads(message)
if not stream:
break
# Append all data into list and process through parser
objs_json += stream
print("Recording Tracked Object")
# Parsing json file with json_normalize object
objs_df = json_normalize(objs_json, record_path='tracked_objects',
meta=[['metadata', 'serial_number'], 'timestamp'])
# Renaming columns
objs_df = objs_df.rename(
columns={"id": "object_id", "position.x": "x_pos", "position.y": "y_pos",
"person_data.height": "height",
"metadata.serial_number": "serial_number", "timestamp": "timestamp"})
# Selecting columns of interest
objs_df = objs_df.loc[:, ["timestamp", "serial_number", "object_id", "x_pos", "y_pos", "height"]]
# Converting datatime into requested format
objs_df["timestamp"] = pd.to_datetime(objs_df["timestamp"])
objs_df["timestamp"].apply(lambda x: x.strftime("%d-%m-%Y %Hh:%Mm:%Ss.%f")[:-3])
# Writting the data into SQlite db
objs_df.to_sql('data_object', con=engine, if_exists='append', index=False)
# In case there is no tracks, print No Tracks in console.
else:
print("No Object Tracked")
# Empty the list and prepare it for next capture
objs_json = []
Here is the error message i am getting:
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Program Files (x86)\Python37-32\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "C:\Users\slavi\PycharmProjects\ProRail_FInal_POC\pythreads\runner.py", line 15, in run
self.function(*self.args, **self.kwargs)
File "C:\Users\slavi\PycharmProjects\ProRail_FInal_POC\ObjectStream.py", line 46, in objectstream
objs_df["timestamp"].apply(lambda x: x.strftime("%d-%m-%Y %Hh:%Mm:%Ss.%f")[:-3])
File "C:\Users\slavi\PycharmProjects\ProRail_FInal_POC\venv\lib\site-packages\pandas\core\series.py", line 4049, in apply
return self._constructor(mapped, index=self.index).__finalize__(self)
File "C:\Users\slavi\PycharmProjects\ProRail_FInal_POC\venv\lib\site-packages\pandas\core\series.py", line 299, in __init__
"index implies {ind}".format(val=len(data), ind=len(index))
ValueError: Length of passed values is 0, index implies 1
Any idea how do I resolve this error?

Related

pyodbc ERROR - ('ODBC SQL type -151 is not yet supported. column-index=16 type=-151', 'HY106')

I'm working on automating some query extraction using python and pyodbc, and then converting to parquet format, and send to AWS S3.
My script solution is working fine so far, but I have faced a problem. I have a Schema, let us call it SCHEMA_A, and inside of it several tables, TABLE_1, TABLE_2 .... TABLE_N.
All those tables inside that schema are accessible by using the same credentials.
So I'm using a script like this one to automate the task.
def get_stream(cursor, batch_size=100000):
while True:
row = cursor.fetchmany(batch_size)
if row is None or not row:
break
yield row
cnxn = pyodbc.connect(driver='pyodbc driver here',
host='host name',
database='schema name',
user='user name,
password='password')
print('Connection stabilished ...')
cursor = cnxn.cursor()
print('Initializing cursos ...')
if len(sys.argv) > 1:
table_name = sys.argv[1]
cursor.execute('SELECT * FROM {}'.format(table_name))
else:
exit()
print('Query fetched ...')
row_batch = get_stream(cursor)
print('Getting Iterator ...')
cols = cursor.description
cols = [col[0] for col in cols]
print('Initalizin batch data frame ..')
df = pd.DataFrame(columns=cols)
start_time = time.time()
for rows in row_batch:
tmp = pd.DataFrame.from_records(rows, columns=cols)
df = df.append(tmp, ignore_index=True)
tmp = None
print("--- Batch inserted inn%s seconds ---" % (time.time() - start_time))
start_time = time.time()
I run a code similar to that inside Airflow tasks, and works just fine for all other tables. But then I have two tables, lets call TABLE_I and TABLE_II that yields the following error when I execute cursor.fetchmany(batch_size):
ERROR - ('ODBC SQL type -151 is not yet supported. column-index=16 type=-151', 'HY106')
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1285, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1310, in _execute_task
result = task_copy.execute(context=context)
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 117, in execute
return_value = self.execute_callable()
File "/home/ubuntu/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 128, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/ubuntu/prea-ninja-airflow/jobs/plugins/extract/fetch.py", line 58, in fetch_data
for rows in row_batch:
File "/home/ubuntu/prea-ninja-airflow/jobs/plugins/extract/fetch.py", line 27, in stream
row = cursor.fetchmany(batch_size)
Inspecting those tables with SQLElectron, and Querying the first few lines, I have realized that both TABLE_I and TABLE_II have a Column called 'Geolocalizacao', when I use SQL server language to find the DATA TYPE of that column with:
SELECT DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE
TABLE_NAME = 'TABLE_I' AND
COLUMN_NAME = 'Geolocalizacao';
It yields:
DATA_TYPE
geography
Seraching here on stack overflow I found this solution: python pyodbc SQL Server Native Client 11.0 cannot return geometry column
By the description of the user, it seem work fine by adding:
def unpack_geometry(raw_bytes):
# adapted from SSCLRT information at
# https://learn.microsoft.com/en-us/openspecs/sql_server_protocols/ms-ssclrt/dc988cb6-4812-4ec6-91cd-cce329f6ecda
tup = struct.unpack('<i2b3d', raw_bytes)
# tup contains: (unknown, Version, Serialization_Properties, X, Y, SRID)
return tup[3], tup[4], tup[5]
and then:
cnxn.add_output_converter(-151, unpack_geometry)
After creating the connection. But It's not working for the GEOGRAPHY DATA TYPE, when I use this code (add import struct on python script), it gives me the following error:
Traceback (most recent call last):
File "benchmark.py", line 79, in <module>
for rows in row_batch:
File "benchmark.py", line 39, in get_stream
row = cursor.fetchmany(batch_size)
File "benchmark.py", line 47, in unpack_geometry
tup = struct.unpack('<i2b3d', raw_bytes)
struct.error: unpack requires a buffer of 30 bytes
An example of values that this column have, follows the given template:
{"srid":4326,"version":1,"points":[{}],"figures":[{"attribute":1,"pointOffset":0}],"shapes":[{"parentOffset":-1,"figureOffset":0,"type":1}],"segments":[]}
I honestly don't know how to adapt the code for this given structure, can someone help me? It's been working fine for all other tables, but I have those two tables with this column that are giving me a lot o headeach.
Hi this is what I have done:
from binascii import hexlify
def _handle_geometry(geometry_value):
return f"0x{hexlify(geometry_value).decode().upper()}"
and then on connection:
cnxn.add_output_converter(-151, _handle_geometry)
this will return value as SSMS.

How do you use the python alpha_vantage API to return extended intraday data?

I have been working with the alpha vantage python API for a while now, but I have only needed to pull daily and intraday timeseries data. I am trying to pull extended intraday data, but am not having any luck getting it to work. Trying to run the following code:
from alpha_vantage.timeseries import TimeSeries
apiKey = 'MY API KEY'
ts = TimeSeries(key = apiKey, output_format = 'pandas')
totalData, _ = ts.get_intraday_extended(symbol = 'NIO', interval = '15min', slice = 'year1month1')
print(totalData)
gives me the following error:
Traceback (most recent call last):
File "/home/pi/Desktop/test.py", line 9, in <module>
totalData, _ = ts.get_intraday_extended(symbol = 'NIO', interval = '15min', slice = 'year1month1')
File "/home/pi/.local/lib/python3.7/site-packages/alpha_vantage/alphavantage.py", line 219, in _format_wrapper
self, *args, **kwargs)
File "/home/pi/.local/lib/python3.7/site-packages/alpha_vantage/alphavantage.py", line 160, in _call_wrapper
return self._handle_api_call(url), data_key, meta_data_key
File "/home/pi/.local/lib/python3.7/site-packages/alpha_vantage/alphavantage.py", line 354, in _handle_api_call
json_response = response.json()
File "/usr/lib/python3/dist-packages/requests/models.py", line 889, in json
self.content.decode(encoding), **kwargs
File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
What is interesting is that if you look at the TimeSeries class, it states that extended intraday is returned as a "time series in one csv_reader object" whereas everything else, which works for me, is returned as "two json objects". I am 99% sure this has something to do with the issue, but I'm not entirely sure because I would think that calling intraday extended function would at least return SOMETHING (despite it being in a different format), but instead just gives me an error.
Another interesting little note is that the function refuses to take "adjusted = True" (or False) as an input despite it being in the documentation... likely unrelated, but maybe it might help diagnose.
Seems like TIME_SERIES_INTRADAY_EXTENDED can return only CSV format, but the alpha_vantage wrapper applies JSON methods, which results in the error.
My workaround:
from alpha_vantage.timeseries import TimeSeries
import pandas as pd
apiKey = 'MY API KEY'
ts = TimeSeries(key = apiKey, output_format = 'csv')
#download the csv
totalData = ts.get_intraday_extended(symbol = 'NIO', interval = '15min', slice = 'year1month1')
#csv --> dataframe
df = pd.DataFrame(list(totalData[0]))
#setup of column and index
header_row=0
df.columns = df.iloc[header_row]
df = df.drop(header_row)
df.set_index('time', inplace=True)
#show output
print(df)
This is an easy way to do it.
ticker = 'IBM'
date= 'year1month2'
apiKey = 'MY API KEY'
df = pd.read_csv('https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol='+ticker+'&interval=15min&slice='+date+'&apikey='+apiKey+'&datatype=csv&outputsize=full')
#Show output
print(df)
import pandas as pd
symbol = 'AAPL'
interval = '15min'
slice = 'year1month1'
api_key = ''
adjusted = '&adjusted=true&'
csv_url = 'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol='+symbol+'&interval='+interval+'&slice='+slice+adjusted+'&apikey='+api_key
data = pd.read_csv(csv_url)
print(data.head)

Pycharm Memory Error when I read a 7GB sqlite3 File with pandas

I am trying to count number of duplicate rows from a file "train.db" comprising of 7GB. My laptop has 8GB RAM. Below is the code I have used to obtain the results. When I run the code, I get the Error as below:
Traceback (most recent call last):
File "C:/Users/tahir/PycharmProjects/stopwordsremovefile/stopwordsrem.py", line 13, in <module>
df_no_dup = pd.read_sql_query('SELECT Title, Body, Tags, COUNT(*) as cnt_dup FROM trainingdata GROUP by Title, Body, Tags', con)
File "C:\Users\tahir\PycharmProjects\stopwordsremovefile\venv\lib\site-packages\pandas\io\sql.py", line 332, in read_sql_query
chunksize=chunksize,
File "C:\Users\tahir\PycharmProjects\stopwordsremovefile\venv\lib\site-packages\pandas\io\sql.py", line 1658, in read_query
data = self._fetchall_as_list(cursor)
File "C:\Users\tahir\PycharmProjects\stopwordsremovefile\venv\lib\site-packages\pandas\io\sql.py", line 1671, in _fetchall_as_list
result = cur.fetchall()
MemoryError
Process finished with exit code 1
Following is the code I am using:-
import os
import sqlite3
import pandas as pd
from datetime import datetime
from pandas import DataFrame
if os.path.isfile('train.db'):
start = datetime.now()
con = sqlite3.connect('train.db')
con.text_factory = lambda x: str(x, 'iso-8859-1')
df_no_dup = pd.read_sql_query('SELECT Title, Body, Tags, COUNT(*) as cnt_dup FROM trainingdata GROUP by Title, Body, Tags', con)
con.close()
print("Time taken to run this cell:", datetime.now() - start)
else:
print("Please download train.db file")

Spark DF schema works successfully for single field, then throws error when another field is added

If I create a schema with only the first field, spark sql proceeds without issue, but if I uncomment any additional fields (i.e. I've uncommented s_store_id below) in both the mapper and the schema, I get a type error saying that the integer designation was not expecting a string.
from pyspark import SparkContext
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as types
from datetime import datetime
from decimal import *
sc = SparkContext()
spark = SparkSession(sc)
#sample raw data for this table
data = [
'1|AAAAAAAABAAAAAAA|1997-03-13||2451189|ought|245|5250760|8AM-4PM|William Ward|2|Unknown|Enough high areas stop expectations. Elaborate, local is|Charles Bartley|1|Unknown|1|Unknown|767|Spring |Wy|Suite 250|Midway|Williamson County|TN|31904|United States|-5|0.03|',
'2|AAAAAAAACAAAAAAA|1997-03-13|2000-03-12||able|236|5285950|8AM-4PM|Scott Smith|8|Unknown|Parliamentary candidates wait then heavy, keen mil|David Lamontagne|1|Unknown|1|Unknown|255|Sycamore |Dr.|Suite 410|Midway|Williamson County|TN|31904|UnitedStates|-5|0.03|',
'3|AAAAAAAACAAAAAAA|2000-03-13|||able|236|7557959|8AM-4PM|Scott Smith|7|Unknown|Impossible, true arms can treat constant, complete w|David Lamontagne|1|Unknown|1|Unknown|877|Park Laurel|Road|Suite T|Midway|Williamson County|TN|31904|United States|-5|0.03|',
'4|AAAAAAAAEAAAAAAA|1997-03-13|1999-03-13|2451044|ese|218|9341467|8AM-4PM|Edwin Adams|4|Unknown|Events would achieve other, eastern hours. Mechanisms must not eat other, new org|Thomas Pollack|1|Unknown|1|Unknown|27|Lake |Ln|Suite 260|Midway|Williamson County|TN|31904|United States|-5|0.03|',
'5|AAAAAAAAEAAAAAAA|1999-03-14|2001-03-12|2450910|anti|288|9078805|8AM-4PM|Edwin Adams|8|Unknown|Events would achieve other, eastern hours. Mechanisms must not eat other, new org|Thomas Pollack|1|Unknown|1|Unknown|27|Lee 6th|Court|Suite 80|Fairview|Williamson County|TN|35709|United States|-5|0.03|'
]
# Import the above sample data into RDD
lines = sc.parallelize(data)
# map the data, return data as a Row, and cast data types of some fields
def mapper(lines):
r = lines.split("|")
return Row(
s_store_sk=int(r[0]),
s_store_id=r[1],
# s_rec_start_date=None if r[2]=='' else datetime.strptime(r[2],'%Y-%m-%d').date(),
# s_rec_end_date=None if r[3]=='' else datetime.strptime(r[3],'%Y-%m-%d').date(),
# s_closed_date_sk=None if r[4]=='' else int(r[4]),
# s_store_name=r[5],
# s_number_employees=None if r[6]=='' else int(r[6]),
# s_floor_space=None if r[7]=='' else int(r[7]),
# s_hours=r[8],
# s_manager=r[9],
# s_market_id=None if r[10]=='' else int(r[10]),
# s_geography_class=r[11],
# s_market_desc=r[12],
# s_market_manager=r[13],
# s_division_id=None if r[14]=='' else int(r[14]),
# s_division_name=r[15],
# s_company_id=None if r[16]=='' else int(r[16]),
# s_company_name=r[17],
# s_street_number=r[18],
# s_street_name=r[19],
# s_street_type=r[20],
# s_suite_number=r[21],
# s_city=r[22],
# s_county=r[23],
# s_state=r[24],
# s_zip=r[25],
# s_country=r[26],
# s_gmt_offset=None if r[27]=='' else Decimal(r[27]),
# s_tax_precentage=None if r[28]=='' else Decimal(r[28])
)
#build strict schema for the table closely based on the original sql schema
schema = types.StructType([
types.StructField('s_store_sk',types.IntegerType())
,types.StructField('s_store_id',types.StringType())
# ,types.StructField('s_rec_start_date',types.DateType())
# ,types.StructField('s_rec_end_date',types.DateType())
# ,types.StructField('s_closed_date_sk',types.IntegerType())
# ,types.StructField('s_store_name',types.StringType())
# ,types.StructField('s_number_employees',types.IntegerType())
# ,types.StructField('s_floor_space',types.IntegerType())
# ,types.StructField('s_hours',types.StringType())
# ,types.StructField('s_manager',types.StringType())
# ,types.StructField('s_market_id',types.IntegerType())
# ,types.StructField('s_geography_class',types.StringType())
# ,types.StructField('s_market_desc',types.StringType())
# ,types.StructField('s_market_manager',types.StringType())
# ,types.StructField('s_division_id',types.IntegerType())
# ,types.StructField('s_division_name',types.StringType())
# ,types.StructField('s_company_id',types.IntegerType())
# ,types.StructField('s_company_name',types.StringType())
# ,types.StructField('s_street_number',types.StringType())
# ,types.StructField('s_street_name',types.StringType())
# ,types.StructField('s_street_type',types.StringType())
# ,types.StructField('s_suite_number',types.StringType())
# ,types.StructField('s_city',types.StringType())
# ,types.StructField('s_county',types.StringType())
# ,types.StructField('s_state',types.StringType())
# ,types.StructField('s_zip',types.StringType())
# ,types.StructField('s_country',types.StringType())
# ,types.StructField('s_gmt_offset',types.DecimalType())
# ,types.StructField('s_tax_precentage',types.DecimalType())
])
rows = lines.map(mapper)
# create data frame by passing in the mapped data AND its strict schema
store = spark.createDataFrame(rows,schema)
# create temp table name of the new table
store.createOrReplaceTempView("store")
# run basic SQL query against the table
results = spark.sql("SELECT * FROM store")
# show 20 results from query
results.show()
# end the spark application
spark.stop()
And this throws this error:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/tools/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/tools/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/tools/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/tools/spark2/python/lib/pyspark.zip/pyspark/sql/session.py", line 505, in prepare
File "/tools/spark2/python/lib/pyspark.zip/pyspark/sql/types.py", line 1349, in _verify_type
_verify_type(v, f.dataType, f.nullable)
File "/tools/spark2/python/lib/pyspark.zip/pyspark/sql/types.py", line 1321, in _verify_type
raise TypeError("%s can not accept object %r in type %s" % (dataType, obj, type(obj)))
TypeError: IntegerType can not accept object 'AAAAAAAABAAAAAAA' in type <type 'str'>
Is there something wrong with the way I'm handling these types?

Key Error on python & App Engine

I am developing an application on App Engine and Python. This app is meant to create routes to several points in town. To create this routes, I invoke a request to an Arcgis service. Once that is done, I need to check the status of the request and get a JSON with the results. I check these results with the following method:
def store_route(job_id, token):
import requests, json
#Process stops result and store it as json in stops_response
stops_url = "https://logistics.arcgis.com/arcgis/rest/services/World/VehicleRoutingProblem/GPServer/SolveVehicleRoutingProblem/jobs/"
stops_url = stops_url+str(job_id)+"/results/out_stops?token="+str(token)+"&f=json"
stops_r = requests.get(stops_url)
stops_response = json.loads(stops_r.text)
#Process routes result and store it as json in routes_response
routes_url = "https://logistics.arcgis.com/arcgis/rest/services/World/VehicleRoutingProblem/GPServer/SolveVehicleRoutingProblem/jobs/"
routes_url = routes_url+str(job_id)+"/results/out_routes?token="+str(token)+"&f=json"
routes_r = requests.get(routes_url)
routes_response = json.loads(routes_r.text)
from routing.models import ArcGisJob, DeliveryRoute
#Process each route from response
processed_routes = []
for route_info in routes_response['value']['features']:
print route_info
route_name = route_info['attributes']['Name']
coordinates = route_info['geometry']['paths']
coordinates_json = {"coordinates": coordinates}
#Process stops from each route
stops = []
for route_stops in stops_response['value']['features']:
if route_name == route_stops['attributes']['RouteName']:
stops.append({"Name": route_stops['attributes']['Name'],
"Sequence": route_stops['attributes']['Sequence']})
stops_json = {"content": stops}
#Create new Delivery Route object
processed_routes.append(DeliveryRoute(name=route_name,route_coordinates=coordinates_json, stops=stops_json))
#insert a new Job table entry with all processed routes
new_job = ArcGisJob(job_id=str(job_id), routes=processed_routes)
new_job.put()
As you can see, what my code does is practically visit the JSON returned by the service and parse it for the content that interest me. The problem is I get the following output:
{u'attributes': {
u'Name': u'ruta_4855443348258816',
...
u'StartTime': 1427356800000},
u'geometry': {u'paths': [[[-100.37766063699996, 25.67669987000005],
...
[-100.37716999999998, 25.67715000000004],
[-100.37766063699996, 25.67669987000005]]]}}
ERROR 2015-03-26 19:02:58,405 handlers.py:73] 'geometry'
Traceback (most recent call last):
File "/Users/Vercetti/Dropbox/Logyt/Quaker Routing/logytrouting/routing/handlers.py", line 68, in get
arc_gis.store_route(job_id, token)
File "/Users/Vercetti/Dropbox/Logyt/Quaker Routing/logytrouting/libs/arc_gis.py", line 150, in store_route
coordinates = route_info['geometry']['paths']
KeyError: 'geometry'
ERROR 2015-03-26 19:02:58,412 BaseRequestHandler.py:51] Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/Users/Vercetti/Dropbox/Logyt/Quaker Routing/logytrouting/routing/handlers.py", line 68, in get
arc_gis.store_route(job_id, token)
File "/Users/Vercetti/Dropbox/Logyt/Quaker Routing/logytrouting/libs/arc_gis.py", line 150, in store_route
coordinates = route_info['geometry']['paths']
KeyError: 'geometry'
The actual JSON returned has a lot more of info, but i just wrote a little portion of it so you can see that there IS a 'geometry' key. Any idea why I get this error??

Categories

Resources