How to delete data from a RethinkDB database using its changefeed

How to delete data from a RethinkDB database using its changefeed - python

I'm working on a 'controller' for a database which continuously accrues data, but only uses recent data, defined as less than 3 days old. As soon as data becomes more than 3 days old, I'd like to dump it to a JSON file and remove it from the database.
To simulate this, I've done the following. The 'controller' program rethinkdb_monitor.py is
import json
import rethinkdb as r
import pytz
from datetime import datetime, timedelta
# The database and table are assumed to have been previously created
database_name = "sensor_db"
table_name = "sensor_data"
port_offset = 1 # To avoid interference of this testing program with the main program, all ports are initialized at an offset of 1 from the default ports using "rethinkdb --port_offset 1" at the command line.
conn = r.connect("localhost", 28015 + port_offset)
current_time = datetime.utcnow().replace(tzinfo=pytz.utc) # Current time include timezone (assumed UTC)
retention_period = timedelta(days=3) # Period of time during which data is retained on the main server
expiry_time = current_time - retention_period # Age of data which is removed from the main server
data_to_archive = r.db(database_name).table(table_name).filter(r.row['timestamp'] < expiry_time)
output_file = "archived_sensor_data.json"
with open(output_file, 'a') as f:
for change in data_to_archive.changes().run(conn, time_format="raw"): # The time_format="raw" option is passed to prevent a "RqlTzinfo object is not JSON serializable" error when dumping
print change
json.dump(change['new_val'], f) # Since the main database we are reading from is append-only, the 'old_val' of the change is always None and we are interested in the 'new_val' only
f.write("\n") # Separate entries by a new line
Prior to running this program, I started up RethinkDB using
rethinkdb --port_offset 1
at the command line, and used the web interface at localhost:8081 to create a database called sensor_db with a table called sensor_data (see below).
Once rethinkdb_monitor.py is running and waiting for changes, I run a script rethinkdb_add_data.py which generates synthetic data:
import random
import faker
from datetime import datetime, timedelta
import pytz
import rethinkdb as r
class RandomData(object):
def __init__(self, seed=None):
self._seed = seed
self._random = random.Random()
self._random.seed(seed)
self.fake = faker.Faker()
self.fake.random.seed(seed)
def __getattr__(self, x):
return getattr(self._random, x)
def name(self):
return self.fake.name()
def datetime(self, start=None, end=None):
if start is None:
start = datetime(2000, 1, 1, tzinfo=pytz.utc) # Jan 1st 2000
if end is None:
end = datetime.utcnow().replace(tzinfo=pytz.utc)
if isinstance(end, datetime):
dt = end - start
elif isinstance(end, timedelta):
dt = end
assert isinstance(dt, timedelta)
random_dt = timedelta(microseconds=self._random.randrange(int(dt.total_seconds() * (10 ** 6))))
return start + random_dt
# Rethinkdb has been started at a port offset of 1 using the "--port_offset 1" argument.
port_offset = 1
conn = r.connect("localhost", 28015 + port_offset).repl()
rd = RandomData(seed=0) # Instantiate and seed a random data generator
# The database and table have been previously created (e.g. through the web interface at localhost:8081)
database_name = "sensor_db"
table_name = "sensor_data"
# Generate random data with timestamps uniformly distributed over the past 6 days
random_data_time_interval = timedelta(days=6)
start_random_data = datetime.utcnow().replace(tzinfo=pytz.utc) - random_data_time_interval
for _ in range(5):
entry = {"name": rd.name(), "timestamp": rd.datetime(start=start_random_data)}
r.db(database_name).table(table_name).insert(entry).run()
After interrupting rethinkdb_monitor.py with Cntrl+C, the archived_sensor_data.json file contains the data to be archived:
{"timestamp": {"timezone": "+00:00", "$reql_type$": "TIME", "epoch_time": 1475963599.347}, "id": "be2b5fd7-28df-48ee-b744-99856643265a", "name": "Elizabeth Woods"}
{"timestamp": {"timezone": "+00:00", "$reql_type$": "TIME", "epoch_time": 1475879797.486}, "id": "36d69236-f710-481b-82b6-4a62a1aae36c", "name": "Susan Wagner"}
What I am still struggling with, however, is how to subsequently remove this data from the DB. The command syntax of delete seems to be such that it can be called on a table or selection, but the change obtained through the changefeed is simply a dictionary.
How could I use the changefeed to continuously delete data from the database?

I used the fact that each change contains the ID of the corresponding document in the database, and created a selection using get with this ID:
with open(output_file, 'a') as f:
for change in data_to_archive.changes().run(conn, time_format="raw"): # The time_format="raw" option is passed to prevent a "RqlTzinfo object is not JSON serializable" error when dumping
print change
if change['new_val'] is not None: # If the change is not a deletion
json.dump(change['new_val'], f) # Since the main database we are reading from is append-only, the 'old_val' of the change is always None and we are interested in the 'new_val' only
f.write("\n") # Separate entries by a new line
ID_to_delete = change['new_val']['id'] # Get the ID of the data to be deleted from the database
r.db(database_name).table(table_name).get(ID_to_delete).delete().run(conn)
The deletions will themselves be registered as changes, but I've used the if change['new_val'] is not None statement to filter these out.

Related

How to break down script into smaller function and create main.py? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
I wrote the script in python that works perfectly fine if executed as-is. What I am trying to do is to break this script into meaningful functions and create main.py to execute this as a proper python application.
Here is my LiveStream.py code with which I am collecting data from the sensor at the beginning of every minute, and sending it to the MySQL database, and also posting it to the URL. As mentioned this works perfectly fine if I execute: python3 LiveStream.py
# Import Dependencies
import board
import pandas as pd
from busio import I2C
import adafruit_bme680
from datetime import datetime, timedelta
import time
import requests
import mysql.connector
import json
import sqlalchemy
# read database config file
with open("config.json") as config:
param = json.load(config)
# Create library object using Bus I2C port
i2c = I2C(board.SCL, board.SDA)
bme680 = adafruit_bme680.Adafruit_BME680_I2C(i2c, debug=False)
# change this to match the location's pressure (hPa) at sea level
bme680.sea_level_pressure = 1013.25
# Read data from sensors
while True:
# Create the now variable to capture the current moment
TimeStamp = datetime.now()
Temperature = round((bme680.temperature * 9/5) + 32, 2)
Gas = round(bme680.gas, 2)
Humidity = round(bme680.humidity, 2)
Pressure = round(bme680.pressure, 2)
Altitude = round(bme680.altitude, 2)
now = datetime.strftime(TimeStamp,"%Y-%m-%dT%H:%M:%S")
# Adding collected measurements into dataframe
data = pd.DataFrame([
{
"TimeStamp": now,
"Temperature": Temperature,
"Gas": Gas,
"Humidity": Humidity,
"Pressure": Pressure,
"Altitude": Altitude
}
])
# Try establishing connection with database
try:
engine = sqlalchemy.create_engine('mysql+mysqlconnector://{0}:{1}#{2}/{3}'.
format(param['MyDemoServer'][0]['user'],
param['MyDemoServer'][0]['password'],
param['MyDemoServer'][0]['host'],
param['MyDemoServer'][0]['database']), echo=False)
# Cleaning the data from existing tables MetricValues and Metrics
db_con = engine.connect()
if db_con.connect():
try:
data.to_sql('sensordata', con = db_con, if_exists = 'append', index = False)
db_con.close()
# Dispose the engine
engine.dispose()
except OSError as e:
print(e)
except OSError as e:
print(e)
# Power BI API
# BI Address to push the data to
url = 'https://api.powerbi.com/beta/94cd2fa9-eb6a-490b-af36-53bf7f5ef485/datasets/2a7a2529-dbfd-4c32-9513-7d5857b61137/rows?noSignUpCheck=1&key=nS3bP1Mo4qN9%2Fp6XJcTBgHBUV%2FcOZb0edYrK%2BtVWDg6iWwzRtY16HWUGSqB9YsqF3GHMNO2fe3r5ltB7NhVIvw%3D%3D'
# post/push data to the streaming API
headers = {
"Content-Type": "application/json"
}
response = requests.request(
method="POST",
url=url,
headers=headers,
data=json.dumps(data.to_json())
)
data = pd.DataFrame()
# Re-run the script at the beginning of every new minute.
dt = datetime.now() + timedelta(minutes=1)
dt = dt.replace(second=1)
while datetime.now() < dt:
time.sleep(1)
Here is what I have tried so far... I created a lib folder where I have etl.py file. in this file I tried creating functions such us:
def sensorsreading():
# Create library object using Bus I2C port
i2c = I2C(board.SCL, board.SDA)
bme680 = adafruit_bme680.Adafruit_BME680_I2C(i2c, debug=False)
# change this to match the location's pressure (hPa) at sea level
bme680.sea_level_pressure = 1013.25
# Read data from sensors
while True:
# Create the now variable to capture the current moment
TimeStamp = datetime.now()
Temperature = round((bme680.temperature * 9 / 5) + 32, 2)
Gas = round(bme680.gas, 2)
Humidity = round(bme680.humidity, 2)
Pressure = round(bme680.pressure, 2)
Altitude = round(bme680.altitude, 2)
now = datetime.strftime(TimeStamp, "%Y-%m-%dT%H:%M:%S")
# Adding collected measurements into dataframe
data = pd.DataFrame([
{
"TimeStamp": now,
"Temperature": Temperature,
"Gas": Gas,
"Humidity": Humidity,
"Pressure": Pressure,
"Altitude": Altitude
}
])
return data
And also function:
def dataload(data):
# Try establishing connection with database
try:
engine = sqlalchemy.create_engine('mysql+mysqlconnector://{0}:{1}#{2}/{3}'.
format(param['MyDemoServer'][0]['user'],
param['MyDemoServer'][0]['password'],
param['MyDemoServer'][0]['host'],
param['MyDemoServer'][0]['database']), echo=False)
# Cleaning the data from existing tables MetricValues and Metrics
db_con = engine.connect()
if db_con.connect():
try:
data.to_sql('sensordata', con=db_con, if_exists='append', index=False)
db_con.close()
# Dispose the engine
engine.dispose()
except OSError as e:
print(e)
except OSError as e:
print(e)
And my main.py looks like this:
import pandas as pd
from datetime import datetime, timedelta
import time
from lib.etl import *
def etl(name):
data = sensorsreading()
dataload(data)
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
etl('PyCharm')
# Re-run the script at the beginning of every new minute.
dt = datetime.now() + timedelta(minutes=1)
dt = dt.replace(second=1)
while datetime.now() < dt:
time.sleep(1)
When I run main.py it seems that I am not passing the data frame from sensorsreading() to dataload() function.
Any idea what am I doing wrong here?

To address you original question, you were using yield instead of return. Yields is used in generators, as you can read more here: https://www.geeksforgeeks.org/use-yield-keyword-instead-return-keyword-python/
In the case you don't need a precise execution, this will call the function each 60 seconds. Anyways, I'll sugest using a scheduler like systemctl or cron.
import time
while True:
etl('PyCharm')
time.sleep(60)
If you want something more precise you could use:
import time
starttime = time.time()
while True:
etl('PyCharm')
time.sleep(60.0 - ((time.time() - starttime) % 60.0))
as explained in What is the best way to repeatedly execute a function every x seconds?

How to run multiple Azure Functions in parallel which scroll through Elasticsearch?

I have a setup where I need to extract data from Elasticsearch and store it on an Azure Blob. Now to get the data I am using Elasticsearch's _search and _scroll API. The indexes are pretty well designed and are formatted something like game1.*, game2.*, game3.* etc.
I've created a worker.py file which I stored in a folder called shared_code as Microsoft suggests and I have several Timer Trigger Functions which import and call worker.py. Due to the way ES was setup on our side I had to create a VNET and a static Outbound IP address which we've then whitelisted on ES. Conversely, the data is only available to be extracted from ES only on port 9200. So I've created an Azure Function App which has the connection setup and I am trying to create multiple Functions (game1-worker, game2-worker, game3-worker) to pull the data from ES running in parallel on minute 5. I've noticed if I add the FUNCTIONS_WORKER_PROCESS_COUNT = 1 setting then the functions will wait until the first triggered one finishes its task and then the second one triggers. If I don't add this app setting or increase the number, then once a function stopped because it finished working, it will try to start it again and then I get a OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted error. Is there a way I can make these run in parallel but not have the mentioned error?
Here is the code for the worker.py:
#!/usr/bin/env python
# coding: utf-8
# # Elasticsearch to Azure Microservice
import json, datetime, gzip, importlib, os, re, logging
from elasticsearch import Elasticsearch
import azure.storage.blob as azsb
import azure.identity as azi
import os
import tempfile
def batch(game_name, env='prod'):
# #### Global Variables
env = env.lower()
connection_string = os.getenv('conn_storage')
lowerFormat = game_name.lower().replace(" ","_")
azFormat = re.sub(r'[^0-9a-zA-Z]+', '-', game_name).lower()
storageContainerName = azFormat
stateStorageContainerName = "azure-webjobs-state"
minutesOffset = 5
tempFilePath = tempfile.gettempdir()
curFileName = f"{lowerFormat}_cursor.py"
curTempFilePath = os.path.join(tempFilePath,curFileName)
curBlobFilePath = f"cursors/{curFileName}"
esUrl = os.getenv('esUrl')
# #### Connections
es = Elasticsearch(
esUrl,
port=9200,
timeout=300)
def uploadJsonGzipBlob(filePathAndName, jsonBody):
blob = azsb.BlobClient.from_connection_string(
conn_str=connection_string,
container_name=storageContainerName,
blob_name=filePathAndName
)
blob.upload_blob(gzip.compress(bytes(json.dumps(jsonBody), encoding='utf-8')))
def getAndLoadCursor(filePathAndName):
# Get cursor from blob
blob = azsb.BlobClient.from_connection_string(
conn_str=os.getenv('AzureWebJobsStorage'),
container_name=stateStorageContainerName,
blob_name=filePathAndName
)
# Stream it to Temp file
with open(curTempFilePath, "wb") as f:
data = blob.download_blob()
data.readinto(f)
# Load it by path
spec = importlib.util.spec_from_file_location("cursor", curTempFilePath)
cur = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cur)
return cur
def writeCursor(filePathAndName, body):
blob = azsb.BlobClient.from_connection_string(
conn_str=os.getenv('AzureWebJobsStorage'),
container_name=stateStorageContainerName,
blob_name=filePathAndName
)
blob.upload_blob(body, overwrite=True)
# Parameter and state settings
if os.getenv(f"{lowerFormat}_maxSizeMB") is None:
maxSizeMB = 10 # Default to 10 MB
else:
maxSizeMB = int(os.getenv(f"{lowerFormat}_maxSizeMB"))
if os.getenv(f"{lowerFormat}_maxProcessTimeSeconds") is None:
maxProcessTimeSeconds = 300 # Default to 300 seconds
else:
maxProcessTimeSeconds = int(os.getenv(f"{lowerFormat}_maxProcessTimeSeconds"))
try:
cur = getAndLoadCursor(curBlobFilePath)
except Exception as e:
dtStr = f"{datetime.datetime.utcnow():%Y/%m/%d %H:%M:00}"
writeCursor(curBlobFilePath, f"# Please use format YYYY/MM/DD HH24:MI:SS\nlastPolled = '{dtStr}'")
logging.info(f"No cursor file. Generated {curFileName} file with date {dtStr}")
return 0
# # Scrolling and Batching Engine
lastRowDateOffset = cur.lastPolled
nrFilesThisInstance = 0
while 1:
# Offset the current time by -5 minutes to account for the 2-3 min delay in Elasticsearch
initTime = datetime.datetime.utcnow()
## Filter lt (less than) endDate to avoid infinite loops.
## Filter lt manually when compiling historical based on
endDate = initTime-datetime.timedelta(minutes=minutesOffset)
endDate = f"{endDate:%Y/%m/%d %H:%M:%S}"
doc = {
"query": {
"range": {
"baseCtx.date": {
"gt": lastRowDateOffset,
"lt": endDate
}
}
}
}
Index = lowerFormat + ".*"
if env == 'dev': Index = 'dev.' + Index
if nrFilesThisInstance == 0:
page = es.search(
index = Index,
sort = "baseCtx.date:asc",
scroll = "2m",
size = 10000,
body = doc
)
else:
page = es.scroll(scroll_id = sid, scroll = "10m")
pageSize = len(page["hits"]["hits"])
data = page["hits"]["hits"]
sid = page["_scroll_id"]
totalSize = page["hits"]["total"]
print(f"Total Size: {totalSize}")
cnt = 0
# totalSize might be flawed as it returns at times an integer > 0 but array is empty
# To overcome this, I've added the below check for the array size instead
if pageSize == 0: break
while 1:
cnt += 1
page = es.scroll(scroll_id = sid, scroll = "10m")
pageSize = len(page["hits"]["hits"])
sid = page["_scroll_id"]
data += page["hits"]["hits"]
sizeMB = len(gzip.compress(bytes(json.dumps(data), encoding='utf-8'))) / (1024**2)
loopTime = datetime.datetime.utcnow()
processTimeSeconds = (loopTime-initTime).seconds
print(f"{cnt} Results pulled: {pageSize} -- Cumulative Results: {len(data)} -- Gzip Size MB: {sizeMB} -- processTimeSeconds: {processTimeSeconds} -- pageSize: {pageSize} -- startDate: {lastRowDateOffset} -- endDate: {endDate}")
if sizeMB > maxSizeMB: break
if processTimeSeconds > maxProcessTimeSeconds: break
if pageSize < 10000: break
lastRowDateOffset = max([x['_source']['baseCtx']['date'] for x in data])
lastRowDateOffsetDT = datetime.datetime.strptime(lastRowDateOffset, '%Y/%m/%d %H:%M:%S')
outFile = f"elasticsearch/live/{lastRowDateOffsetDT:%Y/%m/%d/%H}/{lowerFormat}_live_{lastRowDateOffsetDT:%Y%m%d%H%M%S}.json.gz"
uploadJsonGzipBlob(outFile, data)
writeCursor(curBlobFilePath, f"# Please use format YYYY/MM/DD HH24:MI:SS\nlastPolled = '{lastRowDateOffset}'")
nrFilesThisInstance += 1
logging.info(f"File compiled: {outFile} -- {sizeMB} MB\n")
# If the while loop ran for more than maxProcessTimeSeconds then end it
if processTimeSeconds > maxProcessTimeSeconds: break
if pageSize < 10000: break
logging.info(f"Closing Connection to {esUrl}")
es.close()
return 0
And these are 2 of the timing triggers I am calling:
game1-worker
import logging
import datetime
import azure.functions as func
#from shared_code import worker
import importlib
def main(mytimer: func.TimerRequest) -> None:
utc_timestamp = datetime.datetime.utcnow().replace(
tzinfo=datetime.timezone.utc).isoformat()
if mytimer.past_due:
logging.info('The timer is past due!')
# Load a new instance of worker.py
spec = importlib.util.spec_from_file_location("worker", "shared_code/worker.py")
worker = importlib.util.module_from_spec(spec)
spec.loader.exec_module(worker)
worker.batch('game1name')
logging.info('Python timer trigger function ran at %s', utc_timestamp)
game2-worker
import logging
import datetime
import azure.functions as func
#from shared_code import worker
import importlib
def main(mytimer: func.TimerRequest) -> None:
utc_timestamp = datetime.datetime.utcnow().replace(
tzinfo=datetime.timezone.utc).isoformat()
if mytimer.past_due:
logging.info('The timer is past due!')
# Load a new instance of worker.py
spec = importlib.util.spec_from_file_location("worker", "shared_code/worker.py")
worker = importlib.util.module_from_spec(spec)
spec.loader.exec_module(worker)
worker.batch('game2name')
logging.info('Python timer trigger function ran at %s', utc_timestamp)

TL;DR
Based on what you described, multiple worker-processes share underlying runtime's resources (sockets).
For your usecase you just need to leave FUNCTIONS_WORKER_PROCESS_COUNT at 1. Default value is supposed to be 1, so not specifying it should mean the same as setting it to 1.
You need to understand how Azure Functions scale. It is very unnatural/confusing.
Assumes Consumption Plan.
Coding: You write Functions. Say F1 an F2. How you organize is up to you.
Provisioning:
You create a Function App.
You deploy F1 and F2 to this App.
You start the App. (not function).
Runtime:
At start
Azure spawns one Function Host. Think of this as a container/OS.
Inside the Host, one worker-process is created. This worker-process will host one instance of App.
If you change FUNCTIONS_WORKER_PROCESS_COUNT to say 10 then Host will spawn 10 processes and run your App inside each of them.
When a Function is triggered (function could be triggered due to timer, or REST calls or message in Q, ...)
Each worker-process is capable of servicing one request at a time. Be it a request for F1 or F2. One at a time.
Each Host is capable servicing one request per worker-process in it.
If backlog of requests grows, then Azure load balancer would trigger scale-out and create new Function Hosts.
Based on limited info, it seems like bad design to create 3 functions. You could instead create a single timer-triggered function, which sends out 3 messages to a Q (Storage Q should be more than plenty for such minuscule traffic), which in turn triggers your actual Function/implementation (which is storage Q triggered Function). Message would be something like {"game_name": "game1"}.

Making tables in MySQL with python who are monitoring your filemanagement system

I have a very interesting case. I have a built a filemanagement system in python which moves files from source to destination or archive everytime I run it. Now I want to make 2 tables in MySQL (using python) who are actually monitoring the filemanagement system.
The first table monitors the last time the filemanagementsystem ran. So just a small table with 1 column and 1 row which contains the following information --> Last run: 1-1-2020 10:30.
The second table has to give me all the content of the last file or files which were/was moved from source to destination in table form.
Everytime I run my python script 2 things need to happen. 1. The files are being moved and 2. the MySQL monitoring tables are being updated. Does anyone knows how this needs to be done? Please note I'am using a MySQL Workbench 8.0. Thank you indeed.
Here is the code I have right now for moving the files.
import os
import time
from datetime import datetime
import pathlib
SOURCE = r'C:\Users\AM\Desktop\Source'
DESTINATION = r'C:\Users\AM\Desktop\Destination'
ARCHIVE =r'C:\Users\AM\Desktop\Archive'
def get_time_difference(date, time_string):
"""
You may want to modify this logic to change the way the time difference is calculated.
"""
time_difference = datetime.now() - datetime.strptime(f"{date} {time_string}", "%d-%m-%Y %H:%M")
hours = time_difference.total_seconds() // 3600
minutes = (time_difference.total_seconds() % 3600) // 60
return f"{int(hours)}:{int(minutes)}"
def move_and_transform_file(file_path, dst_path, delimiter="\t"):
"""
Reads the data from the old file, writes it into the new file and then
deletes the old file.
"""
with open(file_path, "r") as input_file, open(dst_path, "w") as output_file:
data = {
"Date": None,
"Time": None,
"Power": None,
}
time_difference_seen = False
for line in input_file:
(line_id, item, line_type, value) = line.strip().split()
if item in data:
data[item] = value
if not time_difference_seen and data["Date"] is not None and data["Time"] is not None:
time_difference = get_time_difference(data["Date"], data["Time"])
time_difference_seen = True
print(delimiter.join([line_id, "TimeDif", line_type, time_difference]), file=output_file)
if item == "Power":
value = str(int(value) * 10)
print(delimiter.join((line_id, item, line_type, value)), file=output_file)
os.remove(file_path)
def process_files(all_file_paths, newest_file_path, subdir):
"""
For each file, decide where to send it, then perform the transformation.
"""
for file_path in all_file_paths:
if file_path == newest_file_path and os.path.getctime(newest_file_path) < time.time() - 120:
dst_root = DESTINATION
else:
dst_root = ARCHIVE
dst_path = os.path.join(dst_root, subdir, os.path.basename(file_path))
move_and_transform_file(file_path, dst_path)
def main():
"""
Gather the files from the directories and then process them.
"""
for subdir in os.listdir(SOURCE):
subdir_path = os.path.join(SOURCE, subdir)
if not os.path.isdir(subdir_path):
continue
all_file_paths = [
os.path.join(subdir_path, p)
for p in os.listdir(subdir_path)
if os.path.isfile(os.path.join(subdir_path, p))
]
if all_file_paths:
newest_path = max(all_file_paths, key=os.path.getctime)
process_files(all_file_paths, newest_path, subdir)
if __name__ == "__main__":
main()

How to JSON dump to a rotating file object

I'm writing a program which periodically dumps old data from a RethinkDB database into a file and removes it from the database. Currently, the data is dumped into a single file which grows without limit. I'd like to change this so that the maximum file size is, say, 250 Mb, and the program starts to write to a new output file just before this size is exceeded.
It seems like Python's RotatingFileHandler class for loggers does approximately what I want; however, I'm not sure whether logging can be applied to any JSON-dumpable object or just to strings.
Another possible approach would be to use (a variant of) Mike Pennington's
RotatingFile class (see python: outfile to another text file if exceed certain file size).
Which of these approaches is likely to be the most fruitful?
For reference, my current program is as follows:
import os
import sys
import json
import rethinkdb as r
import pytz
from datetime import datetime, timedelta
import schedule
import time
import functools
from iclib import RethinkDB
import msgpack
''' The purpose of the Controller is to periodically archive data from the "sensor_data" table so that it does not grow without limit.'''
class Controller(RethinkDB):
def __init__(self, db_address=(os.environ['DB_ADDR'], int(os.environ['DB_PORT'])), db_name=os.environ['DB_NAME']):
super(Controller, self).__init__(db_address=db_address, db_name=db_name) # Initialize the IperCronComponent with the default logger name (in this case, "Controller")
self.db_table = RethinkDB.SENSOR_DATA_TABLE # The table name is "sensor_data" and is stored as a class variable in RethinkDBMixIn
def generate_archiving_query(self, retention_period=timedelta(days=3)):
expiry_time = r.now() - retention_period.total_seconds() # Timestamp before which data is to be archived
if "timestamp" in r.table(self.db_table).index_list().run(self.db): # If "timestamp" is a secondary index
beginning_of_time = r.time(1400, 1, 1, 'Z') # The minimum time of a ReQL time object (i.e., the year 1400 in the UTC timezone)
data_to_archive = r.table(self.db_table).between(beginning_of_time, expiry_time, index="timestamp") # Generate query using "between" (faster)
else:
data_to_archive = r.table(self.db_table).filter(r.row['timestamp'] < expiry_time) # Generate the same query using "filter" (slower, but does not require "timestamp" to be a secondary index)
return data_to_archive
def archiving_job(self, data_to_archive=None, output_file="archived_sensor_data.json"):
if data_to_archive is None:
data_to_archive = self.generate_archiving_query() # By default, the call the "generate_archiving_query" function to generate the query
old_data = data_to_archive.run(self.db, time_format="raw") # Without time_format="raw" the output does not dump to JSON
with open(output_file, 'a') as f:
ids_to_delete = []
for item in old_data:
print item
# msgpack.dump(item, f)
json.dump(item, f)
f.write('\n') # Separate each document by a new line
ids_to_delete.append(item['id'])
r.table(self.db_table).get_all(r.args(ids_to_delete)).delete().run(self.db) # Delete based on ID. It is preferred to delete the entire batch in a single operation rather than to delete them one by one in the for loop.
def test_job_1():
db_name = "ipercron"
table_name = "sensor_data"
port_offset = 1 # To avoid interference of this testing program with the main program, all ports are initialized at an offset of 1 from the default ports using "rethinkdb --port_offset 1" at the command line.
conn = r.connect("localhost", 28015 + port_offset)
r.db(db_name).table(table_name).delete().run(conn)
import rethinkdb_add_data
controller = Controller(db_address=("localhost", 28015+port_offset))
archiving_job = functools.partial(controller.archiving_job, data_to_archive=controller.generate_archiving_query())
return archiving_job
if __name__ == "__main__":
archiving_job = test_job_1()
schedule.every(0.1).minutes.do(archiving_job)
while True:
schedule.run_pending()
It is not completely 'runnable' from the part shown, but the key point is that I would like to replace the line
json.dump(item, f)
with a similar line in which f is a rotating, and not fixed, file object.

Following Stanislav Ivanov, I used json.dumps to convert each RethinkDB document to a string and wrote this to a RotatingFileHandler:
import os
import sys
import json
import rethinkdb as r
import pytz
from datetime import datetime, timedelta
import schedule
import time
import functools
from iclib import RethinkDB
import msgpack
import logging
from logging.handlers import RotatingFileHandler
from random_data_generator import RandomDataGenerator
''' The purpose of the Controller is to periodically archive data from the "sensor_data" table so that it does not grow without limit.'''
os.environ['DB_ADDR'] = 'localhost'
os.environ['DB_PORT'] = '28015'
os.environ['DB_NAME'] = 'ipercron'
class Controller(RethinkDB):
def __init__(self, db_address=None, db_name=None):
if db_address is None:
db_address = (os.environ['DB_ADDR'], int(os.environ['DB_PORT'])) # The default host ("rethinkdb") and port (28015) are stored as environment variables
if db_name is None:
db_name = os.environ['DB_NAME'] # The default database is "ipercron" and is stored as an environment variable
super(Controller, self).__init__(db_address=db_address, db_name=db_name) # Initialize the instance of the RethinkDB class. IperCronComponent will be initialized with its default logger name (in this case, "Controller")
self.db_name = db_name
self.db_table = RethinkDB.SENSOR_DATA_TABLE # The table name is "sensor_data" and is stored as a class variable of RethinkDBMixIn
self.table = r.db(self.db_name).table(self.db_table)
self.archiving_logger = logging.getLogger("archiving_logger")
self.archiving_logger.setLevel(logging.DEBUG)
self.archiving_handler = RotatingFileHandler("archived_sensor_data.log", maxBytes=2000, backupCount=10)
self.archiving_logger.addHandler(self.archiving_handler)
def generate_archiving_query(self, retention_period=timedelta(days=3)):
expiry_time = r.now() - retention_period.total_seconds() # Timestamp before which data is to be archived
if "timestamp" in self.table.index_list().run(self.db):
beginning_of_time = r.time(1400, 1, 1, 'Z') # The minimum time of a ReQL time object (namely, the year 1400 in UTC)
data_to_archive = self.table.between(beginning_of_time, expiry_time, index="timestamp") # Generate query using "between" (faster, requires "timestamp" to be a secondary index)
else:
data_to_archive = self.table.filter(r.row['timestamp'] < expiry_time) # Generate query using "filter" (slower, but does not require "timestamp" to be a secondary index)
return data_to_archive
def archiving_job(self, data_to_archive=None):
if data_to_archive is None:
data_to_archive = self.generate_archiving_query() # By default, the call the "generate_archiving_query" function to generate the query
old_data = data_to_archive.run(self.db, time_format="raw") # Without time_format="raw" the output does not dump to JSON or msgpack
ids_to_delete = []
for item in old_data:
print item
self.dump(item)
ids_to_delete.append(item['id'])
self.table.get_all(r.args(ids_to_delete)).delete().run(self.db) # Delete based on ID. It is preferred to delete the entire batch in a single operation rather than to delete them one by one in the for-loop.
def dump(self, item, mode='json'):
if mode == 'json':
dump_string = json.dumps(item)
elif mode == 'msgpack':
dump_string = msgpack.packb(item)
self.archiving_logger.debug(dump_string)
def populate_database(db_name, table_name, conn):
if db_name not in r.db_list().run(conn):
r.db_create(db_name).run(conn) # Create the database if it does not yet exist
if table_name not in r.db(db_name).table_list().run(conn):
r.db(db_name).table_create(table_name).run(conn) # Create the table if it does not yet exist
r.db(db_name).table(table_name).delete().run(conn) # Empty the table to start with a clean slate
# Generate random data with timestamps uniformly distributed over the past 6 days
random_data_time_interval = timedelta(days=6)
start_random_data = datetime.utcnow().replace(tzinfo=pytz.utc) - random_data_time_interval
random_generator = RandomDataGenerator(seed=0)
packets = random_generator.packets(N=100, start=start_random_data)
# print packets
print "Adding data to the database..."
r.db(db_name).table(table_name).insert(packets).run(conn)
if __name__ == "__main__":
db_name = "ipercron"
table_name = "sensor_data"
port_offset = 1 # To avoid interference of this testing program with the main program, all ports are initialized at an offset of 1 from the default ports using "rethinkdb --port_offset 1" at the command line.
host = "localhost"
port = 28015 + port_offset
conn = r.connect(host, port) # RethinkDB connection object
populate_database(db_name, table_name, conn)
# import rethinkdb_add_data
controller = Controller(db_address=(host, port))
archiving_job = functools.partial(controller.archiving_job, data_to_archive=controller.generate_archiving_query()) # This ensures that the query is only generated once. (This is sufficient since r.now() is re-evaluated every time a connection is made).
schedule.every(0.1).minutes.do(archiving_job)
while True:
schedule.run_pending()
In this context the RethinkDB class does little other than define the class variable SENSOR_DATA_TABLE and the RethinkDB connection, self.db = r.connect(self.address[0], self.address[1]). This is run together with a module for generating fake data, random_data_generator.py:
import random
import faker
from datetime import datetime, timedelta
import pytz
import rethinkdb as r
class RandomDataGenerator(object):
def __init__(self, seed=None):
self._seed = seed
self._random = random.Random()
self._random.seed(seed)
self.fake = faker.Faker()
self.fake.random.seed(seed)
def __getattr__(self, x):
return getattr(self._random, x)
def name(self):
return self.fake.name()
def datetime(self, start=None, end=None):
if start is None:
start = datetime(2000, 1, 1, tzinfo=pytz.utc) # Jan 1st 2000
if end is None:
end = datetime.utcnow().replace(tzinfo=pytz.utc)
if isinstance(end, datetime):
dt = end - start
elif isinstance(end, timedelta):
dt = end
assert isinstance(dt, timedelta)
random_dt = timedelta(microseconds=self._random.randrange(int(dt.total_seconds() * (10 ** 6))))
return start + random_dt
def packets(self, N=1, start=None, end=None):
return [{'name': self.name(), 'timestamp': self.datetime(start=start, end=end)} for _ in range(N)]
When I run controller it produces several rolled-over output logs, each at most 2 kB in size, as expected:

Convert datetime fields in Chrome history file (sqlite) to readable format

Working on a script to collect users browser history with time stamps ( educational setting).
Firefox 3 history is kept in a sqlite file, and stamps are in UNIX epoch time... getting them and converting to readable format via a SQL command in python is pretty straightforward:
sql_select = """ SELECT datetime(moz_historyvisits.visit_date/1000000,'unixepoch','localtime'),
moz_places.url
FROM moz_places, moz_historyvisits
WHERE moz_places.id = moz_historyvisits.place_id
"""
get_hist = list(cursor.execute (sql_select))
Chrome also stores history in a sqlite file.. but it's history time stamp is apparently formatted as the number of microseconds since midnight UTC of 1 January 1601....
How can this timestamp be converted to a readable format as in the Firefox example (like 2010-01-23 11:22:09)? I am writing the script with python 2.5.x ( the version on OS X 10.5 ), and importing sqlite3 module....

Try this:
sql_select = """ SELECT datetime(last_visit_time/1000000-11644473600,'unixepoch','localtime'),
url
FROM urls
ORDER BY last_visit_time DESC
"""
get_hist = list(cursor.execute (sql_select))
Or something along those lines
seems to be working for me.

This is a more pythonic and memory-friendly way to do what you described (by the way, thanks for the initial code!):
#!/usr/bin/env python
import os
import datetime
import sqlite3
import opster
from itertools import izip
SQL_TIME = 'SELECT time FROM info'
SQL_URL = 'SELECT c0url FROM pages_content'
def date_from_webkit(webkit_timestamp):
epoch_start = datetime.datetime(1601,1,1)
delta = datetime.timedelta(microseconds=int(webkit_timestamp))
return epoch_start + delta
#opster.command()
def import_history(*paths):
for path in paths:
assert os.path.exists(path)
c = sqlite3.connect(path)
times = (row[0] for row in c.execute(SQL_TIME))
urls = (row[0] for row in c.execute(SQL_URL))
for timestamp, url in izip(times, urls):
date_time = date_from_webkit(timestamp)
print date_time, url
c.close()
if __name__=='__main__':
opster.dispatch()
The script can be used this way:
$ ./chrome-tools.py import-history ~/.config/chromium/Default/History* > history.txt
Of course Opster can be thrown out but seems handy to me :-)

The sqlite module returns datetime objects for datetime fields, which have a format method for printing readable strings called strftime.
You can do something like this once you have the recordset:
for record in get_hist:
date_string = record[0].strftime("%Y-%m-%d %H:%M:%S")
url = record[1]

This may not be the most Pythonic code in the world, but here's a solution: Cheated by adjusting for time zone (EST here) by doing this:
utctime = datetime.datetime(1601,1,1) + datetime.timedelta(microseconds = ms, hours =-5)
Here's the function : It assumes that the Chrome history file has been copied from another account into /Users/someuser/Documents/tmp/Chrome/History
def getcr():
connection = sqlite3.connect('/Users/someuser/Documents/tmp/Chrome/History')
cursor = connection.cursor()
get_time = list(cursor.execute("""SELECT last_visit_time FROM urls"""))
get_url = list(cursor.execute("""SELECT url from urls"""))
stripped_time = []
crf = open ('/Users/someuser/Documents/tmp/cr/cr_hist.txt','w' )
itr = iter(get_time)
itr2 = iter(get_url)
while True:
try:
newdate = str(itr.next())
stripped1 = newdate.strip(' (),L')
ms = int(stripped1)
utctime = datetime.datetime(1601,1,1) + datetime.timedelta(microseconds = ms, hours =-5)
stripped_time.append(str(utctime))
newurl = str(itr2.next())
stripped_url = newurl.strip(' ()')
stripped_time.append(str(stripped_url))
crf.write('\n')
crf.write(str(utctime))
crf.write('\n')
crf.write(str(newurl))
crf.write('\n')
crf.write('\n')
crf.write('********* Next Entry *********')
crf.write('\n')
except StopIteration:
break
crf.close()
shutil.copy('/Users/someuser/Documents/tmp/cr/cr_hist.txt' , '/Users/parent/Documents/Chrome_History_Logs')
os.rename('/Users/someuser/Documents/Chrome_History_Logs/cr_hist.txt','/Users/someuser/Documents/Chrome_History_Logs/%s.txt' % formatdate)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.