Algorithm to determine the closest date to some date input - python

I have a Python program that uses historical data coming from a database and allows the user to select the dates input. However not all the possible dates are available into the database, since these are financial data: in other words, if the user will insert "02/03/2014" (which is Sunday) he won't find any record in the database because the stock exchange was closed.
This causes SQL problems cause when the record is not found, the SQL statement fails and the user needs to adjust the date until the moment he finds an existing record. To avoid this I would like to build an algorithm which is able to change the date inputs itself choosing the closest to the originary input. For example, if the user inserts "02/03/2014", the closest would be 03/03/2014".
I have thought about something like this, where the table MyData is containing date values only (I'm still in process of working on the proper syntaxis but it's just to show the idea):
con = lite.connect('C:/.../MyDatabase.db')
cur = con.cursor()
cur.execute('SELECT * from MyDates')
rowsD= cur.fetchall()
data = []
for row in rowsD:
data.append(rowsD[row])
>>>data
['01/01/2010', '02/01/2010', .... '31/12/2013']
inputDate = '07/01/2010'
differences = []
for i in range(0, len(data)):
differences.append(abs(data[i] - inputDate))
After that, I was thinking about:
getting the minimum value from the vector differences: mV = min(differences)
getting the corresponding date value into the list data
Howwever, this cost me two things in terms of memory:
I need to load all the database, which is huge;
I have to iterate many times (once to build the list data, then the list of differences etc.)
Does anyone have a better idea to build this, or knows a different approach to the problem?

Query the database on the dates that are smaller than the input date and take the maximum of these. This will give you the closest date before.
Symmetrically, you can query the minimum of the larger dates to get the closest date after. And keep the preferred of the two.
These should be efficient queries.
SELECT MAX(Date)
FROM MyDates
WHERE Date <= InputDate;

I would try getting a record with the maximum date smaller then the given one from database directly (this can be done with SQL). If you put an index in your database on date then this can be done in O(log(n)). That's of course not really the same as "being closest" but if you combine it with "the minimum date bigger then the given one" you will achieve it.
Also if you know more or less the distribution of your data, for example that in each 7 consecutive days you have some data, then you can restrict to a smaller range of data like [-3 days, +3 days].
Combining both of these solutions should give you quite nice performance.

Related

Mongodb get new values from collection without timestamp

I want to fetch added new values from mongodb collections without timestamp value. I guess only choice using objectid field. I using test dataset on github. "https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json"
For example if I add new data to this collection, how ı fetch or how ı find these new values.
Some mongodb collections using timestamp value, and I use this timestamp value for get new values. But ı do not know, how ı find without timestamp.
Example dataset ;
enter image description here
I want like this filter. but it doesn't work
{_id: {$gt: '622e04d69edb39455e06d4af'}}
If you don't want to create a new field in the document.
SomeGlobalObj = ObjectId[] // length limit is 10
// you will need Redis or other outside storage if you have multi server
SomeGlobalObj.shift(newDocumentId)
SomeGlobalObj = SomeGlobalObj.slice(0,10)
//Make sure to keep the latest 10 IDs.
Now, if you want to retrieve the latest document, you can use this array.
If the up-to-date thing you're talking about is, disappears after checking, you can remove it from this array after query.
In the comments you mentioned that you want to do this using Python, so I shall answer from that perspective.
In Mongo, an ObjectId is composed of 3 sections:
a 4-byte timestamp value, representing the ObjectId's creation, measured in seconds since the Unix epoch
a 5-byte random value generated once per process. This random value is unique to the machine and process.
a 3-byte incrementing counter, initialized to a random value
Because of this, we can use the ObjectId to sort or filter by created timestamp. To construct an ObjectId for a specific date, we can use the following code:
gen_time = datetime.datetime(2010, 1, 1)
dummy_id = ObjectId.from_datetime(gen_time)
result = collection.find({"_id": {"$lt": dummy_id}})
Source: objectid - Tools for working with MongoDB ObjectIds
This example will find all documents created before 2010/01/01. Substituting $gt would allow this query to function as you desire.
If you need to get the timetamp from an ObjectId, you can use the following code:
id = myObjectId.generation_time

Diagonal Query in Postgresql

I am dealing with a table that has roughly 50k rows, each of which containing a timestamp and an array of smallints of the length 25920. What I am trying to do is pull a single value from each array with a list of timestamps that is being passed. For example, I would have 25920 timestamps that I would pass and I would want the first element for the timestamp, then the second element for the second timestamp and so on. By now I am having a tunnel vision and do not seem to find a solution to what is probably a trivial problem.
I either end up pulling the full 25920 rows which consumes too much memory or execute 25920 queries that take way too long for obvious reasons.
I am using Python 3.8 with the psycopg2 module.
Thanks in advance!
You need to generate an index into the array for every row you extract with your query. In this specific case (diagonal) you want an index based on the row number. Something along the lines of:
SELECT ts, val[row_number() over (order by ts)] FROM ... ORDER BY ts

Order_By custom date in peewee for SQLite

I made a huge misstake building up a database, but it works perfectly except for 1 feature. Changing the program in all the places where it needs to be changed for that feature to work would be a titanic job of weeks, so let's hope this workaround is possible.
The issue: I've stored data in a SQLite database as "dd/mm/yyyy" TextField format instead of DateField.
The need: I need to sort by dates on a union query, to get the last number of records in that union following my custom date format. They are from different tables, so I can't just use rowid or stuff like that to get the last ones, I need to do it by date and I can't change the already stored data in the database because there are already invoices created with that format ("dd/mm/yyyy" is the default date format in my country).
This is the query that captures data:
records = []
limited_to = 25
union = (facturas | contado | albaranes | presupuestos)
for record in (union
.select_from(union.c.idunica, union.c.fecha, union.c.codigo,
union.c.tipo, union.c.clienterazonsocial,
union.c.totalimporte, union.c.pagada,
union.c.contabilizar, union.c.concepto1,
union.c.cantidad1, union.c.precio1,
union.c.observaciones)
.order_by(union.c.fecha.desc()) # TODO this is what I need to change.
.limit(limited_to)
.tuples()):
records.append(record)
Now to complicate things even more, the union is already created by a really complex where clause for each database before it's transformed into an union query.
So my only hope is: Is there a way to make order_by follow a custom date format instead?
To clarify, this is the simple transformation that I'd need the order_by clause to follow, because I assume SQLite wouldn't have issues sorting if this would be the date format:
def reverse_date(date: str) -> str:
"""Reverse the date order from dd/mm/yyyy dates into yyyy-mm-dd"""
yyyy, mm, dd = date.split("/")
return f"{yyyy}-{mm}-{dd}"
Note: I've left lot of code out because I think it's unnecesary. This is the minimum amount of code needed to understand the problem. Let me know if you need more data.
Update: Trying this workaround, it seems to work fine. Need more testing but it's promising. If someone ever faces the same issue, here you go:
.order_by(sqlfunc.Substr(union.c.fecha, 7)
.concat('/')
.concat(sqlfunc.Substr(union.c.fecha, 4, 2))
.concat('/')
.concat(sqlfunc.Substr(union.c.fecha, 1, 2))
.desc())
Happy end of 2020 year!
As you pointed out, if you want the dates to sort properly, they need to be in yyyy-mm-dd format, which is the text format you should always use in SQLite (or something with the same year, month, day, order).
You might be able to do a rearrangement here using re.sub:
.order_by(re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\2-\1',
union.c.fecha))
Here we are using regex to capture the year, month, and day components in separate capture groups. Then, we replace with these components in the correct order for sorting.

How to iterate through a Firebird Database without having to completely load it inside my program?

I've recently began to work with Database queries when I was asked to develop a program that would have read data from the last 1 month in a Firebird database with almost 100M rows.
After stumbling a little bit, I finally managed to filter the database, using Python (and, more specifically, Pandas library), but the code takes more than 8 hours just to filter the data, so it becomes useless when trying to realize the task with the required frequency.
The rest of the code runs really quickly, since I just need around the 3000 last rows of the dataset.
So far, my function responsible to execute the query is:
def read_query(access):
start_time = time.time()
conn = pyodbc.connect(access)
df = pd.read_sql_query(r"SELECT * from TABLE where DAY >= DATEADD(MONTH,-1, CURRENT_TIMESTAMP(2)) AND DAY <= 'TODAY'", conn)
Or, isolating the query:
SELECT * from TABLE where DAY >= DATEADD(MONTH,-1, CURRENT_TIMESTAMP(2)) AND DAY <= 'TODAY'
Since I will only need a X number of rows from the bottom of the database (where this X number changes everyday), I know I could optimize my code by just reading part of the database, starting from the last rows, iterating through each one of the rows, without having to process the entire dataframe.
So my question is: how can it be done? And if it's not a good idea/approach whatelse could I do to solve this issue?
I think chunksize is your way out, please check the documentation here:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html
and also the examples posted here:
http://shichaoji.com/2016/10/11/python-iterators-loading-data-in-chunks/#Loading-data-in-chunks
Good luck!

Django aggregate count of records per day

I've got a django app that is doing some logging. My model looks like this:
class MessageLog(models.Model):
logtime = models.DateTimeField(auto_now_add=True)
user = models.CharField(max_length=50)
message = models.CharField(max_length=512)
What a want to do is get the average number of messages logged per day of the week so that I can see which days are the most active. I've managed to write a query that pulls the total number of messages per day which is:
for i in range(1, 8):
MessageLog.objects.filter(logtime__week_day=i).count()
But I'm having trouble calculating the average in a query. What I have right now is:
for i in range(1, 8):
MessageLog.objects.filter(logtime__week_day=i).annotate(num_msgs=Count('id')).aggregate(Avg('num_msgs'))
For some reason this is returning 1.0 for every day though. I looked at the SQL it is generating and it is:
SELECT AVG(num_msgs) FROM (
SELECT
`myapp_messagelog`.`id` AS `id`, `myapp_messagelog`.`logtime` AS `logtime`,
`myapp_messagelog`.`user` AS `user`, `myapp_messagelog`.`message` AS `message`,
COUNT(`myapp_messagelog`.`id`) AS `num_msgs`
FROM `myapp_messagelog`
WHERE DAYOFWEEK(`myapp_messagelog`.`logtime`) = 1
GROUP BY `myapp_messagelog`.`id` ORDER BY NULL
) subquery
I think the problem might be coming from the GROUP BY id but I'm not really sure. Anyone have any ideas or suggestions? Thanks in advance!
The reason your listed query always gives 1 is because you're not grouping by date. Basically, you've asked the database to take the MessageLog rows that fall on a given day of the week. For each such row, count how many ids it has (always 1). Then take the average of all those counts, which is of course also 1.
Normally, you would need to use a values clause to group your MessageLog rows prior to your annotate and aggregate parts. However, since your logtime field is a datetime rather than just a date, I am not sure you can express that directly with Django's ORM. You can definitely do it with an extra clause, as shown here. Or if you felt like it you could declare a view in your SQL with as much of the aggregating and average math as you liked and declare an unmanaged model for it, then just use the ORM normally.
So an extra field works to get the total number of records per actual day, but doesn't handle aggregating the average of the computed annotation. I think this may be sufficiently abstracted from the model that you'd have to use a raw SQL query, or at least I can't find anything that makes it work in one call.
That said, you already know how you can get the total number of records per weekday in a simple query as shown in your question.
And this query will tell you how many distinct date records there are on a given weekday:
MessageLog.objects.filter(logtime__week_day=i).dates('logtime', day').count()
So you could do the averaging math in Python instead, which might be simpler than trying get the SQL right.
Alternately, this query will get you the raw number of messages for all weekdays in one query rather than a for loop:
MessageLog.objects.extra({'weekday': "dayofweek(logtime)"}).values('weekday').annotate(Count('id'))
But I haven't been able to get a nice query to give you the count of distinct dates for each weekday annotated to that - dates querysets lose the ability to handle annotate calls, and annotating over an extra value doesn't seem to work either.
This has been surprisingly tricky, given that it's not that hard a SQL expression.
I do something similar with a datetime field, but annotating over extra values does work for me. I have a Record model with a datetime field "created_at" and a "my_value" field I want to get the average for.
from django.db.models import Avg
qs = Record.objects.extra({'created_day':"date(created_at)"}).\
values('created_day').\
annotate(count=Avg('my_value'))
The above will group by the day of the datetime value in "created_at" field.
queryset.extra(select={'day': 'date(logtime)'}).values('day').order_by('-day').annotate(Count('id'))

Categories

Resources