How to write a multi-range/overlap query in python sqlite3 - python

I'm using sqlite in python. I'm pretty new to sql. My sql table has two columns start and end which represent an interval. I have another "input" list of intervals (represented as a pandas dataframe) and I'd like to find all the overlaps between the input and the db.
SELECT * FROM db WHERE
# you can write an interval query in two statements like so:
db.start <= input.end AND db.end >= input.start
My issue is that the above queries for overlaps with a single input interval, I'm not sure how to write a query for many overlaps. I'm also unsure how to effectively write this in python. From the sqlite docs:
t = ('RHAT',)
c.execute('SELECT * FROM stocks WHERE symbol=?', t)
print(c.fetchone())
This seems difficult because I need to pass in a range for my expression, or a list of ranges, and so a single ? probably won't cut it, right?
I'd appreciate either sql or python+sql or suggestions for how to do this entirely differently. Thanks!

Putting multiple interval values into a single query becomes cumbersome quickly.
Create a (temporary) table input for the intervals, and then search for any match in that table:
SELECT *
FROM db
WHERE EXISTS (SELECT *
FROM input
WHERE db.start <= input.end
AND db.end >= input.start);
It's simpler to write this as a join, but then you get multiple output rows if multiple inputs match (OTOH, this might actually be what you want):
SELECT db.*
FROM db
JOIN input ON db.start <= input.end
AND db.end >= input.start;

Related

cannot select a row based on certain string

I have been trying to select rows of a column in a dataset based on a string, named 'Gemeente', the dutch translation of municipality.
I have used the following code to select it.
select * from incomecbs where regioaanduiding = 'Gemeente'
In this case, regioaanduiding means region'.
Unfortunately i get no results when doing this.
Does anyone know what i am doing wrong?
There may be trailing spaces in your data. Try using like:
select * from incomecbs where regioaanduiding like 'Gemeente%'
Keep in mind that this may perform poorly on large tables compared to strict equality.

2 SQL queries interlaced on ID number

I have a 2 queries that will be run repetitively to feed a report and some charts so need to make sure it is tight. First query has 25 columns and will yield out 25-50 rows from a massive table. My second query will result in another 25 columns (a couple matching columns) of 25 to 50 rows from another massive table.
Desired end result is a single document in that Query 1 (Problem) and Query 2 (Problem tasks) could match on a common column (Problem ID) so that row 1 is the problem, row 2-4 is the tasks, row 5 is the next problem and 6-9 are the tasks....ect. Now I realize I could do this manually by running the 2 queries and them just combining them in excel by hand, but looking for a eloquent process that could be reusable in my absence without too much overhead.
I was exploring inserts, union all, and cross join but the 2 queries have different columns that contain different critical data elements to be returned. Also, exploring setting up a Python job to do this by importing the CSVs and interlacing results but I am a early data science student and not yet much past creating charts from imported CSVs.
Any suggestions on how I might attack this challenge? Thanks for the help.
Picture of desired end result.
enter image description here
You can do it with something like
INSERT INTO target_table (<columns...>)
SELECT <your first query>
UNION
SELECT <your second query>
And then to retrieve data
SELECT * from target_table
WHERE <...>
ORDER BY problem_id, task_id
Just ensure both queries return the same columns, i.e. the columns you want to populate in target_table, probably using fixed default values (e.g. the first query may return a default task_id by including NULL as task_id in the column list)
Thanks for the feedback #gimix, I ended up aliasing the columns that I was able to put together from the 2 tables (open_time vs date_opened ect...) so they all matched and selected '' for the null values I needed to. I unioned the 2 selected statements as suggested, Then I finally realized I can just insert my filtering queries twice as sub queries. It will now be nice and quickly repeatable for pulling and dropping into excel 2x per week. Thank you!

shuffling values in sqlite database gives None

I tried to shuffle the values of a column in an sqlite database using sqlalchemy but using SQL queries.
I have a table mtr_cdr with a column idx. If I have
conn = engine.connect()
conn.execute('SELECT idx FROM mtr_cdr').fetchall()[:3]
I get the right values: [(3,),(3,),(3,)] in this case, which are the top values in the column (at this point, the idx column is ordered, and there are multiple rows that correspond to the same idx).
However, if I try:
conn.execute('SELECT idx FROM mtr_cdr ORDER BY RANDOM()').fetchall()[:3]
I get [(None,),(None,),(None,).
I'm using sqlalchemy version 1.2.9 if that helps.
Which approach can give me the correct result, but extendable to a large database? In the full database, I expect around 100 million+ rows, and I heard ORDER BY RANDOM() (if I can even get it to work) will be very slow...

How to query on a Mysql cursor?

I have fetched data into mysql cursor now i want to sum up a column in the cursor.
Is there any BIF or any thing i can do to make it work ?
db = cursor.execute("SELECT api_user.id, api_user.udid, api_user.ps, api_user.deid, selldata.lid, api_selldata.sells
FROM api_user
INNER JOIN api_user.udid=api_selldata.udid AND api_user.pc='com'")
I suspect the answer is that you can't include aggregate functions such as SUM() in a query unless you can guarantee (usually by adding a GROUP BY clause) that the values of the non-aggregated columns are the same for all rows included in the SUM().
The aggregate functions effectively condense a column over many rows into a single value, which cannot be done for non-aggregated columns unless SQL knows that they are guaranteed to have the same value for all considered rows (which a GROUP BY will do, but this may not be what you want).

Algorithm to determine the closest date to some date input

I have a Python program that uses historical data coming from a database and allows the user to select the dates input. However not all the possible dates are available into the database, since these are financial data: in other words, if the user will insert "02/03/2014" (which is Sunday) he won't find any record in the database because the stock exchange was closed.
This causes SQL problems cause when the record is not found, the SQL statement fails and the user needs to adjust the date until the moment he finds an existing record. To avoid this I would like to build an algorithm which is able to change the date inputs itself choosing the closest to the originary input. For example, if the user inserts "02/03/2014", the closest would be 03/03/2014".
I have thought about something like this, where the table MyData is containing date values only (I'm still in process of working on the proper syntaxis but it's just to show the idea):
con = lite.connect('C:/.../MyDatabase.db')
cur = con.cursor()
cur.execute('SELECT * from MyDates')
rowsD= cur.fetchall()
data = []
for row in rowsD:
data.append(rowsD[row])
>>>data
['01/01/2010', '02/01/2010', .... '31/12/2013']
inputDate = '07/01/2010'
differences = []
for i in range(0, len(data)):
differences.append(abs(data[i] - inputDate))
After that, I was thinking about:
getting the minimum value from the vector differences: mV = min(differences)
getting the corresponding date value into the list data
Howwever, this cost me two things in terms of memory:
I need to load all the database, which is huge;
I have to iterate many times (once to build the list data, then the list of differences etc.)
Does anyone have a better idea to build this, or knows a different approach to the problem?
Query the database on the dates that are smaller than the input date and take the maximum of these. This will give you the closest date before.
Symmetrically, you can query the minimum of the larger dates to get the closest date after. And keep the preferred of the two.
These should be efficient queries.
SELECT MAX(Date)
FROM MyDates
WHERE Date <= InputDate;
I would try getting a record with the maximum date smaller then the given one from database directly (this can be done with SQL). If you put an index in your database on date then this can be done in O(log(n)). That's of course not really the same as "being closest" but if you combine it with "the minimum date bigger then the given one" you will achieve it.
Also if you know more or less the distribution of your data, for example that in each 7 consecutive days you have some data, then you can restrict to a smaller range of data like [-3 days, +3 days].
Combining both of these solutions should give you quite nice performance.

Categories

Resources