I am dealing with a table that has roughly 50k rows, each of which containing a timestamp and an array of smallints of the length 25920. What I am trying to do is pull a single value from each array with a list of timestamps that is being passed. For example, I would have 25920 timestamps that I would pass and I would want the first element for the timestamp, then the second element for the second timestamp and so on. By now I am having a tunnel vision and do not seem to find a solution to what is probably a trivial problem.
I either end up pulling the full 25920 rows which consumes too much memory or execute 25920 queries that take way too long for obvious reasons.
I am using Python 3.8 with the psycopg2 module.
Thanks in advance!
You need to generate an index into the array for every row you extract with your query. In this specific case (diagonal) you want an index based on the row number. Something along the lines of:
SELECT ts, val[row_number() over (order by ts)] FROM ... ORDER BY ts
Related
I have a very large data file (foo.sas7bdat) that I would like to filter rows from without loading the whole data file into memory. For example, I can print the first 20 rows of the dataset without loading the entire file into memory by doing the following:
import pandas
import itertools
with pandas.read_sas('foo.sas7bdat') as f:
for row in itertools.islice(f,20):
print(row)
However, I am unclear on how to only print (or preferably place in a new file) only rows that have any column that contain the number 123.1. How can I do this?
Pandas has the ability to pull dataframes one chunk at a time. Following the trail of read_sas() documentation to "chunksize" I came across this:
http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk
for chunk in pd.read_sas('foo.sas7bdat', interator=True, chunksize=100000):
print(chunk)
This would get chunks of 100,000 lines.
As for the other problem you would need a query. However I don't know the constraints of the problem. If you make a Dataframe with all the columns then you still might overflow your memory space so an efficient way would be to collect the indexes and put those in a set, then sort those and use .iloc to get those entries if you wanted to put those into a Dataframe.
You may need to use tools that take this into account. Dask is a good alternative for use on clusters.
I have 30 CSV files (saved as .txt files) ranging from 2GB to 11GB each on a server machine with 16 cores.
Each row of each CSV contains a date, a time, and an ID.
I need to construct a dense matrix of size datetime x ID (roughly 35,000 x 2000), where each cell is count of rows that had this datetime and ID (so each CSV row’s datetime and ID are used as matrix indices to update this matrix). Each file contains a unique range of datetimes, so this job is embarrassingly parallel across files.
Question: What is a faster/fastest way to accomplish this & (possibly) parallelize it? I am partial to Python, but could work in C++ if there is a better solution there. Should I re-write with MapReduce or MPI? Look into Dask or Pandas? Compile my python script somehow? Something else entirely?
My current approach (which I would happily discard for something faster):
Currently, I am doing this serially (one CSV at a time) in Python and saving the output matrix in h5 format. I stream a CSV line-by-line from the command line using:
cat one_csv.txt | my_script.py > outputfile.h5
And my python script works like:
# initialize matrix
…
for line in sys.stdin:
# Split the line into data columns
split = line.replace('\n','').split(',')
...(extract & process datetime; extract ID)...
# Update matrix
matrix[datetime, ID] = matrix[datetime, ID] +1
EDIT Below are a few example lines from one of the CSV's. The only relevant columns are 'dateYMDD' (formatted so that '80101' means jan. 1 2008), 'time', and 'ID'. So for example, the code should read use the first row of the CSV below to add 1 to the matrix cell corresponding to (Jan_1_2008_00_00_00, 12).
Also: There are many more unique times than unique ID's, and the CSV's are time-sorted.
Type|Number|dateYMDD|time|ID
2|519275|80101|0:00:00|12
5|525491|80101|0:05:00|25
2|624094|80101|0:12:00|75
5|623044|80102|0:01:00|75
6|658787|80102|0:03:00|4
First of all, you should probably profile your script to make sure the bottleneck is actually where you think.
That said, Python's Global Interpreter Lock will make parallelizing it difficult, unless you use multiprocessing, and I expect it will be faster to simply process them separately and merge the results: feed each Python script one CSV and output to one table, then merge the tables. If the tables are much smaller than the CSVs (as one would expect if the cells have high values) then this should be relatively efficient.
I don't think that will get you all-caps full-throttle FAST like you mentioned, though. If that doesn't meet your expectations I would think of writing it in C++, Rust or Cython.
Is it possible to get only a limited number of columns for a column family from a row? Lets say I just want to fetch the first 10 values for ['cf1': 'col1'] for a particular row.
This is the same question as https://github.com/wbolster/happybase/issues/93
The answer is:
I think the only way to do this is a scan with a server side filter. I think the one you're after is the ColumnCountGetFilter:
ColumnCountGetFilter - takes one argument, a limit. It returns the first limit number of columns in the table. Syntax: ColumnCountGetFilter (‘’) Example: ColumnCountGetFilter (4)
Source: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbase_filtering.html
With Happybase that would look like this (untested):
for row_key, data in table.scan(columns=['cf1'], filter='ColumnCountGetFilter(10)'):
print(row_key, data)
use limit to get specific row in hbase
table.scan(limit=int(limit)
I have a Python program that uses historical data coming from a database and allows the user to select the dates input. However not all the possible dates are available into the database, since these are financial data: in other words, if the user will insert "02/03/2014" (which is Sunday) he won't find any record in the database because the stock exchange was closed.
This causes SQL problems cause when the record is not found, the SQL statement fails and the user needs to adjust the date until the moment he finds an existing record. To avoid this I would like to build an algorithm which is able to change the date inputs itself choosing the closest to the originary input. For example, if the user inserts "02/03/2014", the closest would be 03/03/2014".
I have thought about something like this, where the table MyData is containing date values only (I'm still in process of working on the proper syntaxis but it's just to show the idea):
con = lite.connect('C:/.../MyDatabase.db')
cur = con.cursor()
cur.execute('SELECT * from MyDates')
rowsD= cur.fetchall()
data = []
for row in rowsD:
data.append(rowsD[row])
>>>data
['01/01/2010', '02/01/2010', .... '31/12/2013']
inputDate = '07/01/2010'
differences = []
for i in range(0, len(data)):
differences.append(abs(data[i] - inputDate))
After that, I was thinking about:
getting the minimum value from the vector differences: mV = min(differences)
getting the corresponding date value into the list data
Howwever, this cost me two things in terms of memory:
I need to load all the database, which is huge;
I have to iterate many times (once to build the list data, then the list of differences etc.)
Does anyone have a better idea to build this, or knows a different approach to the problem?
Query the database on the dates that are smaller than the input date and take the maximum of these. This will give you the closest date before.
Symmetrically, you can query the minimum of the larger dates to get the closest date after. And keep the preferred of the two.
These should be efficient queries.
SELECT MAX(Date)
FROM MyDates
WHERE Date <= InputDate;
I would try getting a record with the maximum date smaller then the given one from database directly (this can be done with SQL). If you put an index in your database on date then this can be done in O(log(n)). That's of course not really the same as "being closest" but if you combine it with "the minimum date bigger then the given one" you will achieve it.
Also if you know more or less the distribution of your data, for example that in each 7 consecutive days you have some data, then you can restrict to a smaller range of data like [-3 days, +3 days].
Combining both of these solutions should give you quite nice performance.
Hey all,
I have two databases. One with 145000 rows and approx. 12 columns. I have another database with around 40000 rows and 5 columns. I am trying to compare based on two columns values. For example if in CSV#1 column 1 says 100-199 and column two says Main St(meaning that this row is contained within the 100 block of main street), how would I go about comparing that with a similar two columns in CSV#2. I need to compare every row in CSV#1 to each single row in CSV#2. If there is a match I need to append the 5 columns of each matching row to the end of the row of CSV#2. Thus CSV#2's number of columns will grow significantly and have repeat entries, doesnt matter how the columns are ordered. Any advice on how to compare two columns with another two columns in a separate database and then iterate across all rows. I've been using python and the import csv so far with the rest of the work, but this part of the problem has me stumped.
Thanks in advance
-John
A csv file is NOT a database. A csv file is just rows of text-chunks; a proper database (like PostgreSQL or Mysql or SQL Server or SQLite or many others) gives you proper data types and table joins and indexes and row iteration and proper handling of multiple matches and many other things which you really don't want to rewrite from scratch.
How is it supposed to know that Address("100-199")==Address("Main Street")? You will have to come up with some sort of knowledge-base which transforms each bit of text into a canonical address or address-range which you can then compare; see Where is a good Address Parser but be aware that it deals with singular addresses (not address ranges).
Edit:
Thanks to Sven; if you were using a real database, you could do something like
SELECT
User.firstname, User.lastname, User.account, Order.placed, Order.fulfilled
FROM
User
INNER JOIN Order ON
User.streetnumber=Order.streetnumber
AND User.streetname=Order.streetname
if streetnumber and streetname are exact matches; otherwise you still need to consider point #2 above.