Scanning Bigtable by prefix with Python SDKs - python

I'm trying to perform searches by multiple prefixes at Google Cloud Bigtable with the Python SDK. I'm using read_rows, and I can't see a good way to search by prefix explicitly.
My first option is RowSet + RowRange. I'm testing three queries, and the times that I'm getting are ~1.5s, ~3.5s and ~4.2s, which are an order of magnitude slower than the searches with the Node SDK (which has a filter option) ~0.19, ~0.13, ~0.46.
The second option is using RowFilterChain + RowKeyRegexFilter. Performance is terrible for two of the queries: ~3.1s, ~70s, ~75s ~0.124s, ~72s, ~69s. It looks like it's doing a full scan. This is the code section:
regex = f'^{prefix}.*'.encode()
filters.append(RowKeyRegexFilter(regex))
My third option is using the alternative Happybase-based SDK, which has prefix filtering. With that, I'm getting ~36s, ~3s, ~1s ~0.4, ~0.1, ~0.17. The first query involves multiple prefixes, and it doesn't seem to have support for multiple filtering in the same request, so I'm performing as many requests as prefixes and then concatenating the iterators. The other two seem to leverage the prefix filter.
UPDATE: I deleted the first times because there was a mistake with the environment. After doing it properly, times are not bad for range query, but it seems to be room for improvement, as Happybase tests are still faster when they leverage prefix search.
Would appreciate help about using multiple prefix searches in Happybase, or actual prefix search in the main Python SDK.

The read_rows method have two parameters start_key and end_key that you can use to filter efficiently rows based on the row key (see docs). Behind the scenes, this method performs a Scan, so that's why this is probably the most efficient way to filter rows based on their row keys.
For example, let's suppose you have the following row keys in your table :
a
aa
b
bb
bbb
and you want to retrieve all rows with a row key prefixed by a, you can run :
rows_with_prefix_a = my_table.read_rows(start_key="a", end_key="b")
This will only scan rows between a and b (b excluded), so this will return all rows with row key prefix a (a and aa in the previous example).

Related

Is MapReduce a possible solution for two lists that have an id in common?

I have a list of 30m entries, containing a unique id and 4 attributes for each entry. In addition to that I have a second list with 10m entries, containing again a uniqe id and 2 other attributes.
The unique IDs in list 2 are a subset of the IDs in list 1.
I want to combine the two lists to do some analysis.
Example:
List 1:
ID|Age|Flag1|Flag2|Flag3
------------------------
ucab577|12|1|0|1
uhe4586|32|1|0|1
uhf4566|45|1|1|1
45e45tz|37|1|1|1
7ge4546|42|0|0|1
vdf4545|66|1|0|1
List 2:
ID|Country|Flag4|Flag5|Flag6
------------------------
uhe4586|US|0|0|1
uhf4566|US|0|1|1
45e45tz|UK|1|1|0
7ge4546|ES|0|0|1
I want to do analysis like:
"How many at the age of 45 have Flag4=1?" Or "What is the age structure of all IDs in US?"
My current approach is to load the two list into separate tables of a relational database and then doing a join.
Does a MapReduce approach make sense in this case?
If yes, how would a MapReduce approach look like?
How can I combine the attributes of list 1 with list 2?
Will it bring any advantages? (Currently I need more than 12 hours for importing the data)
when the files are big hadoops distributed processing helps(faster). once you bring data to hdfs then you can use hive or pig for your query. Both uses hadoop MR for processing,you do not need to write separate code for it . hive is almost sql like. from your query type i guess you can manage with hive. if your queries are more complex then you can consider pig. if you use hive here is the sample steps.
load both the files in two separate folder in hdfs.
create external tables for both of them and give location to the destination folders.
perform join and the query!
hive> create external table hiveint_r(id string, age int, Flag1 int, Flag2 int, Flag3 int)
> row format delimited
> fields terminated by '|'
> location '/user/root/data/hiveint_r'; (it is in hdfs)
table will be automatically populated with data, no need to load it.
similar way create other table, then run the join and query.
select a.* from hiveint_l a full outer join hiveint_r b on (a.id=b.id) where b.age>=30 and a.flag4=1 ;
MapReduce might be overkill for just 30m entries. How you should work really depends on your data. Is is dynamic (e.g. will new entries be added?) If not, just stick with your database, the data is now in it. 30m entries shouldn't take 12 hours to import, it's more likely 12 minutes (you should be able to get 30.000 insert/seconds with 20 byte datasize), so your approach should be to fix your import. You might want to try bulk import, LOAD DATA INFILE, use transactions and/or generate the index afterwards, try another engine (innodb, MyISAM), ...
You can get just one big table (so you can get rid of the joins when you query which will speed them up) by e.g.
Update List1 join List2 on List1.Id = List2.Id
set List1.Flag4 = List2.Flag4, List1.Flag5 = List2.Flag5, List1.Flag6 = List2.Flag6
after adding the columns to List1 of course and after adding the indexes, and you should afterwards add indexes for all your columns.
You can actually combine your data before you import it to mysql by e.g. reading list 2 into a hashmap (Hashmap in c/c++/java, array in php/python) and then generate a new import file with the combined data. It should actually just take you some seconds to read the data. You can even do evaluation here, it is just not that flexible as sql, but if you just have some fixed querys, that might be the fastest approach if your data changes often.
In map-Reduce you can process the two files by using the join techniques. There are two types of joins-Map side and reduce side.
Map side join can be efficiently used by using DistributedCache API in which one file shud be loaded in memory. In you case you can can create a HashMap with key->id and value-> Flag4 and during the map phase you can join the data based on ID. One point shud be noted that the file should be as large so that it can be saved in memory.
If both the files are large go for Reduce join.
First try to load the 2nd file in memory and create Map-side join.
OR you can go for pig. Anyway the pig executes its statements as map-reduce jobs only. But map-reduce is fast as compared to PIG and HIVE.

Is there a way to optimize MYSQL REPLACE function?

I need to develop a query to find MF001317-077944-01 in the database, but the string provided(which I must use to search), is without the -.
So I am currently using:
select * from sims where replace(pack, "-", "") = "MF00131707794401";
sqlAlchemy equivalent:
s.query(Sims).filter(func.replace(Sims.pack, "-", "") == "MF00131707794401").all()
But it is taking to long. It is taking, on average 1min 22s, I need to get is well under 1 second.
I have considered using wildcards, but I do not know if that is the best way of approaching my problem.
Is there a way to optimize the replace query?
or is there a better way of achieving what I want i.e, manipulating the string in python to get MF001317-077944-01?
oh.. I should also mention that it might not always be the same, for example, two different pack numbers might be XAN002-026-001 or CK10000579-020-3.
Any help would be appreciated :).
You must find a way to avoid a table scan.
Several Options:
1) create an index on your "pack" column and put the "-" into the search String before Querying. Will only work when you know where to put the "-" in the search string (e.g. when they always at the same positions). This is the easiest way.
2) create an additional column "pack_search". Fill it with replace(pack, "-", ""). Create an INSERT OR UPDATE Trigger to update its value when rows are updated or inserted. Create an Index on that column and use that column for your query.
3) nicer: create a View on the table with a modified pack column and an Index on that view (dunno if that works on mysql, postgres can definitely do that). Use that view vor your Query. For further speedup you could materialize that view if the table is much more read than written or if a lag is ok for the query results (e.g. if the table is updated nightly and you query for an Online Service).
4) maybe it can be done by using a functional Index

Converting a list of strings to a single pattern

I have a list of strings that follow a specific pattern. Here's an example
['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
I'm trying to end up with a blob pattern that will represent this list like the following
'ratelimiter:foobar:201401011*
I know the first two fields ahead of time. The third field is a time stamp and I want to find the column at which they start to have different values from other values in the column.
In the example given the timestamp ranges from 2014-01-01-11:57 to 2014-01-01-12:00 and the column that's different is the third to the last column where 1 changes to 2. If I can find that, then I can slice the string to [:-3] += '*' (for this example)
Every time I try and tackle this problem I end up with loops everywhere. I just feel like there's a better way of doing this.
Or maybe someone knows a better way of doing this with redis. I'm doing this because I'm trying to get keys from redis and I don't want to make a request for every key but rather make a batch request using the pattern parameter. Maybe there's a better way of doing this but haven't found anything yet.
Thanks
Staying in the pattern thing (converting to timestamp is probably best, though), I would do that to find the longest prefix:
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
print items[0][:[len(set(x)) == 1 for x in zip(*items)].index(False)] + '*'
# ratelimiter:foobar:201401011*
Which reads as: cut the first element of items where all nth elements of items are no longer equals.
[len(set(x)) == 1 for x in zip(*items)] will return a list of boolean being True for i if all elements at i are equal across items
This is what I will do:
convert the timestamp to numbers
find the max and min (if your list is not ordered)
take the difference between max and min and convert it back to pattern.
For example, in your case, the difference between max and min is 43. And the min is already 57, you can quickly deduct that if the min ends with ***157, the max should be ***200. And you know the pattern
You almost never want to use the '*' parameter in Redis in production because it is very slow-- much slower than making a request for each key individually in the vast majority of cases. Unless you're requesting so many keys that your bottleneck becomes the sheer amount of data you're transferring over the network (in which case you should really convert things to Lua and run the logic server-side), a pipeline is really want you want.
The reason you want a pipeline is you're probably getting hit by the costs of transferring data back and forth between your Redis server in separate hops right now. A pipeline, in contrast, queues up a bunch of commands to run against Redis, and then executes them all at once, when you're ready. Assuming you're using redis-py (if you're not, you really should be), and r is your connection to your Redis server, you can do this like so:
r = redis.Redis(...)
pipe = r.pipeline()
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
for item in items:
pipe.get(item)
#all the values for each item you're getting from Redis will be here.
item_values = pipe.execute()
Note: this will only make one call to Redis and will be much faster than either getting each value individually or running a pattern selection.
All of the other answers so far are good Python answers, but you're dealing with a Redis problem. You need a Redis answer.

Storing an Inverted index in mysql

I am working to create a very big inverted index terms. What method would you suggest?
First
termId - > docId
a doc2[locations],doc5[locations],doc12[locations]
b doc5[locations],doc7[locations],doc4[locations]
Second
termId - > docId
a doc2[locations]
a doc5[locations]
a doc12[locations]
b doc5[locations]
b doc7[locations]
b doc4[locations]
p.s Lucene is not an option
The right table design depends on how you plan on using the data. If you plan on using strings like "doc2[locations],doc5[locations],doc12[locations]" as is -- without any further postprocessing, then your First design is fine.
But if -- as your question tacitly suggests -- that you may at times want to regard doc2[locations], doc5[locations], etc. as separate entities, then you should definitely use your Second design.
Here are some use cases which show why the Second design is better:
If you use First and ask for all docs with termID = a then you
get back a string like
doc2[locations],doc5[locations],doc12[locations] which you then
have to split.
If you use Second, you get each doc as a separate row. No splitting!
The Second structure is more convenient.
Or, suppose at some point doc5[locations] changes and you need to
update your table. If you use the First design, you'd have to use
some relatively complicated MySQL string function to find and replace the substring in all rows that contain it. (Note that MySQL does not come with regex substitution built in.)
If you use the Second design, updating is easy:
UPDATE table SET docId = "newdoc5[locations]" where docId = "doc5[locations]"

Data Structure for storing a sorting field to efficiently allow modifications

I'm using Django and PostgreSQL, but I'm not absolutely tied to the Django ORM if there's a better way to do this with raw SQL or database specific operations.
I've got a model that needs sequential ordering. Lookup operations will generally retrieve the entire list in order. The most common operation on this data is to move a row to the bottom of a list, with a subset of the intervening items bubbling up to replace the previous item like this:
(operation on A, with subset B, C, E)
A -> B
B -> C
C -> E
D -> D
E -> A
Notice how D does not move.
In general, the subset of items will not be more than about 50 items, but the base list may grow to tens of thousands of entries.
The most obvious way of implementing this is with a simple integer order field. This seems suboptimal. It requires the compromise of making the position ordering column non-unique, where non-uniqueness is only required for the duration of a modification operation. To see this, imagine the minimal operation using A with subset B:
oldpos = B.pos
B.pos = A.pos
A.pos = oldpos
Even though you've stored the position, at the second line you've violated the uniqueness constraint. Additionally, this method makes atomicity problematic - your read operation has to happen before the write, during which time your records could change. Django's default transaction handling documentation doesn't address this, though I know it should be possible in the SQL using the "REPEATABLE READ" level of transaction locking.
I'm looking for alternate data structures that suit this use pattern more closely. I've looked at this question for ideas.
One proposal there is the Dewey decimal style solution, which makes insert operations occur numerically between existing values, so inserting A between B and C results in:
A=1 -> B=2
B=2 -> A=2.5
C=3 -> C=3
This solves the column uniqueness problem, but introduces the issue that the column must be a float of a specified number of decimals. Either I over-estimate, and store way more data than I need to, or the system becomes limited by whatever arbitrary decimal length I impose. Furthermore, I don't expect use to be even over the database - some keys are going to be moved far more often than others, making this solution hit the limit sooner. I could solve this problem by periodically re-numbering the database, but it seems that a good data structure should avoid needing this.
Another structure I've considered is the linked list (and variants). This has the advantage of making modification straightforward, but I'm not certain of it's properties with respect to SQL - ordering such a list in the SQL query seems like it would be painful, and extracting a non-sequential subset of the list has terrible retrieval properties.
Beyond this, there are B-Trees, various Binary Trees, and so on. What do you recommend for this data structure? Is there a standard data structure for this solution in SQL? Is the initial idea of going with sequential integers really going to have scaling issues, or am I seeing problems where there are none?
Prefered solutions:
A linked list would be the usual way to achieve this. A query to return the items in order is trivial in Oracle, but Im not sure how you would do it in PostreSQL.
Another option would be to implement this using the ltree module for postgresql.
Less graceful (and write-heavy) solution:
Start transaction. "select for update" within scope for row level locks. Move the target record to position 0, update the targets future succeeding records to +1 where their position is higher than the targets original position (or vice versa) and then update the target to the new position - a single additional write over that needed without a unique constraint. Commit :D
Simple (yet still write-heavy) solution if you can wait for Postgresql 8.5 (Alpha is available) :)
Wrap it in a transaction, select for update in scope, and use a deferred constraint (postgresql 8.5 has support for deferred unique constraints like Oracle).
A temp table and a transaction should maintain atomicity and the unique constraint on sort order. Restating the problem, you want to go from:
A 10 to B 10
B 25 C 25
C 26 E 26
E 34 A 34
Where there can be any number of items in between each row. So, first you read in the records and create a list [['A',10],['B',25],['C',26],['E',34]]. Through some pythonic magic you shift the identifiers around and insert them into a temp table:
create temporary table reorder (
id varchar(20), -- whatever
sort_order number,
primary key (id));
Now for the update:
update table XYZ
set sort_order = (select sort_order from reorder where xyz.id = reorder.id)
where id in (select id from reorder)
I'm only assuming pgsql can handle that query. If it can, it will be atomic.
Optionally, create table REORDER as a permanent table and the transaction will ensure that attempts to reorder the same record twice will be serialized.
EDIT: There are some transaction issues. You might need to implement both of my ideas. If two processes both want to update item B (for example) there can be issues. So, assume all order values are even:
Begin Transaction
Increment all the orders being used by 1. This puts row level write locks on all the rows you are going to update.
Select the data you just updated, if any sort_order fields are even some other process has added a record that matches your criteria. You can either abort the transaction and restart or you can just drop the record and finish the operation using only the records that were updated in step 2. The "right" thing to do depends on what you need this code to accomplish.
Fill your temporary reorder table as above using the proper even sort_orders.
Update the main table as above.
Drop the temporary table.
Commit the transaction
Step 2 ensures that if two lists overlap, only the first one will have access to the row
in question until the transaction completes:
update XYZ set sort_order = sort_order + 1
where -- whatever your select criteria are
select * from XYZ
where -- same select criteria
order by sort_order
Alternatively, you can add a control field to the table to get the same affect and then you don't need to play with the sort_order field. The benefit of using the sort_order field is indexing by a BIT field or a LOCK_BY_USERID field when the field is usually null tends to have poor performance since the index 99% of the time is meaningless. SQL engines don't like indexes that spend most of their time empty.
It seems to me that your real problem is the need to lock a table for the duration of a transaction. I don't immediately see a good way to solve this problem in a single operation, hence the need for locking.
So the question is whether you can do this in a "Django way" as opposed to using straight SQL. Searching "django lock table" turned up some interesting links, including this snippet, there are many others that implement similar behavior.
A straight SQL linked-list style solution can be found in this stack overflow post, it appeared logical and succinct to me, but again it's two operations.
I'm very curious to hear how this turns out and what your final solution is, be sure to keep us updated!
Why not do a simple character field of some length like a max of 16 (or 255) initially.
Start initially with labeling things aaa through zzz (that should be 17576 entries). (You could also add in 0-9, and the uppercase letters and symbols for an optimization.)
As items are added, they can go to the end up to the maximum you allow for the additional 'end times' (zzza, zzzaa, zzzaaa, zzzaab, zzzaac, zzzaad, etc.)
This should be reasonable simple to program, and it's very similar to the Dewey Decimal system.
Yes, you will need to rebalance it occasionally, but that should be a simple operaion. The simplest approach is two passes, pass 1 would be to set the new ordering tag to '0' (or any character earlier than the first character) followed by the new tag of the appropriate length, and step 2 would be to remove the '0 from the front.
Obviuosly, you could do the same thing with floats, and rebalancing it regularly, this is just a variation on that. The one advantage is that most databases will allow you to set a ridiculously large maximum size for the character field, large enough to make it very, very, very unlikely that you would run out of digits to do the ordering, and also make it unlikely that you would ever have to modify the schema, while not wasting a lot of space.
You can solve the renumbering issue by doing the order column as an integer that is always an even number. When you are moving the data, you change the order field to the new sort value + 1 and then do a quick update to convert all the odd order fields to even:
update table set sort_order = bitand(sort_order, '0xFFFFFFFE')
where sort_order <> bitand(sort_order, '0xFFFFFFFE')
Thus you can keep the uniqueness of sort_order as a constraint
EDIT: Okay, looking at the question again, I've started a new answer.

Categories

Resources