Efficient Intersection of pandas dataframe with remote mongodb?

Efficient Intersection of pandas dataframe with remote mongodb? - python

I have a python pandas dataframe on my local machine, and have access to a remote mongodb server that has additional data that I can query via pymongo.
If my local dataframe is large, say 40k rows with 3 columns in each row, what's the most efficient way to check for the intersection of my local dataframe's features and a remote collection containing millions of documents?
I'm looking for general advice here. I thought I could just take a distinct list of values from each of the 3 features, and use each of these in an $or find statement, but if I have 90k distinct values for one of the 3 features it seems like a bad idea.
So any opinion would be very welcome. I don't have access to insert my local dataframe into the remote server, I only have select/find access.
thanks very much!

As you already explained that you won't be able to insert data. So only thing is possible is first take the unique values to a list.df['column_name'].unique(). Then you can use the $in operator in .find() method and pass your list as a parameter. If it takes time or it is too much. Then break your list in equal chunks, I mean list of list [[id1, id2, id3], [id4, id5, id6] ... ] and do a for loop for sub-list in list: db.xyz.find({'key':{'$in': sublist}}, {'_id': 1}) and use the sub list as parameter in $in operator. Then for each iteration if the value exist in the db it will return the _id and we can easily store that in a empty list and append it and we will be able to get all the id's in such cases where the value exist in the collection.
So it's just the way I would do. Not necessarily the best possible.

Related

releavence of creating index of field [duplicate]

Also, when is it appropriate to use one?

An index is used to speed up searching in the database. MySQL has some good documentation on the subject (which is relevant for other SQL servers as well):
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
An index can be used to efficiently find all rows matching some column in your query and then walk through only that subset of the table to find exact matches. If you don't have indexes on any column in the WHERE clause, the SQL server has to walk through the whole table and check every row to see if it matches, which may be a slow operation on big tables.
The index can also be a UNIQUE index, which means that you cannot have duplicate values in that column, or a PRIMARY KEY which in some storage engines defines where in the database file the value is stored.
In MySQL you can use EXPLAIN in front of your SELECT statement to see if your query will make use of any index. This is a good start for troubleshooting performance problems. Read more here:
http://dev.mysql.com/doc/refman/5.0/en/explain.html

A clustered index is like the contents of a phone book. You can open the book at 'Hilditch, David' and find all the information for all of the 'Hilditch's right next to each other. Here the keys for the clustered index are (lastname, firstname).
This makes clustered indexes great for retrieving lots of data based on range based queries since all the data is located next to each other.
Since the clustered index is actually related to how the data is stored, there is only one of them possible per table (although you can cheat to simulate multiple clustered indexes).
A non-clustered index is different in that you can have many of them and they then point at the data in the clustered index. You could have e.g. a non-clustered index at the back of a phone book which is keyed on (town, address)
Imagine if you had to search through the phone book for all the people who live in 'London' - with only the clustered index you would have to search every single item in the phone book since the key on the clustered index is on (lastname, firstname) and as a result the people living in London are scattered randomly throughout the index.
If you have a non-clustered index on (town) then these queries can be performed much more quickly.

An index is used to speed up the performance of queries. It does this by reducing the number of database data pages that have to be visited/scanned.
In SQL Server, a clustered index determines the physical order of data in a table. There can be only one clustered index per table (the clustered index IS the table). All other indexes on a table are termed non-clustered.
SQL Server Index Basics
SQL Server Indexes: The Basics
SQL Server Indexes
Index Basics
Index (wiki)

Indexes are all about finding data quickly.
Indexes in a database are analogous to indexes that you find in a book. If a book has an index, and I ask you to find a chapter in that book, you can quickly find that with the help of the index. On the other hand, if the book does not have an index, you will have to spend more time looking for the chapter by looking at every page from the start to the end of the book.
In a similar fashion, indexes in a database can help queries find data quickly. If you are new to indexes, the following videos, can be very useful. In fact, I have learned a lot from them.
Index Basics
Clustered and Non-Clustered Indexes
Unique and Non-Unique Indexes
Advantages and disadvantages of indexes

Well in general index is a B-tree. There are two types of indexes: clustered and nonclustered.
Clustered index creates a physical order of rows (it can be only one and in most cases it is also a primary key - if you create primary key on table you create clustered index on this table also).
Nonclustered index is also a binary tree but it doesn't create a physical order of rows. So the leaf nodes of nonclustered index contain PK (if it exists) or row index.
Indexes are used to increase the speed of search. Because the complexity is of O(log N). Indexes is very large and interesting topic. I can say that creating indexes on large database is some kind of art sometimes.

INDEXES - to find data easily
UNIQUE INDEX - duplicate values are not allowed
Syntax for INDEX
CREATE INDEX INDEX_NAME ON TABLE_NAME(COLUMN);
Syntax for UNIQUE INDEX
CREATE UNIQUE INDEX INDEX_NAME ON TABLE_NAME(COLUMN);

First we need to understand how normal (without indexing) query runs. It basically traverse each rows one by one and when it finds the data it returns. Refer the following image. (This image has been taken from this video.)
So suppose query is to find 50 , it will have to read 49 records as a linear search.
Refer the following image. (This image has been taken from this video)
When we apply indexing, the query will quickly find out the data without reading each one of them just by eliminating half of the data in each traversal like a binary search. The mysql indexes are stored as B-tree where all the data are in leaf node.

INDEX is a performance optimization technique that speeds up the data retrieval process. It is a persistent data structure that is associated with a Table (or View) in order to increase performance during retrieving the data from that table (or View).
Index based search is applied more particularly when your queries include WHERE filter. Otherwise, i.e, a query without WHERE-filter selects whole data and process. Searching whole table without INDEX is called Table-scan.
You will find exact information for Sql-Indexes in clear and reliable way:
follow these links:
For cocnept-wise understanding:
http://dotnetauthorities.blogspot.in/2013/12/Microsoft-SQL-Server-Training-Online-Learning-Classes-INDEX-Overview-and-Optimizations.html
For implementation-wise understanding:
http://dotnetauthorities.blogspot.in/2013/12/Microsoft-SQL-Server-Training-Online-Learning-Classes-INDEX-Creation-Deletetion-Optimizations.html

If you're using SQL Server, one of the best resources is its own Books Online that comes with the install! It's the 1st place I would refer to for ANY SQL Server related topics.
If it's practical "how should I do this?" kind of questions, then StackOverflow would be a better place to ask.
Also, I haven't been back for a while but sqlservercentral.com used to be one of the top SQL Server related sites out there.

An index is used for several different reasons. The main reason is to speed up querying so that you can get rows or sort rows faster. Another reason is to define a primary-key or unique index which will guarantee that no other columns have the same values.

So, How indexing actually works?
Well, first off, the database table does not reorder itself when we put index on a column to optimize the query performance.
An index is a data structure, (most commonly its B-tree {Its balanced tree, not binary tree}) that stores the value for a specific column in a table.
The major advantage of B-tree is that the data in it is sortable. Along with it, B-Tree data structure is time efficient and operations such as searching, insertion, deletion can be done in logarithmic time.
So the index would look like this -
Here for each column, it would be mapped with a database internal identifier (pointer) which points to the exact location of the row. And, now if we run the same query.
Visual Representation of the Query execution
So, indexing just cuts down the time complexity from o(n) to o(log n).
A detailed info - https://pankajtanwar.in/blog/what-is-the-sorting-algorithm-behind-order-by-query-in-mysql

INDEX is not part of SQL. INDEX creates a Balanced Tree on physical level to accelerate CRUD.
SQL is a language which describe the Conceptual Level Schema and External Level Schema. SQL doesn't describe Physical Level Schema.
The statement which creates an INDEX is defined by DBMS, not by SQL standard.

An index is an on-disk structure associated with a table or view that speeds retrieval of rows from the table or view. An index contains keys built from one or more columns in the table or view. These keys are stored in a structure (B-tree) that enables SQL Server to find the row or rows associated with the key values quickly and efficiently.
Indexes are automatically created when PRIMARY KEY and UNIQUE constraints are defined on table columns. For example, when you create a table with a UNIQUE constraint, Database Engine automatically creates a nonclustered index.
If you configure a PRIMARY KEY, Database Engine automatically creates a clustered index, unless a clustered index already exists. When you try to enforce a PRIMARY KEY constraint on an existing table and a clustered index already exists on that table, SQL Server enforces the primary key using a nonclustered index.
Please refer to this for more information about indexes (clustered and non clustered):
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-ver15
Hope this helps!

How to use mongodb query operation on a very large database (have 3 shards of around 260-300 million in each)

I have to find data in between different date ranges column in a sharded database having total of around 800 million documents. I am using this query:
cursordata=event.aggregate([{"$match":{}},{"$unwind":},{"$project":{}}])
However, when I change it to a pandas dataframe
df=pd.DataFrame(cursordata)
It is taking for ever and not working at all, it just got stuck.
I have 2 choices:
Either keep doing query for different conditions directly from mongodb or
After changing to data to dataframe, perform operation for different conditions
Please suggest how to proceed.

Could we have a sample of documents?
I think you should look for an index matching the fields you're querying.
As a reminder, try to keep in mind the Equality, Sort, Range rule in MongoDB indexing.
Besides, since you're in a sharded cluster you might want to have your sharding key in you query, otherwise the mongos will query all the shards (more info here)

SQLite indexing groups of rows?

I have a large set of city, state pairs I'm loading into a SQLite table. I will be querying the city and will know the state. Suppose I want to look for a particular city that I know is in Texas. The following query is roughly O(n) notwithstanding the limit, right?
SELECT * FROM cities WHERE state_abbr=? LIMIT 1
Is there some way of grouping the rows by state or creating a secondary index or something so that SQLite knows where to the find the 'TX' rows and only search within them? I've considered creating separate tables for each state-- and that's an option-- but I'm hoping I can just do something within this single table to make the queries more efficient.
In the tutorials I've read, the query doesn't change after creating a composite index. Is SQLite just using the index under the hood on the same query?

Why not just have a composite index?
create index cities_state_abbr_city on cities(state, city);

Selecting data from large MySQL database where value of one column is found in a large list of values

I generally use Pandas to extract data from MySQL into a dataframe. This works well and allows me to manipulate the data before analysis. This workflow works well for me.
I'm in a situation where I have a large MySQL database (multiple tables that will yield several million rows). I want to extract the data where one of the columns matches a value in a Pandas series. This series could be of variable length and may change frequently. How can I extract data from the MySQL database where one of the columns of data is found in the Pandas series? The two options I've explored are:
Extract all the data from MySQL into a Pandas dataframe (using pymysql, for example) and then keep only the rows I need (using df.isin()).
or
Query the MySQL database using a query with multiple WHERE ... OR ... OR statements (and load this into Pandas dataframe). This query could be generated using Python to join items of a list with ORs.
I guess both these methods would work but they both seem to have high overheads. Method 1 downloads a lot of unnecessary data (which could be slow and is, perhaps, a higher security risk) whilst method 2 downloads only the desired records but it requires an unwieldy query that contains potentially thousands of OR statements.
Is there a better alternative? If not, which of the two above would be preferred?

I am not familiar with pandas but strictly speaking from a database point of view you could just have your panda values inserted in a PANDA_VALUES table and then join that PANDA_VALUES table with the table(s) you want to grab your data from.
Assuming you will have some indexes in place on both PANDA_VALUES table and the table with your column the JOIN would be quite fast.
Of course you will have to have a process in place to keep PANDA_VALUES tables updated as the business needs change.
Hope it helps.

Python dictionary of sets in SQL

I have a dictionary in Python where the keys are integers and the values sets of integers. Considering the potential size (millions of key-value pairs, where a set can contain from 1 to several hundreds of integers), I would like to store it in a SQL (?) database, rather than serialize it with pickle to store it and load it back in whenever I need it.
From reading around I see two potential ways to do this, both with its downsides:
Serialize the sets and store them as BLOBs: So I would get an SQL with two columns, the first column are the keys of the dictionary as INTEGER PRIMARY KEY, the second column are the BLOBS, containing a set of integers.
Downside: Not able to alter sets anymore without loading the complete BLOB in, and after adding a value to it, serialize it back and insert it back to the database as a BLOB.
Add a unique key for each element of each set: I would get two columns, one with the keys (which are now key_dictionary + index element of set/list), one with one integer value in each row. I'd now be able to add values to a "set" without having to load the whole set into python. I would have to put more work in keeping track of all the keys.
In addition, once the database is complete, I will always need sets as a whole, so idea 1 seems to be faster? If I query for all in primary keys BETWEEN certain values, or LIKE certain values, to obtain my whole set in system 2, will the SQL database (sqlite) still work as a hashtable? Or will it linearly search for all values that fit my BETWEEN or LIKE search?
Overall, what's the best way to tackle this problem? Obviously, if there's a completely different 3rd way that solves my problems naturally, feel free to suggest it! (haven't found any other solution by searching around)
I'm kind of new to Python and especially to databases, so let me know if my question isn't clear. :)

You second answer is nearly what I would recommend. What I would do is have three columns:
Set ID
Key
Value
I would then create a composite primary key on the Set ID and Key which guarantees that the combination is unique:
CREATE TABLE something (
set,
key,
value,
PRIMARY KEY (set, key)
);
You can now add a value straight into a particular set (Or update a key in a set) and select all keys in a set.
This being said, your first strategy would be more optimal for read-heavy workloads as the size of the indexes would be smaller.
will the SQL database (sqlite) still work as a hashtable?
SQL databases tend not to use hashtables. Nor do they usually do a sequential lookup. What they do is usually create an index (Which tends to be some kind of tree, e.g. a B-tree) which allows for range lookups (e.g. where you don't know exactly what keys you're looking for).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.