How do you get Athena/Presto to recognize parquet index - python

I have a 25k "row" parquet file (totaling 469.5kb) where each item in the parquet has a unique integer id. Knowing this i've put an index on this column, but it doesn't appear indexing the column actually affects performance when using Athena (AWS service) / Presto (underlying engine). I'm trying a simple select from where where I want to pull one of the rows by it's id-
SELECT *
FROM widgets w
WHERE w.id = 1
The id column is indexed, so once Presto finds this match it shouldn't do any further scanning. The column is also ordered, so it should be able to do a binary search the resolve the location instead of a dumb scan.
I can tell if the index is being used properly since Athena returns the number of bytes scanned in the operation. With and without the index, Athena returns the byte size of the file itself as the scan size, meaning it scanned the entire file. Just to be sure, ordering so that the id was the very first row also didn't have an affect.
Is this not possible with the current version of Athena/Presto? I am using python, pandas, and pyarrow.

You did not specify how you created the index, I assume you are talking about a Hive index. According to 1 and 2, Presto does not support Hive indexes. According to 3, Hive itself has dropped support for them in Hive 3.
That answers your question regarding why the presence of the index does not affect the way Presto executes the query. So what other ways are there to limit the amount of data that has to be processed?
Parquet metadata includes the min and max values per row group for each column. If you have multiple row groups in your table, only those will be read that could potentially match.
The upcoming PARQUET-1201 feature will add page-level indexes to the Parquet files themselves.
If you query specific columns, only those columns will be read.
If your table is partitioned, filtering for the "partition by" column will only read that partition.
Please note, however, that all of these measures only make sense for data sizes sevaral orders of magnitude larger than 500KB. In fact, Parquet itself is an overkill for such small tables. The default size of a row group is 128MB and you are expected to have many row groups.

Related

releavence of creating index of field [duplicate]

Also, when is it appropriate to use one?
An index is used to speed up searching in the database. MySQL has some good documentation on the subject (which is relevant for other SQL servers as well):
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
An index can be used to efficiently find all rows matching some column in your query and then walk through only that subset of the table to find exact matches. If you don't have indexes on any column in the WHERE clause, the SQL server has to walk through the whole table and check every row to see if it matches, which may be a slow operation on big tables.
The index can also be a UNIQUE index, which means that you cannot have duplicate values in that column, or a PRIMARY KEY which in some storage engines defines where in the database file the value is stored.
In MySQL you can use EXPLAIN in front of your SELECT statement to see if your query will make use of any index. This is a good start for troubleshooting performance problems. Read more here:
http://dev.mysql.com/doc/refman/5.0/en/explain.html
A clustered index is like the contents of a phone book. You can open the book at 'Hilditch, David' and find all the information for all of the 'Hilditch's right next to each other. Here the keys for the clustered index are (lastname, firstname).
This makes clustered indexes great for retrieving lots of data based on range based queries since all the data is located next to each other.
Since the clustered index is actually related to how the data is stored, there is only one of them possible per table (although you can cheat to simulate multiple clustered indexes).
A non-clustered index is different in that you can have many of them and they then point at the data in the clustered index. You could have e.g. a non-clustered index at the back of a phone book which is keyed on (town, address)
Imagine if you had to search through the phone book for all the people who live in 'London' - with only the clustered index you would have to search every single item in the phone book since the key on the clustered index is on (lastname, firstname) and as a result the people living in London are scattered randomly throughout the index.
If you have a non-clustered index on (town) then these queries can be performed much more quickly.
An index is used to speed up the performance of queries. It does this by reducing the number of database data pages that have to be visited/scanned.
In SQL Server, a clustered index determines the physical order of data in a table. There can be only one clustered index per table (the clustered index IS the table). All other indexes on a table are termed non-clustered.
SQL Server Index Basics
SQL Server Indexes: The Basics
SQL Server Indexes
Index Basics
Index (wiki)
Indexes are all about finding data quickly.
Indexes in a database are analogous to indexes that you find in a book. If a book has an index, and I ask you to find a chapter in that book, you can quickly find that with the help of the index. On the other hand, if the book does not have an index, you will have to spend more time looking for the chapter by looking at every page from the start to the end of the book.
In a similar fashion, indexes in a database can help queries find data quickly. If you are new to indexes, the following videos, can be very useful. In fact, I have learned a lot from them.
Index Basics
Clustered and Non-Clustered Indexes
Unique and Non-Unique Indexes
Advantages and disadvantages of indexes
Well in general index is a B-tree. There are two types of indexes: clustered and nonclustered.
Clustered index creates a physical order of rows (it can be only one and in most cases it is also a primary key - if you create primary key on table you create clustered index on this table also).
Nonclustered index is also a binary tree but it doesn't create a physical order of rows. So the leaf nodes of nonclustered index contain PK (if it exists) or row index.
Indexes are used to increase the speed of search. Because the complexity is of O(log N). Indexes is very large and interesting topic. I can say that creating indexes on large database is some kind of art sometimes.
INDEXES - to find data easily
UNIQUE INDEX - duplicate values are not allowed
Syntax for INDEX
CREATE INDEX INDEX_NAME ON TABLE_NAME(COLUMN);
Syntax for UNIQUE INDEX
CREATE UNIQUE INDEX INDEX_NAME ON TABLE_NAME(COLUMN);
First we need to understand how normal (without indexing) query runs. It basically traverse each rows one by one and when it finds the data it returns. Refer the following image. (This image has been taken from this video.)
So suppose query is to find 50 , it will have to read 49 records as a linear search.
Refer the following image. (This image has been taken from this video)
When we apply indexing, the query will quickly find out the data without reading each one of them just by eliminating half of the data in each traversal like a binary search. The mysql indexes are stored as B-tree where all the data are in leaf node.
INDEX is a performance optimization technique that speeds up the data retrieval process. It is a persistent data structure that is associated with a Table (or View) in order to increase performance during retrieving the data from that table (or View).
Index based search is applied more particularly when your queries include WHERE filter. Otherwise, i.e, a query without WHERE-filter selects whole data and process. Searching whole table without INDEX is called Table-scan.
You will find exact information for Sql-Indexes in clear and reliable way:
follow these links:
For cocnept-wise understanding:
http://dotnetauthorities.blogspot.in/2013/12/Microsoft-SQL-Server-Training-Online-Learning-Classes-INDEX-Overview-and-Optimizations.html
For implementation-wise understanding:
http://dotnetauthorities.blogspot.in/2013/12/Microsoft-SQL-Server-Training-Online-Learning-Classes-INDEX-Creation-Deletetion-Optimizations.html
If you're using SQL Server, one of the best resources is its own Books Online that comes with the install! It's the 1st place I would refer to for ANY SQL Server related topics.
If it's practical "how should I do this?" kind of questions, then StackOverflow would be a better place to ask.
Also, I haven't been back for a while but sqlservercentral.com used to be one of the top SQL Server related sites out there.
An index is used for several different reasons. The main reason is to speed up querying so that you can get rows or sort rows faster. Another reason is to define a primary-key or unique index which will guarantee that no other columns have the same values.
So, How indexing actually works?
Well, first off, the database table does not reorder itself when we put index on a column to optimize the query performance.
An index is a data structure, (most commonly its B-tree {Its balanced tree, not binary tree}) that stores the value for a specific column in a table.
The major advantage of B-tree is that the data in it is sortable. Along with it, B-Tree data structure is time efficient and operations such as searching, insertion, deletion can be done in logarithmic time.
So the index would look like this -
Here for each column, it would be mapped with a database internal identifier (pointer) which points to the exact location of the row. And, now if we run the same query.
Visual Representation of the Query execution
So, indexing just cuts down the time complexity from o(n) to o(log n).
A detailed info - https://pankajtanwar.in/blog/what-is-the-sorting-algorithm-behind-order-by-query-in-mysql
INDEX is not part of SQL. INDEX creates a Balanced Tree on physical level to accelerate CRUD.
SQL is a language which describe the Conceptual Level Schema and External Level Schema. SQL doesn't describe Physical Level Schema.
The statement which creates an INDEX is defined by DBMS, not by SQL standard.
An index is an on-disk structure associated with a table or view that speeds retrieval of rows from the table or view. An index contains keys built from one or more columns in the table or view. These keys are stored in a structure (B-tree) that enables SQL Server to find the row or rows associated with the key values quickly and efficiently.
Indexes are automatically created when PRIMARY KEY and UNIQUE constraints are defined on table columns. For example, when you create a table with a UNIQUE constraint, Database Engine automatically creates a nonclustered index.
If you configure a PRIMARY KEY, Database Engine automatically creates a clustered index, unless a clustered index already exists. When you try to enforce a PRIMARY KEY constraint on an existing table and a clustered index already exists on that table, SQL Server enforces the primary key using a nonclustered index.
Please refer to this for more information about indexes (clustered and non clustered):
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-ver15
Hope this helps!

Connectorx - cx.read_sql() returns dataframe of all zeros rows intermittently or crashes when table is being conncurrently updated

When using df = connectorx.read_sql(conn=cx_con, query=f"SELECT * FROM table;"), I occasionally get a Dataframe returned with the correct columns, but all the rows are zeros or it crashes with the message Process finished with exit code -1073741819 (0xC0000005). This only happens when the table is being updated at the same time with df.to_sql("table", con=con_in, if_exists="append")
My program reads a table from a local database that I am continuously updating in a concurrently running program. This issue does not occur when I try to read from the table using pandas.read_sql_query() (which is far slower). This indicates that there is some issue with the handling of the read/write traffic accessing the table when using connectorx that does not exist with the pandas read. Is this a bug with connectorx or is there something I can do to prevent this from happening?
I'm using PyCharm 2022.2.1, Windows11, and Python 3.10
Reason for the error
ConnectorX is more designed for OLAP scenarios where the data is static with readonly queries. The reason that causes zero rows / crushes is because of the inconsistency between multiple queries. In order to achieve maximum speed up, ConnectorX issues queries to fetch metadata before fetching the real query result, including:
limit 1 query to get result schema (column types and names)
count query to get the number of the rows in the result
min/max query to get the min and max value of the partition column (if partition enabled)
The first two are used to pre-allocate the destination pandas.DataFrame in advance (step 1 and 2 in the below example workflow). Getting the dataframe in the beginning makes it possible for ConnectorX to stream the result values directly to the final destination, avoiding extra data copy and result concatenations (step 4-6 below, done in streaming fashion).
If the data is updating, the result of the count query X may be different from the real number of rows that the query returns Y. In such case, the program may crash (if X < Y) or return some rows with all zeros (if X > Y).
Possible workarounds
Avoid COUNT query through arrow
One possible way to avoid the count query is to set the return_type to arrow2 to get the arrow format first. Since arrow table is consisted with multiple record batches, ConnectorX can allocate the memory on demand without issuing the count query. After getting the arrow result, you can then convert arrow to pandas using the efficient to_pandas API provided by pyarrow. Here is an example:
import connectorx as cx
table = cx.read_sql(conn, query, return_type="arrow2")
df = table.to_pandas(split_blocks=False, date_as_object=False)
However, one thing need to be noticed is that if you are using partition, the result might still be incorrect due to the inconsistency among min/max and multiple partitioned queries.
Add predicates for consistency
If your data is append-only in a certain way, for example a monotonic ID column. You can add a predicate like ID <= current max ID so the concurrently appending data will be filtered out in both count query and fetch result query. If you are using partition, you can also partition on this ID column so that the result can be consistent.

Partition Data By Year/Month Column without Adding Columns to Result -pyspark/databricks

I have a dataframe in pyspark (and databricks) with the following schema structure:
orders schema:
submitted_at:timestamp
submitted_yyyy_mm using the format "yyyy-MM"
order_id:string
customer_id:string
sales_rep_id:string
shipping_address_attention:string
shipping_address_address:string
shipping_address_city:string
shipping_address_state:string
shipping_address_zip:integer
ingest_file_name:string
ingested_at:timestamp
I need to capture the data in my table in delta lake format, with a partition for every month of the order history reflected in the data of the submitted_yyyy_mm column. I am capturing the data correctly with the exception of two problems. One, my technique is adding two columns (and corresponding data) to the schema (could not figure out how to do the partitioning without adding columns). Two, the partitions correctly capture all the year/months with data, but are missing the year/months without data (requirement is those need to be included also). Specifically, all the months of 2017-2019 should have their own partition (so 36 months). However, my technique only created partitions for those months that actually had orders (which turned out to be 18 of the 36 months of the years 2017-2019).
Here is relevant are of my code:
# take the pristine order table and add these two extra columns you should not have in order to get partition structure
df_with_year_and_month = (df_orders
.withColumn("year", F.year(F.col("submitted_yyyy_mm").cast(T.TimestampType())))
.withColumn("month", F.month(F.col("submitted_yyyy_mm").cast(T.TimestampType()))))
# capture the data to the orders table using the year/month partitioning
df_with_year_and_month.write.partitionBy("year", "month").mode("overwrite").format("delta").saveAsTable(orders_table)
I would be grateful to anyone who might be able to help me tweak my code to fix the two issues I have the result. Thank you
There's no issue here. That's just how it works.
You want to partition on year and month. So you should have those values in you data, no way around it. You should also only partition on values where you want to filter on, since this 'causes partition pruning and results in faster queries. It would make no sense to partition on a field without related value.
Also it's totally normal that you don't create partitions where you don't have data for them. Once data is added, the corresponding partition is created if it doesn't exist yet. You don't need it any sooner than that.

How do you overwrite certain date partitions in BigQuery?

I have a partitioned table by column date.
Let's say I have 3 partitions for the following dates : 2019-04-01, 2019-04-02, 2019-04-03
At t+1, I have an input file containing data for the 2019-04-02, 2019-04-03, 2019-04-04.
What I want to do is to replace the current partitions for any overlapped dates, and leave unchanged the partition for 2019-04-01, 2019-04-04.
I've tried using WRITE_TRUNCATE but this ends up deleting the whole table on me. Can someone please assist.
I know partition decorator can be used such as table$20190404 but how exactly does this work? Is it working in conjunction with WRITE_TRUNCATE? How is it overwriting multiple date partitions if I can only provide the decorator with one date?
You may need to pre-process your input data for this use-case and exclude the data that you don't want to be updated in the target table. Alternatively, you can load the input data to a new BQ table then use DML statement to update the target partitioned table

Selecting data from large MySQL database where value of one column is found in a large list of values

I generally use Pandas to extract data from MySQL into a dataframe. This works well and allows me to manipulate the data before analysis. This workflow works well for me.
I'm in a situation where I have a large MySQL database (multiple tables that will yield several million rows). I want to extract the data where one of the columns matches a value in a Pandas series. This series could be of variable length and may change frequently. How can I extract data from the MySQL database where one of the columns of data is found in the Pandas series? The two options I've explored are:
Extract all the data from MySQL into a Pandas dataframe (using pymysql, for example) and then keep only the rows I need (using df.isin()).
or
Query the MySQL database using a query with multiple WHERE ... OR ... OR statements (and load this into Pandas dataframe). This query could be generated using Python to join items of a list with ORs.
I guess both these methods would work but they both seem to have high overheads. Method 1 downloads a lot of unnecessary data (which could be slow and is, perhaps, a higher security risk) whilst method 2 downloads only the desired records but it requires an unwieldy query that contains potentially thousands of OR statements.
Is there a better alternative? If not, which of the two above would be preferred?
I am not familiar with pandas but strictly speaking from a database point of view you could just have your panda values inserted in a PANDA_VALUES table and then join that PANDA_VALUES table with the table(s) you want to grab your data from.
Assuming you will have some indexes in place on both PANDA_VALUES table and the table with your column the JOIN would be quite fast.
Of course you will have to have a process in place to keep PANDA_VALUES tables updated as the business needs change.
Hope it helps.

Categories

Resources