I have a partitioned table by column date.
Let's say I have 3 partitions for the following dates : 2019-04-01, 2019-04-02, 2019-04-03
At t+1, I have an input file containing data for the 2019-04-02, 2019-04-03, 2019-04-04.
What I want to do is to replace the current partitions for any overlapped dates, and leave unchanged the partition for 2019-04-01, 2019-04-04.
I've tried using WRITE_TRUNCATE but this ends up deleting the whole table on me. Can someone please assist.
I know partition decorator can be used such as table$20190404 but how exactly does this work? Is it working in conjunction with WRITE_TRUNCATE? How is it overwriting multiple date partitions if I can only provide the decorator with one date?
You may need to pre-process your input data for this use-case and exclude the data that you don't want to be updated in the target table. Alternatively, you can load the input data to a new BQ table then use DML statement to update the target partitioned table
Related
I have a dataframe in pyspark (and databricks) with the following schema structure:
orders schema:
submitted_at:timestamp
submitted_yyyy_mm using the format "yyyy-MM"
order_id:string
customer_id:string
sales_rep_id:string
shipping_address_attention:string
shipping_address_address:string
shipping_address_city:string
shipping_address_state:string
shipping_address_zip:integer
ingest_file_name:string
ingested_at:timestamp
I need to capture the data in my table in delta lake format, with a partition for every month of the order history reflected in the data of the submitted_yyyy_mm column. I am capturing the data correctly with the exception of two problems. One, my technique is adding two columns (and corresponding data) to the schema (could not figure out how to do the partitioning without adding columns). Two, the partitions correctly capture all the year/months with data, but are missing the year/months without data (requirement is those need to be included also). Specifically, all the months of 2017-2019 should have their own partition (so 36 months). However, my technique only created partitions for those months that actually had orders (which turned out to be 18 of the 36 months of the years 2017-2019).
Here is relevant are of my code:
# take the pristine order table and add these two extra columns you should not have in order to get partition structure
df_with_year_and_month = (df_orders
.withColumn("year", F.year(F.col("submitted_yyyy_mm").cast(T.TimestampType())))
.withColumn("month", F.month(F.col("submitted_yyyy_mm").cast(T.TimestampType()))))
# capture the data to the orders table using the year/month partitioning
df_with_year_and_month.write.partitionBy("year", "month").mode("overwrite").format("delta").saveAsTable(orders_table)
I would be grateful to anyone who might be able to help me tweak my code to fix the two issues I have the result. Thank you
There's no issue here. That's just how it works.
You want to partition on year and month. So you should have those values in you data, no way around it. You should also only partition on values where you want to filter on, since this 'causes partition pruning and results in faster queries. It would make no sense to partition on a field without related value.
Also it's totally normal that you don't create partitions where you don't have data for them. Once data is added, the corresponding partition is created if it doesn't exist yet. You don't need it any sooner than that.
This is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is a real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=["ts", "session_id", "event", "duration", "sample_data"])
But here is where it gets a bit tricky for me. One column(a key column) is called "session_id" . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes. I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows
I think the groupby function will work:
df.groupby('session_id')['duration'].sum()
More info here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
I'm new to the community and I only recently started to use Python and more specifically Pandas.
The data set I have I would like the columns to be the date. For each Date I would like to have a customer list that then breaks down to more specific row elements. Everything would be rolled up by an order number, so a distinct count on an order number because sometimes a client purchases more than 1 item. In excel I create a pivot table and process it by distinct order. Then I sort each row element by the distinct count of the order number. I collapse each row down until I just have the client name. If I click to expand the cell then I see each row element.
So my question: If I'm pulling in these huge data sets as a dataframe can I pull in xlsx in as an array? I know it will strip the values, so I would have to set the datetime as a datetime64 element. I've been trying to reshape the array around the date being column, and the rows I want but so far I haven't had luck. I have tried to use pivot_table and groupby with some success but I wasn't able to move the date to the column.
Summary: Overall what I'm looking to know is am I going down the wrong rabbit hole together? I'm looking to basically create a collapsible pivot table with specific color parameters for the table as well so that the current spreadsheet will look identical to the one I'm automating.
I really appreciate any help, as I said I'm brand new to Pandas so direction is key. If I know I'm onto the "best" way of dealing with the export to excel after I've imported and modified the spreadsheet. I get a single sheet of raw data kicked out in .xlsx form. Thanks again!
I have a 25k "row" parquet file (totaling 469.5kb) where each item in the parquet has a unique integer id. Knowing this i've put an index on this column, but it doesn't appear indexing the column actually affects performance when using Athena (AWS service) / Presto (underlying engine). I'm trying a simple select from where where I want to pull one of the rows by it's id-
SELECT *
FROM widgets w
WHERE w.id = 1
The id column is indexed, so once Presto finds this match it shouldn't do any further scanning. The column is also ordered, so it should be able to do a binary search the resolve the location instead of a dumb scan.
I can tell if the index is being used properly since Athena returns the number of bytes scanned in the operation. With and without the index, Athena returns the byte size of the file itself as the scan size, meaning it scanned the entire file. Just to be sure, ordering so that the id was the very first row also didn't have an affect.
Is this not possible with the current version of Athena/Presto? I am using python, pandas, and pyarrow.
You did not specify how you created the index, I assume you are talking about a Hive index. According to 1 and 2, Presto does not support Hive indexes. According to 3, Hive itself has dropped support for them in Hive 3.
That answers your question regarding why the presence of the index does not affect the way Presto executes the query. So what other ways are there to limit the amount of data that has to be processed?
Parquet metadata includes the min and max values per row group for each column. If you have multiple row groups in your table, only those will be read that could potentially match.
The upcoming PARQUET-1201 feature will add page-level indexes to the Parquet files themselves.
If you query specific columns, only those columns will be read.
If your table is partitioned, filtering for the "partition by" column will only read that partition.
Please note, however, that all of these measures only make sense for data sizes sevaral orders of magnitude larger than 500KB. In fact, Parquet itself is an overkill for such small tables. The default size of a row group is 128MB and you are expected to have many row groups.
I generally use Pandas to extract data from MySQL into a dataframe. This works well and allows me to manipulate the data before analysis. This workflow works well for me.
I'm in a situation where I have a large MySQL database (multiple tables that will yield several million rows). I want to extract the data where one of the columns matches a value in a Pandas series. This series could be of variable length and may change frequently. How can I extract data from the MySQL database where one of the columns of data is found in the Pandas series? The two options I've explored are:
Extract all the data from MySQL into a Pandas dataframe (using pymysql, for example) and then keep only the rows I need (using df.isin()).
or
Query the MySQL database using a query with multiple WHERE ... OR ... OR statements (and load this into Pandas dataframe). This query could be generated using Python to join items of a list with ORs.
I guess both these methods would work but they both seem to have high overheads. Method 1 downloads a lot of unnecessary data (which could be slow and is, perhaps, a higher security risk) whilst method 2 downloads only the desired records but it requires an unwieldy query that contains potentially thousands of OR statements.
Is there a better alternative? If not, which of the two above would be preferred?
I am not familiar with pandas but strictly speaking from a database point of view you could just have your panda values inserted in a PANDA_VALUES table and then join that PANDA_VALUES table with the table(s) you want to grab your data from.
Assuming you will have some indexes in place on both PANDA_VALUES table and the table with your column the JOIN would be quite fast.
Of course you will have to have a process in place to keep PANDA_VALUES tables updated as the business needs change.
Hope it helps.