How to select columns that built on each other in SQL? - python

how can I read entries in SQL where the column entries build on each other.
In the first column is the material number. These again consist of different materials, with their own material number in the first column.
How can I find out e.g. from material 123 the quantity of all sub-materials.
I have loaded the data from an Exeltabbele into a Panda database and then loaded this into a SQL database.
Should I use loops in python, to iterate over the table? Or is there a smarter way?

Related

Strategy for creating pivot tables that collapse with large data sets

I'm new to the community and I only recently started to use Python and more specifically Pandas.
The data set I have I would like the columns to be the date. For each Date I would like to have a customer list that then breaks down to more specific row elements. Everything would be rolled up by an order number, so a distinct count on an order number because sometimes a client purchases more than 1 item. In excel I create a pivot table and process it by distinct order. Then I sort each row element by the distinct count of the order number. I collapse each row down until I just have the client name. If I click to expand the cell then I see each row element.
So my question: If I'm pulling in these huge data sets as a dataframe can I pull in xlsx in as an array? I know it will strip the values, so I would have to set the datetime as a datetime64 element. I've been trying to reshape the array around the date being column, and the rows I want but so far I haven't had luck. I have tried to use pivot_table and groupby with some success but I wasn't able to move the date to the column.
Summary: Overall what I'm looking to know is am I going down the wrong rabbit hole together? I'm looking to basically create a collapsible pivot table with specific color parameters for the table as well so that the current spreadsheet will look identical to the one I'm automating.
I really appreciate any help, as I said I'm brand new to Pandas so direction is key. If I know I'm onto the "best" way of dealing with the export to excel after I've imported and modified the spreadsheet. I get a single sheet of raw data kicked out in .xlsx form. Thanks again!

SQLite indexing groups of rows?

I have a large set of city, state pairs I'm loading into a SQLite table. I will be querying the city and will know the state. Suppose I want to look for a particular city that I know is in Texas. The following query is roughly O(n) notwithstanding the limit, right?
SELECT * FROM cities WHERE state_abbr=? LIMIT 1
Is there some way of grouping the rows by state or creating a secondary index or something so that SQLite knows where to the find the 'TX' rows and only search within them? I've considered creating separate tables for each state-- and that's an option-- but I'm hoping I can just do something within this single table to make the queries more efficient.
In the tutorials I've read, the query doesn't change after creating a composite index. Is SQLite just using the index under the hood on the same query?
Why not just have a composite index?
create index cities_state_abbr_city on cities(state, city);

How do you get Athena/Presto to recognize parquet index

I have a 25k "row" parquet file (totaling 469.5kb) where each item in the parquet has a unique integer id. Knowing this i've put an index on this column, but it doesn't appear indexing the column actually affects performance when using Athena (AWS service) / Presto (underlying engine). I'm trying a simple select from where where I want to pull one of the rows by it's id-
SELECT *
FROM widgets w
WHERE w.id = 1
The id column is indexed, so once Presto finds this match it shouldn't do any further scanning. The column is also ordered, so it should be able to do a binary search the resolve the location instead of a dumb scan.
I can tell if the index is being used properly since Athena returns the number of bytes scanned in the operation. With and without the index, Athena returns the byte size of the file itself as the scan size, meaning it scanned the entire file. Just to be sure, ordering so that the id was the very first row also didn't have an affect.
Is this not possible with the current version of Athena/Presto? I am using python, pandas, and pyarrow.
You did not specify how you created the index, I assume you are talking about a Hive index. According to 1 and 2, Presto does not support Hive indexes. According to 3, Hive itself has dropped support for them in Hive 3.
That answers your question regarding why the presence of the index does not affect the way Presto executes the query. So what other ways are there to limit the amount of data that has to be processed?
Parquet metadata includes the min and max values per row group for each column. If you have multiple row groups in your table, only those will be read that could potentially match.
The upcoming PARQUET-1201 feature will add page-level indexes to the Parquet files themselves.
If you query specific columns, only those columns will be read.
If your table is partitioned, filtering for the "partition by" column will only read that partition.
Please note, however, that all of these measures only make sense for data sizes sevaral orders of magnitude larger than 500KB. In fact, Parquet itself is an overkill for such small tables. The default size of a row group is 128MB and you are expected to have many row groups.

create table with variable number of columns in sql

I'm scraping one site.
And there are several tables that represent attributes of one observation.
I wonder if it is useful to put images in this post because It's Korean alphabet.
I insert explanation image.
There are many tables. I will reshape those table into one table, which will be one record and many fields.
But I got a problem.
A few tables have variable numbers of columns.
I'd like to store those data in sql.
From what I know, sql table has fixed numbers of fields.
Do you have a solution what I have to search??
Here is the link. http://goodauction.land.naver.com/auction/ca_view.php?product_id=1698750&class1=5&ju_price1=&ju_price2=&bi_price1=&bi_price2=&num1=&num2=&lawsup=0&lesson=0&next_biddate1=&next_biddate2=&state=91&b_count1=0&b_count2=0&b_area1=&b_area2=&special=0&e_area1=&e_area2=&si=11&gu=0&dong=0&apt_no=0&order=&start=0&total_record_val=&detail_search=&detail_class=1&recieveCode=
Those variables in table in this link indicate the winning bid, number of floors in apartment, size of the area, use of floor, and so on
And Do you recommend some sites to me in which I learn to scrape the table consisting of cells spanning multiple rows and columns using python.
If you have a table appartment, you need a table floor related to appartment

Selecting data from large MySQL database where value of one column is found in a large list of values

I generally use Pandas to extract data from MySQL into a dataframe. This works well and allows me to manipulate the data before analysis. This workflow works well for me.
I'm in a situation where I have a large MySQL database (multiple tables that will yield several million rows). I want to extract the data where one of the columns matches a value in a Pandas series. This series could be of variable length and may change frequently. How can I extract data from the MySQL database where one of the columns of data is found in the Pandas series? The two options I've explored are:
Extract all the data from MySQL into a Pandas dataframe (using pymysql, for example) and then keep only the rows I need (using df.isin()).
or
Query the MySQL database using a query with multiple WHERE ... OR ... OR statements (and load this into Pandas dataframe). This query could be generated using Python to join items of a list with ORs.
I guess both these methods would work but they both seem to have high overheads. Method 1 downloads a lot of unnecessary data (which could be slow and is, perhaps, a higher security risk) whilst method 2 downloads only the desired records but it requires an unwieldy query that contains potentially thousands of OR statements.
Is there a better alternative? If not, which of the two above would be preferred?
I am not familiar with pandas but strictly speaking from a database point of view you could just have your panda values inserted in a PANDA_VALUES table and then join that PANDA_VALUES table with the table(s) you want to grab your data from.
Assuming you will have some indexes in place on both PANDA_VALUES table and the table with your column the JOIN would be quite fast.
Of course you will have to have a process in place to keep PANDA_VALUES tables updated as the business needs change.
Hope it helps.

Categories

Resources