I have a 2 queries that will be run repetitively to feed a report and some charts so need to make sure it is tight. First query has 25 columns and will yield out 25-50 rows from a massive table. My second query will result in another 25 columns (a couple matching columns) of 25 to 50 rows from another massive table.
Desired end result is a single document in that Query 1 (Problem) and Query 2 (Problem tasks) could match on a common column (Problem ID) so that row 1 is the problem, row 2-4 is the tasks, row 5 is the next problem and 6-9 are the tasks....ect. Now I realize I could do this manually by running the 2 queries and them just combining them in excel by hand, but looking for a eloquent process that could be reusable in my absence without too much overhead.
I was exploring inserts, union all, and cross join but the 2 queries have different columns that contain different critical data elements to be returned. Also, exploring setting up a Python job to do this by importing the CSVs and interlacing results but I am a early data science student and not yet much past creating charts from imported CSVs.
Any suggestions on how I might attack this challenge? Thanks for the help.
Picture of desired end result.
enter image description here
You can do it with something like
INSERT INTO target_table (<columns...>)
SELECT <your first query>
UNION
SELECT <your second query>
And then to retrieve data
SELECT * from target_table
WHERE <...>
ORDER BY problem_id, task_id
Just ensure both queries return the same columns, i.e. the columns you want to populate in target_table, probably using fixed default values (e.g. the first query may return a default task_id by including NULL as task_id in the column list)
Thanks for the feedback #gimix, I ended up aliasing the columns that I was able to put together from the 2 tables (open_time vs date_opened ect...) so they all matched and selected '' for the null values I needed to. I unioned the 2 selected statements as suggested, Then I finally realized I can just insert my filtering queries twice as sub queries. It will now be nice and quickly repeatable for pulling and dropping into excel 2x per week. Thank you!
Related
When you have Columns with two category (Columns_A and Columns_B)
And you have 2 measures (Value1 and Value2) (from different tables, but it doesnt matter)
Then normaly Table metrix shows like this:
But what I need is to switch columns with value in first 2 rows like this:
In other words, I need division of categories for every value.
All in One image (My dataset) :)
Do you have any idea please?
Maybe in python? (I guess)
Thanks
Create a relationship b/w both the tables with Category column and then merge both the tables by following the steps in the screenshot. (Use full outer join while merging)
and then perform unpivot operation to see the below result set.
Now in the visualization tab select the matrix and as below.
I got a few (15) data frames. They contain values based on one map, but they have fragmentary form.
List of samples looks like A1 - 3k records, A2 - 6k records. B1 - 12k records, B2- 1k records, B3 - 3k records. C1... etc.
All files have the same format and it looks that:
name sample position position_ID
String1 String1 num1 num1
String2 String2 num2 num2
...
All files come from a variety of biological microarrays. Different companies have different matrices, hence the scatter in the size of files. But each of them is based on one common, whole database. Just some of the data from the main database is selected. Therefore, individual records can be repeated between files. I want to see if they are compatible.
What do I want to achieve in this task?
I want to check that all records are the same in terms of name in all files have the same position and pos_ID values.
If the tested record with the same name differs in values in any file, it must be written to error.csv.
If it is everywhere the same - result.csv.
And to be honest I do not know how to bite it, so I am guided here with a hint that someone is taking good advise to me. I want do it in python.
I have two ideas.
Load in Pandas all files as one data frame and try to write a function filtering whole DF record by record (for loop with if statements?).
Open separate all files by python read file and adding unique rows to the new list, and when read function would encounter again the same recordName, it would check it with previous. If all rest of values are tha same it will pass it without writing, if no, the record will be written in error.csv.
I am afraid, however, that these may not be the most optimal methods, hence asking you for advice and directing me for something better? I have read about numpy, I have not studied it yet, but maybe it is worth it to be in the context of this task? Maybe there is a function that has already been created for this, and I do not know about it?
Can someone help a more sensible (maybe easier) solution?
I think I have a rough idea of where you are going. This is how I would approach it
import pandas as pd
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
df1["filename"] ="file1.csv"
df2["filename"] ="file2.csv"
df_total = pd.concat([df1,df2],axis=1) # stacks them vertically
df_total_no_dupes = df_total.drop_duplicates() # drops duplicate rows
# this gives you the cases where id occures more than once
name_counts = df_total_no_dupes.groupby("name").size().reset_index(name='counts')
names_which_appear_more_than_once = name_counts[name_counts["counts"] > 1]["name"].unique()
filter_condition = df_total_no_dupes["name"].isin(names_which_appear_more_than_once)
# this should be your dataframe where there are at least two rows with same name but different values.
print(df_total_no_dupes[filter_condition].sort_values("name"))
I need to make calculations on such a huge XLS table (~150 columns, ~1.000.000 rows) Excel freezes itself. I've decided to migrate the data into an Oracle table.
The columns have all the basic data types: int, floats, chars, strings, dates
I have two options:
I create the table manually with the proper data types. This may be the practical solution in the short run but tedious all the same.
For a long term solution (expecting many different types of excel tables in the future) I'd like to dynamically define the column types from the CSV.
I'm creating the table now with a default varchar tpye, using Python:
list_of_columns = parseFirstLine_CSV(.\data.csv)
columns = ['"%s" varchar2(255)' % n for n in list_of_columns]
sql = "create table SomeTableName (%s)" % ",".join(columns)
cursor.execute(sql)
Let's see this table. Let's assume that not every row is complete. The 'key' data are naturally available in the xls, this is just for an example.
ID Company Date Quality
144 Apple 2019.01.03 ""
"" IBM 2019.01.03 200
105591 Oracle 2019.01.03 9
10R91 Microsoft "" 113
10M99 "" 2019.01.03 3
1076a Walmart "" ""
10M95 Lorem Co. 2019.01.03 3
I will use Python, but that's not the point.
My theoretical question is: how to determine the types if I'm not sure whether a line in the CSV is complete for every column (so I can't only look at the second csv line to get a list of types). Should I iterate through the CSV rows until I have all the types or is there a simpler algorithm for it?
I think it's cleaner to have the proper types than long varchar in my tables, so I can make cleaner queries in it.
EDIT: I will include the code in Python 3 after I'm done with implementation, I'm just interested in other opinions in the meantime.
I'm trying to run VACUUM REINDEX for some huge tables in Redshift. When I run one of those vacuums in SQLWorkbenchJ, it never finishes and returns a connection reset by peer after about 2 hours. Same thing actually happens in Python when I run the vacuums using something like this:
conn_string = "postgresql+pg8000://%s:%s#%s:%d/%s" % (db_user, db_pass, host, port, schema)
conn = sqlalchemy.engine.create_engine(conn_string,
execution_options={'autocommit': True},
encoding='utf-8',
connect_args={"keepalives": 1, "keepalives_idle": 60,
"keepalives_interval": 60},
isolation_level="AUTOCOMMIT")
conn.execute(query)
Is there a way that either using Python or SQLWorkbenchJ I can run these queries? I expect them to last at least an hour each. Is this expected behavior?
Short Answer
You might need to add a mechanism in your python script to retry when the reindexing fails, based on https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html
If a VACUUM REINDEX operation terminates before it completes, the next VACUUM resumes the reindex operation before performing the full vacuum operation.
However...
Couple of things to note (I apologize if you already know this)
Tables in redshift can have N sort keys (columns that data are sorted by) and Redshift supports only 2 sorting styles
Compound: You are really sorting
based on the first sort column and then on the second, ...
Interleaved: The table will be sorted on all sort columns (https://en.wikipedia.org/wiki/Z-order_curve), some people would choose this style when they are not sure how the table will be used. However, it comes with a lot of issues on its own (More solid documentation here https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-compound-and-interleaved-sort-keys/ where compound sorting is generally favored)
So how does this answer the question?
If your table is using compound sorting or no sorting at all VACUUM REINDEX is not necessary at all, it brings no value
If your table is using interleaved, you will need to first check whether you even need to re-index?. Sample query
SELECT tbl AS table_id,
(col + 1) AS column_num, -- Column in this view is zero indexed
interleaved_skew,
last_reindex
FROM svv_interleaved_columns
If the value of the skew is 1.0 you for sure don't need REINDEX
Bringing it all together
You could have your python script run the query listed in https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_INTERLEAVED_COLUMNS.html to find the tables that you need to re-index (maybe you add some business logic that works better for your situation, example: your own sort skew threshold)
REINDEX applies the worst type of lock, so try to target the run of the script during off hours if possible
Challenge the need for interleaving sorting and favor compound
Hey all,
I have two databases. One with 145000 rows and approx. 12 columns. I have another database with around 40000 rows and 5 columns. I am trying to compare based on two columns values. For example if in CSV#1 column 1 says 100-199 and column two says Main St(meaning that this row is contained within the 100 block of main street), how would I go about comparing that with a similar two columns in CSV#2. I need to compare every row in CSV#1 to each single row in CSV#2. If there is a match I need to append the 5 columns of each matching row to the end of the row of CSV#2. Thus CSV#2's number of columns will grow significantly and have repeat entries, doesnt matter how the columns are ordered. Any advice on how to compare two columns with another two columns in a separate database and then iterate across all rows. I've been using python and the import csv so far with the rest of the work, but this part of the problem has me stumped.
Thanks in advance
-John
A csv file is NOT a database. A csv file is just rows of text-chunks; a proper database (like PostgreSQL or Mysql or SQL Server or SQLite or many others) gives you proper data types and table joins and indexes and row iteration and proper handling of multiple matches and many other things which you really don't want to rewrite from scratch.
How is it supposed to know that Address("100-199")==Address("Main Street")? You will have to come up with some sort of knowledge-base which transforms each bit of text into a canonical address or address-range which you can then compare; see Where is a good Address Parser but be aware that it deals with singular addresses (not address ranges).
Edit:
Thanks to Sven; if you were using a real database, you could do something like
SELECT
User.firstname, User.lastname, User.account, Order.placed, Order.fulfilled
FROM
User
INNER JOIN Order ON
User.streetnumber=Order.streetnumber
AND User.streetname=Order.streetname
if streetnumber and streetname are exact matches; otherwise you still need to consider point #2 above.