How to create pairs from the table - python

I have a Hive table wherein data looks like this -
Each customer has corresponding accounts and the objective is to make intra-customer pair.
Pairs are based on whether the accounts have same year of birth or their first 3 characters of name are same.
Eg - Sam and Samuel.
The output looks like this -
Ideally same account pair like AA, XX etc should not get created.
Also a pair AC and CA are both same hence only one entry of such pairs is needed. A pair can be formed on Name as well Year of Birth key but here also only one entry is required (can be anyone).
How should I approach this problem.
Test data for check -
create table customer_account(
customer INT NOT NULL,
accounts VARCHAR(100) NOT NULL,
name VARCHAR(40) NOT NULL,
yob DATE,
);
INSERT INTO
customer_account(customer,accounts,name,yob)
VALUES
(1,"A","John",2001),
(1,"X","Tom",1996),
(1,"C","Harry",2001),
(2,"D","Sam",1994),
(2,"F","Samuel",1995),
(3,"Z","Jake",)1994,
(3,"G","Drake",1998),
(3,"H","Arnold",1993),
(3,"K","Yang",1990)
;

You should be able to use substrings for your join in the HIVE language. The logic should be sound though you may need to tune it for your needs a bit.
What you're trying to do is a unary (or self) join. Below is an example of a type of query that can be passed. You're essentially joining with an OR condition and testing that condition with a case statement to get the "Pair_Key". I used an inner join assuming you want only instances where matches occur.
SELECT
t1.customer as Customer1,
t2.customer as Customer2,
t1.Accounts as Accounts1,
t2.Accounts as Accounts2,
CONCAT(t1.Accounts, t2.Accounts) as Pair_No,
t1.Name as Name1,
t2.Name as Name2,
t1.YOB as YOB1,
t2.YOB as YOB2,
CASE
WHEN t1.YOB = t2.YOB THEN 'YOB'
WHEN SUBSTR(t1.Name, 3) = SUBSTR(t2.Name, 3) THEN 'Name'
else 'Issue'
END as Pair_Key
FROM (SELECT * FROM Table1) as t1
inner join (SELECT * FROM Table1) as t2 --instance 2 of the same table
on (SUBSTR(t1.Name, 3) = SUBSTR(t2.Name, 3) OR t1.YOB = t2.YOB)
Without test data or more details of how far along you are, this is a start.
If the customer number needs to be the same simply adjust to:
on (t1.Customer = t2.Customer) and (SUBSTR(t1.Name, 3) = SUBSTR(t2.Name, 3) OR t1.YOB = t2.YOB)

This does what you describe:
select t1.*, t2.name, t2.yob
from t t1 join
t t2
on t2.customer = t1.customer and
(t2.yob = t1.yob or
substr(t2.name, 1, 3) = substr(t1.name, 1, 3)
) and
t2.account > t1.account;
There is no need to fetch customer twice. If you want "identical" pairs, then change the last condition to >=.

Related

Substracting from two tables in mysql

I Have two tables in mysql that store amounts. For instance, we have amounts0 and amounts1. Each table stores a name and an amount. Let's say that in amounts0 we have: "'James', 50". We also have "'Paul', 75". In the other table, we have: "'James', 25" and "'Paul' 50" and "'James', 10". I have a text file and I want a result like this:
What this is doing: It is collecting the amounts from the first table and subtracting them from the sum of the amounts in the second table grouped by name. For example in the picture above we have: James: 50 - sum(25, 10) = 25 - 35 so 15.
Can someone help me to create a query that does that with the 2 tables and then write them to the text file?
I think you can use this
SELECT NAME,
Abs(Sum(t1.amount) - t2.amount) AS amount
FROM table1 AS t1
JOIN (SELECT Sum( ` amount ` ) AS amount,
` NAME ` as t2name
FROM table2
GROUP BY ` NAME ` ) AS t2
ON t2.t2name = t1.NAME
GROUP BY NAME

SQLite Find Name where datetime is between start and end

I have the following table in my database, which represents the shifts of a working day.
When a new product is added to another table 'Products' I want to assign a shift to it based on the start_timestamp.
So when I insert into Products its takes start_timestamp and looks in table ProductionPlan and looks for a result (ProductionPlan.name) where it is between the start and end timestamp of that shift.
On that way I can assign a shift to the product.
I hope somebody can help me out with this!
Table ProductionPlan
name
start_timestamp
end_timestamp
shift 1
2021-05-10T07:00:00
2021-05-10T11:00:00
shift 2
2021-05-10T11:00:00
2021-05-10T15:00:00
shift 3
2021-05-10T15:00:00
2021-05-10T19:00:00
shift 1
2021-05-11T07:00:00
2021-05-11T11:00:00
shift 2
2021-05-11T11:00:00
2021-05-11T15:00:00
shift 3
2021-05-11T15:00:00
2021-05-11T19:00:00
Table Products
id
name
start_timestamp
end_timestamp
shift
1
Schroef
2021-05-10T08:09:05
2021-05-10T08:19:05
2
Bout
2021-05-10T08:20:08
2021-04-28T08:30:11
3
Schroef
2021-05-10T12:09:12
2021-04-28T12:30:15
I have the following code to insert into Products:
def insertNewProduct(self, log):
"""
This function is used to insert a new product into the database.
#param log : a object to log
#return None.
"""
debug("Class: SQLite, function: insertNewProduct")
self.__openDB()
timestampStart = datetime.fromtimestamp(int(log.startTime)).isoformat()
queryToExecute = "INSERT INTO Products (name, start_timestamp) VALUES('{0}','{1}')".format(log.summary,
timestampStart)
self.cur.execute(queryToExecute)
self.__closeDB()
return self.cur.lastrowid
It's just a simple INSERT INTO but I want to add a query or even extend this query to fill in the column shift.
You can use a SELECT inside an INSERT.
queryToExecute = """INSERT INTO Products (name, start_timestamp, shift)
SELECT :1, :2, name FROM ProductionPlan pp
WHERE :2 BETWEEN pp.start_timestamp and pp.end_timestamp"""
self.cur.execute(queryToExecute, (log.summary, timestampStart))
In above code I have used a parameterized query because I hate inserting parameters as strings inside a query. It was the cause of too many SQL injection attacks...

Need help to merge two sql tables when one contains ids and second contains names associated to that id

I have 2 tables in MySQL.
One has transactions with important columns where each row has Debit account ID and Credit account ID. I have second table which contains Account name and special number associated to Account ID. I want somehow to try sql query which will take data from transactions table and assign account name and account number from second table.
I tried doing everything using two query , one would get transactions and second one would get account details and then I did iterate over dataframe and assigned everything one by one which doesn't seem to be good idea
query = "SELECT tr_id, tr_date, description, dr_acc, cr_acc, amount, currency, currency_rate, document, comment FROM transactions WHERE " \
"company_id = {} {} and deleted = 0 {} LIMIT {}, {}".format(
company_id, filter, sort, sn, en)
df = ncon.getDF(query)
df.insert(4, 'dr_name', '')
df.insert(6, 'cr_name', '')
data = tuple(list(set(df['dr_acc'].values.tolist() + df['cr_acc'].values.tolist())))
query = "SELECT account_number, acc_id, account_name FROM tb_accounts WHERE company_id = {} and deleted = 0 and acc_id in {}".format(
company_id, data)
df_accs = ncon.getDF(query)
for index, row in df_accs.iterrows():
acc = str(row['acc_id'])
ac = row['account_number']
nm = row['account_name']
indx = df.index[df['dr_acc'] == acc].tolist()
df.at[indx, 'dr_acc'] = ac
df.at[indx, 'dr_name'] = nm
indx = df.index[df['cr_acc'] == acc].tolist()
df.at[indx, 'cr_acc'] = ac
df.at[indx, 'cr_name'] = nm
What you're looking for, I think, is a SQL JOIN statement.
Taking a crack at writing a query that might work based on your code:
query = '''
SELECT transactions.tr_id,
transactions.tr_date,
transactions.description,
transactions.dr_acc,
transactions.cr_acc,
transactions.amount,
transactions.currency,
transactions.currency_rate,
transactions.document,
transactions.comment
FROM transactions INNER JOIN tb_accounts ON tb_accounts.acc_id = transactions.acc_id
WHERE
transactions.company_id = {} AND
tb_accounts.company_id = {} AND
transactions.deleted = 0 AND
tb_accounts.deleted = 0
ORDER BY transactions.tr_id
LIMIT 10;'''
The above query will, roughly, present query results with all the fields listed from the two tables for each pair of rows where the acc_id is the same.
NOTE, the query above will probably not have very good performance. SQL JOIN statements must be written with care, but I wrote it above in a way that's easy to understand, so as to illustrate the power of the JOIN.
You should as a matter of habit NEVER try to program something when you could use a join instead. As long as you take care to write a join properly so that it can be efficient, the MySQL engine will beat your python code for performance almost every time.
sort two dataframe and use merge for merging 2data frame
df1 = df1.sort_values(['dr_acc'], ascending=True)
df2 = df2.sort_values(['acc_id'], ascending=True)
merge2df = pd.merge(df1, df2, how='outer',
left_on=['dr_acc'], right_on=['acc_id'])
I assumed df1 is 1st query data set and df2 is 2nd query data set
sql query
'''SELECT tr_id, tr_date,
description,
dr_acc, cr_acc,
amount, currency,
currency_rate,
document,
account_number, acc_id, account_name
comment FROM transactions left join
tb_accounts on transactions.dr_acc=tb_accounts.account_number'''

Fast complex SQL queries on PostgreSQL database with Python

I have a large dataset with +50M records in a PostgreSQL database that require massive calculations, inner join.
Python is the tool of choice with Psycopg2.
Running the process with fetchmany of 20,000 records takes a couple of hours to finish.
The execution needs to take place sequentially, as in each record of the 50M needs to be fetched separately, then another query (in the below example) needs to run before a result is returned and saved in a separate table.
Indexes are properly configured on each table (5 tables in total) and the complex query (that returns a calculated value - example below) takes around 240MS to return results (when the database is not under load).
Celery is used to take care of database inserts of the calculated values in a separate table.
My question is about common strategies to reduce overall running time and produce results/calculations faster.
In other words, what is an effective way to go through all the records, one by one, calculate the value of a field via a second query then save the result.
UPDATE:
There is an important piece of information that I unintentionally missed mentioning while trying to obfuscate sensitive details. Sorry for that.
The original SELECT query calculates a value aggregated from different tables as follows:
SELECT CR.gg, (AX.b + BF.f)/CR.d AS calculated_field
FROM table_one CR
LEFT JOIN table_two AX ON EX.x = CR.x
LEFT JOIN table_three BF ON BF.x = CR.x
WHERE CR.gg = '123'
GROUP BY CR.gg;
PS: the SQL query is written by our experienced DBA so i trust that it is optimised.
don't loop over records and call the DBMS repeatedly for every record.
instead, let the DBMS process large chunks (preferrably: all) of data
and, let it spit out all the results.
Below is a snippet of my twitter-sucker(with a rather complex ugly query)
def fetch_referred_tweets(self):
self.curs = self.conn.cursor()
tups = ()
selrefd = """SELECT twx.id, twx.in_reply_to_id, twx.seq, twx.created_at
FROM(
SELECT tw1.id, tw1.in_reply_to_id, tw1.seq, tw1.created_at
FROM tt_tweets tw1
WHERE 1=1
AND tw1.in_reply_to_id > 0
AND tw1.is_retweet = False
AND tw1.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw1.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw1.in_reply_to_id)
UNION ALL
SELECT tw2.id, tw2.in_reply_to_id, tw2.seq, tw2.created_at
FROM tweets tw2
WHERE 1=1
AND tw2.in_reply_to_id > 0
AND tw2.is_retweet = False
AND tw2.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw2.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw2.in_reply_to_id)
-- ORDER BY tw2.created_at DESC
)twx
LIMIT %s;"""
# -- AND tw.created_at < now() - '15 min':: interval
# -- AND tw.created_at >= now() - '72 hour':: interval
count = 0
uniqs = 0
self.curs.execute(selrefd, (quotum_referred_tweets, ) )
tups = self.curs.fetchmany(quotum_referred_tweets)
for tup in tups:
if tup == None: break
print ('%d -->> %d [seq=%d] datum=%s' % tup)
self.resolve_list.append(tup[0] ) # this tweet
if tup[1] not in self.refetch_tweets:
self.refetch_tweets[ tup[1] ] = [ tup[0]] # referred tweet
uniqs += 1
count += 1
self.curs.close()
Note: your query makes no sense:
you only select fields from the ertable
so, the two LEFT JOINed tables could be omitted
if ex and ef do contain multiple matching rows, the resultset could be larger than just all the rows selected from er, resulting in duplicateder records
there is a GROUP BY present, but no aggregates are in the select list
select er.gg, er.z, er.y
from table_one er
where er.gg = '123'
-- or:
where er.gg >= '123'
and er.gg <= '456'
ORDER BY er.gg, er.z, er.y -- Or: some other ordering
;
since you are doing a join in your query, the logical thing to do is to work around it, meaning create what's known as a summary table, this summary table -residing on the database- will hold the final joined dataset, so in your python code you will just fetch/select data from it.
another way is to use materialized view link
I took #wildplasser's advice and moved the calculation operation inside the database as a function.
The result has been impressively efficient to say the least and total run time dropped to minutes/~ hour.
To recap:
Database records are no longer fetched in the sequence
mentioned earlier
Calculations happen inside the database via a function PostgreSQL function

peewee orm: bulk insert using a subquery but is based on python-side-data

peewee allows bulk inserts via insert_many() and insert_from(), however insert_many() allows a list of data to be inserted, but does not allow data computed from other parts of the database. insert_from() does allow data computed from other parts of the database, but does not allow any data to be sent from python.
Example:
Assuming a model structure like so:
class BaseModel(Model):
class Meta:
database = db
class Person(BaseModel):
name = CharField(max_length=100, unique=True)
class StatusUpdate(BaseModel):
person = ForeignKeyField(Person, related_name='statuses')
status = TextField()
timestamp = DateTimeField(constraints=[SQL('DEFAULT CURRENT_TIMESTAMP')], index=True)
And some initial data:
Person.insert_many(rows=[{'name': 'Frank'}, {'name': 'Joe'}, {'name': 'Arnold'}]).execute()
print ('Person.select().count():',Person.select().count())
Output:
Person.select().count(): 3
Say we want to add a bunch new status updates, like the ones in this list:
new_status_updates = [ ('Frank', 'wat')
, ('Frank', 'nooo')
, ('Joe', 'noooo')
, ('Arnold', 'nooooo')]
We might try to use insert_many() like so:
StatusUpdate.insert_many( rows=[{'person': 'Frank', 'status': 'wat'}
, {'person': 'Frank', 'status': 'nooo'}
, {'person': 'Joe', 'status': 'noooo'}
, {'person': 'Arnold', 'status': 'nooooo'}]).execute()
But this would fail: the person field expects a Person model or a Person.id, and we would have to make an extra query to retrieve those from the names.
We might be able to avoid this with insert_from() allows us to make subqueries, but insert_from() has no way of processing our lists or dictionaries. What to do?
One idea is to use the SQL VALUES clause as part of a SELECT statement.
If you are familiar with SQL, you may have seen the VALUES clause before, it is commonly used as part of an INSERT statement like so:
INSERT INTO statusupdate (person_id,status)
VALUES (1, 'my status'), (1, 'another status'), (2, 'his status');
This tells the database to insert three rows - AKA tuples - into the table statusupdate.
Another way of inserting things though is to do something like:
INSERT INTO statusupdate (person_id,status)
SELECT ..., ... FROM <elsewhere or subquery>;
This is equivalent to the insert_from() functionality that peewee provides.
But there is another less common thing you can do: you can use the VALUES clause in any select to provide literal values. Example:
SELECT *
FROM (VALUES (1,2,3), (4,5,6)) as my_literal_values;
This will return a result-set of two rows/tuples, each with 3 values.
So, if you can convert the "bulk" insert into a SELECT/FROM/VALUES statement, you can then do whatever transformations you need to do (namely, convert Person.name values to corresponding Person.id values) and then combine it with the peewee 'insert_from()` functionality.
So let us see how this would look.
First let us begin constructing the VALUES clause itself. We want properly escaped values, so we will use question marks instead of the values for now, and put the actual values in later.
#this is gonna look like '(?,?), (?,?), (?,?)'
# or '(%s,%s), (%s,%s), (%s,%s)' depending on the database type
values_question_marks = ','.join(['(%s, %s)' % (db.interpolation,db.interpolation)]*len(new_status_updates))
The next step is to construct the values clause. Here is our first attempt:
--the %s here will be replaced by the question marks of the clause
--in postgres, you must have a name for every item in `FROM`
SELECT * FROM (VALUES %s) someanonymousname
OK, so now we have a result-set that looks like:
name | status
-----|-------
... | ...
Except! There are no column names. This will cause us a bit of heartache in a minute, so we have to figure out a way to give the result-set proper column names.
The postgres way would be to just alter the AS clause:
SELECT * FROM (VALUES %s) someanonymousname(name,status)
sqlite3 does not support that (grr).
So we are reduced to a kludge. Luckily stackoverflow provides: Is it possible to select sql server data using column ordinal position, and we can construct something like this:
SELECT NULL as name, NULL as status WHERE 1=0
UNION ALL
SELECT * FROM (VALUES %s) someanonymousname
This works by first creating an empty result-set with the proper column-names, and then concatenating the result-set from the VALUES clause to it. This will produce a result-set that has the proper column-names, will work in sqlite3, and in postgres.
Now to bring this back to peewee:
values_query = """
(
--a trick to make an empty query result with two named columns, to more portably name the resulting
--VALUES clause columns (grr sqlite)
SELECT NULL as name, NULL as status WHERE 1=0
UNION ALL
SELECT * FROM (VALUES %s) someanonymousname
)
"""
values_query %= (values_question_marks,)
#unroll the parameters into one large list
#this is gonna look like ['Frank', 'wat', 'Frank', 'nooo', 'Joe', 'noooo' ...]
values_query_params = [value for values in new_status_updates for value in values]
#turn it into peewee SQL
values_query = SQL(values_query,*values_query_params)
data_query = (Person
.select(Person.id, SQL('values_list.status').alias('status'))
.from_(Person,values_query.alias('values_list'))
.where(SQL('values_list.name') == Person.name))
insert_query = StatusUpdate.insert_from([StatusUpdate.person, StatusUpdate.status], data_query)
print (insert_query)
insert_query.execute()
print ('StatusUpdate.select().count():',StatusUpdate.select().count())
Output:
StatusUpdate.select().count(): 4

Categories

Resources