Hash Join Algorithm from MySQL in Python

Hash Join Algorithm from MySQL in Python - python

Assume I want to write the following SQL Query:
SELECT Employee.Name, Employee.ID, InvitedToParty.Name, InvitedToParty.FavoriteFood
FROM Employee, InvitedToParty
WHERE Employee.Name = InvitedToParty.Name
given the required tables :
Employee
ID Name Birthday
1 Heiny 01.01.2000
2 Peter 10.10.1990
3 Sabrina 12.10. 2015
.
.
InvitedToParty
Name FavoriteFood
Michael Pizza
Heiny Pizza
Sabrina Burger
George Pasta
.
.
.
Assume I have this information as two lists in Python inside a dictionary:
tables['Employee'].id = [1, 2, 3 ..]
tables['Employee'].Name = [Heiny, Peter, Sabrina ...]
I hope you get the idea. These keys of the dictionary have attributes, because I created a class for each table.
How can I write this query in Python? My initial idea was (pseudo):
match_counter = 0
for i, value in enumerate(table1.column):
for j in range(len(table2.column)):
if table2.column[j] == value:
table2.column[j], table2.column[i] = table2.column[i], table2.column[]
match_counter += 1
And remove everything after 'match_counter' rows. But I am sure there must be a better way? Moreover, I do not even know if this would give me the correct result

The rough equivalent of your query is:
results = []
for row1 in table1:
for row2 in table2:
if row1.Name == row2.Name:
results.append( row1.Name, row1,ID, row2.FavoriteFood )

Related

How can I make pairs of the partner_name values in the given dataset?

user_id
partner_name
order_sequence
2
Star Bucks
1
2
KFC
2
2
MCD
3
6
Coffee Store
1
6
MCD
2
9
KFC
1
I am trying to figure out what two restaurant combinations occur the most often. For instance, user with user_id 2 went to star bucks, KFC, and MCD, so I want a two-dimensional array that has [[star bucks, KFC],[KFC, MCD].
However, each time the user_id changes, for instance, in lines 3 and 4, the code should skip adding this combination.
Also, if a user has only one entry in the table, for instance, user with user_id 9, then this user should not be added to the list because he did not visit two or more restaurants.
The final result I am expecting for this table are:
[[Star Bucks, KFC], [KFC,MCD], [Coffee Store, MCD]]
I have written the following lines of code but so far, I am unable to crack it.
Requesting help!
arr1 = []
arr2 = []
for idx,x in enumerate(df['order_sequence']):
if x!=1:
arr1.append(df['partner_name'][idx])
arr1.append(df['partner_name'][idx+1])
arr2.append(arr1)

You could try to use .groupby() and zip():
res = [
pair
for _, sdf in df.groupby("user_id")
for pair in zip(sdf["partner_name"], sdf["partner_name"].iloc[1:])
]
Result for the sample dataframe:
[('Star Bucks', 'KFC'), ('KFC', 'MCD'), ('Coffee Store', 'MCD')]
Or try
res = (
df.groupby("user_id")["partner_name"].agg(list)
.map(lambda l: list(zip(l, l[1:])))
.sum()
)
with the same result.
Might be, that you have to sort the dataframe before:
df = df.sort_values(["user_id", "order_sequence"])

How to get difference in flask SQLalchemy queries then order the queries

I am still trying to figure out queries. I was able to make two queries and now I am trying to figure out how to combine them. For the models.py I have a class Ingredient(db.Model). In my views.py I have the queries:
i0 = db.session.query(Ingredient).filter(Ingredient.ROrder.isnot(None),
Ingredient.recipe_id==0).order_by(Ingredient.ROrder)
ir = db.session.query(Ingredient).filter(Ingredient.ROrder.isnot(None),
Ingredient.recipe_id==recipe_id).order_by(Ingredient.ROrder)
Okay so there is another column and I want to know the source_id. If source_id is present in the ir then I don't want the instance of source_id in i0. I hope this makes sense.
Edit to add table image
The table is like so.
source_id
name
Descriptor
tv
ROrder
recipe_id
3110
Baking Powder
61
0
C100
Flour
All-purpose
1781
362
0
C100
Flour
Bread
1346
207
1
3120
Baking Soda
83
0
B121
Brown Sugar
Light
542
41
1
B121
Brown Sugar
Dark
97
18
0
All records have source_id and recipe_id.

combined = db.session.query(Ingredient).filter(Ingredient.ROrder.isnot(None)).\
filter(_or(Ingredient.recipe_id==0, Ingredient.recipe_id==recipe_id)).\
order_by(Ingredient.ROrder)
It sounds like you should be using a single query. If this is not the solution you seek, please farther explain the conditions you're trying to grab.
EDIT:
I didn't run this locally so I may have syntax error but I would likely do something like this. Just save all the data into a dictionary and when you get a source_id that already exist, overwrite the values if recipe_id==recipe_id.
combined_objects = db.session.query(Ingredient).filter(Ingredient.ROrder.isnot(None)).\
filter(_or(Ingredient.recipe_id==0, Ingredient.recipe_id==recipe_id)).\
order_by(Ingredient.ROrder).all()
data_dict = {}
source_id_list = []
for combined_object in combined_objects:
if combined_object.source_id not in data_dict.keys():
# add to dictionary
temp_dict = {'name': combined_object.name, 'Descriptor': combined_object.Descriptor,
'tv': combined_object.tv, 'ROrder': combined_object.ROrder, 'recipe_id': combined_object.recipe_id}
data_dict[combined_object.source_id] = temp_dict
# keep track of list of source_ids
source_id_list.append(combined_object.source_id)
# if key is already in dictionary, check if recipe_id==recipe_id, if it does, overwrite data in dictionary
else:
if combined_object.recipe_id == recipe_id:
# add to dictionary
temp_dict = {'name': combined_object.name, 'Descriptor': combined_object.Descriptor,
'tv': combined_object.tv, 'ROrder': combined_object.ROrder, 'recipe_id': combined_object.recipe_id}
data_dict[combined_object.source_id] = temp_dict
# show result
for source_id in source_id_list:
print(data_dict[source_id])

simply selecting a column in a specific row

Ok, my frustration has hit epic proportions. I am new to Pandas and trying to use it on an excel db i have, however, i cannot seem to figure out what should be a VERY simple action.
I have a dataframe as such:
ID UID NAME STATE
1 123 Bob NY
1 123 Bob PA
2 124 Jim NY
2 124 Jim PA
3 125 Sue NY
all i need is to be able to locate and print the ID of a record by the unique combination of UID and STATE.
The closest I can come up with is this:
temp_db = fd_db.loc[(fd_db['UID'] == "1") & (fd_db['STATE'] == "NY")]
but this still grabs all UID and not ONLY the one with the STATE
Then, when i try to print the result
temp_db.ID.values
prints this:
['1', '1']
I need just the data and not the structure.
My end result needs to be just to print to the screen : 1
Any help is much appreciated.

I think it's because your UID condition is wrong : the UID column an Integer and you give a String.
For example when I run this :
df.loc[(df['UID'] == "123") & (df['STATE'] == 'NY')]
The output is :
Empty DataFrame
Columns: [ID, UID, NAME, STATE]
Index: []
but when I consider UID as an Integer :
df.loc[(df['UID'] == 123) & (df['STATE'] == 'NY')]
It output :
ID UID NAME STATE
0 1 123 Bob NY
I hope that will help you !

fd_db.loc[(fd_db['UID'] == 123) & (fd_db['STATE'] == 'NY')]['ID'].iloc[0]

Use contains to merge data frame

I have two separates files, one from our service providers and the other is internal (HR).
The service providers write the names of our employer in different ways, there are those who write it in firstname lastname format, or first letter of the firstname and the last name or lastname firstname...while the HR file includes separately the first and last name.
DF1
Full Name
0 B.pitt
1 Mr Nickolson Jacl
2 Johnny, Deep
3 Streep Meryl
DF2
First Last
0 Brad Pitt
1 Jack Nicklson
2 Johnny Deep
3 Streep Meryl
My idea is to use str.contains to look for the first letter of the first name and the last name. I've succed to do it with static values using the following code:
df1[['Full Name']][df1['Full Name'].str.contains('B')
& df1['Full Name'].str.contains('pitt')]
Which gives the following result:
Full Name
0 B.pitt
The challenge is comparing the two datasets... Any advise on that please?
Regards

if you are just checking if it exists or no this could be useful:
because it is rare to have 2 exactly the same family name, I recommend to just split your Df1 and compare families, then for ensuring you can differ first names too
you can easily do it with a for:
for i in range('your index'):
if df1_splitted[i].str.contain('family you searching for'):
print("yes")
if you need to compare in other aspects just let me know

I suggest to use next module for parsing names:
pip install nameparser
Then you can process your data frames :
from nameparser import HumanName
import pandas as pd
df1 = pd.DataFrame({'Full Name':['B.pitt','Mr Nickolson Jack','Johnny, Deep','Streep Meryl']})
df2 = pd.DataFrame({'First':['Brad', 'Jack','Johnny', 'Streep'],'Last':['Pitt','Nicklson','Deep','Meryl']})
names1 = [HumanName(name) for name in df1['Full Name']]
names2 = [HumanName(str(row[0]+" "+ str(row[1]))) for i,row in df2.iterrows()]
After that you can try comparing HumanName instances which have parsed fileds. it looks like this:
<HumanName : [
title: ''
first: 'Brad'
middle: ''
last: 'Pitt'
suffix: ''
nickname: '' ]
I have used this approach for processing thousands of names and merging them to same names from other documents and results were good.
More about module can be found at https://nameparser.readthedocs.io/en/latest/

Hey you could use fuzzy string matching with fuzzywuzzy
First create Full Name for df2
df2_ = df2[['First', 'Last']].agg(lambda a: a[0] + ' ' + a[1], axis=1).rename('Full Name').to_frame()
Then merge the two dataframes by index
merged_df = df2_.merge(df1, left_index=True, right_index=True)
Now you can apply fuzz.token_sort_ratio so you get the similarity
merged_df['similarity'] = merged_df[['Full Name_x', 'Full Name_y']].apply(lambda r: fuzz.token_sort_ratio(*r), axis=1)
This results in the following dataframe. You can now filter or sort it by similarity.
Full Name_x Full Name_y similarity
0 Brad Pitt B.pitt 80
1 Jack Nicklson Mr Nickolson Jacl 80
2 Johnny Deep Johnny, Deep 100
3 Streep Meryl Streep Meryl 100

Can mysql query be simplified ... or is there a need for a function, procedure?

I am making an small program in Python and with web.py as web end. For that code I have a sql query as below. The query is needed for making a web slideshow with names and titles for scale models to be shown in that slideshow. All Bronze are te bo shown on one page, all Silver medals are to be shown on one page and all gold medals are shown separatly.
As I have it now is that with every tick on a next button in the browser, I will get the next index of the sql query. Say index 0 from the sql query is:
Name Surname Title Results Class_ID outcome
John Doe Test1 Bronze 1 John Doe with: Test1 Jane Doe with: Test1
And index 1 from the sql query is:
Name Surname Title Results Class_ID outcome
Jane Doe Test2 Silver 1 Jane Doe with: Test2 John Doe with: Test2
Model Test1 from John and Jane have bronze medals and are concated with GROUP_CONCAT to get all same results from that particular class on one 'line'. Same goes for the silver medals. All gold medals are to be shown separatly. I start with Class 1 and bronze medals, after that the next index is shown and are the silver medals of Class 1, after that Gold and all that repeats for the 31 classes I have.
Question is:
Can below query be simplified? All I have now is a query for one Class. In total there are 31 different classes, so query should be run over all 31 classes to see if there are medals awarded and if so, concat the Bronze and after that concat the Silver and show the Gold separatly. I tried to make a function in MySQL but could not get it to work...
SELECT naw.Name, naw.Surname, model.Title, model.Results, model.Class_ID,
GROUP_CONCAT(naw.Name, ' ',naw.Surname, ' with: ', model.Title SEPARATOR ' ') AS outcome
FROM naw
LEFT OUTER JOIN model ON model.User_ID = naw.User_ID
WHERE model.Results = 'Bronze' AND Class_ID = '1'
UNION
SELECT naw.Name, naw.Surname, model.Title, model.Results, model.Class_ID,
GROUP_CONCAT(naw.Name, ' ',naw.Surname, ' with: ', model.Title SEPARATOR ' ') AS outcome
FROM naw
LEFT OUTER JOIN model ON model.User_ID = naw.User_ID
WHERE model.Results = 'Silver' AND Class_ID = '1'
UNION
SELECT naw.Name, naw.Surname, model.Title, model.Results, model.Class_ID, model.ID
FROM naw
LEFT OUTER JOIN model ON model.User_ID = naw.User_ID
WHERE model.Results = 'Gold' AND Class_ID = '1'
Please feel free to ask if something is not correctly explained. My first in asking help here.
Thanks and with kind Regards,
UPDATE - Problem solved
Was looking in the complete wrong direction. Below solution to my question.
SELECT naw.*, model.*,
CASE
WHEN Results = "Gold" THEN 4 + naw.User_ID
WHEN Results = "Silver" THEN 2
WHEN Results = "Bronze" THEN 3
ELSE 4
END AS pageCategory,
GROUP_CONCAT(naw.Name, ' ',naw.Surname, ' with: ', model.Title SEPARATOR '<br>') AS categories
FROM naw
INNER JOIN model ON model.User_ID = naw.User_ID
INNER JOIN classes ON model.Class_ID = classes.Class_ID
WHERE
Results <> 'None'
GROUP BY
pageCategory, classes.Class_ID
ORDER BY
classes.Class_ID, FIELD(Results, 'Bronze', 'Silver', 'Gold')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Hash Join Algorithm from MySQL in Python - python

The rough equivalent of your query is: results = [] for row1 in table1: for row2 in table2: if row1.Name == row2.Name: results.append( row1.Name, row1,ID, row2.FavoriteFood )

Related

How can I make pairs of the partner_name values in the given dataset?

How to get difference in flask SQLalchemy queries then order the queries

simply selecting a column in a specific row

Use contains to merge data frame

Can mysql query be simplified ... or is there a need for a function, procedure?

Categories

Resources