compare columns and generate duplicate rows in mysql (or) python pandas

compare columns and generate duplicate rows in mysql (or) python pandas - python

i am new to Mysql and just getting started with some basic concepts. i have been trying to solve this for a while now. any help is appreciated.
I have a list of users with two phone numbers. i would want to compare two columns(phone numbers) and generate a new row if the data is different in both columns, else retain the row and make no changes.
The processed data would look like the second table.
Is there any way to acheive this in MySql.
i also don't minds doing the transformation in a dataframe and then loading into a table.
id username primary_phone landline
1 John 222 222
2 Michael 123 121
3 lucy 456 456
4 Anderson 900 901
Thanks!!!

Use DataFrame.melt with remove variable column and DataFrame.drop_duplicates:
df = (df.melt(['id','username'], value_name='phone')
.drop('variable', axis=1)
.drop_duplicates()
.sort_values('id'))
print (df)
id username phone
0 1 John 222
1 2 Michael 123
5 2 Michael 121
2 3 lucy 456
3 4 Anderson 900
7 4 Anderson 901

Related

I want to count occurence of a subset in pandas dataframe

If in a data frame I have a data like below:
Name Id
Alex 123
John 222
Alex 123
Kendal 333
So I want to add a column which will result:
Name Id Subset Count
Alex 123 2
John 222 1
Alex 123 2
Kendal 333 1
I used the below code but didnt got the output:
df['Subset Count'] = df.value_counts(subset=['Name','Id'])

try via groupby():
df['Subset Count']=df.groupby(['Name','Id'])['Name'].transform('count')
OR
via droplevel() and map()
df['Subset Count']=df['Name'].map(df.value_counts(subset=['Name','Id']).droplevel(1))

How do I optimize Levenshtein distance calculation for all rows of a single column of a Pandas DataFrame?

I want to calculate Levenshtein distance for all rows of a single column of a Pandas DataFrame. I am getting MemoryError when I cross-join my DataFrame containing ~115,000 rows. In the end, I want to keep only those rows where Levenshtein distance is either 1 or 2. Is there an optimized way to do the same?
Here's my brute force approach:
import pandas as pd
from textdistance import levenshtein
# from itertools import product
# original df
df = pd.DataFrame({'Name':['John', 'Jon', 'Ron'], 'Phone':[123, 456, 789], 'State':['CA', 'GA', 'MA']})
# create another df containing all rows and a few columns needed for further checks
name = df['Name']
phone = df['Phone']
dic_ = {'Name_Match':name,'Phone_Match':phone}
df_match = pd.DataFrame(dic_, index=range(len(name)))
df['key'] = 1
df_match['key'] = 1
# cross join df containing all columns with another df containing some of its columns
df_merged = pd.merge(df, df_match, on='key').drop("key",1)
# keep only rows where distance = 1 or distance = 2
df_merged['distance'] = df_merged.apply(lambda x: levenshtein.distance(x['Name'], x['Name_Match']), axis=1)
Original DataFrame:
Out[1]:
Name Phone State
0 John 123 CA
1 Jon 456 GA
2 Ron 789 MA
New DataFrame from original DataFrame:
df_match
Out[2]:
Name_Match Phone_Match
0 John 123
1 Jon 456
2 Ron 789
Cross-join:
df_merged
Out[3]:
Name Phone State Name_Match Phone_Match distance
0 John 123 CA John 123 0
1 John 123 CA Jon 456 1
2 John 123 CA Ron 789 2
3 Jon 456 GA John 123 1
4 Jon 456 GA Jon 456 0
5 Jon 456 GA Ron 789 1
6 Ron 789 MA John 123 2
7 Ron 789 MA Jon 456 1
8 Ron 789 MA Ron 789 0
Final output:
df_merged[((df_merged.distance==1)==True) | ((df_merged.distance==2)==True)]
Out[4]:
Name Phone State Name_Match Phone_Match distance
1 John 123 CA Jon 456 1
2 John 123 CA Ron 789 2
3 Jon 456 GA John 123 1
5 Jon 456 GA Ron 789 1
6 Ron 789 MA John 123 2
7 Ron 789 MA Jon 456 1

Your problem is not related to levenshtein distance, your main problem is that you are running out of device memory (RAM) while doing the operations (you can check it using the task manager in windows or the top or htop commands on linux/mac).
One solution would be to partition your dataframe before starting the apply operation into smaller partitions and running it on each partition then deleting the ones that you don't need BEFORE processing the next partition.
If you are running it on the cloud, you can get a machine with more memory instead.
Bonus: I'd suggest you parallelize the apply operation using something like Pandarallel to make it way faster.

How to merge multiple rows based on a single column (implode or nest) in pandas dataframe?

I'm looking to combine multiple row in a dataframe into a single row based on one column
This is what my df looks like:
id Name score
0 1234 jim 34
1 5678 james 45
2 4321 Macy 56
3 1234 Jim 78
4 5678 James 80
I want to combine based on column "score" so the output would look like:
id Name score
0 1234 jim 34,78
1 5678 james 45,80
2 4321 Macy 56
Basically I want to do the reverse of the explode function. How can I achieve this using pandas dataframe?

Try agg with groupby
out = df.groupby('id',as_index=False).agg({'Name':'first','score':lambda x : ','.join(x.astype(str))})
Out[29]:
id Name score
0 1234 jim 34,78
1 4321 Macy 56
2 5678 james 45,80

pandas - Can't merge df/series and groupby then count

TL;DR:
Have 2 dataframes with different sizes, but one 'id' column(in both df) that supposed to act as index. Need to merge them, group by 'sector' and 'gender' and count/sum entrys in each group.
Long version:
I have a dataframe with 'id', 'sector', among other columns, from company personnel. Another dataframe with 'id' and 'gender'. Examples bellow:
df1:
row* id sector other columns
1 0 Operational ...
2 0 Administrative ...
3 1 Sales ...
4 2 IT ...
5 3 Operational ...
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational ...
151 100 Sales ...
152 101 IT ...
*I don't really have a 'row' column, it's there just to make it easier to understand my problem.
df2:
row* id gender
1 0 Male
2 1 Female
3 2 Female
4 3 Male
5 4 Male
[...]
101 100 Male
102 101 Female
As you can see, one person can be in more then one sector (which seems to make my problem more complicated.)
I need to merge them together and then make a sum from how many male and female in each sector.
FIRST PROBLEM
Decided to make a new df to get only the columns 'id' and 'sector'.
df3 = df1[['id','sector']]
df3 = df3.merge(df2)
I get:
No common columns to perform merge on. Merge options: left_on=None,
right_on=None, left_index=False, right_index=False
Tried using .join() instead of .merge() and I get:
['id'] not in index"
Tried now with reset_index() - Found in some of the answers around here, but didn't really solved my issue.
df1 = df1.reset_index()
df3 = df1[['id','sector']]
df3 = df3.join(df2)
What I got was this:
row* id sector gender
1 0 Operational Male
2 0 Administrative Female
3 1 Sales Female
4 2 IT Male
5 3 Operational Male
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational NaN
151 100 Sales NaN
152 101 IT NaN
It didn't respected the 'id' and just concatenated the column to the side. Since df2 only had 102 rows, I got NaN in the other rows(103 to 152), aside from the fact that the 'gender' was no longer accurate.
SECOND PROBLEM
Decided to power through that in order to get the rest of the work done. I tried this:
df3 = df3.groupby('sector','gender').size()
It raises:
No axis named gender for object type < class 'pandas.core.frame.DataFrame'>
What doesn't really make sense to me, because I can call df3.gender and I get the (entire) expected series. If I remove 'gender' from the line above, it actually group but just that doesn't work for me. Also tried passing the columns names befor groupby, to no avail.
Expected result should be something like this:
sector gender sum
operational male 20
operational female 5
administrative male 10
administrative female 17
sales male 12
sales female 13
IT male 1
IT female 11

Not sure if I can answer to my own question but I think I should since the issue is resolved.
The solutions were very simple, even though I don't understand some of the issues I got.
First problem added on='id' in the merge
df3 = df1[['id','sector']].merge(df2, on='id')
Second problem just missing a list, as pointed by #DYZ
df3.groupby(['sector','gender']).size()
Feeling quite stupid right now... Must be tired. Thanks DYZ and sorry for the trouble.

Add data from multiple dataframe by its index using pandas [duplicate]

This question already has answers here:
replace column values in one dataframe by values of another dataframe
(5 answers)
Closed 5 years ago.
So i got my dataframe (df1):
Number Name Gender Hobby
122 John Male -
123 Patrick Male -
124 Rudy Male -
I want to add data to hobby based on number column. Assuming i've got my list of hobby based on its number on different dataframe. Like Example (df2):
Number Hobby
124 Soccer
... ...
... ...
and df3 :
Number Hobby
122 Basketball
... ...
... ...
How can i achieve this dataframe:
Number Name Gender Hobby
122 John Male Basketball
123 Patrick Male -
124 Rudy Male Soccer
So far i've already tried this following solutions :
Select rows from a DataFrame based on values in a column in pandas
but its only selecting some data. How can i update the 'Hobby' column ?
Thanks in advance.

You can use map, merge and join will also achieve it
df['Hobby']=df.Number.map(df1.set_index('Number').Hobby)
df
Out[155]:
Number Name Gender Hobby
0 122 John Male NaN
1 123 Patrick Male NaN
2 124 Rudy Male Soccer

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

compare columns and generate duplicate rows in mysql (or) python pandas - python

Related

I want to count occurence of a subset in pandas dataframe

How do I optimize Levenshtein distance calculation for all rows of a single column of a Pandas DataFrame?

How to merge multiple rows based on a single column (implode or nest) in pandas dataframe?

pandas - Can't merge df/series and groupby then count

Add data from multiple dataframe by its index using pandas [duplicate]

Categories

Resources