I want to count occurence of a subset in pandas dataframe

I want to count occurence of a subset in pandas dataframe - python

If in a data frame I have a data like below:
Name Id
Alex 123
John 222
Alex 123
Kendal 333
So I want to add a column which will result:
Name Id Subset Count
Alex 123 2
John 222 1
Alex 123 2
Kendal 333 1
I used the below code but didnt got the output:
df['Subset Count'] = df.value_counts(subset=['Name','Id'])

try via groupby():
df['Subset Count']=df.groupby(['Name','Id'])['Name'].transform('count')
OR
via droplevel() and map()
df['Subset Count']=df['Name'].map(df.value_counts(subset=['Name','Id']).droplevel(1))

Related

how to choose certain amount of character from a column in Python?

for example, there is a column in a dataframe, 'ID'.
One of the entries is for example, '13245993, 3004992'
I only want to get '13245993'.
That also applies for every row in column 'ID'.
How to change the data in each row in column 'ID'?

You can try like this, apply slicing on ID column to get the required result. I am using 3 chars as no:of chars here
import pandas as pd
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'ID':[90877, 10909, 12223, 12334]}
df=pd.DataFrame(data)
print('Before change')
print(df)
df["ID"]=df["ID"].apply(lambda x: (str(x)[:3]))
print('After change')
print(df)
output
Before change
Name ID
0 Tom 90877
1 nick 10909
2 krish 12223
3 jack 12334
After change
Name ID
0 Tom 908
1 nick 109
2 krish 122
3 jack 123

You could do something like
data[data['ID'] == '13245993']
this will give you the columns where ID is 13245993
More Indepth Code
I hope this answers your question if not please let me know.
With best regards

Check if pandas column contains text in another dataframe and replace values

I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!

You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary

How do I optimize Levenshtein distance calculation for all rows of a single column of a Pandas DataFrame?

I want to calculate Levenshtein distance for all rows of a single column of a Pandas DataFrame. I am getting MemoryError when I cross-join my DataFrame containing ~115,000 rows. In the end, I want to keep only those rows where Levenshtein distance is either 1 or 2. Is there an optimized way to do the same?
Here's my brute force approach:
import pandas as pd
from textdistance import levenshtein
# from itertools import product
# original df
df = pd.DataFrame({'Name':['John', 'Jon', 'Ron'], 'Phone':[123, 456, 789], 'State':['CA', 'GA', 'MA']})
# create another df containing all rows and a few columns needed for further checks
name = df['Name']
phone = df['Phone']
dic_ = {'Name_Match':name,'Phone_Match':phone}
df_match = pd.DataFrame(dic_, index=range(len(name)))
df['key'] = 1
df_match['key'] = 1
# cross join df containing all columns with another df containing some of its columns
df_merged = pd.merge(df, df_match, on='key').drop("key",1)
# keep only rows where distance = 1 or distance = 2
df_merged['distance'] = df_merged.apply(lambda x: levenshtein.distance(x['Name'], x['Name_Match']), axis=1)
Original DataFrame:
Out[1]:
Name Phone State
0 John 123 CA
1 Jon 456 GA
2 Ron 789 MA
New DataFrame from original DataFrame:
df_match
Out[2]:
Name_Match Phone_Match
0 John 123
1 Jon 456
2 Ron 789
Cross-join:
df_merged
Out[3]:
Name Phone State Name_Match Phone_Match distance
0 John 123 CA John 123 0
1 John 123 CA Jon 456 1
2 John 123 CA Ron 789 2
3 Jon 456 GA John 123 1
4 Jon 456 GA Jon 456 0
5 Jon 456 GA Ron 789 1
6 Ron 789 MA John 123 2
7 Ron 789 MA Jon 456 1
8 Ron 789 MA Ron 789 0
Final output:
df_merged[((df_merged.distance==1)==True) | ((df_merged.distance==2)==True)]
Out[4]:
Name Phone State Name_Match Phone_Match distance
1 John 123 CA Jon 456 1
2 John 123 CA Ron 789 2
3 Jon 456 GA John 123 1
5 Jon 456 GA Ron 789 1
6 Ron 789 MA John 123 2
7 Ron 789 MA Jon 456 1

Your problem is not related to levenshtein distance, your main problem is that you are running out of device memory (RAM) while doing the operations (you can check it using the task manager in windows or the top or htop commands on linux/mac).
One solution would be to partition your dataframe before starting the apply operation into smaller partitions and running it on each partition then deleting the ones that you don't need BEFORE processing the next partition.
If you are running it on the cloud, you can get a machine with more memory instead.
Bonus: I'd suggest you parallelize the apply operation using something like Pandarallel to make it way faster.

How to merge multiple rows based on a single column (implode or nest) in pandas dataframe?

I'm looking to combine multiple row in a dataframe into a single row based on one column
This is what my df looks like:
id Name score
0 1234 jim 34
1 5678 james 45
2 4321 Macy 56
3 1234 Jim 78
4 5678 James 80
I want to combine based on column "score" so the output would look like:
id Name score
0 1234 jim 34,78
1 5678 james 45,80
2 4321 Macy 56
Basically I want to do the reverse of the explode function. How can I achieve this using pandas dataframe?

Try agg with groupby
out = df.groupby('id',as_index=False).agg({'Name':'first','score':lambda x : ','.join(x.astype(str))})
Out[29]:
id Name score
0 1234 jim 34,78
1 4321 Macy 56
2 5678 james 45,80

compare columns and generate duplicate rows in mysql (or) python pandas

i am new to Mysql and just getting started with some basic concepts. i have been trying to solve this for a while now. any help is appreciated.
I have a list of users with two phone numbers. i would want to compare two columns(phone numbers) and generate a new row if the data is different in both columns, else retain the row and make no changes.
The processed data would look like the second table.
Is there any way to acheive this in MySql.
i also don't minds doing the transformation in a dataframe and then loading into a table.
id username primary_phone landline
1 John 222 222
2 Michael 123 121
3 lucy 456 456
4 Anderson 900 901
Thanks!!!

Use DataFrame.melt with remove variable column and DataFrame.drop_duplicates:
df = (df.melt(['id','username'], value_name='phone')
.drop('variable', axis=1)
.drop_duplicates()
.sort_values('id'))
print (df)
id username phone
0 1 John 222
1 2 Michael 123
5 2 Michael 121
2 3 lucy 456
3 4 Anderson 900
7 4 Anderson 901

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

I want to count occurence of a subset in pandas dataframe - python

try via groupby(): df['Subset Count']=df.groupby(['Name','Id'])['Name'].transform('count') OR via droplevel() and map() df['Subset Count']=df['Name'].map(df.value_counts(subset=['Name','Id']).droplevel(1))

Related

how to choose certain amount of character from a column in Python?

Check if pandas column contains text in another dataframe and replace values

How do I optimize Levenshtein distance calculation for all rows of a single column of a Pandas DataFrame?

How to merge multiple rows based on a single column (implode or nest) in pandas dataframe?

compare columns and generate duplicate rows in mysql (or) python pandas

Categories

Resources