Knn for name matching-unsupervised learning - python

I am working on a name matching problem using synthetic data such as
alertname custname
0 wlison wilson
1 dais said
2 4dams adams
3 ad4ms adams
4 ad48s adams
5 smyth smith
6 smythe smith
7 gillan gillan
8 gilen gillan
9 scott-smith scottsmith
10 scott smith scottsmith
11 perrson person
12 persson person
Now I want to apply Knn for this task in unsupervised way, since I do not have any explicit label. I want to output matching score for each of the rows. I have used fuzzy matching already, now just wanted to explore knn for some automation. Would really appreciate if someone can provide starting point. Having said that, we do not have external label here.

Related

How to count paragraphs from each article from dataframe?

I want to count paragraphs from data frames. However, it turns out that my result gets zero inside the list. Does anybody know how to fix it? Thank you so much.
Here is my code:
def count_paragraphs(df):
paragraph_count = []
linecount = 0
for i in df.text:
if i in ('\n','\r\n'):
if linecount == 0:
paragraphcount = paragraphcount + 1
return paragraph_count
count_paragraphs(df)
df.text
0 On Saturday, September 17 at 8:30 pm EST, an e...
1 Story highlights "This, though, is certain: to...
2 Critical Counties is a CNN series exploring 11...
3 McCain Criticized Trump for Arpaio’s Pardon… S...
4 Story highlights Obams reaffirms US commitment...
5 Obama weighs in on the debate\n\nPresident Bar...
6 Story highlights Ted Cruz refused to endorse T...
7 Last week I wrote an article titled “Donald Tr...
8 Story highlights Trump has 45%, Clinton 42% an...
9 Less than a day after protests over the police...
10 I woke up this morning to find a variation of ...
11 Thanks in part to the declassification of Defe...
12 The Democrats are using an intimidation tactic...
13 Dolly Kyle has written a scathing “tell all” b...
14 The Haitians in the audience have some newswor...
15 The man arrested Monday in connection with the...
16 Back when the news first broke about the pay-t...
17 Chicago Environmentalist Scumbags\n\nLeftists ...
18 Well THAT’S Weird. If the Birther movement is ...
19 Former President Bill Clinton and his Clinton ...
Name: text, dtype: object
Use Series.str.count:
def count_paragraphs(df):
return df.text.str.count(r'\n\n').tolist()
count_paragraphs(df)
This is my answer and It works!
def count_paragraphs(df):
paragraph_count = []
for i in range(len(df)):
paragraph_count.append(df.text[i].count('\n\n'))
return paragraph_count
count_paragraphs(df)

How is DIFF calculated on customer demographics in featuretools?

I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name and HealthQuality columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality) variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent) for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info] option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount)). What is the DIFF measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700

Pandas read_html returned column with NaN values in Python

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)
Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.
Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.
Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object

Text Analytics/ Text prediction approach in python

This is my customer dataset
ID Address Location Age
1 room no5,
rose apt,hill street, Nagpur 38
nagpur-"500249"
2 block C-4,kirti park, Thane 26
Thane-"400356"
3 Dream villa, noor nagar, 46
Colaba-"400008"
5 Sita Apt,Room no 101, Kalyan 55
Kalyan- "421509"
7 Rose Apt, room no 20, Mumbai 43
Antop hill,
Mumbai-"400007"
8 Dirk Villa,nouis park, Mumbai 50
Dadar-"400079"
9 Raj apt,room no-2, 33
powai,"400076"
As in above case I have Address column values and corresponding values for Location column. Now given the above training data(data which will have all Address and Location values present) and test data( data which will consist only Address values for which I would be wanting to predict Location for it) I want to forecast which Address will fall under what loaction.
For this problem I want to know which approach in Python would suit the best,here are some reference which I had came across, but do not know which to follow.
here are some linkslink_1
link_2
Any suggestion will be much more appreciated.Thanks

Reshaping Pandas dataframe for plotting

I have what is (to me at least) a complicated dataframe I'm trying to reshape so that I can more easily create visualizations. The data is from a survey and each row is a complete survey with 247 columns. These columns are split as to what sort of data they contain. Some is identifying information, (who took the survey, what product the survey is on, what the scores were on particular questions and what comments they had about that particular product). Here is a simplification of the dataframe
id Evaluator item Mar1 Mar1[Comments] Comf1 Comf1[Com..
1 001 11 3 "asf adfsfs.." 3 "text.."
2 001 14 2 "asf adfsfs.." 4 "text.."
3 002 11 4 "asf adfsfs.." 2 "text.."
4 002 14 3 "asf adfsfs.." 3 "text.."
5 002 34 0 "asf adfsfs.." 1 "text.."
6 003 11 2 "asf adfsfs.." 0 "text.."
....
It continues on from here, but in this case 'Mar1' and 'Comf1' are rated questions. I have another datatable that helps describe all the question and question types within the survey so I can perform data selections like the following...
df[df['ItemNum']==11][(qtable[(qtable['type'].str.contains("OtoU")==True)]).id]
Which pulls from qtable all the 'types' of 'OtoU' (all the rating questions) for the ItemNum 11. This is all well and good and gets me something like this...
Mar1 Mar2 Comf1 Comf2 Comf3 Interop1 Interop2 .....
1 2 3 1 3 4 4
2 3 3 2 4 2 2
2 1 1 4 4 1 2
1 3 2 2 2 1 1
3 4 1 2 3 3 3
I can't really do much with it in that form (at least I don't think I can). What I 'think' I need to do is flatten it out into a form that goes more like
Item Question Score Section Evaluator ...
11 Mar1 3 Maritime 001 ...
11 Comf1 2 Comfort 001 ...
11 Comf2 3 Comfort 001 ...
14 Mar1 1 Maritime 001 ...
But, I'll be damned if I know how to do that. I tried to do it (the wrong way I'm pretty sure) with iterating through the dataframe but I quickly realized that it both took quiet some time to do, and the resulting data was of questionable integrity.
So, (very) long story short. How do I go about doing this sort of transform through the power of pandas? I would like to do a number of plots including box plots by question for each 'item' as well as factorplots broken by 'section' and multi line charts plotting the mean of each question by item... if that helps you better understand where I am trying to go with this thing. Sorry for the long post, I just wanted to make sure I supplied enough information to get a solid answer.
Thanks,

Categories

Resources