sklearn - how to use TfidfVectorizer to use entire strings?

sklearn - how to use TfidfVectorizer to use entire strings? - python

I have this problem where I am using the hostnames of all the URLs I have in my dataset as features. I'm not able to figure out how to use TfidfVectorizer to extract hostnames only from the URLs and calculate their weights.
For instance, I have a dataframe df where the column 'url' has all the URLs I need. I thought I had to do something like:
def preprocess(t):
return urlparse(t).hostname
tfv = TfidfVectorizer(preprocessor=preprocess)
tfv.fit_transform([t for t in df['url']])
It doesn't seem to work this way, since it splits the hostnames instead of treating them as whole strings. I think it's to do with analyzer='word' (which it is by default), which splits the string into words.
Any help would be appreciated, thanks!

You are right. analyzer=word creates a tokeniser that uses the default token pattern '(?u)\b\w\w+\b'. If you wanted to tokenise the entire URL as a single token, you can change the token pattern:
vect = CountVectorizer(token_pattern='\S+')
This tokenises https://www.pythex.org hello hello.there as ['https://www.pythex.org', 'hello', 'hello.there']. You can then create an analyser to extract the hostname from URLs as shown in this question. You can either extend CountVectorizer to change its build_analyzer method or just monkey patch it:
def my_analyser():
# magic is a function that extracts hostname from URL, among other things
return lambda doc: magic(preprocess(self.decode(doc)))
vect = CountVectorizer(token_pattern='\S+')
vect. build_analyzer = my_analyser
vect.fit_transform(...)
Note: tokenisation is not as simple as is appears. The regex I've used has many limitations, e.g. it doesn't split the last token of a sentence and the first token of the next sentence if there isn't a space after the full stop. In general, regex tokenisers get very unwieldy very quickly. I recommend looking at nltk, which offers several different non-regex tokenisers.

Related

Updating spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match) so that hashtags are tokenized as a single token

This is my first time using spacy and I am trying to learn how to edit the tokenizer on one of the pretrained models (en_core_web_md) so that when tweets are tokenized, the entire hashtag becomes a single token (e.g. I want one token '#hashtagText', the default would be two tokens, '#' and 'hashtagText').
I know I am not the first person that has faced this issue. I have tried implementing the advice other places online but after using their methods the output remains the same (#hashtagText is two tokens). These articles show the methods I have tried.
https://the-fintech-guy.medium.com/spacy-handling-of-hashtags-and-dollartags-ed1e661f203c
https://towardsdatascience.com/pre-processing-should-extract-context-specific-features-4d01f6669a7e
Shown in the code below, my troubleshooting steps have been:
save the default pattern matching regex (default_token_matching_regex)
save the regex that nlp (the pretrained model) is using before any updates (nlp_token_matching_regex_pre_update)
Note: I originally suspected these would be the same, but they are not. See below for outputs.
Append the regex I need (#\w+) to the list that nlp is current using, save this combination as updated_token_matching_regex
Update the regex nlp is using with the variable created above (updated_token_matching_regex)
Save the new regex used by nlp to verify things were updated correctly (nlp_token_matching_regex_post_update).
See code below:
import spacy
import en_core_web_md
import re
nlp = en_core_web_md.load()
# Spacys default token matching regex.
default_token_matching_regex = spacy.tokenizer._get_regex_pattern(nlp.Defaults.token_match)
# Verify what regex nlp is using before changing anything.
nlp_token_matching_regex_pre_update = spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match)
# Create a new regex that combines the default regex and a term to treat hashtags as a single token.
updated_token_matching_regex = f"({nlp_token_matching_regex_pre_update}|#\w+)"
# Update the token matching regex used by nlp with the regex created in the line above.
nlp.tokenizer.token_match = re.compile(updated_token_matching_regex).match
# Verify that nlp is now using the updated regex.
nlp_token_matching_regex_post_update = spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match)
# Now let's try again
s = "2020 can't get any worse #ihate2020 #bestfriend <https://t.co>"
doc = nlp(s)
# Let's look at the lemmas and is stopword of each token
print(f"Token\t\tLemma\t\tStopword")
print("="*40)
for token in doc:
print(f"{token}\t\t{token.lemma_}\t\t{token.is_stop}")
As you can see above, the tokenization behavior is not as it should be with the addition of '#\w+'. See below for printouts of all the troubleshooting variables.
Since I feel like I have proven to myself above that I did correctly update the regex nlp is using, the only possible issue I could think of is that the regex itself was wrong. I tested the regex by itself and it seems to behave as intended, see below:
Is anyone able to see the error that is causing nlp to tokenize #hashTagText as two tokens after its nlp.tokenizer.token_match regex was updated to do it as a single token?
Thank you!!

Not sure it is the best possible solution, but I did find a way to make it work. See below for what I did:
Spacy gives us the chart below which shows the order things are processed when tokenization is performed.
I was able to use the tokenizer.explain() method to see that the hashtags were being ripped off due to a prefix rule. Viewing the tokenizer.explain() output is a simple as running the code below, where "first_tweet" is any string.
tweet_doc = nlp.tokenizer.explain(first_tweet)
for token in tweet_doc:
print(token)
Next, referencing the chart above, we see that prefix rules are the first things applied during the tokenization process.
This meant that even though I updated the token_match rules with a regular expression that allows for keeping "#Text" as a single token, it didn't matter because by the time the token_match rules were evaluated the prefix rule had already separated the '#' from the text.
Since this is a twitter project, I will never want "#" treated as a prefix. Therefore my solution was to remove "#" from the list of prefixes considered, this was accomplished with the code below:
default_prefixes = list(nlp.Defaults.prefixes)
default_prefixes.remove('#')
prefix_regex = spacy.util.compile_prefix_regex(default_prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search
and that's it! Hope this solution helps someone else.
Final thoughts:
Recently spacy was updated to version 3.0. I am curious if prior versions of the spacy pretrained models did not include '#' in the prefix list. That is the only explanation I can come up with for why the code shown in the articles previous posted no longer seem to work as intended. If anyone can explain in detail why my solution seems much more complicated than those posted in the articles I linked to earlier, I would certainly love to learn.
Cheers.
-Braden

The default token_match for English is None (as of v2.3.0, now that the URL pattern is in url_match), so you can just overwrite it with your new pattern:
import re
import spacy
nlp = spacy.blank("en")
nlp.tokenizer.token_match = re.compile("^#\w+$").match
assert [t.text for t in nlp("#asdf1234")] == ["#asdf1234"]
Your example in the question ends up with the pattern (None|#\w+), which isn't exactly what you want, but it seems to work fine for this given example with v2.3.5 and v3.0.5:
Token Lemma Stopword
========================================
2020 2020 False
ca ca True
n't n't True
get get True
any any True
worse bad False
#ihate2020 #ihate2020 False
#bestfriend #bestfriend False
< < False
https://t.co https://t.co False
> > False

Using whoosh as matcher without an index

Is it possible to use whoosh as a matcher without building an index?
My situation is that I have subscriptions pre-defined with strings, and documents coming through in a stream. I check each document matches the subscriptions and send them if so. I don't need to store the documents, or recall them later. Once they've been sent to the subscriptions, they can be discarded.
Currently just using simple matching, but as consumers ask for searches based on fields, and/or logic, etc, I'm wondering if it's possible to use a whoosh matcher and allow whoosh query syntax for this.
I could build an index for each document, query it, and then throw it away, but that seems very wasteful, is it possible to directly construct a Matcher? I couldn't find any docs or questions online indicating a way to do this and my attempts haven't worked.
Alternatively, is this just the wrong library for this task, and is there something better suited?

The short answer is no.
Search indices and matchers work quite differently. For example, if searching for the phrase "hello world", a matcher would simply check the document text contains the substring "hello world". A search index cannot do this, it would have to check every document, and that be very slow.
As documents are added, every word in them is added to the index for that word. So the index for "hello" will say that document 1 matches at position 0, and the index for "world" will say that document 1 matches at position 6. And a search for "hello world" will find all document IDs in the "hello" index, then all in the "world" index, and see if any have a position for "world" which is 6 digits after the position for "hello".
So it's a completely orthogonal way of doing things in whoosh vs a matcher.
It is possible to do this with whoosh, using a new index for each document, like so:
def matches_subscription(doc: Document, q: Query) -> bool:
with RamStorage() as store:
ix = store.create_index(schema)
writer = ix.writer()
writer.add_document(
title=doc.title,
description=doc.description,
keywords=doc.keywords
)
writer.commit()
with ix.searcher() as searcher:
results = searcher.search(q)
return bool(results)
This takes about 800 milliseconds per check, which is quite slow.
A better solution is to build a parser with pyparsing, anbd then create your own nested query classes which can do the matching, better fitting your specific search queries. It's quite extendable too that way. That can bring it down to ~40 microseconds, so, 20,000 times faster.

How to split data using regex with Zapier?

I'm setting up integration between a webflow store and shippo to assist with creating labels and managing shipping. Webflow passes the data as one huge object for address information, however to create a new order in shippo, I need the information parsed, separated as individual line items. I have attempted to use formatter which allows one to extract text, split text, use regex to match data and more.
import re
details = re.search(r'(?<=city:\s).*$', input_data[All Addresses])
Regex in Python is my best option, yet the result will not find and/or display the data.
Please any experts in Zapier integrations, I need assistance in figuring out a way to parse the incoming data from webflow, pass it to the 'create a order' action with shippo.
Structure of Data:
addressee: string
city: string
country: string
more....

You can try this one:
Combine all the data in one whole string
import re
details = re.finall(r'(?<=city:\s).*$', all_addresses)
return details
It will you give the list of all matches in the text.

Regex to parse out a part of URL using python

I am having data as follows,
data['url']
http://hostname.com/aaa/uploads/2013/11/a-b-c-d.jpg https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/e-f-g-h.gif https://www.aaa.com/
http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html
http://hostname.com/ddd/uploads/2013/11/w-e-r-t.ico
http://hostname.com/ddd/uploads/2013/11/r-t-y-u.aspx https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/t-r-w-q.jpeg https://www.aaa.com/
I want to find out the formats such as .jpg, .gif, .png, .ico, .aspx, .html, .jpeg and parse it out backwards until it finds a "/". Also I want to check for several occurance all through the string. My output should be,
data['parsed']
a-b-c-d
e-f-g-h
e-f-g-h a-a-a-a
w-e-r-t
r-t-y-u
t-r-w-q
I am thinking instead of writing individual commands for each of the formats, is there a way to write everything under a single command.
Can anybody help me in writing for theses commands? I am new to regex and any help would be appreciated.

this builds a list of name to extension pairs
import re
results = []
for link in data:
matches = re.search(r'/(\w-\w-\w-\w)\.(\w{2,})\b', link)
results.append((matches.group(1), matches.group(2)))

This pattern returns the file names. I have just used one of your urls to demonstrate, for more, you could simply append the matches to a list of results:
import re
url = "http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html"
p = r'((?:[a-z]-){3}[a-z]).'
matches = re.findall(p, url)
>>> print('\n'.join(matches))
e-f-g-h
a-a-a-a
There is the assumption that the urls all have the general form you provided.

You might try this:
data['parse'] = re.findall(r'[^/]+\.[a-z]+ ',data['url'])
That will pick out all of the file names with their extensions. If you want to remove the extensions, the code above returns a list which you can then process with list comprehension and re.sub like so:
[re.sub('\.[a-z]+$','',exp) for exp in data['parse']]
Use the .join function to create a string as demonstrated in Totem's answer

Isolate TLD from FQDN using regex

I am attempting to isolate TLDs utilizing regex from giant lists of FQDNs without importing 3rd party modules and am attempting to determine if there is a more eloquent way of doing this. My way works but is a bit cumbersome for my liking.
Sample code:
domains = ['x.sample1.com', 'y.sample2.org', 'z.sample3.biz']
temp = []
for domain in domains:
temp.append(re.findall('\.[a-z0-9]+', domain, re.I)
tlds = []
for item in temp:
for tld in item:
tlds.append(tld)
It is inconvenient how the return of the re.findall is a list object as it makes the iterating process an entire level deeper than desired but am unsure of how to get around this.

The "quick fix" is either to take the last item in each array:
split('.', domain)[-1]
Or, if you really don't care about the first matches, then don't capture them at all:
re.find('\.[a-z0-9]+$', domain, re.I)
(Note the use of $ to match the end of string.)
HOWEVER, note that it's impossible to solve this problem properly with regex. For example, how can you know that the TLD for google.co.uk is co.uk, and not just uk?
The only full solution to this problem, unfortunately, is by using a library that implements the public suffix list - which is basically just a very long (manually updated) list of all TLDs. For example, in python: https://pypi.python.org/pypi/publicsuffix/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.