Keywords:
Keywords={u'secondary': [u'sales growth', u'next generation store', u'Steps Down', u' Profit warning', u'Store Of The Future', u'groceries']}
Paragraph:
paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""
it there any way to match the keywords in paragraph?(without using regex)
Output:
Matched keywords : next generation store , groceries
No need to use NLTK for this. First of all you will have to clean you text in the paragraph, or change your values in the list for the 'secondary key. '"next generation" store' and 'next generation store' are two different things.
After this you can iterate over the values of 'secondary', and check if any of those strings exist in your text.
match = [i for i in Keywords['secondary'] if i in paragraph]
EDIT: As i specified above, '"next generation" store' and 'next generation store' are two different things, which is the reason you only get 1 match. If you had 'next generation store' and 'next generation store' you would get two matches - as there are in fact two matches.
INPUT:
paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""
OUTPUT:
['groceries']
INPUT:
paragraph="""HOUSTON -- Target has unveiled its first next generation store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""
OUTPUT:
['next generation store','groceries']
Firstly, you don't really need a dict if your keywords has only one key. Use a set() instead.
Keywords={u'secondary': [u'sales growth', u'next generation store',
u'Steps Down', u' Profit warning',
u'Store Of The Future', u'groceries']}
keywords = {u'sales growth', u'next generation store',
u'Steps Down', u' Profit warning',
u'Store Of The Future', u'groceries'}
paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""
Then a minor tweak from Find multi-word terms in a tokenized text in Python
from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize
mwe = MWETokenizer([k.lower().split() for k in keywords], separator='_')
# Clean out the punctuations in your sentence.
import string
puncts = list(string.punctuation)
cleaned_paragraph = ''.join([ch if ch not in puncts else '' for ch in paragraph.lower()])
tokenized_paragraph = [token for token in mwe.tokenize(word_tokenize(cleaned_paragraph))
if token.replace('_', ' ') in keywords]
print(tokenized_paragraph)
[out]:
>>> print(tokenized_paragraph)
['next_generation_store', 'groceries'
Related
I am new to the web scraping. I am trying to scrape "When purchase Online"
When purchased online in the Target. But i did not find it in the HTML.
.
Does anyone konw how to locate the element in HTML? Any help appreciates. Thanks!
Product Url:
https://www.target.com/c/allergy-sinus-medicines-treatments-health/-/N-4y5ny?Nao=144
https://www.target.com/p/genexa-dextromethorphan-kids-39-cough-and-chest-congestion-suppressant-4-fl-oz/-/A-80130848#lnk=sametab
I have no idea which element you want to get but API sends JSON data, not HTML, and you may simply convert it to dictionary/list and use keys/indexes to get value.
But you have to manually find correct keys in JSON data.
Or you may write some script to search in JSON (using for-loops and recursions)
Minimal working code. I found keys manually.
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130848&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA' # JSON
response = requests.get(url)
data = response.json()
product = data['data']['product']
print('price:', product['price']['current_retail'])
print('title:', product['item']['product_description']['title'])
print('description:', product['item']['product_description']['downstream_description'])
print('------------')
for bullet in product['item']['product_description']['bullet_descriptions']:
print(bullet)
print('------------')
print(product['item']['product_description']['soft_bullets']['title'])
for bullet in product['item']['product_description']['soft_bullets']['bullets']:
print('-', bullet)
print('------------')
for attribute in product['item']['wellness_merchandise_attributes']:
print('-', attribute['value_name'])
print(' ', attribute['wellness_description'])
Result:
price: 13.99
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
description: Genexa Kids’ Cough & Chest Congestion is real medicine, made clean - a powerful cough suppressant and expectorant that helps control cough, relieves chest congestion and helps thin and loosen mucus. This liquid, non-drowsy medicine has the same active ingredients you need (dextromethorphan HBr and guaifenesin), but without the artificial ones you don’t (dyes, common allergens, parabens). We only use ingredients people deserve to make the first gluten-free, non-GMO, certified vegan medicines to help your little ones feel better. <br /><br />Genexa is the first clean medicine company. Founded by two dads who believe in putting People Over Everything, Genexa makes medicine with the same active ingredients people need, but without the artificial ones they don’t. It’s real medicine, made clean.
------------
<B>Suggested Age:</B> 4 Years and Up
<B>Product Form:</B> Liquid
<B>Primary Active Ingredient:</B> Dextromethorphan
<B>Package Quantity:</B> 1
<B>Net weight:</B> 4 fl oz (US)
------------
highlights
- This is an age restricted item and will require us to take a quick peek at your ID upon pick-up
- Helps relieve kids’ chest congestion and makes coughs more productive by thinning and loosening mucus
- Non-drowsy so your little ones (ages 4+) can get back to playing
- Our medicine is junk-free, with no artificial sweeteners or preservatives, no dyes, no parabens, and no common allergens
- Certified gluten-free, vegan, and non-GMO
- Flavored with real organic blueberries
- Gentle on little tummies
------------
- Dye-Free
A product that either makes an unqualified on-pack statement indicating that it does not contain dye, or carries an unqualified on-pack statement such as "no dyes" or "dye-free."
- Gluten Free
A product that has an unqualified independent third-party certification, or carries an on-pack statement relating to the finished product being gluten-free.
- Non-GMO
A product that has an independent third-party certification, or carries an unqualified on-pack statement relating to the final product being made without genetically engineered ingredients.
- Vegan
A product that carries an unqualified independent, third-party certification, or carries on-pack statement relating to the product being 100% vegan.
- HSA/FSA Eligible
Restrictions apply; contact your insurance provider about plan allowances and requirements
EDIT:
Information "When purchased online" (or "at Cedar Rapids South") are in different url.
For example
Product url:
https://www.target.com/p/genexa-kids-39-diphenhydramine-allergy-liquid-medicine-organic-agave-4-fl-oz/-/A-80130847
API product data:
https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130847&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA
API "at Cedar Rapids South":
https://redsky.target.com/redsky_aggregations/v1/web_platform/product_fulfillment_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&is_bot=false&tcin=80130847&store_id=1771&zip=52404&state=IA&latitude=41.9831&longitude=-91.6686&scheduled_delivery_store_id=1771&required_store_id=1771&has_required_store_id=true
But probably in some situations it may use other information in product data to put "When purchased online" instead of "at Cedar Rapids South" - and this can be hardcoded in JavaScript. For example product which displays "When purchased online" has formatted_price $13.99 but product which displays "at Cedar Rapids South" has formatted_price "See price in cart"
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&brand_id=q643lel65ir&channel=WEB&count=24&default_purchasability_filter=true&offset=0&page=%2Fb%2Fq643lel65ir&platform=desktop&pricing_store_id=1771&store_ids=1771%2C1768%2C1113%2C3374%2C1792&useragent=Mozilla%2F5.0+%28X11%3B+Linux+x86_64%3B+rv%3A101.0%29+Gecko%2F20100101+Firefox%2F101.0&visitor_id=01819D268B380201B177CA755BCE70CC' # JSON
response = requests.get(url)
data = response.json()
for product in data['data']['search']['products']:
print('title:', product['item']['product_description']['title'])
print('price:', product['price']['current_retail'])
print('formatted:', product['price']['formatted_current_price'])
print('---')
Result:
title: Genexa Kids' Diphenhydramine Allergy Liquid Medicine - Organic Agave - 4 fl oz
price: 7.99
formatted: See price in cart
---
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
price: 13.99
formatted: $13.99
---
Check the following text piece
IN THE HIGH COURT OF GUJARAT AT AHMEDABAD
R/CRIMINAL APPEAL NO. 251 of 2009
FOR APPROVAL AND SIGNATURE:
HONOURABLE MR.JUSTICE R.P.DHOLARIA
==========================================================
1 Whether Reporters of Local Papers may be allowed to see the judgment ?
2 To be referred to the Reporter or not ?
3 Whether their Lordships wish to see the fair copy of the judgment ?
4 Whether this case involves a substantial question of law as to the interpretation of the Constitution of India or any order made thereunder ?
========================================================== STATE OF GUJARAT,S M RAO,FOOD INSPECTOR,OFFICE OF THE Versus DHARMESHBHAI NARHARIBHAI GANDHI ========================================================== Appearance: MS HB PUNANI, APP (2) for the Appellant(s) No. 1 MR DK MODI(1317) for the Opponent(s)/Respondent(s) No. 1 ==========================================================
CORAM: HONOURABLE MR.JUSTICE R.P.DHOLARIA
Date : 12/03/2019
ORAL JUDGMENT
1. The appellant State of Gujarat has
preferred the present appeal under section 378(1)
(3) of the Code of Criminal Procedure, 1973
against the judgment and order of acquittal dated
Page 1 of 12
R/CR.A/251/2009 JUDGMENT
17.11.2008 rendered by learned 2nd Additional
Civil Judge and Judicial Magistrate, First Class,
Nadiad in Food Case No.1 of 2007.
The short facts giving rise to the
present appeal are that on 10.11.2006 at about
18.00 hours, the complainant visited the place of
the respondent accused situated at Juna
Makhanpura, Rabarivad, Nadiad along with panch
witness and the respondent was found dealing in
provisional items. The complainant identified
himself as a Food Inspector and after giving
intimation in Form No.6 has purchased muddamal
sample of mustard seeds in the presence of the
panchas for the purpose of analysis. Thereafter,
the complainant Food Inspector has divided the
said sample in equal three parts and after
completing formalities of packing and sealing
obtained signatures of the vendor and panchas and
out of the said three parts, one part was sent to
the Public Analyst, Vadodara for analysis and
remaining two parts were sent to the Local Health
Authority, Gandhinagar. Thereafter, the Public
Analyst forwarded his report. In the said report,
it is stated that the muddamal sample of mustard
seeds is misbranded which is in breach of the
provisions of the Food Adulteration Act, 1954
(for short “the Act”) and the Rules framed
thereunder. It is alleged that, therefore, the
sample of mustard seeds was misbranded and,
thereby, the accused has committed the offence.
**Page 2 of 12
R/CR.A/251/2009* JUDGMENT*
Hence, the complaint came to be lodged against
the respondent accused.
I want to be able to write a program such that it follows the given constraints. Be wary of the fact that this is only a single file i have like 40k files and it should run on all the files. All the files have some difference but the basic format for every file is the same.
Constraints.
It should start the text extraction process from after the "metadata" . Metadata is the data about the file from the starting of the file i.e " In the high court of gujarat" till Oral Judgment. In all the files i have , there are various POINTS after the string ends. So i need all these points as a separate paragraph ( see the text has 2 points , i need it in different paragraphs ).
Check the lines in italics, these are the panes in the text/pdf file. I need to remove these as these donot have any meaning to the text content i want.
These files are both available in TEXT or PDF format so i can use either. But i am new to python so i dont know how and where to start. I just have basic knowledge in python.
This data is going to be made into a "corpus" for further processes in building a huge expert system so you know what needs to be done i hope.
Read the official python docs!
Start with python's basic str type and its methods. One of its methods, find, will find substrings in your text.
Use the python slicing notation to extract the portion of text you need, e.g.
text = """YOUR TEXT HERE..."""
meta_start = 'In the high court of gujarat'
meta_end = 'ORAL JUDGMENT'
pos1 = text.find(meta_start)
pos2 = text.find(meta_end)
if pos2 > pos1 and pos1 > -1:
# text is found, extract it
text1 = text[meta_start + len(meta_start):meta_end - 1]
After that, you can go ahead and save your extracted text to a database.
Of course, a better and more complicated solution would be to use regular expressions, but that's another story -- try finding the right way for yourself!
As to italics and other text formatting, you won't ever be able to mark it out in plain text (unless you have some 'meta' markers, like e.g. [i] tags).
Given an input file, e.g.
<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>
The desired result is a nested dictionary that stores:
/setid
/docid
/segid
text
I've been using a defaultdict and reading the xml file with BeautifulSoup and nested loops, i.e.
from io import StringIO
from collections import defaultdict
from bs4 import BeautifulSoup
srcfile = """<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>"""
#ntok = NISTTokenizer()
eval_docs = defaultdict(lambda: defaultdict(dict))
with StringIO(srcfile) as fin:
bsoup = BeautifulSoup(fin.read(), 'html5lib')
setid = bsoup.find('srcset')['setid']
for doc in bsoup.find_all('doc'):
docid = doc['docid']
for seg in doc.find_all('seg'):
segid = seg['id']
eval_docs[setid][docid][segid] = seg.text
[out]:
>>> eval_docs
defaultdict(<function __main__.<lambda>>,
{'newstest2015': defaultdict(dict,
{'1012-bbc': {'1': 'India and Japan prime ministers meet in Tokyo',
'2': "India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.",
'3': 'Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.',
'4': 'High on the agenda are plans for greater nuclear co-operation.',
'5': 'India is also reportedly hoping for a deal on defence collaboration between the two nations.'},
'1018-lenta.ru': {'1': 'FANO Russia will hold a final Expert Session',
'2': 'The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.',
'3': 'The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.',
'4': 'At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.'}})})
Is there a simpler way to read the file and get the same eval_docs nested dictionary?
Can it be done easily without using BeautifulSoup?
Note that in the example, there's only one setid and one docid but the actual file has more than one of those.
Since what you have is an HTML with an appearance like XML, you can't go for XML based tools. In most cases your options were
Implement SAX parser
use BS4 (which you are already doing)
Use lxml
In any case you will end up spending more time and effort and have a bigger code to handle this. What you have really sleek and easy. I wouldn't look for another solution if I was you.
PS: What simpler could it be than a 10 liner code!
I don't know if you'll find this simpler, but here's an alternative, using lxml as others have suggested.
Step 1: Convert the XML data into a normalized table (a list of lists)
from lxml import etree
tree = etree.parse('source.xml')
segs = tree.xpath('//seg')
normalized_list = []
for seg in segs:
srcset = seg.getparent().getparent().getparent().attrib['setid']
doc = seg.getparent().getparent().attrib['docid']
normalized_list.append([srcset, doc, seg.attrib['id'], seg.text])
Step 2: Use defaultdict like you did in your original code
d = defaultdict(lambda: defaultdict(dict))
for i in normalized_list:
d[i[0]][i[1]][i[2]] = i[3]
Depending on how you're keeping the source file, you'll have to use one of these methods to parse XML:
tree = etree.parse('source.xml'): when you want to parse a file directly - you won't need StringIO. File is closed automatically by etree.
tree = etree.fromstring(source): where source is a string object, like in your question.
I have news dataset which contains almost 10,000 news over the last 3 years.
I also have a list of companies (names of companies) which are registered in NYSE. Now I want to check whether list of company names in the list have appeared in the news dataset or not.
Example:
company Name: 'E.I. du Pont de Nemours and Company'
News: 'Monsanto and DuPont settle major disputes with broad patent-licensing deal, with DuPont agreeing to pay at least $1.75 billion over 10 years for rights to technology for herbicide-resistant soybeans.'
Now, I can find the news contains company name if the exact company name is in the news but you can see from the above example it is not the case.
I also tried another way i.e. I took the integral name in the company's full name i.e. in the above example 'Pont' is a word which should be definitely a part of the text when this company name is called. So it worked for majority of the times but then problem occurs in the following example:
Company Name: Ennis, Inc.
News: L D`ennis` Kozlowski, former chief executive convicted of looting nearly $100 million from Tyco International, has emerged into far more modest life after serving six-and-a-half year sentence and probation; Kozlowski, who became ultimate symbol of corporate greed in era that included scandals at Enron and WorldCom, describes his personal transformation and more humble pleasures that have replaced his once high-flying lifestyle.
Now you can see Ennis is matching with Dennis in the text so it giving irrelevant news results.
Can someone help in telling the right way of doing this ? Thanks.
Use a regex with boundaries for exact matches whether you choose the full name or some partial part you think is unique is up to you but using word boundaries D'ennis' won't match Ennis :
companies = ["name1", "name2",...]
companies_re = re.compile(r"|".join([r"\b{}\b".format(name) for name in companies]))
Depending on how many matches per news item, you may want to use companies_re.search(artice) or companies_re.find_all(article).
Also for case insensitive matches pass re.I to compile.
If the only line you want to check is also always the one starting with company company Name: you can narrow down the search:
for line in all_lines:
if line.startswith("company Name:"):
name = companies_re.search(line)
if name:
...
break
It sounds like you need the Aho-Corasick algorithm. There is a nice and fast implementation for python here: https://pypi.python.org/pypi/pyahocorasick/
It will only do exact matching, so you would need to index both "Du pont" and "Dupont", for example. But that's not too hard, you can use the Wikidata to help you find aliases: for example, look at the aliases of Dupont's entry: it includes both "Dupont" and "Du pont".
Ok so let's assume you have the list of company names with their aliases:
import ahocorasick
A = ahocorasick.Automaton()
companies = ["google", "apple", "tesla", "dupont", "du pont"]
for idx, key in enumerate(companies):
A.add_word(key, idx)
Next, make the automaton (see the link above for details on the algorithm):
A.make_automaton()
Great! Now you can simply search for all companies in some text:
your_text = """
I love my Apple iPhone. Do you know what a Googleplex is?
I ate some apples this morning.
"""
for end_index, idx in A.iter(your_text.lower()):
print(end_index, companies[idx])
This is the output:
15 apple
49 google
74 apple
The numbers correspond to the index of the last character of the company name in the text.
Easy, right? And super fast, this algorithm is used by some variants of GNU grep.
Saving/loading the automaton
If there are a lot of company names, creating the automaton may take some time, so you may want to create it just once, save it to disk (using pickle), then load it every time you need it:
# create_company_automaton.py
# ... create the automaton (see above)
import pickle
pickle.dump(A, open('company_automaton.pickle', 'wb'))
In the program that will use this automaton, you start by loading the automaton:
# use_company_automaton.py
import ahocorasick
import pickle
A = pickle.load(open("company_automaton.pickle", "rb"))
# ... use the automaton
Hope this helps! :)
Bonus details
If you want to match "Apple" in "Apple releases a new iPhone" but not in "I ate an apple this morning", you are going to have a hard time. But it is doable: for example, you could gather a set of articles containing the word "apple" and about the company, and a set of articles not about the company, then identify words (or n-grams) that are more likely when it's about the company (e.g. "iPhone"). Unfortunately you would need to do this for every company whose name is ambiguous.
You can try
difflib.get_close_matches
with the full company name.
I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>, <user>, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>. Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?
With data like this I'd just use split() with the maxplit argument:
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
https://docs.python.org/2/library/stdtypes.html#str.split
What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])