extract data from website into dictionaries in python - python

Following is the code and corresponding output to extract data of a particular job from Indeed.com. Alongwith data I have lot of junk and I want to separate out Title, location, job description and other important features. How can I convert it into dictionaries?
from bs4 import BeautifulSoup
import urllib2
final_site = 'http://www.indeed.com/cmp/Pullskill-techonoligies/jobs/Data-Scientist-229a6b09c5eb6b44?q=%22data+scientist%22'
html = urllib2.urlopen(final_site).read()
soup = BeautifulSoup(html)
deep = soup.find("td","snip")
deep.get("p","ul")
deep.get_text(strip= True)
Output:
u'Title : Data ScientistLocation : Seattle WADuration : Fulltime / PermanentJob Responsibilities:Implement advanced and predictive analytics models usingJava,R, and Pythonetc.Develop deep expertise with Company\u2019s data warehouse, systems, product and other resources.Extract, collate and analyze data from a variety of sources to provide insights to customersCollaborate with the research team to incorporate qualitative insights into projects where appropriateKnowledge, Skills and Experience:Exceptional problem solving skillsExperience withJava,R, and PythonAdvanced data mining and predictive modeling (especially Machine learning techniques) skillsMust have statistics orientation (Theory and applied)3+ years of business experience in an advanced analytics roleStrong Python and R programming skills are required. SAS, MATLAB will be plusStrong SQL skills are looked for.Analytical and decisive strategic thinker, flexible problem solver, great team player;Able to effectively communicate to all levelsImpeccable attention to detail and very strong ability to convert complex data into insights and action planThanksNick ArthurLead Recruiternick(at)pullskill(dot)com201-497-1010 Ext: 106Salary: $120,000.00 /yearRequired experience:Java And Python And R And PHD Level Education: 4 years5 days ago-save jobwindow[\'result_229a6b09c5eb6b44\'] = {"showSource": false, "source": "Indeed", "loggedIn": false, "showMyJobsLinks": true,"undoAction": "unsave","relativeJobAge": "5 days ago","jobKey": "229a6b09c5eb6b44", "myIndeedAvailable": true, "tellAFriendEnabled": false, "showMoreActionsLink": false, "resultNumber": 0, "jobStateChangedToSaved": false, "searchState": "", "basicPermaLink": "http://www.indeed.com", "saveJobFailed": false, "removeJobFailed": false, "requestPending": false, "notesEnabled": true, "currentPage" : "viewjob", "sponsored" : false, "reportJobButtonEnabled": false};\xbbApply NowPlease review all application instructions before applying to Pullskill Technologies.(function(d, s, id){var js, iajs = d.getElementsByTagName(s)[0], iaqs = \'vjtk=1aa24enhqagvcdj7&hl=en_US&co=US\'; if (d.getElementById(id)){return;}js = d.createElement(s); js.id = id; js.async = true; js.src = \'https://apply.indeed.com/indeedapply/static/scripts/app/bootstrap.js\'; js.setAttribute(\'data-indeed-apply-qs\', iaqs); iajs.parentNode.insertBefore(js, iajs);}(document, \'script\', \'indeed-apply-js\'));Recommended JobsData Scientist, Energy AnalyticsRenew Financial-Oakland, CARenew Financial-5 days agoData ScientistePrize-Seattle, WAePrize-7 days agoData ScientistDocuSign-Seattle, WADocuSign-12 days agoEasily applyEngineer - US Citizen or Permanent ResidentVoxel Innovations-Raleigh, NCIndeed-8 days agoEasily applyData ScientistUnity Technologies-San Francisco, CAUnity Technologies-22 days agoEasily apply'

Find the job summary element, find all b elements inside and split each b element's text by ::
for elm in soup.find("span", id="job_summary").p.find_all("b"):
label, text = elm.get_text().split(" : ")
print(label.strip(), text.strip())

If your output always has the same structure you could use regex to create the dictionary.
dict = {}
title_match = re.match(r'Title : (.+)(?=Location)', output)
dict['Title'] = title_match.group(1)
location_match = re.match(r'Location : (.+)(?=Duration)', output)
dict['Location'] = location_match.group(1)
Of course this is a pretty fragile solution and it would probably serve you better to use BeautifulSoup's in-built parsing to get the results you want, as I guess they are probably surrounded by standard tags.

Related

How can I use the Google Cloud Natural Language Processing API on a big query table or any other topic modelling resource?

As mentioned in the title, I have a bigquery table with 18 million rows, nearly half of them are useless and I am supposed to assign a topic/niche to each row based on an important column (that has detail about a product a website), I have tested NLP API on a sample data with size of 10,000 and it did wonders but my standard approach where I am iterating over the newarr (which is the important details column I am obtaining through querying my bigquery table), here I am sending only one cell at a time, awaiting response from the api and appending it to the results array.
Ideally I want to do this operation on 18 Million rows in the minimum time, my per minute quota is increased to 3000 api requests so thats the max I can make, But I cant figure out how can i send a batch of 3000 rows one after another each minute.
for x in newarr:
i += 1
results.append(sample_classify_text(x))
Sample Classify text is a function straight from Documentation
#this function will return category for the text
from google.cloud import language_v1
def sample_classify_text(text_content):
"""
Classifying Content in a String
Args:
text_content The text content to analyze. Must include at least 20 words.
"""
client = language_v1.LanguageServiceClient()
# text_content = 'That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows.'
# Available types: PLAIN_TEXT, HTML
type_ = language_v1.Document.Type.PLAIN_TEXT
# Optional. If not specified, the language is automatically detected.
# For list of supported languages:
# https://cloud.google.com/natural-language/docs/languages
language = "en"
document = {"content": text_content, "type_": type_, "language": language}
response = client.classify_text(request = {'document': document})
#return response.categories
# Loop through classified categories returned from the API
for category in response.categories:
# Get the name of the category representing the document.
# See the predefined taxonomy of categories:
# https://cloud.google.com/natural-language/docs/categories
x = format(category.name)
return x
# Get the confidence. Number representing how certain the classifier
# is that this category represents the provided text.

2SLS with fixed effects Python

I am trying to rebuild a paper (Dreher et al. 2020. Aid, China, and Growth: Evidence from a New Global Development Finance Dataset) with Python. The paper's calculations were performed with Stata. I managed to rebuild everything until now, but I am stuck.
The data of this project is PanelData and I have to include year specific effects and country specific effects,
time_specific_effects = True, entity_effects = True / other_effects = data4.code
(because the variable data4.code is here the same as the entity).
The plan is to use 2SLS:
Regression
IV Strategy
The Stata code is given by the author:
*xtivreg2 growth_pc (l2.OFn_all= l3.IV_reserves_OFn_all_1_ln l3.IV_factor1_OFn_all_1_ln) l.population_ln time* if code!="CHN", fe first savefprefix(first) cluster(code) endog(l2.OFn_all)*
I rebuilt all the lagged variables in Python using shift() and it worked:
data4["l3IV_reserves_OFn_all_1_ln"] = data4["IV_reserves_OFn_all_1_ln"].shift(3)
data4["l3IV_factor1_OFn_all_1_ln"] = data4["IV_factor1_OFn_all_1_ln"].shift(3)
So the setup is the same as it is for the author:
As far as I know, there is no library in Python that can perform 2SLS with fixed effects. So I thought that I will just use linear model PanelOLS (which is suited for panel data with fixed effects) to perform the First Stage and Second Stage separately:
dependendFS = data4.l2OFn_all
exog2 = sm.tools.add_constant(data4[["l1population_ln", "l3IV_reserves_OFn_all_1_ln","l3IV_factor1_OFn_all_1_ln"]])
mod = lm.panel.PanelOLS(dependendFS, exog2, time_effects = True, entity_effects=True, drop_absorbed=True)
mod_new21c = mod.fit(cov_type='clustered', clusters = data4.code)
# Safe the fitted values
fitted_c = mod_new21c.fitted_values
data4["fitted_values_c"] = fitted_c
dependentSS = data4.growth_pc
exog = sm.tools.add_constant(data4[["fitted_values_c", "l1population_ln"]])
mod = lm.panel.PanelOLS(dependentSS, exog, time_effects=True, entity_effects= True)
mod_new211c = mod.fit(cov_type='clustered', clusters = data4.code)
I tried several combinations of the fixed effects and for the covariance, but it did not so far deliver the results I need. Here is my output for the Second Stage:
Results after Second Stage
and this is what they should look like:
dependent variable is growth p.c, SE in brackets
Where is my mistake? Do I have to adjust my data or the output of the First Stage since I am separately performing 2SLS? Is there a mistake or a better method of estimating 2SLS in Python?

Search Engine - rank the output by a weighted mechanism

I am trying to build a semantic search FAQ system using Elastic 7.7.0 and Universal Sentence Encoder (USE4) word embeddings, so far i have indexed a set of question and answers, which i am able to search. I am doing 2 searches whenever there is input :
key word search on elastic on indexed data
Semantic search using USE4 embeddings
Now i want to combine both to give the robust output, because sometimes results are off from these individual algorithms. Any good suggestions on how can i combine them? use the weighted mechanism to give more weight to Semantic search, and/or be able to match them again. Question is how do i get best of both. Please advise.
import time
import sys
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
import csv
import tensorflow as tf
import tensorflow_hub as hub
def connect2ES():
# connect to ES on localhost on port 9200
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
if es.ping():
print('Connected to ES!')
else:
print('Could not connect!')
sys.exit()
print("*********************************************************************************");
return es
def keywordSearch(es, q):
#Search by Keywords
b={
'query':{
'match':{
"title":q
}
}
}
res= es.search(index='questions-index_quora2',body=b)
print("Keyword Search:\n")
for hit in res['hits']['hits']:
print(str(hit['_score']) + "\t" + hit['_source']['title'] )
print("*********************************************************************************");
return
# Search by Vec Similarity
def sentenceSimilaritybyNN(embed, es, sent):
query_vector = tf.make_ndarray(tf.make_tensor_proto(embed([sent]))).tolist()[0]
b = {"query" : {
"script_score" : {
"query" : {
"match_all": {}
},
"script" : {
"source": "cosineSimilarity(params.query_vector, 'title_vector') + 1.0",
"params": {"query_vector": query_vector}
}
}
}
}
#print(json.dumps(b,indent=4))
res= es.search(index='questions-index_quora2',body=b)
print("Semantic Similarity Search:\n")
for hit in res['hits']['hits']:
print(str(hit['_score']) + "\t" + hit['_source']['title'] )
print("*********************************************************************************");
if __name__=="__main__":
es = connect2ES();
embed = hub.load("./data/USE4/") #this is where my USE4 Model is saved.
while(1):
query=input("Enter a Query:");
start = time.time()
if query=="END":
break;
print("Query: " +query)
keywordSearch(es, query)
sentenceSimilaritybyNN(embed, es, query)
end = time.time()
print(end - start)
My output looks like this:
Enter a Query:what can i watch this weekend
Query: what can i watch this weekend
Keyword Search:
9.6698 Where can I watch gonulcelen with english subtitles?
7.114256 What are some good movies to watch?
6.3105774 What kind of animal did this?
6.2754908 What are some must watch TV shows before you die?
6.0294256 What is the painting on this image?
6.0294256 What the meaning of this all life?
6.0294256 What are your comments on this picture?
5.9638205 Which is better GTA5 or Watch Dogs?
5.9269657 Can somebody explain to me how to do this problem with steps?
*********************************************************************************
Semantic Similarity Search:
1.6078881 What are some good movies to watch?
1.5065247 What are some must watch TV shows before you die?
1.502714 What are some movies that everyone needs to watch at least once in life?
1.4787409 Where can I watch gonulcelen with english subtitles?
1.4713362 What are the best things to do on Halloween?
1.4669418 Which are the best movies of 2016?
1.4554278 What are some interesting things to do when bored?
1.4307204 How can I improve my skills?
1.4261798 What are the best films that take place in one room?
1.4175651 What are the best things to learn in life?
*********************************************************************************
0.05920886993408203
i want one output which is based on both of these, where we can get more accurate results and rank them accordingly too. Please advise or redirect where i can refer some good practices around this. Thanks in Advance.

Getting list of urls form wikipedia page

I have a list of names of fortune 500 companies.
here is an example [Abbott Laboratories,Progressive,Arrow Electronics,Kraft Heinz
Plains GP Holdings,Gilead Sciences,Mondelez International,Northrop Grumman]
Now I want to get the complete url from Wikipedia for each element in the list.
for example, after searching the name on Google or Wikipedia,
it should give me back list of all wikipedia urls like:
https://en.wikipedia.org/wiki/Abbott_Laboratories (this is only one example)
The biggest problem is looking for possible sites and only selecting the one belonging to the company.
One somewhat wrong way would be just just appending the company name to the wiki url and hoping that it works. That results in a) it works (like Abbott Laboratories), b) it produces a page, but not the right one (Progressive, should be Progressive_Corporation) or c) it produces no result at all.
companies = [
"Abbott Laboratories", "Progressive", "Arrow Electronics", "Kraft Heinz Plains GP Holdings", "Gilead Sciences",
"Mondelez International", "Northrop Grumman"
]
url = "https://en.wikipedia.org/wiki/%s"
for company in companies:
print(url % company.replace(" ", "_"))
Another (way better) option would be using the wikipedia package (https://pypi.org/project/wikipedia/) and its built-in search function. The problem of selecting the right site still remains, so you basically have to do this by hand (or create a good automatic selection like searching for the word "company")
companies = [
"Abbott Laboratories", "Progressive", "Arrow Electronics", "Kraft Heinz Plains GP Holdings", "Gilead Sciences",
"Mondelez International", "Northrop Grumman"
]
import wikipedia
for company in companies:
options = wikipedia.search(company)
print(company, options)

NLTK title classifier

Apologies in advance if this has already been questioned/answered, but I couldn't find any answer close to my problem. I am also somewhat noob as to dealing with Python, so sorry too for the long post.
I am trying to build a Python script that, based on a user-given Pubmed query (i.e., "cancer"), retrieves a file with N article titles, and evaluates their relevance to the subject in question.
I have successfully built the "pubmed search and save" part, having it return a .txt file containing titles of articles (each line corresponds to a different article title), for instance:
Feasibility of an ovarian cancer quality-of-life psychoeducational intervention.
A randomized trial to increase physical activity in breast cancer survivors.
Having this file, the idea is to use it into a classifier and get it to answer if the titles in the .txt file are relevant to a subject, for which I have a "gold standard" of titles that I know are relevant (i.e., I want to know the precision and recall of the queried set of titles against my gold standard). For example: Title 1 has the word "neoplasm" X times and "study" N times, therefore it is considered as relevant to "cancer" (Y/N).
For this, I have been using NLTK to (try to) classify my text. I have pursued 2 different approaches, both unsuccessfully:
Approach 1
Loading the .txt file, preprocessing it (tokenization, lower-casing, removing stopwords), converting the text to NLTK text format, finding the N most-common words. All this runs without problems.
f = open('SR_titles.txt')
raw = f.read()
tokens = word_tokenize(raw)
words = [w.lower() for w in tokens]
words = [w for w in words if not w in stopwords.words("english")]
text = nltk.Text(words)
fdist = FreqDist(text)
>>><FreqDist with 116 samples and 304 outcomes>
I am also able to find colocations/bigrams in the text, which is something that might be important afterward.
text.collocations()
>>>randomized controlled; breast cancer; controlled trial; physical
>>>activity; metastatic breast; prostate cancer; randomised study; early
>>>breast; cancer patients; feasibility study; psychosocial support;
>>>group psychosocial; group intervention; randomized trial
Following NLTKs tutorial, I built a feature extractor, so the classifier will know which aspects of the data it should pay attention to.
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
This would, for instance, return something like this:
{'contains(series)': False, 'contains(disorders)': False,
'contains(group)': True, 'contains(neurodegeneration)': False,
'contains(human)': False, 'contains(breast)': True}
The next thing would be to use the feature extractor to train a classifier to label new article titles, and following NLTKs example, I tried this:
featuresets = [(document_features(d), c) for (d,c) in text]
Which gives me the error:
ValueError: too many values to unpack
Quickly googled this and found that this has something to do with tuples, but did not get how can I solve it (like I said, I'm somewhat noob in this), unless by creating a categorized corpus (I would still like to understand how can I solve this tuple problem).
Therefore, I tried approach 2, following Jacob Perkings Text Processing with NLTK Cookbook:
Started by creating a corpus and attributing categories. This time I had 2 different .txt files, one for each subject of title articles.
reader = CategorizedPlaintextCorpusReader('.', r'.*\,
cat_map={'hd_titles.txt': ['HD'], 'SR_titles.txt': ['Cancer']})
With "reader.raw()" I get something like this:
u"A pilot investigation of a multidisciplinary quality of life intervention for men with biochemical recurrence of prostate cancer.\nA randomized controlled pilot feasibility study of the physical and psychological effects of an integrated support programme in breast cancer.\n"
The categories for the corpus seem to be right:
reader.categories()
>>>['Cancer', 'HD']
Then, I try to construct a list of documents, labeled with the appropriate categories:
documents = [(list(reader.words(fileid)), category)
for category in reader.categories()
for fileid in reader.fileids(category)]
Which returns me something like this:
[([u'A', u'pilot', u'investigation', u'of', u'a', u'multidisciplinary',
u'quality', u'of', u'life', u'intervention', u'for', u'men', u'with',
u'biochemical', u'recurrence', u'of', u'prostate', u'cancer', u'.'],
'Cancer'),
([u'Trends', u'in', u'the', u'incidence', u'of', u'dementia', u':',
u'design', u'and', u'methods', u'in', u'the', u'Alzheimer', u'Cohorts',
u'Consortium', u'.'], 'HD')]
Next step would be creating a list of labeled feature sets, for which I used the next function, that takes a corpus and a feature_detector function (that would be document_features referred above). It then constructs and returns a mapping of the form {label: [featureset]}.
def label_feats_from_corpus(corp, feature_detector=document_features):
label_feats = collections.defaultdict(list)
for label in corp.categories():
for fileid in corp.fileids(categories=[label]):
feats = feature_detector(corp.words(fileids=[fileid]))
label_feats[label].append(feats)
return label_feats
lfeats = label_feats_from_corpus(reader)
>>>defaultdict(<type 'list'>, {'HD': [{'contains(series)': True,
'contains(disorders)': True, 'contains(neurodegeneration)': True,
'contains(anilinoquinazoline)': True}], 'Cancer': [{'contains(cancer)':
True, 'contains(of)': True, 'contains(group)': True, 'contains(After)':
True, 'contains(breast)': True}]})
(the list is a lot bigger and everything is set as True).
Then I want to construct a list of labeled training instances and testing instances.
The split_label_feats() function takes a mapping returned from
label_feats_from_corpus() and splits each list of feature sets
into labeled training and testing instances.
def split_label_feats(lfeats, split=0.75):
train_feats = []
test_feats = []
for label, feats in lfeats.items():
cutoff = int(len(feats) * split)
train_feats.extend([(feat, label) for feat in feats[:cutoff]])
test_feats.extend([(feat, label) for feat in feats[cutoff:]])
return train_feats, test_feats
train_feats, test_feats = split_label_feats(lfeats, split=0.75)
len(train_feats)
>>>0
len(test_feats)
>>>2
print(test_feats)
>>>[({'contains(series)': True, 'contains(China)': True,
'contains(disorders)': True, 'contains(neurodegeneration)': True},
'HD'), ({'contains(cancer)': True, 'contains(of)': True,
'contains(group)': True, 'contains(After)': True, 'contains(breast)':
True}, 'Cancer')]
I should've ended up with a lot more labeled training instances and labeled testing instances, I guess.
This brings me to where I am now. I searched stackoverflow, biostars, etc and could not find how to deal with both problems, so any help would be deeply appreciated.
TL;DR: Can't label a single .txt file to classify text, and can't get a corpus correctly labeled (again, to classify text).
If you've read this far, thank you as well.
You're getting an error on the following line:
featuresets = [(document_features(d), c) for (d,c) in text]
Here, you are supposed to convent each document (i.e. each title) to a dictionary of features. But to train with the results, the train() method needs both the feature dictionaries and the correct answer ("label"). So the normal workflow is to have a list of (document, label) pairs, which you transform to (features, label) pairs. It looks like your variable documents has the right structure, so if you just use it instead of text, this should work correctly:
featuresets = [(document_features(d), c) for (d,c) in documents]
As you go forward, get in the habit of inspecting your data carefully and figuring out what will (and should) happen to them. If text is a list of titles, it makes no sense to unpack each title to a pair (d, c). That should have pointed you in the right direction.
In featuresets = [(document_features(d), c) for (d,c) in text], I'm not sure what you are supposed to be getting from text. text seem to be a nltk class which is simply a wrapper around a generator. It seems it will give you a single string each iteration, which is why you are getting an error as you are asking for two items when it only has one to give.

Categories

Resources