I have a list of names of fortune 500 companies.
here is an example [Abbott Laboratories,Progressive,Arrow Electronics,Kraft Heinz
Plains GP Holdings,Gilead Sciences,Mondelez International,Northrop Grumman]
Now I want to get the complete url from Wikipedia for each element in the list.
for example, after searching the name on Google or Wikipedia,
it should give me back list of all wikipedia urls like:
https://en.wikipedia.org/wiki/Abbott_Laboratories (this is only one example)
The biggest problem is looking for possible sites and only selecting the one belonging to the company.
One somewhat wrong way would be just just appending the company name to the wiki url and hoping that it works. That results in a) it works (like Abbott Laboratories), b) it produces a page, but not the right one (Progressive, should be Progressive_Corporation) or c) it produces no result at all.
companies = [
"Abbott Laboratories", "Progressive", "Arrow Electronics", "Kraft Heinz Plains GP Holdings", "Gilead Sciences",
"Mondelez International", "Northrop Grumman"
]
url = "https://en.wikipedia.org/wiki/%s"
for company in companies:
print(url % company.replace(" ", "_"))
Another (way better) option would be using the wikipedia package (https://pypi.org/project/wikipedia/) and its built-in search function. The problem of selecting the right site still remains, so you basically have to do this by hand (or create a good automatic selection like searching for the word "company")
companies = [
"Abbott Laboratories", "Progressive", "Arrow Electronics", "Kraft Heinz Plains GP Holdings", "Gilead Sciences",
"Mondelez International", "Northrop Grumman"
]
import wikipedia
for company in companies:
options = wikipedia.search(company)
print(company, options)
Related
As mentioned in the title, I have a bigquery table with 18 million rows, nearly half of them are useless and I am supposed to assign a topic/niche to each row based on an important column (that has detail about a product a website), I have tested NLP API on a sample data with size of 10,000 and it did wonders but my standard approach where I am iterating over the newarr (which is the important details column I am obtaining through querying my bigquery table), here I am sending only one cell at a time, awaiting response from the api and appending it to the results array.
Ideally I want to do this operation on 18 Million rows in the minimum time, my per minute quota is increased to 3000 api requests so thats the max I can make, But I cant figure out how can i send a batch of 3000 rows one after another each minute.
for x in newarr:
i += 1
results.append(sample_classify_text(x))
Sample Classify text is a function straight from Documentation
#this function will return category for the text
from google.cloud import language_v1
def sample_classify_text(text_content):
"""
Classifying Content in a String
Args:
text_content The text content to analyze. Must include at least 20 words.
"""
client = language_v1.LanguageServiceClient()
# text_content = 'That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows.'
# Available types: PLAIN_TEXT, HTML
type_ = language_v1.Document.Type.PLAIN_TEXT
# Optional. If not specified, the language is automatically detected.
# For list of supported languages:
# https://cloud.google.com/natural-language/docs/languages
language = "en"
document = {"content": text_content, "type_": type_, "language": language}
response = client.classify_text(request = {'document': document})
#return response.categories
# Loop through classified categories returned from the API
for category in response.categories:
# Get the name of the category representing the document.
# See the predefined taxonomy of categories:
# https://cloud.google.com/natural-language/docs/categories
x = format(category.name)
return x
# Get the confidence. Number representing how certain the classifier
# is that this category represents the provided text.
I am currently in a data science Bootcamp and I am ahead of the curriculum for the moment, so I wanted to take the chance to apply some of the skills that I have learned in service of my first project. I am scraping movie information from Box Office Mojo and would like to eventually compile all of this information into a pandas dataframe. So far I have a pagination function that collects all of the links for the individual films:
def pagination_func(req_url):
soup = bs(req_url.content, 'lxml')
table = soup.find('table')
links = [a['href'] for a in table.find_all('a', href=True)]
pagination_list = []
substring = '/release'
for link in links:
if substring in link:
pagination_list.append(link)
return pagination_list
I have sort of lazily implemented a hardwired URL to pass through this function to retrieve the requested data:
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']
link_list_by_year = []
for count, year in tqdm(enumerate(years)):
pagination_url = 'https://www.boxofficemojo.com/year/{}/?grossesOption=calendarGrosses'.format(year)
pagination = requests.get(pagination_url)
link_list_by_year.append(pagination_func(pagination))
This will give me incomplete URLs that I then convert into complete URLs with this for loop:
complete_links = []
for link in link_list_by_year:
for url in link:
complete_links.append('https://www.boxofficemojo.com{}'.format(url))
I have then used the lxml library to retrieve the elements from the page that I wanted with this function:
def scrape_page(req_page):
tree = html.fromstring(req_page.content)
title.append(tree.xpath('//*[#id="a-page"]/main/div/div[1]/div[1]/div/div/div[2]/h1/text()')[0])
domestic.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[1]/span[2]/span/text()')[0].replace('$','').replace(',',''))
international.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[2]/span[2]/a/span/text()')[0].replace('$','').replace(',',''))
worldwide.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[3]/span[2]/a/span/text()')[0].replace('$','').replace(',',''))
opening.append(tree.xpath(
'//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')[0].replace('$','').replace(',',''))
opening_theatres.append(tree.xpath(
'/html/body/div[1]/main/div/div[3]/div[4]/div[2]/span[2]/text()')[0].replace('\n', '').split()[0])
MPAA.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[4]/span[2]/text()')[0])
run_time.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[5]/span[2]/text()')[0])
genres.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[6]/span[2]/text()')[0].replace('\n','').split())
run_time.append(tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[5]/span[2]/text()')[0])
I go on to initialize these lists which I am going to spare posting for sake of text walls, they're all just standard var = [].
Finally, I have a for loop that will iterate over my list of completed links:
for link in tqdm(complete_links[:200]):
movie = requests.get(link)
scrape_page(movie)
So it is all pretty basic and not very optimized, but it has helped me understand a lot of things about the basic nature of Python. Unfortunately, when I run the loop to scrape the pages after it scrapes for about a minute it throws an IndexError: list index out of range and gives the following debug traceback (or one of a similar nature concerning an operation within the scrape_page function):
IndexError Traceback (most recent call last)
<ipython-input-381-739b3dc267d8> in <module>
4 for link in tqdm(test_links[:200]):
5 movie = requests.get(link)
----> 6 scrape_page(movie)
7
8
<ipython-input-378-7c13bea848f6> in scrape_page(req_page)
14
15 opening.append(tree.xpath(
---> 16 '//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')[0].replace('$','').replace(',',''))
17
18 opening_theatres.append(tree.xpath(
IndexError: list index out of range
What I think is going wrong is that the particular page that it is hanging up on either lacks that particular element, it's tagged differently, or there is some sort of oddity. I have searched for a way to error handle this, but I couldn't find one that was relevant to what I was looking for. I honestly have been banging my head against this for the better part of 2 hours and have done everything (in my limited knowledge) but searched every page by hand for some sort of issue.
Check if xpath() returned anything before trying to append the result to the list.
openings = tree.xpath('//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')
if openings:
opening.append(openings[0].replace('$','').replace(',',''))
Since you should probably do this for all the lists, you may want to extract the pattern into a function:
def append_xpath(tree, list, path):
matches = tree.xpath(path)
if matches:
list.append(matches[0].replace('$','').replace(',',''))
Then you would use it like this:
append_xpath(tree, openings, '//*[#id="a-page"]/main/div/div[3]/div[4]/div[2]/span[2]/span/text()')
append_xpath(tree, domestic, '//*[#id="a-page"]/main/div/div[3]/div[1]/div/div[1]/span[2]/span/text()')
...
I have a list of Wikipedia pages related to some entities and I want to select only geographical places and locations (cities, provinces, but also regions, mountains, rivers and so on).
I can easily select pages with coordinates but this is not enough since many places actually in Wikipedia are not associated to their coordinates. I guess I should use labels from Wikidata, but I never used them and I am a bit lost with Python API. For example, if I use wptools:
import wptools
page = wptools.page('Indianapolis')
print(page.get_wikidata())
I obtain this:
www.wikidata.org (wikidata) Indianapolis
www.wikidata.org (labels) Q1000136|P1830|P421|Q1093829|P163|Q2579...
www.wikidata.org (labels) Q537853|P281|P949|Q2494513|Q3166162|Q18...
www.wikidata.org (labels) P1036|Q499547|P1997|P31|P17|P268|Q62049...
en.wikipedia.org (imageinfo) File:IndianapolisC12.png
Indianapolis (en) data
{
aliases: <list(10)> Circle City, Indy, Naptown, Crossroads of Am...
claims: <dict(61)> P1082, P227, P1151, P31, P17, P131, P163, P41...
description: <str(109)> city in and county seat of Marion County...
image: <list(1)> {'file': 'File:IndianapolisC12.png', 'kind': 'w...
label: Indianapolis
labels: <dict(145)> Q1000136, P1830, P421, Q1093829, P163, Q2579...
modified: <dict(1)> wikidata
requests: <list(5)> wikidata, labels, labels, labels, imageinfo
title: Indianapolis
what: county seat
wikibase: Q6346
wikidata: <dict(61)> population (P1082), GND ID (P227), topic's ...
wikidata_pageid: 7459
wikidata_url: https://www.wikidata.org/wiki/Q6346
}
How can I extract only the labels?
I suppose there exists a label "THIS IS A LOCATION" but how to use it?
Thanks in advance
I am trying to scrape data from https://fortnitetracker.com/events/epicgames_S10_FNCS_Week5_NAE. Specifically, I am trying to get the placement and number of points earned by a specific player. I went to the website and found the instance where the specific player ("Nickmercs") was located in the HTML which looked like this:
HTML Text
You can see the "rank" is shown above his name as 56, and the points are shown a few lines below his name which is also 56. I then wrote the following Python 3 program to scrape the data from the website:
import requests
class tracker:
url = "https://fortnitetracker.com/events/epicgames_S10_FNCS_Week5_NAE"
def getReq(website):
req = requests.get(website)
if req:
return req
req = getReq(url)
text = req.text
index = text.find("nickmercs")
split = text[index:index+1000]
print (split)
Running the program resulted in a large portion of the HTML code, but the instance of "Nickmercs" that it found was not the one I was looking for. The one shown in the picture of the HTML code shown above was the actual first instance if the "Nickmercs" string on the page, but for some reason it was not in the req.text / the response for my request. As a result I went back and modified my code to print out where the first instance actually was, and found that the line was different from what was shown in the HTML code picture. The line that was supposed to list the names "Nate Hill, Nickmercs, SypherPK" actually looked like this:
<span :style="{ 'color': '#' + metadata.primary_color }">{{ getPlayerNameList(entry.teamAccountIds, 4) }}</span>
I have little knowledge of how HTML works, so I am wondering if it is possible to fix this problem. It seems to be calling some (what I imagine is a) method called getPlayerNameList() which places the names in the correct spot, but makes it so I can't easily search the names / scrape the data. Is there a way to get around this? Any help is much appreciated!
The site is dynamic, thus, you need some way to access the data populated after the page originally loads. One such way is to use selenium:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://fortnitetracker.com/events/epicgames_S10_FNCS_Week5_NAE')
h, *r = [[i.text for i in b.find_all('th' if b.td is None else 'td')] for b in soup(d.page_source, 'html.parser').find('div', {'id':'leaderboard'}).table.find_all('tr')]
new_data = {tuple(b.split(', ')):dict(zip([h[0], *h[2:]], [a[1:-1], *c])) for a, b, *c in r}
Now, to look up a player by name:
data = [b for a, b in new_data.items() if 'Nickmercs' in a][0]
Output:
{'Rank': '56', 'Points': '56 Top 0.373%', 'Matches': '10', 'Wins': '0', 'K/D': '3.50', 'Avg Place': '16.10'}
For your specific target value (Rank):
rank = [b for a, b in new_data.items() if 'Nickmercs' in a][0]['Rank']
Output:
56
Data is dynamically loaded from script tags so content is present in response. You can regex out the leaderboard/session info and the accounts info and connect the two via account_id. You find the right account_id based on the player name of interest
import requests, re, json
def get_json(pattern):
p = re.compile(pattern, re.DOTALL)
return p.findall(r.text)[0]
r = requests.get('https://fortnitetracker.com/events/epicgames_S10_FNCS_Week5_NAE')
player = 'Nickmercs'
session_info = json.loads(get_json('imp_leaderboard = (.*?);'))
player_info = json.loads(get_json('imp_accounts = (.*?);'))
account_id = [i['accountId'] for i in player_info if i['playerName'] == player][0]
team_info = [i for i in session_info['entries'] if account_id in i['teamId']]
print(team_info)
This gives you all the relevant info. Part of that is shown here:
Specific items:
print(team_info[0]['pointsEarned'])
print(team_info[0]['rank'])
You are scraping the HTML along the javascript code and it is not rendered.
For this task you could use computer vision to extract the table from the page.
Otherwise you can use PhantomJS (https://phantomjs.org/) to scrape the table without using images as it gives you the rendered page.
Following is the code and corresponding output to extract data of a particular job from Indeed.com. Alongwith data I have lot of junk and I want to separate out Title, location, job description and other important features. How can I convert it into dictionaries?
from bs4 import BeautifulSoup
import urllib2
final_site = 'http://www.indeed.com/cmp/Pullskill-techonoligies/jobs/Data-Scientist-229a6b09c5eb6b44?q=%22data+scientist%22'
html = urllib2.urlopen(final_site).read()
soup = BeautifulSoup(html)
deep = soup.find("td","snip")
deep.get("p","ul")
deep.get_text(strip= True)
Output:
u'Title : Data ScientistLocation : Seattle WADuration : Fulltime / PermanentJob Responsibilities:Implement advanced and predictive analytics models usingJava,R, and Pythonetc.Develop deep expertise with Company\u2019s data warehouse, systems, product and other resources.Extract, collate and analyze data from a variety of sources to provide insights to customersCollaborate with the research team to incorporate qualitative insights into projects where appropriateKnowledge, Skills and Experience:Exceptional problem solving skillsExperience withJava,R, and PythonAdvanced data mining and predictive modeling (especially Machine learning techniques) skillsMust have statistics orientation (Theory and applied)3+ years of business experience in an advanced analytics roleStrong Python and R programming skills are required. SAS, MATLAB will be plusStrong SQL skills are looked for.Analytical and decisive strategic thinker, flexible problem solver, great team player;Able to effectively communicate to all levelsImpeccable attention to detail and very strong ability to convert complex data into insights and action planThanksNick ArthurLead Recruiternick(at)pullskill(dot)com201-497-1010 Ext: 106Salary: $120,000.00 /yearRequired experience:Java And Python And R And PHD Level Education: 4 years5 days ago-save jobwindow[\'result_229a6b09c5eb6b44\'] = {"showSource": false, "source": "Indeed", "loggedIn": false, "showMyJobsLinks": true,"undoAction": "unsave","relativeJobAge": "5 days ago","jobKey": "229a6b09c5eb6b44", "myIndeedAvailable": true, "tellAFriendEnabled": false, "showMoreActionsLink": false, "resultNumber": 0, "jobStateChangedToSaved": false, "searchState": "", "basicPermaLink": "http://www.indeed.com", "saveJobFailed": false, "removeJobFailed": false, "requestPending": false, "notesEnabled": true, "currentPage" : "viewjob", "sponsored" : false, "reportJobButtonEnabled": false};\xbbApply NowPlease review all application instructions before applying to Pullskill Technologies.(function(d, s, id){var js, iajs = d.getElementsByTagName(s)[0], iaqs = \'vjtk=1aa24enhqagvcdj7&hl=en_US&co=US\'; if (d.getElementById(id)){return;}js = d.createElement(s); js.id = id; js.async = true; js.src = \'https://apply.indeed.com/indeedapply/static/scripts/app/bootstrap.js\'; js.setAttribute(\'data-indeed-apply-qs\', iaqs); iajs.parentNode.insertBefore(js, iajs);}(document, \'script\', \'indeed-apply-js\'));Recommended JobsData Scientist, Energy AnalyticsRenew Financial-Oakland, CARenew Financial-5 days agoData ScientistePrize-Seattle, WAePrize-7 days agoData ScientistDocuSign-Seattle, WADocuSign-12 days agoEasily applyEngineer - US Citizen or Permanent ResidentVoxel Innovations-Raleigh, NCIndeed-8 days agoEasily applyData ScientistUnity Technologies-San Francisco, CAUnity Technologies-22 days agoEasily apply'
Find the job summary element, find all b elements inside and split each b element's text by ::
for elm in soup.find("span", id="job_summary").p.find_all("b"):
label, text = elm.get_text().split(" : ")
print(label.strip(), text.strip())
If your output always has the same structure you could use regex to create the dictionary.
dict = {}
title_match = re.match(r'Title : (.+)(?=Location)', output)
dict['Title'] = title_match.group(1)
location_match = re.match(r'Location : (.+)(?=Duration)', output)
dict['Location'] = location_match.group(1)
Of course this is a pretty fragile solution and it would probably serve you better to use BeautifulSoup's in-built parsing to get the results you want, as I guess they are probably surrounded by standard tags.