How to scrape wikipedia if you already have all the urls?

How to scrape wikipedia if you already have all the urls? - python

I have the a table of information below that contains 1000 of these entries and the sum of the first column is approximately 90 000.
counts
genus species
14149
Marchantia polymorpha
9345
Physcomitrium patens
7744
Selaginella moellendorffii
5389
Picea sitchensis
For each of the 1000 entries I would like to search the Wikipedia page and extract the grouping of the organism.
For example:
I look up Marchantia polymorpha in Wikipedia and find it's page. On the right most side of the page is the scientific classification of the species. I would like to extract the value for Kingdom i.e. Plantae (amongst others) in this case.
At the end of searching and extracting I would like a table that looks like this:
counts
genus species
kingdom
14149
Marchantia polymorpha
plantae
9345
Physcomitrium patens
plantae
7744
Selaginella moellendorffii
plantae
5389
Picea sitchensis
fungi
So that I can count the total number of entries belonging to each kingdom.
Collecting the URLs won't be difficult. Since the base URL is the same and the page I want to end up on simply adds the genus_species name. For example all the URLs for the list above would be:
https://en.wikipedia.org/wiki/Marchantia_polymorpha
https://en.wikipedia.org/wiki/Physcomitrium_patens
https://en.wikipedia.org/wiki/Selaginella_moellendorffii
https://en.wikipedia.org/wiki/Picea_sitchensis
I am not 100% certain that a Wikipedia page exists for each genus but the majority do based on manual searches I've done previously.

You can use this, but I am not sure if it will work on all Wikipedia sites:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Marchantia_polymorpha'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
# It is in an table row:
tr = soup.find_all('tr')
for t in tr:
if 'kingdom' in t.text.lower():
text = t.text
break
text.replace('\n', '').split(':')[1]
>>> 'Plantae'
# Or since both kingdom and plantea are in a <td>
td = soup.find_all('td')
for i, t in enumerate(td):
if 'kingdom' in t.text.lower():
kingdom = td[i+1].text.rstrip() # in the row after kingdom is plantae (i+1)
print(kingdom)
>>> 'Plantae'

Related

If-condition is not executed in a for-loop when scraping data from kworb.net

I need to collect data on the countries where artists are streamed most frequently on Spotify. To do that, I am using this source that contains a list of 10.000 artists.
So the aim of my code is to create a table with two columns:
artist name;
country where the artist is streamed the most.
I wrote a code (see below) that gets this information from each artist's personal page (here is an example for Drake). An artist's name is taken from the title of a page and the country code -- from table column heading preceded by the column titled "Global". For some artists, there is no column titled "Global" and I need to account for this condition. And here is where my problems comes in.
I am using the following if-condition:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
But only the first condition is executed, where the code extracts the text from the 4th column. Alternatively, I tried the reverse condition:
if "<th>Global</th>" in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[5].text
else:
Country = soup2.find_all('table')[0].find_all('th')[4].text
country.append(Country)
But the code still extracts the text from the 4th column, even if I want it to extract it from the 5th column when the 4th column is titled "Global".
This reproducible code is run for a subset of artists, for whom there is a column titled "Global" (e.g. LANY) and for whom there is none (e.g. Henrique & Diego)(#391 to #395 as of June 16, 2019):
from time import sleep
from random import randint
from requests import get
from bs4 import BeautifulSoup as bs
import pandas as pd
response1 = get('https://kworb.net/spotify/artists.html', headers = headers)
soup1 = bs(response1.text, 'html.parser')
table = soup1.find_all('table')[0]
rows = table.find_all('tr')[391:396] #selected subset of 10.000 artists
artist = []
country = []
for row in rows:
artist_url = row.find('a')['href']
response2 = get('https://kworb.net/spotify/' + artist_url)
sleep(randint(8,15))
soup2 = bs(response2.text, 'html.parser')
Artist = soup2.find('title').text[:-24]
artist.append(Artist)
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'): #problem suspected in this if-condition
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
df = pd.DataFrame({'Artist': artist,
'Country': country
})
print(df)
As a result, I get the following:
Artist Country
0 YNW Melly Global
1 Henrique & Diego BR
2 LANY Global
3 Parson James Global
4 ANAVITÃRIA BR
While the actual output, as of June 16, 2019, should be:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR
I suspect the wrong if-condition for the variable country. I would appreciate any help with regard to that.

You compare bs4 object with string.
Need first get text from each found object then compare with string:
replace:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
with:
# get text options from html
found_options = [item.text for item in soup2.find_all('table')[0].find_all('th')]
if "Global" not in found_options:
Output:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

Apologies in advance for long question- I am new to Python and I'm trying to be as explicit as I can with a fairly specific situation.
I am trying to identify specific data points from SEC Filings on a routine basis however I want to automate this instead of having to manually go search a companies CIK ID and Form filing. So far, I have been able to get to a point where I am downloading metadata about all filings received by the SEC in a given time period. It looks like this:
index cik conm type date path
0 0 1000045 NICHOLAS FINANCIAL INC 10-Q 2019-02-14 edgar/data/1000045/0001193125-19-039489.txt
1 1 1000045 NICHOLAS FINANCIAL INC 4 2019-01-15 edgar/data/1000045/0001357521-19-000001.txt
2 2 1000045 NICHOLAS FINANCIAL INC 4 2019-02-19 edgar/data/1000045/0001357521-19-000002.txt
3 3 1000045 NICHOLAS FINANCIAL INC 4 2019-03-15 edgar/data/1000045/0001357521-19-000003.txt
4 4 1000045 NICHOLAS FINANCIAL INC 8-K 2019-02-01 edgar/data/1000045/0001193125-19-024617.txt
Despite having all this information, as well as being able to download these text files and see the underlying data, I am unable to parse this data as it is in xbrl format and is a bit out of my wheelhouse. Instead I came across this script (kindly provided from this site https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python):
from bs4 import BeautifulSoup
import requests
import sys
# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'
# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text
# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if '2015' in cells[3].text:
doc_link = 'https://www.sec.gov' + cells[1].a['href']
# Exit if document link couldn't be found
if doc_link == '':
print("Couldn't find the document link")
sys.exit()
# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text
# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if 'INS' in cells[3].text:
xbrl_link = 'https://www.sec.gov' + cells[2].a['href']
# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text
# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == 'us-gaap:stockholdersequity':
print("Stockholder's equity: " + tag.text)
Just running this script works exactly how I'd like it to. It returns the stockholders equity for a given company (IBM in this case) and I can then take that value and write it to an excel file.
My two-part question is this:
I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [('1009759', 'D', '20190215'),('1009891', 'D', '20190206'),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I'm interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.
Thank you for the help!

You need to define a function which can be essentially most of the code you have posted and that function should take 3 keyword arguments (your 3 values). Then rather than define the three in your code, you just pass in those values and return a result.
Then you take your list which you created and make a simple for loop around it to cal the function you defined with those three values and then do something with the result.
def get_data(value1, value2, value3):
# your main code here but replace with your arguments above.
return content
for company in companies:
content = get_data(value1, value2, value3)
# do something with content

Assuming you have a dataframe sec with correctly named columns for your list of filings, above, you first need to extract from the dataframe the relevant information into three lists:
cik = list(sec['cik'].values)
dat = list(sec['date'].values)
typ = list(sec['type'].values)
Then you create your base_url, with the items inserted and get your data:
for c, t, d in zip(cik, typ, dat):
base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
edgar_resp = requests.get(base_url)
And go from there.

How can I find next sibling of 'a' tag which is inside into a table th tag?

I am scraping companies data from Wikipedia infobox table where I need to scrape some values that are inside td - like Type, Traded as, Services etc.
my code is
response = requests.get(url,headers=headers)
html_soup = BeautifulSoup(response.text, 'lxml')
table_container = html_soup.find('table', class_='infobox')
hq_name=table_container.find("th", text=['Headquarters']).find_next_sibling("td")
It gives the headquarter value and works perfectly
But when I am going to fetch 'Trade as' or any hyperlink th element the above code is not working, it returns none.
So how to get the next sibling of Trade as or Type?

From your comment:
https://en.wikipedia.org/wiki/IBM This is the URL, and the expected
output will be Trade as- NYSE: IBM DJIA Component S&P 100 Component
S&P 500 Component
Use the a tags to separate and select the required row from the table by nth-of-type. You can then join the first two items in the output list if required
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://en.wikipedia.org/wiki/IBM')
soup = bs(r.content, 'lxml')
items = [item.text.replace('\xa0',' ') for item in soup.select('.vcard tr:nth-of-type(4) a')]
print(items)
To have as shown (if indeed first and second are joined?):
final = items[2:]
final.insert(0, '-'.join([items[0] , items[1]]))
final

Python scraping from website. selecting TR elements based on multiple class attrs

I am scraping from the following page: https://kenpom.com/index.php?y=2018
The page shows a list of every Divison 1 college basketball team. Each row is for one team. I want to assign every team-row to a variable called "teams". The problem is that after each 40 teams there are two header rows that I do not want to include. These rows are unique as they have a class of "thead1" and "thead2". The rows that I want to grab have a class of None or "bold-bottom". So essentially i need to iterate through every tr element in that table and grab any that has a class of None or "bold-bottom". My attempt below does not work. It returns a count of 35 when it should be 353
import requests
from bs4 import BeautifulSoup
url ='https://kenpom.com/index.php?y=2018'
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')
table = soup.find('table',{'id':'ratings-table'}).tbody
teams = table.findAll('tr',attrs = {'class':(None or 'bold-bottom')})
print(len(teams))

Extract using Beautiful Soup

I want to fetch the stock price from web site: http://www.bseindia.com/
For example stock price appears as "S&P BSE :25,489.57".I want to fetch the numeric part of it as "25489.57"
This is the code i have written as of now.It is fetching the entire div in which this amount appears but not the amount.
Below is the code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = "http://www.bseindia.com"
html_page = urlopen(page)
html_text = html_page.read()
soup = BeautifulSoup(html_text,"html.parser")
divtag = soup.find_all("div",{"class":"sensexquotearea"})
for oye in divtag:
tdidTags = oye.find_all("div", {"class": "sensexvalue2"})
for tag in tdidTags:
tdTags = tag.find_all("div",{"class":"newsensexvaluearea"})
for newtag in tdTags:
tdnewtags = newtag.find_all("div",{"class":"sensextext"})
for rakesh in tdnewtags:
tdtdsp1 = rakesh.find_all("div",{"id":"tdsp"})
for texts in tdtdsp1:
print(texts)

I had a look around in what is going on when that page loads the information and I was able to simulate what the javascript is doing in python.
I found out it is referencing a page called IndexMovers.aspx?ln=en check it out here
It looks like this page is a comma separated list of things. First comes the name, next comes the price, and then a couple other things you don't care about.
To simulate this in python, we request the page, split it by the commas, then read through every 6th value in the list, adding that value and the value one after that to a new list called stockInformation.
Now we can just loop through stock information and get the name using item[0] and price with item[1]
import requests
newUrl = "http://www.bseindia.com/Msource/IndexMovers.aspx?ln=en"
response = requests.get(newUrl).text
commaItems = response.split(",")
#create list of stocks, each one containing information
#index 0 is the name, index 1 is the price
#the last item is not included because for some reason it has no price info on indexMovers page
stockInformation = []
for i, item in enumerate(commaItems[:-1]):
if i % 6 == 0:
newList = [item, commaItems[i+1]]
stockInformation.append(newList)
#print each item and its price from your list
for item in stockInformation:
print(item[0], "has a price of", item[1])
This prints out:
S&P BSE SENSEX has a price of 25489.57
SENSEX#S&P BSE 100 has a price of 7944.50
BSE-100#S&P BSE 200 has a price of 3315.87
BSE-200#S&P BSE MidCap has a price of 11156.07
MIDCAP#S&P BSE SmallCap has a price of 11113.30
SMLCAP#S&P BSE 500 has a price of 10399.54
BSE-500#S&P BSE GREENEX has a price of 2234.30
GREENX#S&P BSE CARBONEX has a price of 1283.85
CARBON#S&P BSE India Infrastructure Index has a price of 152.35
INFRA#S&P BSE CPSE has a price of 1190.25
CPSE#S&P BSE IPO has a price of 3038.32
#and many more... (total of 40 items)
Which clearly is equivlent to the values shown on the page
So there you have it, you can simulate exactly what the javascript on that page is doing to load the information. Infact you now have even more information than was just shown to you on the page and the request is going to be faster because we are downloading just data, not all that extraneous html.

If you look into the source code of your page (e.g. by storing it into a file and opening it with an editor), you will see that the actual stock price 25,489.57 does not show up directly. The price is not in the stored html code but loaded in a different way.
You could use the linked page where the numbers show up:
http://www.bseindia.com/sensexview/indexview_new.aspx?index_Code=16&iname=BSE30

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape wikipedia if you already have all the urls? - python

Related

If-condition is not executed in a for-loop when scraping data from kworb.net

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

How can I find next sibling of 'a' tag which is inside into a table th tag?

Python scraping from website. selecting TR elements based on multiple class attrs

Extract using Beautiful Soup

Categories

Resources