GOAL
(I need to repeatedly do the search for hundreds of times):
1. Search (e.g. "WP_000177210.1") in "https://www.ncbi.nlm.nih.gov/ipg/"
(i.e. https://www.ncbi.nlm.nih.gov/ipg/?term=WP_000177210.1)
2. Select the first record in the second column "CDS Region in Nucleotide" of the table
(i.e. " NC_011415.1 1997353-1998831 (-)", https://www.ncbi.nlm.nih.gov/nuccore/NC_011415.1?from=1997353&to=1998831&strand=2)
3. Select "FASTA" under the name of this sequence
4. Get the fasta sequence
(i.e. ">NC_011415.1:c1998831-1997353 Escherichia coli SE11, complete sequence
ATGACTTTATGGATTAACGGTGACTGGATAACGGGCCAGGGCGCATCGCGTGTGAAGCGTAATCCGGTAT
CGGGCGAG.....").
CODE
1. Search (e.g. "WP_000177210.1") in "https://www.ncbi.nlm.nih.gov/ipg/"
import requests
from bs4 import BeautifulSoup
url = "https://www.ncbi.nlm.nih.gov/ipg/"
r = requests.get(url, params = "WP_000177210.1")
if r.status_code == requests.codes.ok:
soup = BeautifulSoup(r.text,"lxml")
2. Select the first record in the second column "CDS Region in Nucleotide" of the table (In this case " NC_011415.1 1997353-1998831 (-)") (i.e. https://www.ncbi.nlm.nih.gov/nuccore/NC_011415.1?from=1997353&to=1998831&strand=2)
# try 1 (wrong)
## I tried this first, but it seemed like it only accessed to the first level of the href?!
for a in soup.find_all('a', href=True):
if (a['href'][:8]) =="/nuccore":
print("Found the URL:", a['href'])
# try 2 (not sure how to access nested href)
## According to the label I saw in the Develop Tools, I think I need to get the href in the following nested structure. However, it didn't work.
soup.select("html div #maincontent div div div #ph-ipg div table tbody tr td a")
I am stuck in this step right now....
PS
It's my first time to deal with html format. It's also my first time to ask question here. I might not phrase the problem very well. If there's anything wrong, please let me know.
Without using NCBI's REST API,
import time
from bs4 import BeautifulSoup
from selenium import webdriver
# Opens a firefox webbrowser for scrapping purposes
browser = webdriver.Firefox(executable_path=r'your\path\geckodriver.exe') # Put your own path here
# Allows you to load a page completely (with all of the JS)
browser.get('https://www.ncbi.nlm.nih.gov/ipg/?term=WP_000177210.1')
# Delay turning the page into a soup in order to collect the newly fetched data
time.sleep(3)
# Creates the soup
soup = BeautifulSoup(browser.page_source, "html")
# Gets all the links by filtering out ones with just '/nuccore' and keeping ones that include '/nuccore'
links = [a['href'] for a in soup.find_all('a', href=True) if '/nuccore' in a['href'] and not a['href'] == '/nuccore']
Note:
You'll need the package selenium
You'll need to install GeckoDriver
Related
I am trying to scrape data from AGMARKNET website. The tables are split into 11 pages but all of the pages use the same url. I am very new to webscraping (or python in general), but AGMARKNET does not have a public API so scraping the page seems to be my only option. I am currently using BeautifulSoup to parse the HTML code and I am able to scrape the initial table, but that only contains the first 500 data points, but I want the entire 11 page data. I am stuck and frustrated. Link and my current code are below. Any direction would be helpful, thank you .
#αԋɱҽԃ αмєяιcαη
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
import requests
import pandas as pd
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
response = requests.get(url)
# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')
# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]
# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
cells = []
tds = tr.find_all('td')
if len(tds) == 0:
ths = tr.find_all('th')
for th in ths:
cells.append(th.text.strip())
else:
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
# convert table to df
table = pd.DataFrame(rows)
The website you linked to seems to be using JavaScript to navigate to the next page. The requests and BeautifulSoup libraries are only for parsing static HTML pages, so they can't run JavaScript.
Instead of using them, you should try something like Selenium that actually simulates a full browser environment (including HTML, CSS, etc.). In fact, Selenium can even open a full browser window so you can see it in action as it navigates!
Here is a quick sample code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
# If you prefer Chrome to Firefox, there is a driver available
# for that as well
# Set the URL
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
# Start the browser
opts = Options()
driver = webdriver.Firefox(options=opts)
driver.get(url)
Now you can use functions like driver.find_element(...) and driver.find_elements(...) to extract the data you want from this page, the same way you did with BeautifulSoup.
For your given link, the page number navigators seem to be running a function of the form,
__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')
...replacing Page$2 with Page$3, Page$4, etc. depending on which page you want. So you can use Selenium to run that JavaScript function when you're ready to navigate.
driver.execute_script("__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')")
A more generic solution is to just select which button you want and then run that button's click() function. General example (not necessarily for the current website):
btn = driver.find_element('id', 'next-button')
btn.click()
A final note: after the button is clicked, you might want to time.sleep(...) for a little while to make sure the page is fully loaded before you start processing the next set of data.
So I'm stuck here. I'm a doctor so my programming background and skills are close to none and most likely that's the problem. I'm trying to learn some basics about Python and for me, the best way is by doing stuff.
The project:
scrape the cover images from several books
Some of the links used:
http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html
http://coleccaoargonauta.blogspot.com/2011/09/n-2-o-estranho-mundo-de-kilsona.html
http://coleccaoargonauta.blogspot.com/2011/09/n-3-ultima-cidade-da-terra.html
http://coleccaoargonauta.blogspot.com/2011/09/n-4-nave-sideral.html
http://coleccaoargonauta.blogspot.com/2011/09/n-5-o-universo-vivo.html
That website structure is messed up.
The links are located inside a div with class:"post-title entry-title" which in turn has two or more "separator" class div's that can have content or be empty.
What I can tell so far is that 95% of the time what I want is the last two links in the first two "separator" class DIV's. And for this stage that's good enough.
My code so far is as follow:
#intro
r=requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
#select the first teo 'separator' divs
separador = soup.select("div.separator")[:2]
#we need a title for each page - for debugging and later used to rename images
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)
#the find all links loop
for div in separador:
imagens = div.find_all('a')
for link in imagens:
print (link['href'], '\n')
What I can do right now:
I can print the right URL's, I can then use wget to download and rename files. However, I only want the last two links from the results and that is the only thing that's missing in my google-fu. I think the problem is in the way BeautifulSoup exports results (ResultSet) and my lack ok knowledge in things such as lists. If the first "separator" has one link and the second two links I get a list with two items (and the second item is two links), hence not slicable.
Example output
2-O Estranho Mundo de Kilsona.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg
http://4.bp.blogspot.com/-D0cUIP8PkEU/UPfbByjSuII/AAAAAAAAB0E/LP6kbIEJ_eI/s1600/Argonauta002.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg
But I wanted it to be
2-O Estranho Mundo de Kilsona.jpg
http://4.bp.blogspot.com/-D0cUIP8PkEU/UPfbByjSuII/AAAAAAAAB0E/LP6kbIEJ_eI/s1600/Argonauta002.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg
Can anyone shed some light on this?
The issue is due to the line imagens = div.find_all('a') being called within a loop. This creates a list of lists. As such we need to find a way to flatten them into a single list. I do this below with the line merged_list = [] [merged_list.extend(list) for list in imagens].
From here I then create a new list with just the links and then dedupe the list by calling using set (a set is a useful data structure to use when you don't want duplicated data). I then turn It back into a list and it's back to your code.
import requests
from bs4 import BeautifulSoup
link1 = "http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html"
link2 = "http://coleccaoargonauta.blogspot.com/2011/09/n-2-o-estranho-mundo-de-kilsona.html"
link3 = "http://coleccaoargonauta.blogspot.com/2011/09/n-3-ultima-cidade-da-terra.html"
link4 = "http://coleccaoargonauta.blogspot.com/2011/09/n-4-nave-sideral.html"
link5 = "http://coleccaoargonauta.blogspot.com/2011/09/n-5-o-universo-vivo.html"
#intro
r=requests.get(link2)
soup = BeautifulSoup(r.content, 'lxml')
#select the first teo 'separator' divs
separador = soup.select("div.separator")[:2]
#we need a title for each page - for debugging and later used to rename images
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)
imagens = [div.find_all('a') for div in separador]
merged_list = []
[merged_list.extend(list) for list in imagens]
link_list = [link['href'] for link in merged_list]
deduped_list = list(set(link_list))
for link in deduped_list:
print(link, '\n')
You can use CSS selectors to extract image directly from div with class separator (link to docs).
I also use list comprehension instead of for loop.
Below is working example for url from your list.
import requests
from bs4 import BeautifulSoup
#intro
url = "http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html"
r=requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
#we need a title for each page - for debugging and later used to rename images
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)
#find all links
links = [link['href'] for link in soup.select('.separator a')]
print(links)
I'm sure this may have been asked in the past but I am attempting to parse a website (hopefully somehow automate it to parse multiple websites at once eventually) but it's not working properly. I may be having issues grabbing appropriate tags or something but essentially I want to go to this website and pull off all of the items from the lists created (possibly with hrefs intact or in a separate document) and stick them into a file where I can share in an easy-to-read format. So far this is my code:
url = "http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/" `
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())
results = soup.find_all('div', class_"tab_content")
for element in results:
title_elem = element.find('h1')
h2_elem = element.find('h2')
h3_elem = element.find('h3')
href_elem = element.find('href')
if None in (title_elem, h2_elem, h3_elem, href_elem):
continue
print(title_elem.text.strip())
print(h2_elem.text.strip())
print(h3_elem.text.strip())
print(href_elem.text.strip())
print()
I even attempted to write this for a table but I get the same type of output, which are a bunch of empty elements:
for table in soup.find_all('table'):
for subtable in table.find_all('table'):
print(subtable)
Does anyone have any insight as to why this may be the case? If possible I would also not be opposed to regex parsing, but the main goal here is to go into this site (and hopefully others like it) and take the entire table/lists/descriptions of the individual programs for each major and write the information into an easy-to-read file
Similar approach in that I also selected to combine bs4 with pandas but I tested for the presence of the hyperlink class.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('.sc_courselist'):
tbl = pd.read_html(str(table))[0]
links_column = ['http://catalog.apu.edu' + i.select_one('.bubblelink')['href'] if i.select_one('.bubblelink') is not None else '' for i in table.select('td:nth-of-type(1)')]
tbl['Links'] = links_column
print(tbl)
With BeautifulSoup, an alternative to find/find_all is select_one/select. The latter two apply css selectors with select_one returning the first match for the css selector passed in, and select returning a list of all matches. "." is a class selector, meaning it will select attributes with the specified class e.g. sc_courselist or bubblelink. bubblelink is the class of the element with the desired hrefs. These are within the first column of each table which is selected using td:nth-of-type(1).
Hello I'm a beginner in python and I'm confused on where to go from here
How do I interact/click with a specific link after using bs4 to search for it? So in my script I use bs4 to search through links, that can be clicked on in the the webpage, with specific keywords so I click on the right product.
onSale = False # Set while loop variable
shoeName = 'moon' # Set keyword to look for
browser = webdriver.Chrome()
browser.get(r'https://www.nike.com/launch/?s=upcoming') # Get URL to scan
soup = BeautifulSoup(browser.page_source, 'html.parser') # Convert URL to a soup object
while onSale is False: # Create loop to keep checking inventory
for link in soup.find_all('a', class_=r'card-link d-sm-b'):
shoeCode = str((link.get('href', None), link.get_text()))
compareName = re.sub("[^\w]", " ", shoeCode.lower()).split() # Takes the link and converts it into a string
if shoeName in compareName: # Checks to see if the keyword is used
# Interact/Click link
else:
print(shoeCode)
continue
Once the proper link has been found how do I use it to interact with the website? Do I use selenium, urllib, and or requests? Thanks!
You can use selenium to click the links, you can see how to do this here. Or, after fetching the page with requests (forget urllib) and extracting urls with bs4, you can requests.get('your_example_url') it and fetch the results again.
i'm trying to parse Taobao website and get information about Goods (photo , text and link ) with BeautifulSoup.find but it doesn't find all classes.
url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'
def get_html(url):
r = requests.get(url)
return r.text
html=get_html(url)
soup=BeautifulSoup(html, 'lxml')
z=soup.find("div",{"class":"J_TItems"})
z-is empty.
but for example:
z=soup.find("div",{"class":"skin-box-bd"})
len(z)
Out[196]: 3
works fine
Why this approach doesn't work? What should i do to get all information about good? i am using python 2.7
So, it looks like the items you want to parse are being built dynamically by JavaScript, that's why soup.text.find("J_TItems") returns -1, i.e. there's no "J_TItems" at all in the text. What you can do is use selenium with a JS interpreter, for a headless browsing you can use PhantomJS like this:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html5lib') # I'd also recommend using html5lib
JTitems = soup.find("div", attrs={"class":"J_TItems"})
Note the items you want are inside each row defined by <div class="item4line1">, and there are 5 of them (you maybe only want the first three, because the other two are not really inside the main search, filtering that should not be difficult, a simple rows = rows[2:] do the trick):
rows = JTitems.findAll("div", attrs={"class":"item4line1"})
>>> len(rows)
5
Now notice each "Good" you mention in the question is inside a <dl class="item">, so you need to get them all in a for loop:
Goods = []
for row in rows:
for item in row.findAll("dl", attrs={"class":"item"}):
Goods.append(item)
All there's left to do is to get "photo, text and link" as you mentioned, and this can be easily done accessing each item in Goods list, by inspection you can know how to get each of the information, for examples, for picture url a simple one-line would be:
>>> Goods[0].find("dt", class_='photo').a.img["src"]
'//img.alicdn.com/bao/uploaded/i3/TB19Fl1SpXXXXbsaXXXXXXXXXXX_!!0-item_pic.jpg_180x180.jpg'