Hello Everyone, I am facing a situation where I need to locate and remove items from a table from a website (https://jobassam.in/indian-army-recruitment-rally-mariani-jorhat/). Here is screenshot for better clearance :
How to remove every <tr> below the Text OUR SOCIAL CONNECTIONS including the Text as well. This is the HTML structure of the Table :
The Arrow in the Second Image is where the Text is Present. Also, this is the Code i wrote to Locate this elements and delete them :
getDetails = soup2.find('div', class_='entry-content single-content')
getTables = getDetails.find_all('table')
for i in getTables:
print(i.tr)
i.decompose()
But, the Issue is these Elements are still there. How to achieve this, Please Guide. Thanks
You can simply check the index of the text you want to remove and start decomposing it from there.
So first lets get the table we're interested:
working_table = next((table for table in getTables if table.text.find("OUR SOCIAL CONNECTIONS") != -1), None)
Then lets iterate over all tr on there to search for our item index to cut off:
tables_itens = working_table.find_all('tr')
cut_line = [tables_itens.index(i) for i in tables_itens if i.text == "OUR SOCIAL CONNECTIONS"][0]
Then we start decomposing from there:
for tr in tables_itens:
if tables_itens.index(tr) >= cut_line:
tr.decompose()
We can simply check this with:
for i in tables_itens:
print(i.text)
And it will ouput:
Download Official NotificationClick Here
Online Application LinkClick Here
Visit Official WebsiteClick Here
Full code:
getDetails = soup2.find('div', class_='entry-content single-content')
getTables = getDetails.find_all('table')
working_table = next((table for table in getTables if table.text.find("OUR SOCIAL CONNECTIONS") != -1), None)
tables_itens = working_table.find_all('tr')
cut_line = [tables_itens.index(i) for i in tables_itens if i.text == "OUR SOCIAL CONNECTIONS"][0]
for tr in tables_itens:
if tables_itens.index(tr) >= cut_line:
tr.decompose()
for i in tables_itens:
print(i.text)
Related
Hi I am currently having issues using bs4 and regex to find information within the html as they are contained within : and not = like I am used to.
<div data-react-cache-id="ListItemSale-0" data-react-class="ListItemSale" data-react-props='{"imageUrl":"https://laced.imgix.net/products/aa0ff81c-ec3b-4275-82b3-549c819d1404.jpg?w=196","title":{"label":"Air Jordan 1 Mid Madder Root GS","href":"/products/air-jordan-1-mid-madder-root-gs"},"contentCount":3,"info":"UK 4.5 | EU 37.5 | US 5","subInfo":"DM9077-108","hasStatus":true,"isBuyer":false,"status":"pending_shipment","statusOverride":null,"statusMessage":"Pending","statusMods":["red"],"price":"£125","priceAction":null,"subPrice":null,"actions":[{"label":"View","href":"/account/selling/M2RO1DNV"},{"label":"Re-Print Postage","href":"/account/selling/M2RO1DNV/shipping-label","options":{"disabled":false}},{"label":"View Postage","href":"/account/selling/M2RO1DNV/shipping-label.pdf","options":{"target":"_blank","disabled":false}}]}'></div>
I am trying to extract the href link in
{"label":"Re-Print Postage","href":"/account/selling/M2RO1DNV/shipping-label"
How do I do this? I've tried regex, find_all but with no avail. Thanks
My code below for reference, I've put # next to the solutions I have tried on top of many others
account_soup = bs(my_account.text, 'lxml')
links = account_soup.find_all('div', {'data-react-class': 'ListItemSale'})
#for links in download_link['actions']:
#print(links['href'])
#for i in links:
#link_main = i.find('title')
#link = re.findall('^/account*shipping-label$', link_main)
#print(link)
You need to fetch the data-react-props attribute of each div, then parse that as JSON. You can then iterate the actions property and get the href property that matches your :
actions = []
for l in links:
props = json.loads(l['data-react-props'])
for a in props['actions']:
m = re.match(r'^/account.*shipping-label$', a['href'])
if m is not None:
actions.append(m[0])
print(actions)
Output for your sample data:
['/account/selling/M2RO1DNV/shipping-label']
I am trying to scrape job ads from website : https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging
I want to get all visible data - job title, location, short description such as : Full Stack; DBA, Big Data; Data Science, AI, ML and Embedded; Test, QA and scraping part for this is:
result = requests.get("https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging").text
soup = bs4.BeautifulSoup(result, "lxml")
jobs = soup.find_all('td', class_ = "offerslistRow")
for job in jobs:
description = find_all('div', class_="card__subtitle mdc-typography mdc-typography--body2")
and it is [0] part to be precise, as there are two type short descriptions with same class name, but this is not the issue.
Some ads don't have short description, but they also don't have the mentioned div part(it is not empty, it doesn't exist at all).
Is there a way to get description for such ads as well as "N/A" for example or something like that?
I'm assuming you want to scrape all job details as the question was a bit unclear. I have made a few other changes to your code as well, and handled all possible cases.
The following code should do the job-
import bs4
import requests
result = requests.get("https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging").text
soup = bs4.BeautifulSoup(result, "lxml")
# find all jobs
jobs = soup.find_all('td', class_ = "offerslistRow")
# list to store job title
job_title=[]
# list to store job location
job_location=[]
# list to store domain and skills
domain_and_skills=[]
# loop through the jobs
for job in jobs:
# this check is to remove the other two blocks aligned to the right
if job.find('a',class_="card__title mdc-typography mdc-typography--headline6 text-overflow") is not None:
# find and append job name
job_name=job.find('a',class_="card__title mdc-typography mdc-typography--headline6 text-overflow")
job_title.append(job_name.text)
# find and append location and salary description
location_salary_desc=job.find('span',class_='card__subtitle mdc-typography mdc-typography--body2 top-margin')
if location_salary_desc is not None:
job_location.append(location_salary_desc.text.strip())
else:
job_location.append('NA')
# find other two descriptions (Skills and domains)
description = job.find_all(class_="card__subtitle mdc-typography mdc-typography--body2")
# if both are empty (len=0)
if len(description)==0:
domain_and_skills.append('NA')
# if len=1 (can either be skills or domain details)
elif len(description)==1:
# to check if domain is present and skills is empty
if description[0].find('div') is None:
domain_and_skills.append(description[0].text.strip())
# domain is empty and skills is present
else:
# list to store skills
skills=[]
# find all images in skills section and get alt attribute which contains skill name
images=description[0].find_all('img')
# if no image and only text is present (for example Shell Scripts is not an image, contains text value)
if len(images)==0:
skills.append(description[0].text.strip())
# both image and text is present
else:
# for each image, append skill name in list
for image in images:
skills.append(image['alt'])
# append text to list if not empty
if description[0].text.strip() !='':
skills.append(description[0].text.strip())
#convert list to string
skills_string = ','.join([str(skill) for skill in skills])
domain_and_skills.append(skills_string)
# both domain and skills are present
else:
domain_string=description[0].text.strip()
# similar procedure as above to print skill names
skills=[]
images=description[1].find_all('img')
if len(images)==0:
skills.append(description[1].text.strip())
else:
for image in images:
skills.append(image['alt'])
if description[1].text.strip() !='':
skills.append(description[1].text.strip())
skills_string = ','.join([str(skill) for skill in skills])
#combine domain and skills
domain_string=domain_string+','+skills_string
domain_and_skills.append(domain_string)
for i in range(0,len(job_title)):
print(job_title[i])
print(job_location[i])
print(domain_and_skills[i])
I am trying to extract a certain section from HTML-files. To be specific, I look for the "ITEM 1" Section of the 10-K filings (a US business reports of a certain company). E.g.:
https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002
Problem: However, I am not able to find the "ITEM 1" section, nor do I have an idea how to tell my algorithm to search from that point "ITEM 1" to another point (e.g. "ITEM 1A") and extract the text in between.
I am super thankful for any help.
Among others, I have tried this (and similar), but my bd is always empty:
try:
# bd = soup.body.findAll(text=re.compile('^ITEM 1$'))
# bd = soup.find_all(name="ITEM 1")
# bd = soup.find_all(["ITEM 1", "ITEM1", "Item 1", "Item1", "item 1", "item1"])
print(" Business Section (Item 1): ", bd.content)
except:
print("\n Section not found!")
Using Python 3.7 and Beautifulsoup4
Regards Heka
As I mentioned in a comment, because of the nature of EDGAR, this may work on one filing but fail on another. The principles, though, should generally work (after some adjustments...)
import requests
import lxml.html
url = 'https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)
tabs = doc.xpath('//table[./tr/td/font/a[#name="a_002"]]/following-sibling::p/font')
#in this filing, Item 1 is hiding in a series of <p> tags following a table with an <a> tag with a
#"name" attribute which has a value of "a_002"
flag = ''
for i in tabs:
if flag == 'stop':
break
if i.text is not None: #we now start extracting the text from each <p> tag and move to the next
print(i.text_content().strip().replace('\n',''))
nxt = i.getparent().getnext()
#the following detects when the <p> tags of Item 1 end and the next Item begins and then stops
if str(type(nxt)) != "<class 'NoneType'>" and nxt.tag == 'table':
for j in nxt.iterdescendants():
if j.tag == 'a' and j.values()[0]=='a_003':
# we have encountered the <a> tag with a "name" attribute which has a value of "a_003", indicated the beginning of the next Item; so we stop
flag='stop'
The output is the text of Item 1 in this filing.
There are special characters. Remove them first
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
doc.loadHtml(doc.replaceReg(doc.html, 'ITEM[\s]+','ITEM '))
item1 = doc.getElementByText('ITEM 1')
print(item1) # {'tag': 'B', 'html': 'ITEM 1. BUSINESS'}
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)
If you use the latest version, you can use the following methods
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
item1 = doc.getElementByReg('ITEM[\s]+1') # Incoming regex
print(item1,item1.text) # {'tag': 'B', 'html': 'ITEM\n 1. BUSINESS'} ITEM 1. BUSINESS
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)
I'm currently writing a Python script and part of it gets winshares from the first 4 seasons of every player's career in the NBA draft between 2005 and 2015. I've been messing around with this for almost 2 hours (getting increasingly frustrated), but I've been unable to get the Win Shares for the individual players. I'm trying to use the "Advanced" table at the following link as a test case: https://www.basketball-reference.com/players/b/bogutan01.html#advanced::none
When getting the player's names from the draft pages I had no problems, but I've tried so many iterations of the following code and have had no success in accessing the td element the stat is in.
playerSoup = BeautifulSoup(playerHtml)
playertr = playerSoup.find_all("table", id = "advanced").find("tbody").findAll("tr")
playerws = playertr.findAll("td")[21].getText()
This page use JavaScript to add tables but it doesn't read data from server. All tables are in HTML but as comments <!-- ... ->
Using BeautifulSoup you can find all comments and then check which one has text "Advanced". And then you can use this comment as normal HTML in BeautifulSoup
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url = 'https://www.basketball-reference.com/players/b/bogutan01.html#advanced::none'
r = requests.get(url)
soup = BeautifulSoup(r.content)
all_comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for item in all_comments:
if "Advanced" in item:
adv = BeautifulSoup(item)
playertr = adv.find("table", id="advanced")
if not playertr:
#print('skip')
continue # skip comment without table - go back to `for`
playertr = playertr.find("tbody").findAll("tr")
playerws = adv.find_all("td")[21].getText()
print('playertr:', playertr)
print('playerws:', playerws)
for row in playertr:
if row:
print(row.find_all('th')[0].text)
all_td = row.find_all('td')
print([x.text for x in all_td])
print('--')
I have been scratching my head for nearly 4 days trying to find the best way to loop through a table of URLs on one website, request the URL and scrape text from 2 different areas of the second site.
I have tried to rewrite this script multiple times, using several different solutions to achieve my desired results, however, I have not been able to fully accomplish this.
Currently, I am able to select the first link of the table on page one, to go to the new page and select the data I need but I cant get the code to continue to loop through every link on the first page.
import requests
from bs4 import BeautifulSoup
journal_site = "https://journals.sagepub.com"
site_link 'http://journals.sagepub.com/action/showPublications?
pageSize=100&startPage='
# each page contains 100 results I need to scrape from
page_1 = '0'
page_2 = '1'
page_3 = '3'
page_4 = '4'
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
for table_row in soup.select('div.results'):
journal_name = table_row.findAll('tr', class_='False')
journal_link = table_row.find('a')['href']
journal_page = journal_site + journal_link
r = requests.get(journal_page)
soup = BeautifulSoup(r.text, 'html.parser')
for journal_header, journal_description in zip(soup.select('main'),
soup.select('div.journalCarouselTextText')):
try:
title = journal_header.h1.text.strip()
description = journal_description.p.text.strip()
print(title,':', description)
except AttributeError:
continue
What is the best way to find the title and the description for every journal_name? Thanks in advance for the help!
Most of your code works for me, just needed to modify the middle section of the code, leaving the parts before and after the same:
# all code same up to here
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find("div", { "class" : "results" })
table = results.find('table')
for row in table.find_all('a', href=True):
journal_link = row['href']
journal_page = journal_site + journal_link
# from here same as your code
I stopped after it got the fourth response(title/description) of 100 results from the first page. I'm pretty sure it will get all the expected results, only needs to loop through the 4 subsequent pages.
Hope this helps.