Find all div with id (not class) inside <article> beautifulsoup - python

In my personal scraping project I cannot locate any job cards on https://unjobs.org neither with requests / requests_html, nor selenium. Job titles are the only fields that I can print in console. Company names and deadlines seem to be located in iframes, but there is no src, somehow href also are not scrapeable. I am not sure whether that site is SPA. Plus DevTools shows no XHR of interest. Please advise what selector/script tag contains all the data?

You are dealing with CloudFlare firewall. You've to inject the cookies. I couldn't share such answer for injecting the cookies as CloudFlare bots is very clever to fetch such threads and then improving the security.
Anyway below is a solution using Selenium
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
mainurl = "https://unjobs.org/"
def main(driver):
driver.get(mainurl)
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located(
(By.XPATH, "//article/div[#id]"))
)
data = (
(
x.find_element_by_class_name('jtitle').text,
x.find_element_by_class_name('jtitle').get_attribute("href"),
x.find_element_by_tag_name('br').text,
x.find_element_by_css_selector('.upd.timeago').text,
x.find_element_by_tag_name('span').text
)
for x in element
)
df = pd.DataFrame(data)
print(df)
except TimeoutException:
exit('Unable to locate element')
finally:
driver.quit()
if __name__ == "__main__":
driver = webdriver.Firefox()
data = main(driver)
Note: you can use headless browser as well.
Output:
0 1 2 3 4
0 Republication : Une consultance internationale... https://unjobs.org/vacancies/1627733212329 about 9 hours ago Closing date: Friday, 13 August 2021
1 Project Management Support Associate (Informat... https://unjobs.org/vacancies/1627734534127 about 9 hours ago Closing date: Tuesday, 17 August 2021
2 Finance Assistant - Retainer, Nairobi, Kenya https://unjobs.org/vacancies/1627734537201 about 10 hours ago Closing date: Saturday, 14 August 2021
3 Procurement Officer, Sana'a, Yemen https://unjobs.org/vacancies/1627734545575 about 10 hours ago Closing date: Wednesday, 4 August 2021
4 ICT Specialist (Geospatial Information Systems... https://unjobs.org/vacancies/1627734547681 about 10 hours ago Closing date: Saturday, 14 August 2021
5 Programme Management - Senior Assistant (Grant... https://unjobs.org/vacancies/1627734550335 about 10 hours ago Closing date: Thursday, 5 August 2021
6 Especialista en Normas Internacionales de Cont... https://unjobs.org/vacancies/1627734552666 about 10 hours ago Closing date: Saturday, 14 August 2021
7 Administration Assistant, Juba, South Sudan https://unjobs.org/vacancies/1627734561330 about 10 hours ago Closing date: Wednesday, 11 August 2021
8 Project Management Support - Senior Assistant,... https://unjobs.org/vacancies/1627734570991 about 10 hours ago Closing date: Saturday, 14 August 2021
9 Administration Senior Assistant [Administrativ... https://unjobs.org/vacancies/1627734572868 about 10 hours ago Closing date: Wednesday, 11 August 2021
10 Project Management Support Officer, Juba, Sout... https://unjobs.org/vacancies/1627734574639 about 10 hours ago Closing date: Wednesday, 11 August 2021
11 Information Management Senior Associate, Bamak... https://unjobs.org/vacancies/1627734576597 about 10 hours ago Closing date: Saturday, 7 August 2021
12 Regional Health & Safety Specialists (French a... https://unjobs.org/vacancies/1627734578207 about 10 hours ago Closing date: Friday, 6 August 2021
13 Project Management Support - Associate, Bonn, ... https://unjobs.org/vacancies/1627734587268 about 10 hours ago Closing date: Tuesday, 10 August 2021
14 Associate Education Officer, Goré, Chad https://unjobs.org/vacancies/1627247597092 a day ago Closing date: Tuesday, 3 August 2021
15 Senior Program Officer, High Impact Africa 2 D... https://unjobs.org/vacancies/1627597499846 a day ago Closing date: Thursday, 12 August 2021
16 Specialist, Supply Chain, Geneva https://unjobs.org/vacancies/1627597509615 a day ago Closing date: Thursday, 12 August 2021
17 Project Manager, Procurement and Supply Manage... https://unjobs.org/vacancies/1627597494487 a day ago Closing date: Thursday, 12 August 2021
18 WCO Drug Programme: Analyst for AIRCOP Project... https://unjobs.org/vacancies/1627594132743 a day ago Closing date: Tuesday, 31 August 2021
19 Regional Desk Assistant, Geneva https://unjobs.org/vacancies/1627594929351 a day ago Closing date: Thursday, 26 August 2021
20 Programme Associate, Zambia https://unjobs.org/vacancies/1627586510917 a day ago Closing date: Wednesday, 11 August 2021
21 Associate Programme Management Officer, Entebb... https://unjobs.org/vacancies/1627512175261 a day ago Closing date: Saturday, 14 August 2021
22 Expert in Transport Facilitation and Connectiv... https://unjobs.org/vacancies/1627594978539 a day ago Closing date: Sunday, 15 August 2021
23 Content Developer for COP Trainings (two posit... https://unjobs.org/vacancies/1627594862178 a day ago
24 Consultant (e) en appui aux Secteurs, Haiti https://unjobs.org/vacancies/1627585454029 a day ago Closing date: Sunday, 8 August 2021

It looks like either Cloudflare knows your request is not coming from an actual browser and is giving a captcha instead of the actual site and/or you need javascript for the site to run.
I would try using something like puppeteer and see if the response you get is valid.

Related

turning one column into multiple pro-rated column

I have a data regarding an insurance customer's premium during a certain year.
User ID
Period From
Period to
Period from-period to
Total premium
A8856
Jan 2022
Apr 2022
4
$600
A8857
Jan 2022
Feb 2022
2
$400
And I'm trying to turn it into a pro-rated one
Assuming that the input I'm expecting is like this:
User ID
Period From
Total premium
A8856
Jan 2022
$150
A8856
Feb 2022
$150
A8856
Mar 2022
$150
A8856
Apr 2022
$150
A8857
Jan 2022
$200
A8857
Feb 2022
$200
What kind of code do you think I should use? I use python and help is really appreciated.

web-scraping an unordered table

I'm trying to scrape with BeautifulSoup then print it out in Pandas, but the table I need to work with is having spans randomly every month.
"https://to.sze.hu/kezdolap" the top table in the middle div
The path is soup.select("#content > div:nth-child(2) > div > div > div > table")
You can do:
df_list = pd.read_html('https://to.sze.hu/kezdolap', header=0)
Which will add your tables to a list, then simply:
df_list[0]
Gives you:
Ügyfélfogadás NAPPALI tagozatos hallgatók számára (full time students) Ügyfélfogadás NAPPALI tagozatos hallgatók számára (full time students).1 Ügyfélfogadás NAPPALI tagozatos hallgatók számára (full time students).2
0 2021. augusztus 25-augusztus 27./ 25. August -... szerda/Wednesday 9.30-11.00
1 2021. augusztus 25-augusztus 27./ 25. August -... csütörtök/Thursday 13.00-14.30
2 2021. augusztus 25-augusztus 27./ 25. August -... péntek/Friday 13.00-14.30
3 2021. augusztus 30-szeptember 3./ 30. August -... kedd/Tuesday 9.30-11.00
4 2021. augusztus 30-szeptember 3./ 30. August -... csütörtök/Thursday 13.00-14.30
5 2021. augusztus 30-szeptember 3./ 30. August -... péntek/Friday 13.00-14.30
6 2021. szeptember 6-17. / 06. September-17. Sep... kedd/Tuesday 9.30 – 11.00
7 2021. szeptember 6-17. / 06. September-17. Sep... csütörtök/Thursday 13.00 – 14.30
8 2021. szeptember 20-30. / 20-30 September 2021 kedd/Tuesday 10.00 – 11.00
9 2021. szeptember 20-30. / 20-30 September 2021 szerda/Wednesday 10.00 – 11.00
10 2021. szeptember 20-30. / 20-30 September 2021 szerda/Wednesday 13.00 – 14.00
11 2021. szeptember 20-30. / 20-30 September 2021 csütörtök/Thursday 10.00 - 11.00

How to calcuate the overlap date in pyspark

I have data with users who has worked with multiple companies.Some users who worked in more than one companies at the same time. How to aggregate the overall experience without considering overlap experience.
I have gone through some of the links could get right solutions.Any help will appreciated.
EMP CSV DATA
fullName,Experience_datesEmployeed,Experience_expcompany,Experience_expduraation, Experience_position
David,Feb 1999 - Sep 2001, Foothill,2 yrs 8 mos, Marketing Assoicate
David,1994 - 1997, abc,3 yrs,Senior Auditor
David,Jun 2020 - Present, Fellows INC,3 mos,Director Board
David,2017 - Jun 2019, Fellows INC ,2 yrs,Fellow - Class 22
David,Sep 2001 - Present, The John D.,19 yrs, Manager
Expected output:
FullName,Total_Experience
David,24.8 yrs

Using Python and BeautifulSoup to scrape list from an URL

I am new to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape an url and want to store list of movies under one date.
Below is the code I have so far:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=ul.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
I am getting "AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
Expected result in list or dataframe
29th May 2020 Romantic
29th May 2020 Sohreyan Da Pind Aa Gaya
5th June 2020 Lakshmi Bomb
and so on
Thanks in advance for help.
This script will get all movies and corresponding dates to a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/calendar?region=IN&ref_=rlm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out, last = [], ''
for tag in soup.select('#main h4, #main li'):
if tag.name == 'h4':
last = tag.get_text(strip=True)
else:
out.append({'Date':last, 'Movie':tag.get_text(strip=True).rsplit('(', maxsplit=1)[0]})
df = pd.DataFrame(out)
print(df)
Prints:
Date Movie
0 29 May 2020 Romantic
1 29 May 2020 Sohreyan Da Pind Aa Gaya
2 05 June 2020 Laxmmi Bomb
3 05 June 2020 Roohi Afzana
4 05 June 2020 Nikamma
.. ... ...
95 26 March 2021 Untitled Luv Ranjan Film
96 02 April 2021 F9
97 02 April 2021 Bell Bottom
98 02 April 2021 NTR Trivikiram Untitled Movie
99 09 April 2021 Manje Bistre 3
[100 rows x 2 columns]
I think you should replace "ul" with "h1" on the 10th line. And add definition of variable "movielist" ahead.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
# add code here
movielist = []
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
# replace ul with h1 here
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
print(movielist)
I didn't specify a list to receive, and I changed it from 'h1' to 'text capture' instead of 'h4'.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
movielist = []
date = soup.find_all("h4")
ul = soup.find_all("ui")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
The reason the date doesn't match in the output result is that the 'date' retrieved looks like the following, so you need to fix the logic.
There are multiple titles on the same release date, so the release date and number of titles don't match up. I can't help you that much because I don't have the time. Have a good night.
29 May 2020
05 June 2020
07 June 2020
07 June 2020 Romantic
12 June 2020
12 June 2020 Sohreyan Da Pind Aa Gaya
18 June 2020
18 June 2020 Laxmmi Bomb
19 June 2020
19 June 2020 Roohi Afzana
25 June 2020
25 June 2020 Nikamma
26 June 2020
26 June 2020 Naandhi
02 July 2020
02 July 2020 Mandela
03 July 2020
03 July 2020 Medium Spicy
10 July 2020
10 July 2020 Out of the Blue

How do I find a block containing 2 tags in a loop?

I am looking to scrape the contents of the following html and want to capture the h2 then each until the next h2 using beautiful soup. Is this possible?
<hr /><h2>California</h2>
<p><strong>Term 1:</strong> (Eastern division): Tuesday 29 January —
Friday
12 April</p>
<p><strong>Term 1:</strong> (Western division): Tuesday 5 February —
Friday
12 April</p>
<p><strong>Term 2</strong><strong>:</strong> Monday 29 April — Friday 5
July</p>
<p><strong>Term 3:</strong> Monday 22 July — Friday 27 September</p>
<p><strong>Term 4:</strong> Monday 14 October — Friday 20 December</p>
<hr /><h2>New York</h2>
<p><strong>Term 1</strong>: Tuesday 29 January — Friday 12 April</p>
<p><strong>Term 2:</strong> Monday 29 April — Friday 5 July</p>
<p><strong>Term 3</strong>: Monday 22 July — Friday 27 September</p>
<p><strong>Term 4</strong>: Monday 14 October — Friday 13 December</p>
</pre>
soup = BeautifulSoup(page.text, 'html.parser')
for each_div in soup.findAll(['h2', 'p']):
myval = str(each_div.prettify("ascii"))
I want to get the following results for each state
Here's something I think you should be able to work with. The list capture
keeps track of the elements you want for each header. The code is using the
find_next_siblings method to get all of the siblings in the tree and iterate
over them. When it reaches another h2 tag, it breaks.
soup = BeautifulSoup(content, 'html.parser')
for head in soup.find_all('h2'):
capture = [head]
for sibling in head.find_next_siblings():
if sibling.name == 'h2':
break
capture += [sibling]
I would just change how you store off the captured tags.
Edit: Forgot to mention that content is the html string provided in
your question.

Categories

Resources