web-scraping an unordered table

web-scraping an unordered table - python

I'm trying to scrape with BeautifulSoup then print it out in Pandas, but the table I need to work with is having spans randomly every month.
"https://to.sze.hu/kezdolap" the top table in the middle div
The path is soup.select("#content > div:nth-child(2) > div > div > div > table")

You can do:
df_list = pd.read_html('https://to.sze.hu/kezdolap', header=0)
Which will add your tables to a list, then simply:
df_list[0]
Gives you:
Ügyfélfogadás NAPPALI tagozatos hallgatók számára (full time students) Ügyfélfogadás NAPPALI tagozatos hallgatók számára (full time students).1 Ügyfélfogadás NAPPALI tagozatos hallgatók számára (full time students).2
0 2021. augusztus 25-augusztus 27./ 25. August -... szerda/Wednesday 9.30-11.00
1 2021. augusztus 25-augusztus 27./ 25. August -... csütörtök/Thursday 13.00-14.30
2 2021. augusztus 25-augusztus 27./ 25. August -... péntek/Friday 13.00-14.30
3 2021. augusztus 30-szeptember 3./ 30. August -... kedd/Tuesday 9.30-11.00
4 2021. augusztus 30-szeptember 3./ 30. August -... csütörtök/Thursday 13.00-14.30
5 2021. augusztus 30-szeptember 3./ 30. August -... péntek/Friday 13.00-14.30
6 2021. szeptember 6-17. / 06. September-17. Sep... kedd/Tuesday 9.30 – 11.00
7 2021. szeptember 6-17. / 06. September-17. Sep... csütörtök/Thursday 13.00 – 14.30
8 2021. szeptember 20-30. / 20-30 September 2021 kedd/Tuesday 10.00 – 11.00
9 2021. szeptember 20-30. / 20-30 September 2021 szerda/Wednesday 10.00 – 11.00
10 2021. szeptember 20-30. / 20-30 September 2021 szerda/Wednesday 13.00 – 14.00
11 2021. szeptember 20-30. / 20-30 September 2021 csütörtök/Thursday 10.00 - 11.00

Related

Extract date and sort rows by date

I have a dataset that includes some strings in the following forms:
Text
Jun 28, 2021 — Brendan Moore is p...
Professor of Psychology at University
Aug 24, 2019 — Chemistry (Nobel prize...
by A Craig · 2019 · Cited by 1 — Authors. ...
... 2020 | Volume 8 | Article 330Edited by:
I would like to create a new column where there are, if there exist, dates sorted by ascending order.
To do so, I need to extract the part of string which includes date information from each row, whether exits.
Something like this:
Text Numbering
Jun 28, 2021 — Brendan Moore is p... 2
Professor of Psychology at University -1
Aug 24, 2019 — Chemistry (Nobel prize... 1
by A Craig · 2019 · Cited by 1 — Authors. ... -1
... 2020 | Volume 8 | Article 330Edited by: -1
All the rows not starting with a date (that follows the format: Jun 28, 2021 — are assigned to -1.
The first step would be identify the pattern: xxx xx, xxxx;
then, transforming date object into datetime (yyyy-mm-dd).
Once got this date information, it needs to be converted into numerical, then sorted.
I am having difficulties in answering the last point, specifically on how to filter only dates and sort them in an appropriate way.
The expected output would be
Text Numbering (sort by date asc)
Jun 28, 2021 — Brendan Moore is p... 2
Professor of Psychology at University -1
Aug 24, 2019 — Chemistry (Nobel prize... 1
by A Craig · 2019 · Cited by 1 — Authors. ... -1
... 2020 | Volume 8 | Article 330Edited by: -1

Mission accomplished:
# Find rows that start with a date
matches = df['Text'].str.match(r'^\w+ \d+, \d{4}')
# Parse dates out of date rows
df['date'] = pd.to_datetime(df[matches]['Text'], format='%b %d, %Y', exact=False, errors='coerce')
# Assign numbering for dates
df['Numbering'] = df['date'].sort_values().groupby(np.ones(df.shape[0])).cumcount() + 1
# -1 for the non-dates
df.loc[~matches, 'Numbering'] = -1
# Cleanup
df.drop('date', axis=1, inplace=True)
Output:
>>> df
Text Numbering
0 Jun 28, 2021 - Brendan Moore is p... 2
1 Professor of Psychology at University -1
2 Aug 24, 2019 - Chemistry (Nobel prize... 1
3 by A Craig - 2019 - Cited by 1 - Authors. ... -1
4 ... 2020 | Volume 8 | Article 330Edited by: -1

Find all div with id (not class) inside <article> beautifulsoup

In my personal scraping project I cannot locate any job cards on https://unjobs.org neither with requests / requests_html, nor selenium. Job titles are the only fields that I can print in console. Company names and deadlines seem to be located in iframes, but there is no src, somehow href also are not scrapeable. I am not sure whether that site is SPA. Plus DevTools shows no XHR of interest. Please advise what selector/script tag contains all the data?

You are dealing with CloudFlare firewall. You've to inject the cookies. I couldn't share such answer for injecting the cookies as CloudFlare bots is very clever to fetch such threads and then improving the security.
Anyway below is a solution using Selenium
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
mainurl = "https://unjobs.org/"
def main(driver):
driver.get(mainurl)
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located(
(By.XPATH, "//article/div[#id]"))
)
data = (
(
x.find_element_by_class_name('jtitle').text,
x.find_element_by_class_name('jtitle').get_attribute("href"),
x.find_element_by_tag_name('br').text,
x.find_element_by_css_selector('.upd.timeago').text,
x.find_element_by_tag_name('span').text
)
for x in element
)
df = pd.DataFrame(data)
print(df)
except TimeoutException:
exit('Unable to locate element')
finally:
driver.quit()
if __name__ == "__main__":
driver = webdriver.Firefox()
data = main(driver)
Note: you can use headless browser as well.
Output:
0 1 2 3 4
0 Republication : Une consultance internationale... https://unjobs.org/vacancies/1627733212329 about 9 hours ago Closing date: Friday, 13 August 2021
1 Project Management Support Associate (Informat... https://unjobs.org/vacancies/1627734534127 about 9 hours ago Closing date: Tuesday, 17 August 2021
2 Finance Assistant - Retainer, Nairobi, Kenya https://unjobs.org/vacancies/1627734537201 about 10 hours ago Closing date: Saturday, 14 August 2021
3 Procurement Officer, Sana'a, Yemen https://unjobs.org/vacancies/1627734545575 about 10 hours ago Closing date: Wednesday, 4 August 2021
4 ICT Specialist (Geospatial Information Systems... https://unjobs.org/vacancies/1627734547681 about 10 hours ago Closing date: Saturday, 14 August 2021
5 Programme Management - Senior Assistant (Grant... https://unjobs.org/vacancies/1627734550335 about 10 hours ago Closing date: Thursday, 5 August 2021
6 Especialista en Normas Internacionales de Cont... https://unjobs.org/vacancies/1627734552666 about 10 hours ago Closing date: Saturday, 14 August 2021
7 Administration Assistant, Juba, South Sudan https://unjobs.org/vacancies/1627734561330 about 10 hours ago Closing date: Wednesday, 11 August 2021
8 Project Management Support - Senior Assistant,... https://unjobs.org/vacancies/1627734570991 about 10 hours ago Closing date: Saturday, 14 August 2021
9 Administration Senior Assistant [Administrativ... https://unjobs.org/vacancies/1627734572868 about 10 hours ago Closing date: Wednesday, 11 August 2021
10 Project Management Support Officer, Juba, Sout... https://unjobs.org/vacancies/1627734574639 about 10 hours ago Closing date: Wednesday, 11 August 2021
11 Information Management Senior Associate, Bamak... https://unjobs.org/vacancies/1627734576597 about 10 hours ago Closing date: Saturday, 7 August 2021
12 Regional Health & Safety Specialists (French a... https://unjobs.org/vacancies/1627734578207 about 10 hours ago Closing date: Friday, 6 August 2021
13 Project Management Support - Associate, Bonn, ... https://unjobs.org/vacancies/1627734587268 about 10 hours ago Closing date: Tuesday, 10 August 2021
14 Associate Education Officer, Goré, Chad https://unjobs.org/vacancies/1627247597092 a day ago Closing date: Tuesday, 3 August 2021
15 Senior Program Officer, High Impact Africa 2 D... https://unjobs.org/vacancies/1627597499846 a day ago Closing date: Thursday, 12 August 2021
16 Specialist, Supply Chain, Geneva https://unjobs.org/vacancies/1627597509615 a day ago Closing date: Thursday, 12 August 2021
17 Project Manager, Procurement and Supply Manage... https://unjobs.org/vacancies/1627597494487 a day ago Closing date: Thursday, 12 August 2021
18 WCO Drug Programme: Analyst for AIRCOP Project... https://unjobs.org/vacancies/1627594132743 a day ago Closing date: Tuesday, 31 August 2021
19 Regional Desk Assistant, Geneva https://unjobs.org/vacancies/1627594929351 a day ago Closing date: Thursday, 26 August 2021
20 Programme Associate, Zambia https://unjobs.org/vacancies/1627586510917 a day ago Closing date: Wednesday, 11 August 2021
21 Associate Programme Management Officer, Entebb... https://unjobs.org/vacancies/1627512175261 a day ago Closing date: Saturday, 14 August 2021
22 Expert in Transport Facilitation and Connectiv... https://unjobs.org/vacancies/1627594978539 a day ago Closing date: Sunday, 15 August 2021
23 Content Developer for COP Trainings (two posit... https://unjobs.org/vacancies/1627594862178 a day ago
24 Consultant (e) en appui aux Secteurs, Haiti https://unjobs.org/vacancies/1627585454029 a day ago Closing date: Sunday, 8 August 2021

It looks like either Cloudflare knows your request is not coming from an actual browser and is giving a captcha instead of the actual site and/or you need javascript for the site to run.
I would try using something like puppeteer and see if the response you get is valid.

Select corresponding column value for max value of separate column(from a specific range of column) of pandas data frame

year month quantity
DateNew
2005-01 2005 January 49550
2005-02 2005 February 96088
2005-03 2005 March 28874
2005-04 2005 April 66917
2005-05 2005 May 24070
... ... ... ...
2018-08 2018 August 132629
2018-09 2018 September 104394
2018-10 2018 October 121305
2018-11 2018 November 121049
2018-12 2018 December 174984
This is the data frame that I have. I want to select the maximum quantity for each year and return the corresponding month for it.
I have tried this so far
df.groupby('year').max()
But in this, I get the max value for each and every column and hence getting September in each year.
I have no clue how to approach the actual solution.

I think you want idxmax:
df.loc[df.groupby('year')['quantity'].idxmax()]
Output:
year month quantity
DateNew
2005-02 2005 February 96088
2018-12 2018 December 174984
Or just for the months:
df.loc[df.groupby('year')['quantity'].idxmax(), 'month']
Output:
DateNew
2005-02 February
2018-12 December
Name: month, dtype: object
Also, you can use sort_values followed by duplicated:
df.loc[~df.sort_values('quantity').duplicated('year', keep='last'), 'month']

Sort Plots by year

I have this data frame where I want to graph 3 plots based on year with x and y being Unspcs Desc and Total_Price. For example plot one will be specific to the year 2018 and only contain contents of Unspsc Desc and Total_Price for 2018
Material Total_Price Year_Purchase
Gasket 50,000 2018
Washer 6,000 2019
Bolts 7,000 2019
Nut 3,000 2020
Gasket 25,000 2019
Gasket 2500 2020
Washer 33500 2018
Nuts 7000 2019
The code I was using
dw.groupby(['Unspsc Desc', 'Total_Price']).Year_Purchase.sort_values().plot.bar()

How to save split data in panda in reverse order?

You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.

You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN

Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

web-scraping an unordered table - python

I'm trying to scrape with BeautifulSoup then print it out in Pandas, but the table I need to work with is having spans randomly every month. "https://to.sze.hu/kezdolap" the top table in the middle div The path is soup.select("#content > div:nth-child(2) > div > div > div > table")

Related

Extract date and sort rows by date

Find all div with id (not class) inside <article> beautifulsoup

Select corresponding column value for max value of separate column(from a specific range of column) of pandas data frame

Sort Plots by year

How to save split data in panda in reverse order?

Categories

Resources