I'm attempting to extract some links from a chunk of beautiful soup html and append them to rows of a new pandas dataframe.
So far, I have this code:
url = "http://www.reed.co.uk/jobs
datecreatedoffset=Today&isnewjobssearch=True&pagesize=100"
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
adcount = soup.find_all("div", class_="pages")
print(adcount)
From my output I then want to take every link, identified by href="" and store each one in a new row of a pandas dataframe.
Using the above snippet I would end up with 6 rows in my new dataset.
Any help would be appreciated!
Your links gives a 404 but the logic should be the same as below. You just need to extract the anchor tags with the page class and join them to the base url:
import pandas as pd
from urlparse import urljoin
import requests
base = "http://www.reed.co.uk/jobs"
url = "http://www.reed.co.uk/jobs?keywords=&location=&jobtitleonly=false"
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
df = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in soup.select("div.pages a.page")])
print(df)
Which gives you:
links
0 http://www.reed.co.uk/jobs?cached=True&pageno=2
1 http://www.reed.co.uk/jobs?cached=True&pageno=3
2 http://www.reed.co.uk/jobs?cached=True&pageno=4
3 http://www.reed.co.uk/jobs?cached=True&pageno=5
4 http://www.reed.co.uk/jobs?cached=True&pageno=...
5 http://www.reed.co.uk/jobs?cached=True&pageno=2
Related
I am trying to create a simple scraper to gatherbasketball stats. I was able to get the info I want, however, I can't figure out how to organized it all in a table.
I keep getting a "TypeError: 'in ' requires string as left operand, not NoneType."
Please see my code below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
url = 'https://basketball.realgm.com/ncaa/boxscore/2021-01-29/North-Texas-at-Rice/367436'
page = requests.get(url)
soup = BeautifulSoup(page.content , 'html.parser')
#Extracting Columns
tables = soup.find('div', class_= 'boxscore-gamesummary')
columns = tables.find_all('th', class_='nosort')
#Extracting Stats
tables = soup.find('div', class_= 'boxscore-gamesummary')
stats = tables.find_all('td')
#Filling DataFrame
temp_df = pd.DataFrame(stats).transpose()
temp_df.columns = columns
final_df = pd.concat([final_df,temp_df], ignore_index=True)
final_df
Looking forward to hearing from someone
Pandas already has a built-in method to get a dataframe from HTML which should make things way easier here.
Code
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://basketball.realgm.com/ncaa/boxscore/2021-01-29/North-Texas-at-Rice/367436'
page = requests.get(url)
soup = BeautifulSoup(page.content , 'html.parser')
tables = soup.find('div', class_= 'boxscore-gamesummary').find_all('table')
df = pd.read_html(str(tables))[0]
print(df)
Output
Unnamed: 0 1 2 Final
0 UNT (8-5) 36 43 79
1 RU (10-7) 37 37 74
I want to create a table with the information available on this website. I want the table to have 3 columns: 0 series/date, 1 title and 2 links. I already managed to get the first two columns but I don't know how to get the link for each entry.
import pandas as pd
import requests
url = "http://legislaturautuado.com/pgs/resolutions.php?st=5&f=2016"
r = requests.get(url)
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()
Will it be possible to get what I want by only using pandas?
As far as I know, it's not possible with pandas only. It can be done with BeautifulSoup, though:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "http://legislaturautuado.com/pgs/resolutions.php?st=5&f=2016"
r = requests.get(url)
html_table = BeautifulSoup(r.text).find('table')
r.close()
df = pd.read_html(str(html_table), header=0)[0]
df['Link'] = [link.get('href') for link in html_table.find_all('a')]
I am relatively new to python. I am planning to
a) obtain a list of URLs from the following url (https://aviation-safety.net/database/) with data from the year 1919 onwards (https://aviation-safety.net/database/dblist.php?Year=1919).
b) obtain the data (date, type, registration, opreator, fat., location, cat) from 1919 to current year
However, i ran into some problems and am still stuck in a)
Any form of help is appreciated, thank you so much!
#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find('a', href = True)
#try clause to go through the content and grab the URLs
try:
for row in datatable:
cols = row.find_all("|")
if len(cols) > 1:
links.append(x, cols = cols)
except: pass
#place links into numpy array
links_array = np.asarray(links)
len(links_array)
#check if links are in dataframe
df = pd.DataFrame(links_array)
df.columns = ['url']
df.head(10)
i can't seem to be able to get the URLs
would be great if i could get the following
S/N URL
1 https://aviation-safety.net/database/dblist.php?Year=1919
2 https://aviation-safety.net/database/dblist.php?Year=1920
3 https://aviation-safety.net/database/dblist.php?Year=1921
You're not extracting the href attributes from the tags you are pulling. What you want to do is find all <a> tags with links (which you did, but you need to use find_all as find will just return the first 1 it finds.) Then iterate through those tags. I choose to just have it look for the substring 'Year' and if it does, put that into the list.
#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
Output:
df.head(10)
Out[24]:
url
0 https://aviation-safety.net/database/dblist.ph...
1 https://aviation-safety.net/database/dblist.ph...
2 https://aviation-safety.net/database/dblist.ph...
3 https://aviation-safety.net/database/dblist.ph...
4 https://aviation-safety.net/database/dblist.ph...
5 https://aviation-safety.net/database/dblist.ph...
6 https://aviation-safety.net/database/dblist.ph...
7 https://aviation-safety.net/database/dblist.ph...
8 https://aviation-safety.net/database/dblist.ph...
9 https://aviation-safety.net/database/dblist.ph...
I am relatively new to programming and completely new to stack overflow. I thought a good way to learn would be with a python & excel based project, but am stuck. My plan was to scrape a website of addresses using beautiful soup look up the zillow estimates of value for those addresses and populate them into tabular form in excel. I am unable to figure out how to get the addresses (the html on the site I am trying to scrape seems pretty messy), but was able to pull google address links from the site. Sorry if this is a very basic question, any advice would help though:
from bs4 import BeautifulSoup
from urllib.request import Request,
urlopen
import re
import pandas as pd
req = Request("http://www.tjsc.com/Sales/TodaySales")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
count = 0
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
count = count +1
print(links)
print("count is", count)
po = links
pd.DataFrame(po).to_excel('todaysale.xlsx', header=False, index=False)
you are on the right track. Instead of 'a', you need to use different html tag 'td' for the rows. Also 'th' for column names. here is one way to implement it. list_slide function converts each 14 elements to one row since the original table has 14 columns.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = "http://www.tjsc.com/Sales/TodaySales"
r = requests.get(url, verify=False)
text = r.text
soup = bs(text, 'lxml')
# Get column headers from the html file
header = []
for c_name in soup.findAll('th'):
header.append(c_name)
# clean up the extracted header content
header = [h.contents[0].strip() for h in header]
# get each row of the table
row = []
for link in soup.find_all('td'):
row.append(link.get_text().strip())
def list_slice(my_list, step):
"""This function takes any list, and divides it to chunks of size of "step"
"""
return [my_list[x:x + step] for x in range(0, len(my_list), step)]
# creating the final dataframe
df = pd.DataFrame(list_slice(row, 14), columns=header[:14])
I am trying to a table from SEC filling 10-K, I think it is going allright except the part where pandas converting it to dataframes as I am new to data frames so I think making mistake in indexing, Please help me on this as I am getting fallowing error "IndexError: index 2 is out of bounds for axis 0 with size 2"
I am using this programming
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1022344/000155837017000934/spg-20161231x10k.htm#Item8FinancialStatementsandSupplementary'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'lxml')
table = soup.find_all('table')[0]
new_table = pd.DataFrame(columns=range(0,2), index = [0])
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
new_table
If dataframe issue isn't resolvable than please suggest any other substitute like writing data to csv/excel also any sugesstion for extracting mutiple table at once will be really helpful