Adding href to panda .read_html DF - python

I want to create a table with the information available on this website. I want the table to have 3 columns: 0 series/date, 1 title and 2 links. I already managed to get the first two columns but I don't know how to get the link for each entry.
import pandas as pd
import requests
url = "http://legislaturautuado.com/pgs/resolutions.php?st=5&f=2016"
r = requests.get(url)
df_list = pd.read_html(r.text)
df = df_list[0]
df.head()
Will it be possible to get what I want by only using pandas?

As far as I know, it's not possible with pandas only. It can be done with BeautifulSoup, though:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "http://legislaturautuado.com/pgs/resolutions.php?st=5&f=2016"
r = requests.get(url)
html_table = BeautifulSoup(r.text).find('table')
r.close()
df = pd.read_html(str(html_table), header=0)[0]
df['Link'] = [link.get('href') for link in html_table.find_all('a')]

Related

TypeError: 'in <string>' requires string as left operand, not NoneType

I am trying to create a simple scraper to gatherbasketball stats. I was able to get the info I want, however, I can't figure out how to organized it all in a table.
I keep getting a "TypeError: 'in ' requires string as left operand, not NoneType."
Please see my code below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
url = 'https://basketball.realgm.com/ncaa/boxscore/2021-01-29/North-Texas-at-Rice/367436'
page = requests.get(url)
soup = BeautifulSoup(page.content , 'html.parser')
#Extracting Columns
tables = soup.find('div', class_= 'boxscore-gamesummary')
columns = tables.find_all('th', class_='nosort')
#Extracting Stats
tables = soup.find('div', class_= 'boxscore-gamesummary')
stats = tables.find_all('td')
#Filling DataFrame
temp_df = pd.DataFrame(stats).transpose()
temp_df.columns = columns
final_df = pd.concat([final_df,temp_df], ignore_index=True)
final_df
Looking forward to hearing from someone
Pandas already has a built-in method to get a dataframe from HTML which should make things way easier here.
Code
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://basketball.realgm.com/ncaa/boxscore/2021-01-29/North-Texas-at-Rice/367436'
page = requests.get(url)
soup = BeautifulSoup(page.content , 'html.parser')
tables = soup.find('div', class_= 'boxscore-gamesummary').find_all('table')
df = pd.read_html(str(tables))[0]
print(df)
Output
Unnamed: 0 1 2 Final
0 UNT (8-5) 36 43 79
1 RU (10-7) 37 37 74

Extract multiple page web table into Excel

I have a table that spans across many pages. I'm able to pull the info from a designated page and pull it into a CSV table. My goal now is to have this iterate through all the pages and add it to the bottom of the previous page's info. Here is the code so far that works on a single page:
import requests
import pandas as pd
url = 'https://www.mineralanswers.com/oklahoma/producers?page=1'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')
The page URL is setup in the "...producers?page=1, ...producers?page=2 ...producers?page=3" format so I feel like it's likely possible using a loop, I just am having trouble amending the data instead of overwriting it.
Here is corrected example code to fetch 3 pages and append them to one DataFrame. You may run this code here online.
import requests
import pandas as pd
df = pd.DataFrame()
for page in range(1, 4):
url = 'https://www.mineralanswers.com/oklahoma/producers?page=' + str(page)
html = requests.get(url).content
df_list = pd.read_html(html)
df = df.append(df_list[-1], ignore_index = True)
df.to_csv('my data.csv')

Crawling data from the web then restructure to a Pandas DataFrame

I have a code like this:
import os
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
from datetime import datetime, timedelta
URL_TEMPLATES ='https://freemeteo.vn/thoi-tiet/ngoc-ha/history/daily-history/?gid=1572463&station=11376&date={}-{:02d}-{:02d}&language=vietnamese&country=vietnam' #%loc
urls = URL_TEMPLATES.format(2015,1,1)
html_docs = requests.get(urls).text
soups = BeautifulSoup(html_doc)
tables = soup.find(class_='table-list')
tables
Then the results like this:
<div class="table-list">
<ul><li>Yên Phụ</li>
<li>Hữu Tiệp</li>
Can anyone help me to create tables to pandas DataFrame to easy to handle like I can select 'Yên Phụ' string? Thanks
You can parse the tables directly with pandas.
import pandas as pd
url = 'https://freemeteo.vn/thoi-tiet/ngoc-ha/history/daily-history/?gid=1572463&station=11376&date={}-{:02d}-{:02d}&language=vietnamese&country=vietnam'
tables = pd.read_html(url)
That will give you a list of dataframes. Each table on the page will be one dataframe.
Then you can query a dataframe like:
tables[0].query("Tên == 'Hà Nội'")
If you just wanted the cities in table-list div:
resp = requests.get(url)
soup = BeautifulSoup(resp.text)
table_list = soup.find('div', {'class': 'table-list'})
names, links = [], []
for city in table_list.find_all('a'):
names.append(city.text)
links.append(city['href'])
To turn the above 2 lists into a dataframe:
df = pd.DataFrame(zip(names, links), columns=['City', 'Link'])

How to i append the output from beautifulsoup to a pandas dataframe

I am relatively new to python. I am planning to
a) obtain a list of URLs from the following url (https://aviation-safety.net/database/) with data from the year 1919 onwards (https://aviation-safety.net/database/dblist.php?Year=1919).
b) obtain the data (date, type, registration, opreator, fat., location, cat) from 1919 to current year
However, i ran into some problems and am still stuck in a)
Any form of help is appreciated, thank you so much!
#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find('a', href = True)
#try clause to go through the content and grab the URLs
try:
for row in datatable:
cols = row.find_all("|")
if len(cols) > 1:
links.append(x, cols = cols)
except: pass
#place links into numpy array
links_array = np.asarray(links)
len(links_array)
#check if links are in dataframe
df = pd.DataFrame(links_array)
df.columns = ['url']
df.head(10)
i can't seem to be able to get the URLs
would be great if i could get the following
S/N URL
1 https://aviation-safety.net/database/dblist.php?Year=1919
2 https://aviation-safety.net/database/dblist.php?Year=1920
3 https://aviation-safety.net/database/dblist.php?Year=1921
You're not extracting the href attributes from the tags you are pulling. What you want to do is find all <a> tags with links (which you did, but you need to use find_all as find will just return the first 1 it finds.) Then iterate through those tags. I choose to just have it look for the substring 'Year' and if it does, put that into the list.
#import packages
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
Output:
df.head(10)
Out[24]:
url
0 https://aviation-safety.net/database/dblist.ph...
1 https://aviation-safety.net/database/dblist.ph...
2 https://aviation-safety.net/database/dblist.ph...
3 https://aviation-safety.net/database/dblist.ph...
4 https://aviation-safety.net/database/dblist.ph...
5 https://aviation-safety.net/database/dblist.ph...
6 https://aviation-safety.net/database/dblist.ph...
7 https://aviation-safety.net/database/dblist.ph...
8 https://aviation-safety.net/database/dblist.ph...
9 https://aviation-safety.net/database/dblist.ph...

Appending links to new rows in pandas df after using beautifulsoup

I'm attempting to extract some links from a chunk of beautiful soup html and append them to rows of a new pandas dataframe.
So far, I have this code:
url = "http://www.reed.co.uk/jobs
datecreatedoffset=Today&isnewjobssearch=True&pagesize=100"
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
adcount = soup.find_all("div", class_="pages")
print(adcount)
From my output I then want to take every link, identified by href="" and store each one in a new row of a pandas dataframe.
Using the above snippet I would end up with 6 rows in my new dataset.
Any help would be appreciated!
Your links gives a 404 but the logic should be the same as below. You just need to extract the anchor tags with the page class and join them to the base url:
import pandas as pd
from urlparse import urljoin
import requests
base = "http://www.reed.co.uk/jobs"
url = "http://www.reed.co.uk/jobs?keywords=&location=&jobtitleonly=false"
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
df = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in soup.select("div.pages a.page")])
print(df)
Which gives you:
links
0 http://www.reed.co.uk/jobs?cached=True&pageno=2
1 http://www.reed.co.uk/jobs?cached=True&pageno=3
2 http://www.reed.co.uk/jobs?cached=True&pageno=4
3 http://www.reed.co.uk/jobs?cached=True&pageno=5
4 http://www.reed.co.uk/jobs?cached=True&pageno=...
5 http://www.reed.co.uk/jobs?cached=True&pageno=2

Categories

Resources