I don't know if there is such a thing - but I'm trying to do an ordered dict comprehension. However it doesn't seem to work?
import requests
from bs4 import BeautifulSoup
from collections import OrderedDict
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')
t_data = OrderedDict()
rows = tables[1].find_all('tr')
t_data = {row.th.text: row.td.text for row in rows if row.td }
It's left as a normal dict comprehension for now (I've also left out the usual requests to soup boilerplate).
Any ideas?
You can't directly do a comprehension with an OrderedDict. You can, however, use a generator in the constructor for OrderedDict.
Try this on for size:
import requests
from bs4 import BeautifulSoup
from collections import OrderedDict
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')
rows = tables[1].find_all('tr')
t_data = OrderedDict((row.th.text, row.td.text) for row in rows if row.td)
Related
Here I am trying to get the value of every column in the table shown in the picture (for three different pages) and store them in pandas dataframe. I have collected the data and now I have a list of lists, but when I try to add them to a dictionary I get empty dictionary. can anyone help me what I'm doing wrong or suggest an alternative way to create 3 dataframes, one for each table?
Here is my code:
import numpy as np
import pandas as pd
from datetime import datetime
import pytz
import requests
import json
from bs4 import BeautifulSoup
url_list = ['https://www.coingecko.com/en/coins/ethereum/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel',
'https://www.coingecko.com/en/coins/cardano/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel',
'https://www.coingecko.com/en/coins/chainlink/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel']
results = []
for url in url_list:
response = requests.get(url)
src = response.content
soup = BeautifulSoup(response.text , 'html.parser')
results.append(soup.find_all( "td",class_= "text-center"))
collected_data = dict()
for result in results:
for r in result:
datas = r.find_all("td", title=True)
for data in datas:
collected_data.setdefault(data.text)
collected_data
What happens?
In your first for loop your are only append the result set of soup.find_all( "td",class_= "text-center") to results.
So you wont find what you are looking for with datas = r.find_all("td", title=True)
Note also, that the column headers are not placed in <td> but in <th>.
How to fix?
You could select more specific, all <tr> in <tbody> to iterate over:
for row in soup.select('tbody tr'):
While iterating select the <th> and <td> and zip() it to dict() with the list of column headers:
data.append(
dict(zip([x.text for x in soup.select('thead th')], [x.text.strip() for x in row.select('th,td')]))
)
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
url_list = ['https://www.coingecko.com/en/coins/ethereum/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel',
'https://www.coingecko.com/en/coins/cardano/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel',
'https://www.coingecko.com/en/coins/chainlink/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel']
data = []
for url in url_list:
response = requests.get(url)
src = response.content
soup = BeautifulSoup(response.text , 'html.parser')
for row in soup.select('tbody tr'):
data.append(
dict(zip([x.text for x in soup.select('thead th')], [x.text.strip() for x in row.select('th,td')]))
)
pd.DataFrame(data)
Output
Date
Market Cap
Volume
Open
Close
2021-09-05
$456,929,768,632
$24,002,848,309
$3,894.94
N/A
2021-09-04
$462,019,852,288
$30,463,347,266
$3,936.16
$3,894.94
2021-09-03
$444,936,758,975
$28,115,776,510
$3,793.30
$3,936.16
EDIT
To get a data frame per url you can change the code to the following - It will append the frames to a list, so that you can iterat over to do things.
Note This is based on your comment and if it fits, okay. I would suggest to store the coin provider also as column, so you would be able to filter, group by, ... over all providers - But that should be asked in a new question, if matters.
dfList = []
for url in url_list:
response = requests.get(url)
src = response.content
soup = BeautifulSoup(response.text , 'html.parser')
data = []
coin = url.split("/")[5].upper()
for row in soup.select('tbody tr'):
data.append(
dict(zip([f'{x.text}_{coin}' for x in soup.select('thead th')], [x.text.strip() for x in row.select('th,td')]))
)
# if you like to save directly as csv... change next line to -> pd.DataFrame(data).to_csv(f'{coin}.csv')
dfList.append(pd.DataFrame(data))
Output
Select data frame by list index for example dfList[0]
Date_ETHEREUM
Market Cap_ETHEREUM
Volume_ETHEREUM
Open_ETHEREUM
Close_ETHEREUM
2021-09-05
$456,929,768,632
$24,002,848,309
$3,894.94
N/A
2021-09-04
$462,019,852,288
$30,463,347,266
$3,936.16
$3,894.94
This question already has answers here:
What is the difference between Python's list methods append and extend?
(20 answers)
Closed 1 year ago.
So I'm working on a web scraper function to get movie data from IMDB. While I'm able to get the data I need into a dictionary, because of how I've written the function, it appends a list of lists to the dictionary. I would like to return a single list.
So right now I'm getting
dict_key: [[A,B,C,D],[E,F,G,H],...] and I want dict_key: [A,B,C,D,E,F,G,H].
Eventually, I want to then take this dictionary and convert it to a pandas dataframe with col names corresponding to the dictionary keys.
Here is my function:
It takes a list of URLs, HTML tags, and variable names and gets the movie category(s), year, and length of movie.
def web_scraper(urls, class_list, col_names):
import requests # Import necessary modules
from bs4 import BeautifulSoup
import pandas as pd
class_dict = {}
for col in col_names:
class_dict[col] = []
for url in urls:
page = requests.get(url) # Link to the page
soup = BeautifulSoup(page.content, 'html.parser') # Create a soup object
for i in range(len(class_list)): # Loop through class_list and col_names
names = soup.select(class_list[i]) # Get text
names = [name.getText(strip=True) for name in names] # append text to dataframe
class_dict[col_names[i]].append(names)
for class_ in class_dict: # Here is my attempt to flatten the list
class_ = [item for sublist in class_ for item in sublist]
return class_dict
Use list.extend instead of list.append:
def web_scraper(urls, class_list, col_names):
import requests # Import necessary modules
from bs4 import BeautifulSoup
import pandas as pd
class_dict = {}
for col in col_names:
class_dict[col] = []
for url in urls:
page = requests.get(url) # Link to the page
soup = BeautifulSoup(page.content, 'html.parser') # Create a soup object
for i in range(len(class_list)): # Loop through class_list and col_names
names = soup.select(class_list[i]) # Get text
names = [name.getText(strip=True) for name in names] # append text to dataframe
class_dict[col_names[i]].extend(names) # <--- use list.extend
return class_dict
I have a code like this:
import os
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
from datetime import datetime, timedelta
URL_TEMPLATES ='https://freemeteo.vn/thoi-tiet/ngoc-ha/history/daily-history/?gid=1572463&station=11376&date={}-{:02d}-{:02d}&language=vietnamese&country=vietnam' #%loc
urls = URL_TEMPLATES.format(2015,1,1)
html_docs = requests.get(urls).text
soups = BeautifulSoup(html_doc)
tables = soup.find(class_='table-list')
tables
Then the results like this:
<div class="table-list">
<ul><li>Yên Phụ</li>
<li>Hữu Tiệp</li>
Can anyone help me to create tables to pandas DataFrame to easy to handle like I can select 'Yên Phụ' string? Thanks
You can parse the tables directly with pandas.
import pandas as pd
url = 'https://freemeteo.vn/thoi-tiet/ngoc-ha/history/daily-history/?gid=1572463&station=11376&date={}-{:02d}-{:02d}&language=vietnamese&country=vietnam'
tables = pd.read_html(url)
That will give you a list of dataframes. Each table on the page will be one dataframe.
Then you can query a dataframe like:
tables[0].query("Tên == 'Hà Nội'")
If you just wanted the cities in table-list div:
resp = requests.get(url)
soup = BeautifulSoup(resp.text)
table_list = soup.find('div', {'class': 'table-list'})
names, links = [], []
for city in table_list.find_all('a'):
names.append(city.text)
links.append(city['href'])
To turn the above 2 lists into a dataframe:
df = pd.DataFrame(zip(names, links), columns=['City', 'Link'])
I'm trying to soup-ify get requests
from bs4 import BeautifulSoup
import requests
import pandas as pd
html_page = requests.get('"https://www.dataquest.io"')
soup = BeautifulSoup(html_page, "lxml")
soup.find_all('<\a>')
However, this just returns an empty list
This will pull the table rows and assign each row to a dictionary, which is appended to a list. You may want to tweak the selectors slightly.
from bs4 import BeautifulSoup
import requests
from pprint import pprint
output_data = [] # This is a LoD containing all of the table data
for i in range(1, 453): # For loop used to paginate
data_page = requests.get(f'https://www.dataquest.io?')
print(data_page)
soup = BeautifulSoup(data_page.text, "lxml")
# Find all of the table rows
elements = soup.select('div.head_table_t')
try:
secondary_elements = soup.select('div.list_table_subs')
elements = elements + secondary_elements
except:
pass
print(len(elements))
# Iterate through the rows and select individual column and assign it to the dictionary with the correct header
for element in elements:
data = {}
data['Name'] = element.select_one('div.col_1 a').text.strip()
data['Page URL'] = element.select_one('div.col_1 a')['href']
output_data.append(data) # Append dictionary (contact info) to the list
pprint(data) # Pretty Print the dictionary out (to see what you're receiving, this can be removed)
I'm attempting to extract some links from a chunk of beautiful soup html and append them to rows of a new pandas dataframe.
So far, I have this code:
url = "http://www.reed.co.uk/jobs
datecreatedoffset=Today&isnewjobssearch=True&pagesize=100"
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
adcount = soup.find_all("div", class_="pages")
print(adcount)
From my output I then want to take every link, identified by href="" and store each one in a new row of a pandas dataframe.
Using the above snippet I would end up with 6 rows in my new dataset.
Any help would be appreciated!
Your links gives a 404 but the logic should be the same as below. You just need to extract the anchor tags with the page class and join them to the base url:
import pandas as pd
from urlparse import urljoin
import requests
base = "http://www.reed.co.uk/jobs"
url = "http://www.reed.co.uk/jobs?keywords=&location=&jobtitleonly=false"
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
df = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in soup.select("div.pages a.page")])
print(df)
Which gives you:
links
0 http://www.reed.co.uk/jobs?cached=True&pageno=2
1 http://www.reed.co.uk/jobs?cached=True&pageno=3
2 http://www.reed.co.uk/jobs?cached=True&pageno=4
3 http://www.reed.co.uk/jobs?cached=True&pageno=5
4 http://www.reed.co.uk/jobs?cached=True&pageno=...
5 http://www.reed.co.uk/jobs?cached=True&pageno=2