Is there an OrderedDict comprehension? - python

I don't know if there is such a thing - but I'm trying to do an ordered dict comprehension. However it doesn't seem to work?
import requests
from bs4 import BeautifulSoup
from collections import OrderedDict
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')
t_data = OrderedDict()
rows = tables[1].find_all('tr')
t_data = {row.th.text: row.td.text for row in rows if row.td }
It's left as a normal dict comprehension for now (I've also left out the usual requests to soup boilerplate).
Any ideas?

You can't directly do a comprehension with an OrderedDict. You can, however, use a generator in the constructor for OrderedDict.
Try this on for size:
import requests
from bs4 import BeautifulSoup
from collections import OrderedDict
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')
rows = tables[1].find_all('tr')
t_data = OrderedDict((row.th.text, row.td.text) for row in rows if row.td)

Related

How to collect "td" text from list of lists and add them into the dictionary python beautifulSoup

Here I am trying to get the value of every column in the table shown in the picture (for three different pages) and store them in pandas dataframe. I have collected the data and now I have a list of lists, but when I try to add them to a dictionary I get empty dictionary. can anyone help me what I'm doing wrong or suggest an alternative way to create 3 dataframes, one for each table?
Here is my code:
import numpy as np
import pandas as pd
from datetime import datetime
import pytz
import requests
import json
from bs4 import BeautifulSoup
url_list = ['https://www.coingecko.com/en/coins/ethereum/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel',
'https://www.coingecko.com/en/coins/cardano/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel',
'https://www.coingecko.com/en/coins/chainlink/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel']
results = []
for url in url_list:
response = requests.get(url)
src = response.content
soup = BeautifulSoup(response.text , 'html.parser')
results.append(soup.find_all( "td",class_= "text-center"))
collected_data = dict()
for result in results:
for r in result:
datas = r.find_all("td", title=True)
for data in datas:
collected_data.setdefault(data.text)
collected_data
What happens?
In your first for loop your are only append the result set of soup.find_all( "td",class_= "text-center") to results.
So you wont find what you are looking for with datas = r.find_all("td", title=True)
Note also, that the column headers are not placed in <td> but in <th>.
How to fix?
You could select more specific, all <tr> in <tbody> to iterate over:
for row in soup.select('tbody tr'):
While iterating select the <th> and <td> and zip() it to dict() with the list of column headers:
data.append(
dict(zip([x.text for x in soup.select('thead th')], [x.text.strip() for x in row.select('th,td')]))
)
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
url_list = ['https://www.coingecko.com/en/coins/ethereum/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel',
'https://www.coingecko.com/en/coins/cardano/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel',
'https://www.coingecko.com/en/coins/chainlink/historical_data/usd?start_date=2021-08-06&end_date=2021-09-05#panel']
data = []
for url in url_list:
response = requests.get(url)
src = response.content
soup = BeautifulSoup(response.text , 'html.parser')
for row in soup.select('tbody tr'):
data.append(
dict(zip([x.text for x in soup.select('thead th')], [x.text.strip() for x in row.select('th,td')]))
)
pd.DataFrame(data)
Output
Date
Market Cap
Volume
Open
Close
2021-09-05
$456,929,768,632
$24,002,848,309
$3,894.94
N/A
2021-09-04
$462,019,852,288
$30,463,347,266
$3,936.16
$3,894.94
2021-09-03
$444,936,758,975
$28,115,776,510
$3,793.30
$3,936.16
EDIT
To get a data frame per url you can change the code to the following - It will append the frames to a list, so that you can iterat over to do things.
Note This is based on your comment and if it fits, okay. I would suggest to store the coin provider also as column, so you would be able to filter, group by, ... over all providers - But that should be asked in a new question, if matters.
dfList = []
for url in url_list:
response = requests.get(url)
src = response.content
soup = BeautifulSoup(response.text , 'html.parser')
data = []
coin = url.split("/")[5].upper()
for row in soup.select('tbody tr'):
data.append(
dict(zip([f'{x.text}_{coin}' for x in soup.select('thead th')], [x.text.strip() for x in row.select('th,td')]))
)
# if you like to save directly as csv... change next line to -> pd.DataFrame(data).to_csv(f'{coin}.csv')
dfList.append(pd.DataFrame(data))
Output
Select data frame by list index for example dfList[0]
Date_ETHEREUM
Market Cap_ETHEREUM
Volume_ETHEREUM
Open_ETHEREUM
Close_ETHEREUM
2021-09-05
$456,929,768,632
$24,002,848,309
$3,894.94
N/A
2021-09-04
$462,019,852,288
$30,463,347,266
$3,936.16
$3,894.94

How do I iterate through multiple URLs with BeautifulSoup and return a single list rather than a list of lists? [duplicate]

This question already has answers here:
What is the difference between Python's list methods append and extend?
(20 answers)
Closed 1 year ago.
So I'm working on a web scraper function to get movie data from IMDB. While I'm able to get the data I need into a dictionary, because of how I've written the function, it appends a list of lists to the dictionary. I would like to return a single list.
So right now I'm getting
dict_key: [[A,B,C,D],[E,F,G,H],...] and I want dict_key: [A,B,C,D,E,F,G,H].
Eventually, I want to then take this dictionary and convert it to a pandas dataframe with col names corresponding to the dictionary keys.
Here is my function:
It takes a list of URLs, HTML tags, and variable names and gets the movie category(s), year, and length of movie.
def web_scraper(urls, class_list, col_names):
import requests # Import necessary modules
from bs4 import BeautifulSoup
import pandas as pd
class_dict = {}
for col in col_names:
class_dict[col] = []
for url in urls:
page = requests.get(url) # Link to the page
soup = BeautifulSoup(page.content, 'html.parser') # Create a soup object
for i in range(len(class_list)): # Loop through class_list and col_names
names = soup.select(class_list[i]) # Get text
names = [name.getText(strip=True) for name in names] # append text to dataframe
class_dict[col_names[i]].append(names)
for class_ in class_dict: # Here is my attempt to flatten the list
class_ = [item for sublist in class_ for item in sublist]
return class_dict
Use list.extend instead of list.append:
def web_scraper(urls, class_list, col_names):
import requests # Import necessary modules
from bs4 import BeautifulSoup
import pandas as pd
class_dict = {}
for col in col_names:
class_dict[col] = []
for url in urls:
page = requests.get(url) # Link to the page
soup = BeautifulSoup(page.content, 'html.parser') # Create a soup object
for i in range(len(class_list)): # Loop through class_list and col_names
names = soup.select(class_list[i]) # Get text
names = [name.getText(strip=True) for name in names] # append text to dataframe
class_dict[col_names[i]].extend(names) # <--- use list.extend
return class_dict

Crawling data from the web then restructure to a Pandas DataFrame

I have a code like this:
import os
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
from datetime import datetime, timedelta
URL_TEMPLATES ='https://freemeteo.vn/thoi-tiet/ngoc-ha/history/daily-history/?gid=1572463&station=11376&date={}-{:02d}-{:02d}&language=vietnamese&country=vietnam' #%loc
urls = URL_TEMPLATES.format(2015,1,1)
html_docs = requests.get(urls).text
soups = BeautifulSoup(html_doc)
tables = soup.find(class_='table-list')
tables
Then the results like this:
<div class="table-list">
<ul><li>Yên Phụ</li>
<li>Hữu Tiệp</li>
Can anyone help me to create tables to pandas DataFrame to easy to handle like I can select 'Yên Phụ' string? Thanks
You can parse the tables directly with pandas.
import pandas as pd
url = 'https://freemeteo.vn/thoi-tiet/ngoc-ha/history/daily-history/?gid=1572463&station=11376&date={}-{:02d}-{:02d}&language=vietnamese&country=vietnam'
tables = pd.read_html(url)
That will give you a list of dataframes. Each table on the page will be one dataframe.
Then you can query a dataframe like:
tables[0].query("Tên == 'Hà Nội'")
If you just wanted the cities in table-list div:
resp = requests.get(url)
soup = BeautifulSoup(resp.text)
table_list = soup.find('div', {'class': 'table-list'})
names, links = [], []
for city in table_list.find_all('a'):
names.append(city.text)
links.append(city['href'])
To turn the above 2 lists into a dataframe:
df = pd.DataFrame(zip(names, links), columns=['City', 'Link'])

Soup-ify get requests

I'm trying to soup-ify get requests
from bs4 import BeautifulSoup
import requests
import pandas as pd
html_page = requests.get('"https://www.dataquest.io"')
soup = BeautifulSoup(html_page, "lxml")
soup.find_all('<\a>')
However, this just returns an empty list
This will pull the table rows and assign each row to a dictionary, which is appended to a list. You may want to tweak the selectors slightly.
from bs4 import BeautifulSoup
import requests
from pprint import pprint
output_data = [] # This is a LoD containing all of the table data
for i in range(1, 453): # For loop used to paginate
data_page = requests.get(f'https://www.dataquest.io?')
print(data_page)
soup = BeautifulSoup(data_page.text, "lxml")
# Find all of the table rows
elements = soup.select('div.head_table_t')
try:
secondary_elements = soup.select('div.list_table_subs')
elements = elements + secondary_elements
except:
pass
print(len(elements))
# Iterate through the rows and select individual column and assign it to the dictionary with the correct header
for element in elements:
data = {}
data['Name'] = element.select_one('div.col_1 a').text.strip()
data['Page URL'] = element.select_one('div.col_1 a')['href']
output_data.append(data) # Append dictionary (contact info) to the list
pprint(data) # Pretty Print the dictionary out (to see what you're receiving, this can be removed)

Appending links to new rows in pandas df after using beautifulsoup

I'm attempting to extract some links from a chunk of beautiful soup html and append them to rows of a new pandas dataframe.
So far, I have this code:
url = "http://www.reed.co.uk/jobs
datecreatedoffset=Today&isnewjobssearch=True&pagesize=100"
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
adcount = soup.find_all("div", class_="pages")
print(adcount)
From my output I then want to take every link, identified by href="" and store each one in a new row of a pandas dataframe.
Using the above snippet I would end up with 6 rows in my new dataset.
Any help would be appreciated!
Your links gives a 404 but the logic should be the same as below. You just need to extract the anchor tags with the page class and join them to the base url:
import pandas as pd
from urlparse import urljoin
import requests
base = "http://www.reed.co.uk/jobs"
url = "http://www.reed.co.uk/jobs?keywords=&location=&jobtitleonly=false"
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
df = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in soup.select("div.pages a.page")])
print(df)
Which gives you:
links
0 http://www.reed.co.uk/jobs?cached=True&pageno=2
1 http://www.reed.co.uk/jobs?cached=True&pageno=3
2 http://www.reed.co.uk/jobs?cached=True&pageno=4
3 http://www.reed.co.uk/jobs?cached=True&pageno=5
4 http://www.reed.co.uk/jobs?cached=True&pageno=...
5 http://www.reed.co.uk/jobs?cached=True&pageno=2

Categories

Resources