Trying to export data taken from HTML using beautiful soup

Trying to export data taken from HTML using beautiful soup - python

I'm trying to extract information from the HTML texts that I get from URLs that I create from a For loop and then use beautiful soup.
I get to isolate the information correctly but when I'm trying to export the data I get an error message "All arrays must be of the same length"
weblink = []
filing_type = []
company_name = []
date = []
#Importing file
df = pd.read_csv('Downloads\Dropped_Companies.csv')
#Getting companie's names into list
companies_column=list(df.columns.values)[4]
name_ = df[companies_column].tolist()
#Formatting company's names for creating URLs
for CompanyName in name_:
company_name.append(CompanyName.lower().replace(" ",'_'))
company_name
for item in range(0, len(company_name)):
link = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&company=' + company_name[item] + '&type=10-K&dateb=&owner=exclude&count=100'
#Getting the HTML text
headers = random.choice(headers_list)
r = requests.Session()
r.headers = headers
html = r.get(link).text
#Calling beautiful soup for better HTML text
soup = bs.BeautifulSoup(html)
tet_ = soup.find_all("a", id = "documentsbutton")
#Get the links
for link in tet_:
weblink.append('https://www.sec.gov' + link.get('href'))
test11 = soup.find_all("table", class_= "tableFile2")
for link in test11:
row = link.find_all("td", nowrap = "nowrap")
for i in range(0, len(row), 3):
filing_type.append(row[i].getText())
date.append(link.find("td", class_ = "small").find_next_sibling("td").text)
name.append(company_name[item])
data = {'Company Name':name,'Filing Date': date,'Filing Type':filing_type,"Weblink":weblink}
outputdf = pd.DataFrame(data)
outputdf.to_csv('Downloads/t_10KLinks.csv')

data = {'Company Name':name,'Filing Date': date,'Filing Type':filing_type,"Weblink":weblink}
outputdf = pd.DataFrame.from_dict(data, orient='index')
outputdf.to_csv('Downloads/t_10KLinks.csv')
pandas.DataFrame.from_dict documentation.

Related

how to match data from a linked pages together by web scraping

i'm trying to scrap some data from a web site and i have an issue in matching the data from every subpage to the data of the main page
for Expample: the main page have a country name "Alabama Trucking Companies" and when i enter to it link, i'll found some cities(Abbeville, Adamsville,...etc), i need to clarify every city details (city name and city link) with it's country name
country names that i scraped from the main page:
city names that i scraped from the sub page:
the below code that i used is extracting the data from the main and sub pages individually without matching them to other, So how can i solve this issue please.
The code that i've used:-
start_time = datetime.now()
url = 'https://www.quicktransportsolutions.com/carrier/usa-trucking-companies.php'
page_country = requests.get(url).content
soup_country = BeautifulSoup(page_country, 'lxml')
countries = soup_country.find('div',{'class':'col-xs-12 col-sm-9'})
countries_list = []
country_info = countries.find_all('div',{'class':'col-md-4 column'})
for i in country_info:
title_country = i.text.strip()
href_country = i.find('a', href=True)['href']
countries_list.append({'Country Title':title_country, 'Link':(f'https://www.quicktransportsolutions.com//carrier//{href_country}')})
countries_links = []
for i in pd.DataFrame(countries_list)['Link']:
page_city = requests.get(i).content
soup_city = BeautifulSoup(page_city, 'lxml')
city = soup_city.find('div',{'align':'center','class':'table-responsive'})
countries_links.append(city)
cities_list = []
for i in countries_links:
city_info = i.find_all('td',"")
for i in city_info:
title_city = i.text.strip()
try:
href_city = i.find('a', href=True)['href']
except:
continue
cities_list.append({'City Title':title_city,'City Link':href_city})
end_time = datetime.now()
print(f'Duration: {end_time - start_time}')
df = pd.DataFrame(cities_list)
df = df.loc[df['City Link']!= '#'].drop_duplicates().reset_index(drop=True)
df
The expected data to see for every country is the below:-

Instead of parsing all of the state links and adding them to a list prior to crawling each of the city pages, what you can do is parse each states extract their link, then immediately follow the link to get all of the cities for that state before moving on the the next state, and then append all the information to one master list at one time.
For example:
start_time = datetime.now()
url = 'https://www.quicktransportsolutions.com/carrier/usa-trucking-companies.php'
page_country = requests.get(url).content
soup_country = BeautifulSoup(page_country, 'lxml')
countries = soup_country.find('div',{'class':'col-xs-12 col-sm-9'})
data_list = []
country_info = countries.find_all('div',{'class':'col-md-4 column'})
for i in country_info:
title_country = i.text.strip()
href_country = i.find('a', href=True)['href']
link = f'https://www.quicktransportsolutions.com/carrier/{href_country}'
page_city = requests.get(link).content
soup_city = BeautifulSoup(page_city, 'lxml')
city = soup_city.find('div',{'align':'center','class':'table-responsive'})
city_info = city.find_all('td',"")
for i in city_info:
title_city = i.text.strip()
try:
href_city = i.find('a', href=True)['href']
except:
continue
row = {
'Country Title':title_country,
'Link':link,
'City Title':title_city,
'City Link':href_city
}
data_list.append(row)
end_time = datetime.now()
print(f'Duration: {end_time - start_time}')
df = pd.DataFrame(data_list)
df = df.loc[df['City Link']!= '#'].drop_duplicates().reset_index(drop=True)

How to get the first column values in the wikipedia table using python bs4?

I'm trying to web scrape a data table in wikipedia using python bs4. But I'm stuck with this problem. When getting the data values my code is not getting the first column or index zero. I feel there something wrong with the index but I can't figure it out. Please help. See the
response_obj = requests.get('https://en.wikipedia.org/wiki/Metro_Manila').text
soup = BeautifulSoup(response_obj,'lxml')
Neighborhoods_MM_Table = soup.find('table', {'class':'wikitable sortable'})
rows = Neighborhoods_MM_Table.select("tbody > tr")[3:8]
cities = []
for row in rows:
city = {}
tds = row.select('td')
city["City or Municipal"] = tds[0].text.strip()
city["%_Population"] = tds[1].text.strip()
city["Population"] = float(tds[2].text.strip().replace(",",""))
city["area_sqkm"] = float(tds[3].text.strip().replace(",",""))
city["area_sqm"] = float(tds[4].text.strip().replace(",",""))
city["density_sqm"] = float(tds[5].text.strip().replace(",",""))
city["density_sqkm"] = float(tds[6].text.strip().replace(",",""))
cities.append(city)
print(cities)
df=pd.DataFrame(cities)
df.head()

import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = [item.get_text(strip=True) for item in soup.findAll(
"td", style="text-align:right") if "%" in item.text] + [""]
df = pd.read_html(r.content, header=0)[5]
df = df.iloc[1: -1]
df['Population (2015)[3]'] = target
print(df)
df.to_csv("data.csv", index=False)
main("https://en.wikipedia.org/wiki/Metro_Manila")
Output: view-online

How do I fix/prevent Data Overwriting Issue in Web Scrape Loop?

I was able to loop the web scraping process, but the data collected from the page that comes after replaces the data from the page before. Making the excel contains only the data from the last page. What do I need to do?
from bs4 import BeautifulSoup
import requests
import pandas as pd
print ('all imported successfuly')
for x in range(1, 44):
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
names = soup.find_all('div', attrs={'class': 'consumer-information__name'})
headers = soup.find_all('h2', attrs={'class':'review-content__title'})
bodies = soup.find_all('p', attrs={'class':'review-content__text'})
ratings = soup.find_all('div', attrs={'class':'star-rating star-rating--medium'})
dates = soup.find_all('div', attrs={'class':'review-content-header__dates'})
print ('pass1')
df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Date': dates})
df.to_csv('birchbox006.csv', index=False, encoding='utf-8')
print ('excel done')

Because you are using a loop the variables are being constantly overwritten. Normally what you'd do in a situation like this is have an array and then append to it throughout the loop:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')
# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 44):
names = []
headers = []
bodies = []
ratings = []
published = []
updated = []
reported = []
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
articles = soup.find_all('article', {'class':'review'})
for article in articles:
names.append(article.find('div', attrs={'class': 'consumer-information__name'}).text.strip())
headers.append(article.find('h2', attrs={'class':'review-content__title'}).text.strip())
try:
bodies.append(article.find('p', attrs={'class':'review-content__text'}).text.strip())
except:
bodies.append('')
try:
ratings.append(article.find('p', attrs={'class':'review-content__text'}).text.strip())
except:
ratings.append('')
dateElements = article.find('div', attrs={'class':'review-content-header__dates'}).text.strip()
jsonData = json.loads(dateElements)
published.append(jsonData['publishedDate'])
updated.append(jsonData['updatedDate'])
reported.append(jsonData['reportedDate'])
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
df = df.append(temp_df, sort=False).reset_index(drop=True)
print ('pass1')
df.to_csv('birchbox006.csv', index=False, encoding='utf-8')
print ('excel done')

The reason is because you are overwriting your variables in each iteration.
If you want to extend this variables, you can do for example:
names = []
bodies = []
ratings = []
dates = []
for x in range(1, 44):
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
names += soup.find_all('div', attrs={'class': 'consumer-information__name'})
headers += soup.find_all('h2', attrs={'class':'review-content__title'})
bodies += soup.find_all('p', attrs={'class':'review-content__text'})
ratings += soup.find_all('div', attrs={'class':'star-rating star-rating--medium'})
dates += soup.find_all('div', attrs={'class':'review-content-header__dates'})

You'll have to store that data after each iteration somewhere. Theres a few ways you can do it. You can just store everythin in a list, then create your dataframe. Or what I did is create a "temporary" dataframe that is created after each iteration, then append that into the final dataframe. Think of it like bailing water. You have a small bucket of water, to then empty into a large bucket, that will collect/hold all the water you are trying to collect.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')
# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 44):
published = []
updated = []
reported = []
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
names = [ x.text.strip() for x in soup.find_all('div', attrs={'class': 'consumer-information__name'})]
headers = [ x.text.strip() for x in soup.find_all('h2', attrs={'class':'review-content__title'})]
bodies = [ x.text.strip() for x in soup.find_all('p', attrs={'class':'review-content__text'})]
ratings = [ x.text.strip() for x in soup.find_all('div', attrs={'class':'star-rating star-rating--medium'})]
dateElements = soup.find_all('div', attrs={'class':'review-content-header__dates'})
for date in dateElements:
jsonData = json.loads(date.text.strip())
published.append(jsonData['publishedDate'])
updated.append(jsonData['updatedDate'])
reported.append(jsonData['reportedDate'])
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
df = df.append(temp_df, sort=False).reset_index(drop=True)
print ('pass1')
df.to_csv('birchbox006.csv', index=False, encoding='utf-8')
print ('excel done')

How to scrape the data off multiple tags with same tag name and attributes in python?

I want to extract the data from this website:
https://forecast.weather.gov/MapClick.php?lat=35.0868&lon=-90.0568
This image shows the info I wanna extract but I couldn't do it as I couldn't find a way to extract data from same tag name under the same tree...
I have successfully extracted some data before but I couldn't fetch it. Here is my code:
def weatherFetch(latitude,longitude):
URL = 'https://forecast.weather.gov/MapClick.php?'
URL = URL + 'lat=' + str(latitude) + '&lon=' + str(longitude)
print(URL)
dictionary = {
'latitude':str(latitude), 'longitude':str(longitude),
'cityName': '', 'weatherCondition': '', 'temprature': ''
}
res = requests.get(URL)
if res.status_code==200: #we have used legit coordinates
soup = BeautifulSoup(res.text, 'html.parser')
arr=soup.findAll('div', {'class': 'panel panel-default'})
if arr:
try:
cityName = arr[0].find("h2","panel-title").text
weatherCondition = arr[0].find("p", "myforecast-current").text
temprature = arr[0].find("p", "myforecast-current-lrg").text
windSpeed = arr[0].find_next("td", "text-right") #this is the line of code where i am supposed to fetch wind speed
print(windSpeed)
dictionary['cityName']=cityName
dictionary['weatherCondition'] = weatherCondition
dictionary['temprature']=temprature
except:
return dictionary

Find the element with id: current_conditions_detail
Then find all the tr tags inside the table.
for each tr tag, find td tags, there will be 2 such tags.
First one is the title and the second one is the value

You could just use pandas to get the table, then filter out the stuff you want using .loc.
Not sure what the rest of your code is trying to do. You're creating a dictionary but you only want it to return it if there's an exception??
import requests
from bs4 import BeautifulSoup
import pandas as pd
def weatherFetch(latitude,longitude):
URL = 'https://forecast.weather.gov/MapClick.php?'
URL = URL + 'lat=' + str(latitude) + '&lon=' + str(longitude)
print(URL)
dictionary = {
'latitude':str(latitude), 'longitude':str(longitude),
'cityName': '', 'weatherCondition': '', 'temprature': ''
}
res = requests.get(URL)
if res.status_code==200: #we have used legit coordinates
soup = BeautifulSoup(res.text, 'html.parser')
arr=soup.findAll('div', {'class': 'panel panel-default'})
if arr:
try:
cityName = arr[0].find("h2","panel-title").text
weatherCondition = arr[0].find("p", "myforecast-current").text
temprature = arr[0].find("p", "myforecast-current-lrg").text
df = pd.read_html(str(arr[0]))[0]
windSpeed = df.loc[df[0] == 'Wind Speed', 1][1]
print(windSpeed)
dictionary['cityName']=cityName
dictionary['weatherCondition'] = weatherCondition
dictionary['temprature']=temprature
except:
return dictionary
latitude,longitude = 35.0868, -90.0568
weatherFetch(latitude,longitude)
Output:
https://forecast.weather.gov/MapClick.php?lat=35.0868&lon=-90.0568
SW 5 mph

How can I make the output with pair of list : content in my python code?

I have been developing a python web-crawler for this website. I made two functions, which works well as separately.
One is to collect the list of stocks and
Another is to collect the content data of each list.
I would like to make the output of my code with pairs of
"list#1/content#1",
"list#2/content#2",
"list#3/content#3",
What needs to be modified in my code in order to achieve this?
Thanks.
from bs4 import BeautifulSoup
import urllib.request
CAR_PAGE_TEMPLATE = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page="
BASE_PAGE = 'http://www.bobaedream.co.kr'
def fetch_post_list():
for i in range(20,21):
URL = CAR_PAGE_TEMPLATE + str(i)
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
#print ("Page#", i)
# 50 lists per each page
lists=table.find_all('tr', itemtype="http://schema.org/Article")
count=0
for lst in lists:
if lst.find_all('td')[3].find('em').text:
lst_price=lst.find_all('td')[3].find('em').text
lst_title=lst.find_all('td')[1].find('a').text
lst_link = lst.find_all('td')[1].find('a')['href']
lst_photo_url=''
if lst.find_all('td')[0].find('img'):
lst_photo_url = lst.find_all('td')[0].find('img')['src']
count+=1
else: continue
#print('#',count, lst_title, lst_photo_url, lst_link, lst_price)
return lst_link
def fetch_post_content(lst_link):
URL = BASE_PAGE + lst_link
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
#Basic Information
table = soup.find('div', class_='rightarea')
# Number, Year, Mileage, Gas Type, Color, Accident
content_table1 = table.find_all('div')[0]
dds = content_table1.find_all('dd')
for dd in dds:
car_span_t = dd.find_all('span', {'class': 't'})[0]
car_span_s = dd.find_all('span', {'class': 's'})[0]
#print(car_span_t.text, ':', car_span_s.text)
# Seller Information
content_table2 = table.find_all('div')[1]
dds2 = content_table2.find_all('dd')
for dd2 in dds2:
seller_span_t = dd.find_all('span', {'class': 't'})[0]
seller_span_s = dd.find_all('span', {'class': 's'})[0]
#print(seller_span_t.text, ':', seller_span_s.text)
return dds

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to export data taken from HTML using beautiful soup - python

data = {'Company Name':name,'Filing Date': date,'Filing Type':filing_type,"Weblink":weblink} outputdf = pd.DataFrame.from_dict(data, orient='index') outputdf.to_csv('Downloads/t_10KLinks.csv') pandas.DataFrame.from_dict documentation.

Related

how to match data from a linked pages together by web scraping

How to get the first column values in the wikipedia table using python bs4?

How do I fix/prevent Data Overwriting Issue in Web Scrape Loop?

How to scrape the data off multiple tags with same tag name and attributes in python?

How can I make the output with pair of list : content in my python code?

Categories

Resources