Cleaning up data written from BeautifulSoup to Text File

Cleaning up data written from BeautifulSoup to Text File - python

I am trying to write a program that will collect specific information from an ebay product page and write that information to a text file. To do this I'm using BeautifulSoup and Requests and I'm working with Python 2.7.9.
I've been mostly using this tutorial (Easy Web Scraping with Python) with a few modifications. So far everything works as intended until it writes to the text file. The information is written, just not in the format that I would like.
What I'm getting is this:
{'item_title': u'Old Navy Pink Coat M', 'item_no': u'301585876394', 'item_price': u'US $25.00', 'item_img': 'http://i.ebayimg.com/00/s/MTYwMFgxMjAw/z/Sv0AAOSwv0tVIoBd/$_35.JPG'}
What I was hoping for was something that would be a bit easier to work with.
For example :
New Shirt 5555555555 US $25.00 http://ImageURL.jpg
In other words I want just the scraped text and not the brackets, the 'item_whatever', or the u'.
After a bit of research I suspect my problem is to do with the encoding of the information as its being written to the text file, but I'm not sure how to fix it.
So far I have tried,
def collect_data():
with open('writetest001.txt','w') as x:
for product_url in get_links():
get_info(product_url)
data = "'{0}','{1}','{2}','{3}'".format(item_data['item_title'],'item_price','item_no','item_img')
x.write(str(data))
In the hopes that it would make the data easier to format in the way I want. It only resulted in "NameError: global name 'item_data' is not defined" displayed in IDLE.
I have also tried using .split() and .decode('utf-8') in various positions but have only received AttributeErrors or the written outcome does not change.
Here is the code for the program itself.
import requests
import bs4
#Main URL for Harvesting
main_url = 'http://www.ebay.com/sch/Coats-Jackets-/63862/i.html?LH_BIN=1&LH_ItemCondition=1000&_ipg=24&rt=nc'
#Harvests Links from "Main" Page
def get_links():
r = requests.get(main_url)
data = r.text
soup = bs4.BeautifulSoup(data)
return [a.attrs.get('href')for a in soup.select('div.gvtitle a[href^=http://www.ebay.com/itm]')]
print "Harvesting Now... Please Wait...\n"
print "Harvested:", len(get_links()), "URLs"
#print (get_links())
print "Finished Harvesting... Scraping will Begin Shortly...\n"
#Scrapes Select Information from each page
def get_info(product_url):
item_data = {}
r = requests.get(product_url)
data = r.text
soup = bs4.BeautifulSoup(data)
#Fixes the 'Details about ' problem in the Title
for tag in soup.find_all('span',{'class':'g-hdn'}):
tag.decompose()
item_data['item_title'] = soup.select('h1#itemTitle')[0].get_text()
#Grabs the Price, if the item is on sale, grabs the sale price
try:
item_data['item_price'] = soup.select('span#prcIsum')[0].get_text()
except IndexError:
item_data['item_price'] = soup.select('span#mm-saleDscPrc')[0].get_text()
item_data['item_no'] = soup.select('div#descItemNumber')[0].get_text()
item_data['item_img'] = soup.find('img', {'id':'icImg'})['src']
return item_data
#Collects information from each page and write to a text file
write_it = open("writetest003.txt","w","utf-8")
def collect_data():
for product_url in get_links():
write_it.write(str(get_info(product_url))+ '\n')
collect_data()
write_it.close()

You were on the right track.
You need a local variable to assign the results of get_info to. The variable item_data you tried to reference only exists within the scope of the get_info function. You can use the same variable name though, and assign the results of the function to it.
There was also a syntax issue in the section you tried with respect to how you're formatting the items.
Replace the section you tried with this:
for product_url in get_links():
item_data = get_info(product_url)
data = "{0},{1},{2},{3}".format(*(item_data[item] for item in ('item_title','item_price','item_no','item_img')))
x.write(data)

Related

Beautifulsoup unable to get data from mds-data-table from morningstar

I'm trying to get the dividend information from morningstar.
The following code works for scraping info from finviz but the dividend information is not the same as my broker platform.
symbol = 'bxs'
morningstar_url = 'https://www.morningstar.com/stocks/xnys/' + symbol + '/dividends'
http = urllib3.PoolManager()
response = http.request('GET', morningstar_url)
soup = BeautifulSoup(response.data, 'lxml')
html = list(soup.children)[1]
[type(item) for item in list(soup.children)]
def display_elements(L, show = 0):
test = list(L.children)
if(show):
for i in range(len(test)):
print(i)
print(test[i])
print()
return(test)
test = display_elements(html,1)
I have no issue printing out the elements but cannot find the element that houses the information such as "Total Yield %" of 2.8%. How do I get inside the mds-data-table to extract the information?

Great question! I've actually worked on this specifically, but years ago. Morningstar will only load the tables after running a script to prevent this exact type of scraping behavior. If you view source generally, immediately on load, you won't be able to see any HTML.
What your going to want to do is find the JavaScript code that is loading the elements, and hook up bs4 to use that. You'll have to poke around the files, but somewhere deep in those js files, you'll find a dynamic URL. It'll be hidden, but it'll be in there somewhere. I'll go look at some of my old code and see if i can find something that helps.
So here's an edited sample of what used to work for me:
from urllib.request import urlopen
exchange = 'NYSE'
ticker = 'V'
if exchange == 'NYSE':
exchange_code = "XNYS"
elif exchange in ["NasdaqNM", "NASDAQ"]:
exchange_code = "XNAS"
else:
logging.info("Unknown Exchange Code for {}".format(stock.symbol))
return
time_now = int(time.time())
time_delay = int(time.time()+150)
morningstar_raw = urlopen(f'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t={exchange_code}:{ticker}&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=asc&columnYear=5&rounding=3&view=raw&r=354589&callback=jsonp{time_now}&_={time_delay}')
print(morningstar_raw)
Granted this solution is from a file lasted edited sometime in 2018, and they may have changed up their scripting, but you can find this and much more on my github project wxStocks

parsing text from an object tag

im creating a small amazon alexa skill called JokePro and i have made a website that i can directly upload jokes to. The jokes go into a txt file in the database and then are loaded to the crude page from there.
I am looking to randomly select lines from the joke file directly displayed to the page with an object tag
how would i go about scraping the text given by the object tag.
http://jokepro.dx.am
source = requests.get("http://jokepro.dx.am/")
bs4call = bs4.BeautifulSoup(source.text, "html.parser")
parsed = bs4call.find('pre') #ive replaced pre with object aswell
any help would be apreciated

If I understand you correctly, you want to load the text file described by <object> tag and then select random line from it:
import bs4
import requests
import random
url = "http://jokepro.dx.am/"
source = requests.get(url)
bs4call = bs4.BeautifulSoup(source.text, "html.parser")
obj = bs4call.find('object')
text = requests.get(url + obj['data']).text
# print(text) # <-- to print the textfile
print( random.choice(text.splitlines()) )
This prints (for example):
want to know a REALLY good joke? A high school student making this application in a week!

Web scraping with python ('NoneType' object has no attribute 'get_text')

I would like to extract multiple drug informations from multiple pages in https://www.medindia.net/doctors/drug_information/abacavir.htm,
https://www.medindia.net/doctors/drug_information/talimogene_laherparepvec.htm,
and etc
On each pages, The information that I would like to extract are as follows: General, Brands, Prescription Contraindications, Side effects, Dosage, How to Take, Warning and Storage.
By using Beautiful soup, I am able to identify the class needed for extraction. However, when i am trying to extract the information and store the information into a variable, it shows the 'NoneType' object has no attribute 'get_text' . It seems that there is no element with the class 'drug-content'. However, when I print the items it shows the class. Please help me. Below is my code:
import pandas as pd
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.medindia.net/doctors/drug_information/abacavir.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
drug = soup.find(class_='mi-container__fluid')
print(drug)
# whole page contain drug content
items = drug.find_all(class_='drug-content')
print(items)
# extract drug information from drug content into individual variable
general = items[0].find(class_='drug-content').get_text(strip=True).replace("\n", "")
brand = items[1].find(class_='report-content').get_text(strip=True).replace("\n", "")
prescription = items[1].find(class_='drug-content').get_text(strip=True).replace("\n", "")
contraindications = items[2].find(class_='drug-content').get_text(strip=True).replace("\n", "")
side_effect = items[2].find(class_='drug-content').get_text(strip=True).replace("\n", "")
dosage = items[3].find(class_='drug-content').get_text(strip=True).replace("\n", "")
how_to_use = items[4].find(class_='drug-content').get_text(strip=True).replace("\n", "")
warnings = items[5].find(class_='drug-content').get_text(strip=True).replace("\n", "")
storage = items[7].find(class_='drug-content').get_text(strip=True).replace("\n", "")
I have try to change the class to 'report-content drug-widget'. However, with that class, I am unable to extract the general information. And also side-effect is unavailable for this drug. How can I put an NA into the variable if the information is not available for the drug.
# whole page contain drug content
items = drug.find_all(class_='report-content drug-widget')
print(items)
# extract drug information from drug content into individual variable
general = items.find(class_='drug-content').get_text(strip=True).replace("\n", "")
brand = items[0].find(class_='drug-content').get_text(strip=True).replace("\n", "")
Please advice how to extract the information and how can I put NA where information which I need are not available.

I can help you with the first one, it should help you get started on how to deal with non finds, and how to search for the pattern your looking for:
try:
general = items[0].find('h3', attrs={'style': 'margin:0px!important'}).get_text(strip=True).replace("\n", "").replace("\xa0", " ")
except:
general = "N/A"
You can slice the Generic Name: out since it's probably the same size for each answer by:
general = general[15:]
print(general):
#'Abacavir'

Unable to get Facebook Group members after first page using Python

I am trying to get the names of members of a group I am a member of. I am able to get the names in the first page but not sure how to go to the next page:
My Code:
url = 'https://graph.facebook.com/v2.5/1671554786408615/members?access_token=<MY_CUSTOM_ACCESS_CODE_HERE>'
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
for each in data['data']:
print each['name']
Using the code above I am successfully getting all names on the first page but question is -- how do I go to the next page?
In the Graph API Explorer Output screen I see this:
What change does my code need to keep going to next pages and get names of ALL members of the group?

The JSON returned by the Graph API is telling you where to get the next page of data, in data['paging']['next']. You could give something like this a try:
def printNames():
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
for each in data['data']:
print each['name']
return data['paging']['next'] # Return the URL to the next page of data
url = 'https://graph.facebook.com/v2.5/1671554786408615/members?access_token=<MY_CUSTOM_ACCESS_CODE_HERE>'
url = printNames()
print "====END OF PAGE 1===="
url = printNames()
print "====END OF PAGE 2===="
You would need to add checks, for instance ['paging']['next'] will only be available in your JSON object if there is a next page, so you might want to modify your function to return a more complex structure to convey this information, but this should give you the idea.

Extracting parts of a webpage with python

So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?

I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.

They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning up data written from BeautifulSoup to Text File - python

Related

Beautifulsoup unable to get data from mds-data-table from morningstar

parsing text from an object tag

Web scraping with python ('NoneType' object has no attribute 'get_text')

Unable to get Facebook Group members after first page using Python

Extracting parts of a webpage with python

Categories

Resources