I have been scratching my head on how to tackle this dilemma of mines for a while now. I have a Address column in my csv file, which contains list of Addresses. I want to be able to direct Python to search the website designated below with the individual address values in the csv file and save the results into a new csv file.
import csv
import requests
with open('C:/Users/thefirstcolumn.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['Address'])
website = requests.get('https://etrakit.friscotexas.gov/Search/permit.aspx')
writer = csv.writer(open('thematchingresults.csv', 'w'))
print website.content
For example:
One of the address value I have in the csv file:
6525 Mountain Sky Rd
returns three rows of data when I manually paste the address in the search box. How can I tell Python to search for each one of the addresses in the csv file on the website and save the results for each one of the addresses in a new csv file. How can I accomplish this mountainous task?
The request module downloads static HTML pages from the website. You cannot interact with Javascript
You need to use Selenium to interact with the website
For example
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get('https://etrakit.friscotexas.gov/Search/permit.aspx')
#read in addresses
with open('file.csv','r') as f:
adresses = f.readlines()
# use css selectors to locate search field
for address in adresses:
driver.find_element_by_css_selector('#cplMain_txtSearchString').clear()
driver.find_element_by_css_selector('#cplMain_txtSearchString').send_keys(address)
driver.find_element_by_css_selector('#cplMain_btnSearch').click()
time.sleep(5)
# JS injected HTML
soup = BeautifulSoup(driver.page_source)
# extract relavant info from the soup
# and save to your new csv here
You would need to do a POST request for each value you have in the csv file. For example, to search for "6525 Mountain Sky Rd" at https://etrakit.friscotexas.gov/Search/permit.aspx, you can look at the developer console to see what POST params it is giving. For example:
You can use something like requests and pass the header values and form data, or you could use something like casper or selenium to emulate the browser.
Related
I am very new to python and BeautifulSoup. I wrote the code below to call up the website: https://www.baseball-reference.com/leagues/MLB-standings.shtml, with the goal of scraping the table at the bottom named "MLB Detailed Standings" and exporting to a CSV file. My code successfully creates a CSV file but with the wrong data table pulled and it is missing the first column with the team names. My code pulls in the "East Division" table up top (excluding the first column) rather than my targeted table with the full "MLB Detailed Standings" table at the bottom.
Wondering if there is a simple way to pull the MLB Detailed Standings table at the bottom. When I inspect the page, the ID for the specific table I am trying to pull is: "expanded_standings_overall". Do I need to reference this in my code? Or, any other guidance to rework the code to pull the correct table would be greatly appreciated. Again, I very new and trying my best to learn.
import requests
import csv
import datetime
from bs4 import BeautifulSoup
# static urls
season = datetime.datetime.now().year
URL = "https://www.baseball-reference.com/leagues/MLB-standings.shtml".format(season=season)
# request the data
batting_html = requests.get(URL).text
def parse_array_from_fangraphs_html(input_html, out_file_name):
"""
Take a HTML stats page from fangraphs and parse it out to a CSV file.
"""
# parse input
soup = BeautifulSoup(input_html, "lxml")
table = soup.find("table", class_=["sortable,", "stats_table", "now_sortable"])
# get headers
headers_html = table.find("thead").find_all("th")
headers = []
for header in headers_html:
headers.append(header.text)
print(headers)
# get rows
rows = []
rows_html = table.find_all("tr")
for row in rows_html:
row_data = []
for cell in row.find_all("td"):
row_data.append(cell.text)
rows.append(row_data)
# write to CSV file
with open(out_file_name, "w") as out_file:
writer = csv.writer(out_file)
writer.writerow(headers)
writer.writerows(rows)
parse_array_from_fangraphs_html(batting_html, 'BBRefTest.csv')
First of all, yes, it would be better to reference the ID as you would suspect the developer has made this ID unique to this table vs class which are just style descriptor.
Now, the problem run deeper. A quick look at the page code actually shows that the html that defines the table is commented out a few tags above. I suspect a script 'enables' this code on the client-side (in your browser). requests.get which just pull out the html without processing any javascript doesn't catch it (you could check the content of batting_html to verify that).
A very quick and dirty fix would be to catch the commented out code and reprocess it in BeautifulSoup:
from bs4 import Comment
...
# parse input
soup = BeautifulSoup(input_html, "lxml")
dynamic_content = soup.find("div", id="all_expanded_standings_overall")
comments = dynamic_content.find(string=lambda text: isinstance(text, Comment))
table = BeautifulSoup(comments, "lxml")
# get headers
By the way, you want to specify utf8 encoding when writing your file ...
with open(out_file_name, "w", encoding="utf8") as out_file:
writer = csv.writer(out_file)
...
Now that's really 'quick and dirty' and I would try to check deeper into the html code and javascript what is really happening before scaling this out to other pages.
I know that this is a repeated question however from all answers on web I could not find the solution as all throwing error.
Simply trying to scrape headers from the web and save them to a txt file.
scraping code works well, however, it saves only last string bypassing all headers to the last one.
I have tried looping, putting writing code before scraping, appending to list etc, different method of scraping however all having the same issue.
please help.
here is my code
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h2"):
print(headlines.text.strip())
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
file_object.write(str(headlines.text.strip()))
'''
Every time your for loop runs, it overwrites the headlines variable, so when you get to writing to the file, the headlines variable only stores the last headline. An easy solution to this is to bring the for loop inside your with statement, like so:
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n") # write a newline after each headline
here is full working code corrected as per advice.
from bs4 import BeautifulSoup
import requests
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n")
this code will trough error in Jupiter work but when an
opening file, however when file open outside Jupiter headers saved...
I am trying to loop through a list of URLs and scrape some data from each link. Here is my code.
from bs4 import BeautifulSoup as bs
import webbrowser
import requests
url_list = ['https://corp-intranet.com/admin/graph?dag_id=emm1_daily_legacy',
'https://corp-intranet.com/admin/graph?dag_id=emm1_daily_legacy_history']
for link in url_list:
File = webbrowser.open(link)
File = requests.get(link)
data = File.text
soup = bs(data, "lxml")
tspans = soup.find_all("tspan")
tspans
I think this is pretty close, but I'm getting nothing for the 'tspans' variable. I get no error; 'tspans' just shows [].
This is an internal corporate intranet, so I can't share the exact details, but I think it's just a matter of grabbing all the HTML elements named 'tspans' and writing all of them to a text file or a CSV file. That's my ultimate goal. I want to collate everything into a large list and write it all to a file.
As an aside, I was going to use Selenium to log into this site, which requires creds, but it seem like the code I'm testing now allows you you open new tabs on a browser, and everything loads up fine, if you are already logged in. Is this the best practice, or should I use the full login creds + Selenium? I'm just trying to keep things simple.
I am working on a web scraping project which involves scraping URLs from a website based on a search term, storing them in a CSV file(under a single column) and finally scraping the information from these links and storing them in a text file.
I am currently stuck with 2 issues.
Only the first few links are scraped. I'm unable to extract links
from other pages(Website contains load more button). I don't know
how to use the XHR object in the code.
The second half of the code reads only the last link(stored in the
csv file), scrapes the respective information and stores it in a
text file. It does not go through all the links from the beginning.
I am unable to figure out where I have gone wrong in terms of file
handling and f.seek(0).
from pprint import pprint
import requests
import lxml
import csv
import urllib2
from bs4 import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content, "lxml")
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
results = soup.findAll('a', {'rel': 'bookmark'})
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
pprint(get_url_for_search_key('digital advertising'))
with open('ctp_output.csv', 'w+') as f:
f.write('\n'.join(get_url_for_search_key('digital advertising')))
f.seek(0)
Reading CSV file, scraping respective content and storing in .txt file
with open('ctp_output.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
Regarding your second problem, your mode is off. You'll need to convert w+ to a+. In addition, your indentation is off.
with open('ctp_output.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
The + suffix will create the file if it doesn't exist. However, w+ will erase all contents before writing at each iteration. a+ on the other hand will append to a file if it exists, or create it if it does not.
For your first problem, there's no option but to switch to something that can automate clicking browser buttons and whatnot. You'd have to look at selenium. The alternative is to manually search for that button, extract the url from the href or text, and then make a second request. I leave that to you.
If there are more pages with results observe what changes in the URL when you manually click to go to the next page of results.
I can guarantee 100% that a small piece of the URL will have eighter a subpage number or some other variable encoded in it that strictly relates to the subpage.
Once you have figured out what is the combination you just fit that into a for loop where you put a .format() into the URL that you want to scrape and keep navigating this way through all the subpages of the results.
As to what is the last subpage number - you have to inspect the html code of the site you are scraping and find the variable responsible for it and extract its value. See if there is "class": "Page" or equivalent in their code - it may contain that number that you will need for your for loop.
Unfortunately there is no magic navigate through subresults option....
But this gets pretty close :).
Good luck.
I am parsing some content from the web and then saving it to a file. So far I manually create the filename.
Here's my code:
import requests
url = "http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840"
html = requests.get(url).text.encode('utf-8')
with open("html_output_test.html", "wb") as file:
file.write(html)
How could I automate the process of creating and saving the following html filename from the url:
The-Google-Way-Revolutionizing-Management (instead of html_output_test?
This name comes from the original bookstore url that I posted and that probably was modified to avoid product adv.
Thanks!
You can use BeautifulSoup to get the title text from the page, I would let requests handle the encoding with .content:
url = "http://rads.stackoverflow.com/amzn/click/1593271840"
html = requests.get(url).content
from bs4 import BeautifulSoup
print(BeautifulSoup(html).title.text)
with open("{}.html".format(BeautifulSoup(html).title.text), "wb") as file:
file.write(html)
The Google Way: How One Company is Revolutionizing Management As We Know It: Bernard Girard: 9781593271848: Amazon.com: Books
For that particular page if you just want The Google Way: How One Company is Revolutionizing Management As We Know It the product title is in the class a-size-large:
text = BeautifulSoup(html).find("span",attrs={"class":"a-size-large"}).text
with open("{}.html".format(text), "wb") as file:
file.write(html)
The link with The-Google-Way-Revolutionizing-Management is in the link tag:
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
print(link["href"])
http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840
So to get that part you need to parse it:
print(link["href"].split("/")[3])
The-Google-Way-Revolutionizing-Management
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
with open("{}.html".format(link["href"].split("/")[3]),"wb") as file:
file.write(html)
You could parse the web page using beautiful soup, get the of the page, then slugify it and use as file name, or generate a random filename, something like os.tmpfile.