How to scrape repeatedly dynamic web content and save in new files - python

I've written a little function in Python to scrape the content from a certain Wikipedia's page and save it in an external file. My code below does that. However, I cannot figure out how to make my code save the repeatedly scraped content in a new file, e.g. Wiki1.txt, Wiki2.txt, and so on. Currently, my timer runs but, each time it scrapes the page's content anew, it will overwrite the previously saved content.
def Wiki():
url = re.get("https://en.wikipedia.org/wiki/Digital_identity")
html = url.content
html2 = bs(html, 'html5lib')
Wiki = html2.find_all('p')
Data = [i.text for i in Wiki]
with open("Wiki.txt", "w", encoding="utf-8")as f:
f.write(''.join(Data))
f.close()
return
if __name__ == '__main__':
while True: # run the forever loop
Wiki() # name of function to be written yet
wait_time = 10
print("Waiting {} minutes".format(wait_time)) #
time.sleep(wait_time * 10)

Related

How to make duplicate exclusion for crawled links?

async def test():
old_links = open("old_links.txt", "r", encoding="utf-8")
while True:
base_url = 'https://www.cnbc.com/world/' async with aiohttp.ClientSession() as sess: async with sess.get(base_url, headers={'user-agent': 'Mozilla/5.0'}) as res:
text = await res.text()
soup = bs(text, 'lxml')
title = soup.find("div", attrs={"class":"LatestNews-container"}).a.text.strip()
link = soup.find("div", attrs={"class":"LatestNews-container"}).a["href"]
if link not in old_links:
print(f"title : {title}")
print(f"link : {link}")
f = open("old_links.txt", "w", encoding="utf-8")
f.write(f"{link}\n")
f.close()
else: print("No update")
await asyncio.sleep(3)
The site above is an example. i'm making a discord bot via python, and i'm making a crawling bot for some sites that don't support mailing service. i'm not major learning python. i just succeeded in creating this code through search and simple study, i maded a list with old_links = [] and applied it and used it. but when i reboot the discord bot, the posts that were crawled in the past are sent back to discord as messages. to solve this, i save the link of the post what bot sent the message to as a .txt file and keep it on my computer, and compare it with the link in the text file when the bot runs every certain time. i'm trying to implement sending function. i saving the crawled links in the .txt file was successful, but the function to compare the links stored in the .txt file was not implemented. how should i write the code?
You appear to be opening old_links as a text file without reading its contents. Assuming your existing links are separated by a new line, you can use
old_links = []
with open("old_links.txt", "r", encoding="utf-8") as file:
old_links = file.read().splitlines()
to create a list out of the old links read from the file. The with block makes sure the file is closed properly immediately after execution and limits its scope.
When you want to add a new link, simply append the new link to the array like so:
old_links.append(new_link)
Make sure you write the new lines back to the file after every update so that your changes save correctly:
with open("old_links.txt", "w", encoding="utf-8") as file:
file.write("\n".join(old_links))

Why does my Python program close after running the first loop?

I'm new to Python and scraping. I'm trying to run two loops. One goes and scrapes ids from one page. Then, using those ids, I call another API to get more info/properties.
But when I run this program, it just runs the first bit fine (gets the IDs), but then it closes and doesn't run the 2nd part. I feel I'm missing something really basic about control flow in Python here. Why does Python close after the first loop when I run it in Terminal?
import requests
import csv
import time
import json
from bs4 import BeautifulSoup, Tag
file = open('parcelids.csv','w')
writer = csv.writer(file)
writer.writerow(['parcelId'])
for x in range(1,10):
time.sleep(1) # slowing it down
url = 'http://apixyz/Parcel.aspx?Pid=' + str(x)
source = requests.get(url)
response = source.content
soup = BeautifulSoup(response, 'html.parser')
parcelId = soup.find("span", id="MainContent_lblMblu").text.strip()
writer.writerow([parcelId])
out = open('mapdata.csv','w')
with open('parcelIds.csv', 'r') as in1:
reader = csv.reader(in1)
writer = csv.writer(out)
next(reader, None) # skip header
for row in reader:
row = ''.join(row[0].split())[:-2].upper().replace('/','-') #formatting
url="https://api.io/api/properties/"
url1=url+row
time.sleep(1) # slowing it down
response = requests.get(url1)
resp_json_payload = response.json()
address = resp_json_payload['property']['address']
writer.writerow([address])
If you are running in windows (where filenames are not case sensitive), then the file you have open for writing (parcelids.csv) is still open when you reopen it to read from it.
Try closing the file before opening it to read from it.

How to properly extract URLs from HTML code?

I have saved a website's HTML code in a .txt file on my computer. I would like to extract all URLs from this text file using the following code:
def get_net_target(page):
start_link=page.find("href=")
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
url=page[start_quote+1:end_quote]
return url
my_file = open("test12.txt")
page = my_file.read()
print(get_net_target(page))
However, the script only prints the first URL, but not all other links. Why is this?
You need to implement a loop to go through all URLs.
print(get_net_target(page)) only prints the first URL found in page, so you will need to call this function again and again, each time replacing page by the substring page[end_quote+1:] until no more URL is found.
To get you started, next_index will store the last ending URL position, then the loop will retrieve the following URLs:
next_index = 0 # the next page position from which the URL search starts
def get_net_target(page):
global next_index
start_link=page.find("href=")
if start_link == -1: # no more URL
return ""
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
next_index=end_quote
url=page[start_quote+1:end_quote]
end_quote=5
return url
my_file = open("test12.txt")
page = my_file.read()
while True:
url = get_net_target(page)
if url == "": # no more URL
break
print(url)
page = page[next_index:] # continue with the page
Also be careful because you only retrieve links which are enclosed inside ", but they can be enclosed by ' or even nothing...

Python URLs in file Requests

I have a problem with my Python script in which I want to scrape the same content from every website. I have a file with a lot of URLs and I want Python to go over them to place them into the requests.get(url) object. After that I write the output to a file named 'somefile.txt'.
I have to the following Python script (version 2.7 - Windows 8):
from lxml import html
import requests
urls = ('URL1',
'URL2',
'URL3'
)
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()
As you can see if have not included the file with the URLs in the script. I tried out many tutorials but failed. The filename would be 'urllist.txt'. In the current script I only get the data from URL3 - in an ideal case I want to get all data from urllist.txt.
Attempt for reading over the text file:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url)
You'll need to remove the newline from your lines:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url.strip())
The str.strip() call removes all whitespace (including tabs and newlines and carriage returns) from the line.
Do make sure you then process page in the loop; if you run your code to extract the data outside the loop all you'll get is the data from the last response you loaded. You may as well open the output file just once, in the with statement so Python closes it again:
with open('urllist.txt', 'r') as urls, open('somefile.txt', 'a') as output:
for url in urls:
page = requests.get(url.strip())
tree = html.fromstring(page.content)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
print >> output, 'Visitors:', visitors
You should either save the each page in a seperate variable, or perform all the computation within the looping of the url list.
Based on your code, by the time your page parsing happens it will only contain the data for the last page get since you are overriding the page variable within each iteration.
Something like the following should append all the pages' info.
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()

Python Blog RSS Feed Scraping BeautifulSoup Output to .txt Files

Apologies in advance for the long block of code following. I'm new to BeautifulSoup, but found there were some useful tutorials using it to scrape RSS feeds for blogs. Full disclosure: this is code adapted from this video tutorial which has been immensely helpful in getting this off the ground: http://www.youtube.com/watch?v=Ap_DlSrT-iE.
Here's my problem: the video does a great job of showing how to print the relevant content to the console. I need to write out each article's text to a separate .txt file and save it to some directory (right now I'm just trying to save to my Desktop). I know the problem lies i the scope of the two for-loops near the end of the code (I've tried to comment this for people to see quickly--it's the last comment beginning # Here's where I'm lost...), but I can't seem to figure it out on my own.
Currently what the program does is takes the text from the last article read in by the program and writes that out to the number of .txt files that are indicated in the variable listIterator. So, in this case I believe there are 20 .txt files that get written out, but they all contain the text of the last article that's looped over. What I want the program to do is loop over each article and print the text of each article out to a separate .txt file. Sorry for the verbosity, but any insight would be really appreciated.
from urllib import urlopen
from bs4 import BeautifulSoup
import re
# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
# On RSS Feed site, find tags for title of articles and
# tags for article links to be downloaded.
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)
# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))
for i in listIterator:
# Print each title to console to ensure program is working.
print findPatTitle[i]
# Read in the linked-to article.
articlePage = urlopen(findPatLink[i]).read()
# Find the beginning and end of articles using tags listed below.
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
# Define article variable that will contain all the content between the
# beginning of the article to the end as indicated by variables above.
article = articlePage[divBegin:divEnd]
# Parse the page using BeautifulSoup
soup = BeautifulSoup(article)
# Compile list of all <p> tags for each article and store in paragList
paragList = soup.findAll('p')
# Create empty string to eventually convert items in paragList to string to
# be written to .txt files.
para_string = ''
# Here's where I'm lost and have some sort of scope issue with my for-loops.
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
The reason why it seems that only the last article is written down, is because all the articles are writer to 20 separate files over and over again. Lets have a look at the following:
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
You are writing parag_string over and over again to the same 20 files for each iteration. What you need to be doing is this, append all your parag_strings to a separate list, say paraStringList, and then write all its contents to separate files, like so:
for i, var in enumerate(paraStringList): # Enumerate creates a tuple
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)
Now that this needs to be outside of your main loop i.e. for i in listIterator:(...). This is a working version of the program:
from urllib import urlopen
from bs4 import BeautifulSoup
import re
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
findPatTitle = re.findall(patFinderTitle, webpage)[0:4]
findPatLink = re.findall(patFinderLink, webpage)[0:4]
listIterator = []
listIterator[:] = range(len(findPatTitle))
paraStringList = []
for i in listIterator:
print findPatTitle[i]
articlePage = urlopen(findPatLink[i]).read()
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
article = articlePage[divBegin:divEnd]
soup = BeautifulSoup(article)
paragList = soup.findAll('p')
para_string = ''
for i in paragList:
para_string += str(i)
paraStringList.append(para_string)
for i, var in enumerate(paraStringList):
with open("{0}.txt".format(i), 'w') as writer:
writer.write(var)

Categories

Resources