How to make duplicate exclusion for crawled links?

How to make duplicate exclusion for crawled links? - python

async def test():
old_links = open("old_links.txt", "r", encoding="utf-8")
while True:
base_url = 'https://www.cnbc.com/world/' async with aiohttp.ClientSession() as sess: async with sess.get(base_url, headers={'user-agent': 'Mozilla/5.0'}) as res:
text = await res.text()
soup = bs(text, 'lxml')
title = soup.find("div", attrs={"class":"LatestNews-container"}).a.text.strip()
link = soup.find("div", attrs={"class":"LatestNews-container"}).a["href"]
if link not in old_links:
print(f"title : {title}")
print(f"link : {link}")
f = open("old_links.txt", "w", encoding="utf-8")
f.write(f"{link}\n")
f.close()
else: print("No update")
await asyncio.sleep(3)
The site above is an example. i'm making a discord bot via python, and i'm making a crawling bot for some sites that don't support mailing service. i'm not major learning python. i just succeeded in creating this code through search and simple study, i maded a list with old_links = [] and applied it and used it. but when i reboot the discord bot, the posts that were crawled in the past are sent back to discord as messages. to solve this, i save the link of the post what bot sent the message to as a .txt file and keep it on my computer, and compare it with the link in the text file when the bot runs every certain time. i'm trying to implement sending function. i saving the crawled links in the .txt file was successful, but the function to compare the links stored in the .txt file was not implemented. how should i write the code?

You appear to be opening old_links as a text file without reading its contents. Assuming your existing links are separated by a new line, you can use
old_links = []
with open("old_links.txt", "r", encoding="utf-8") as file:
old_links = file.read().splitlines()
to create a list out of the old links read from the file. The with block makes sure the file is closed properly immediately after execution and limits its scope.
When you want to add a new link, simply append the new link to the array like so:
old_links.append(new_link)
Make sure you write the new lines back to the file after every update so that your changes save correctly:
with open("old_links.txt", "w", encoding="utf-8") as file:
file.write("\n".join(old_links))

Related

How to scrape repeatedly dynamic web content and save in new files

I've written a little function in Python to scrape the content from a certain Wikipedia's page and save it in an external file. My code below does that. However, I cannot figure out how to make my code save the repeatedly scraped content in a new file, e.g. Wiki1.txt, Wiki2.txt, and so on. Currently, my timer runs but, each time it scrapes the page's content anew, it will overwrite the previously saved content.
def Wiki():
url = re.get("https://en.wikipedia.org/wiki/Digital_identity")
html = url.content
html2 = bs(html, 'html5lib')
Wiki = html2.find_all('p')
Data = [i.text for i in Wiki]
with open("Wiki.txt", "w", encoding="utf-8")as f:
f.write(''.join(Data))
f.close()
return
if __name__ == '__main__':
while True: # run the forever loop
Wiki() # name of function to be written yet
wait_time = 10
print("Waiting {} minutes".format(wait_time)) #
time.sleep(wait_time * 10)

How to follow webscraping progress with python

I'm running a Python script with BeautifulSoup in order to extract Text, topics and tags from web articles. The website contains 210 pages, and each page contain 10 articles. (each article's url is stocked in a txt file)
I'm using the following code :
data = []
with open('urls.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
try:
soup = BeautifulSoup(response.text,"html.parser")
text = soup.select_one('div.para_content_text').get_text(strip=True)
topic = soup.select_one('div.article_tags_topics').get_text(strip=True)
tags = soup.select_one('div.article_tags_tags').get_text(strip=True)
except AttributeError:
print (" ")
data.append(
{
'text':text,
'topic': topic,
'tags':tags
}
)
pd.DataFrame(data).to_csv('text.csv', index = False, header=True)
time.sleep(3)
My code seems to be corret but I ran this code and it has been running for several days now.
I would like to understand if it is an error that is blocking progress or if the process is simply very long.
To do this, I would like to know if it would be possible to add a "component" to my code that would allow me to track the number of urls processed in real time.
Any ideas ?

The way your code is written now, you are accumulating all the data in memory until it's all fetched. The easiest way to keep track of the progress without changing the code too much would be to just print either the current URL, or the number of the URL you're processing.
A better way that involves changing the code a little more would be to write the data to the CSV file as you are parsing it, instead of all at once in the end. Something like
print("text,topic,tags")
with open('urls.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
# Getting the data you want...
print(f"{text},{topic},{tags}")
If you are going with this method, make sure to escape/remove commas, or use an actual CSV library to produce the lines.

Why does my Python program close after running the first loop?

I'm new to Python and scraping. I'm trying to run two loops. One goes and scrapes ids from one page. Then, using those ids, I call another API to get more info/properties.
But when I run this program, it just runs the first bit fine (gets the IDs), but then it closes and doesn't run the 2nd part. I feel I'm missing something really basic about control flow in Python here. Why does Python close after the first loop when I run it in Terminal?
import requests
import csv
import time
import json
from bs4 import BeautifulSoup, Tag
file = open('parcelids.csv','w')
writer = csv.writer(file)
writer.writerow(['parcelId'])
for x in range(1,10):
time.sleep(1) # slowing it down
url = 'http://apixyz/Parcel.aspx?Pid=' + str(x)
source = requests.get(url)
response = source.content
soup = BeautifulSoup(response, 'html.parser')
parcelId = soup.find("span", id="MainContent_lblMblu").text.strip()
writer.writerow([parcelId])
out = open('mapdata.csv','w')
with open('parcelIds.csv', 'r') as in1:
reader = csv.reader(in1)
writer = csv.writer(out)
next(reader, None) # skip header
for row in reader:
row = ''.join(row[0].split())[:-2].upper().replace('/','-') #formatting
url="https://api.io/api/properties/"
url1=url+row
time.sleep(1) # slowing it down
response = requests.get(url1)
resp_json_payload = response.json()
address = resp_json_payload['property']['address']
writer.writerow([address])

If you are running in windows (where filenames are not case sensitive), then the file you have open for writing (parcelids.csv) is still open when you reopen it to read from it.
Try closing the file before opening it to read from it.

How to write data into multiple files in a directory

I have been stuck for hours on how to write my crawled data into multiple files. I wrote a code that scraps a website and extracts all the body of each link in the website. An example is crawling news website and you extract all the links and then extracts all the body of each links. I have done that succesffully But now my concern now is that instead of storing them all into a file using this code below
def save_data(data):
the_file = open('raw_data.txt', 'w')
for title_text, body_content, url in data:
the_file.write("%s\n" % [title_text, body_content, url])
how do I write the code such that I store each article in a different file. So I would be having something like Article_00, Article_01, Article_01...
Thanks

If you want to save the data in multiple files, then you must open multiple files for writing.
Use enumerate to get a counter for which data set you're iterating over, so you can use it in the filename like this:
def save_data(data):
for i, (title_text, body_content, url) in enumerate(data):
file = open('Article_%02d' % (i,), 'w+')
file.write("%s\n" % [title_text, body_content, url])
file.close()

Python URLs in file Requests

I have a problem with my Python script in which I want to scrape the same content from every website. I have a file with a lot of URLs and I want Python to go over them to place them into the requests.get(url) object. After that I write the output to a file named 'somefile.txt'.
I have to the following Python script (version 2.7 - Windows 8):
from lxml import html
import requests
urls = ('URL1',
'URL2',
'URL3'
)
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()
As you can see if have not included the file with the URLs in the script. I tried out many tutorials but failed. The filename would be 'urllist.txt'. In the current script I only get the data from URL3 - in an ideal case I want to get all data from urllist.txt.
Attempt for reading over the text file:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url)

You'll need to remove the newline from your lines:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url.strip())
The str.strip() call removes all whitespace (including tabs and newlines and carriage returns) from the line.
Do make sure you then process page in the loop; if you run your code to extract the data outside the loop all you'll get is the data from the last response you loaded. You may as well open the output file just once, in the with statement so Python closes it again:
with open('urllist.txt', 'r') as urls, open('somefile.txt', 'a') as output:
for url in urls:
page = requests.get(url.strip())
tree = html.fromstring(page.content)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
print >> output, 'Visitors:', visitors

You should either save the each page in a seperate variable, or perform all the computation within the looping of the url list.
Based on your code, by the time your page parsing happens it will only contain the data for the last page get since you are overriding the page variable within each iteration.
Something like the following should append all the pages' info.
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make duplicate exclusion for crawled links? - python

Related

How to scrape repeatedly dynamic web content and save in new files

How to follow webscraping progress with python

Why does my Python program close after running the first loop?

How to write data into multiple files in a directory

Python URLs in file Requests

Categories

Resources