Web scraping python: IndexError: list index out of range

Web scraping python: IndexError: list index out of range - python

The script reads a single URL from a text file and then imports information from that web page and store it in a CSV file. The script works fine for a single URL.
Problem: I have added several URLs in my text file line by line and now I want my script to read first URL, do the desired operation and then go back to text file to read the second URL and repeat.
Once I added the for loop to get this done, I stated facing the below error:
Traceback (most recent call last):
File "C:\Users\T947610\Desktop\hahah.py", line 22, in
table = soup.findAll("table", {"class":"display"})[0] #Facing error in this statement
IndexError: list index out of range
f = open("URL.txt", 'r')
for line in f.readlines():
print (line)
page = requests.get(line)
print(page.status_code)
print(page.content)
soup = BeautifulSoup(page.text, 'html.parser')
print("soup command worked")
table = soup.findAll("table", {"class":"display"})[0] #Facing error in this statement
rows = table.findAll("tr")

Sometimes findAll throws an exception if it can't find the data in the findall. I have this same issue and I work around it with try/except, except you'll need to deal with empty values probably differently than I've show, which is for example:
f = open("URL.txt", 'r')
for line in f.readlines():
print (line)
page = requests.get(line)
print(page.status_code)
print(page.content)
soup = BeautifulSoup(page.text, 'html.parser')
print("soup command worked")
try:
table = soup.findAll("table", {"class":"display"})[0] #Facing error in this statement
rows = table.findAll("tr")
except IndexError:
table = None
rows = None

If the single url input was working, maybe new input line from .txt is the problem. Try apply .strip() to the line, the line normally has whitespace at the head and tail
page = requests.get(line.strip())
Also, if soup.findall() find nothing, it will return None, which cannot be indexed. Try print the soup and check the content.

Related

Iterating website URLs from a text file into BeautifulSoup w/ Python

I have a .txt file with a different link on each line that I want to iterate, and parse into BeautifulSoup(response.text, "html.parser"). I'm having a couple issues though.
I can see the lines iterating from the text file, but when I assign them to my requests.get(websitelink), my code that previously worked (without iteration) no longer prints any data that I scrape.
All I receive are some blank lines in the results.
I'm new to Python and BeautifulSoup, so I'm not quite sure what I'm doing wrong. I've tried parsing the lines as a string, but that didn't seem to work.
import requests
from bs4 import BeautifulSoup
filename = 'item_ids.txt'
with open(filename, "r") as fp:
lines = fp.readlines()
for line in lines:
#Test to see if iteration for line to line works
print(line)
#Assign single line to websitelink
websitelink = line
#Parse websitelink into requests
response = requests.get(websitelink)
soup = BeautifulSoup(response.text, "html.parser")
#initialize and reset vars for cd loop
count = 0
weapon = ''
stats = ''
#iterate through cdata on page, and parse wanted data
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
#print(cd)
count += 1
if count == 1:
weapon = cd
if count == 6:
stats = cd
#concatenate cdata info
both = weapon + " " + stats
print(both)
The code should follow these steps:
Read line (URL) from text file, and assign to variable to be used w/ request.get(websitelink)
BeautifulSoup scrapes that link for the CData and prints it
Repeat Step 1 & 2 until final line of the text file (last URL)
Any help would be greatly appreciated,
Thanks

I don't know this could help you or not but I've added a strip() to your link variable when you are assigning it to the websitelink and helped me to make your code work. You could try it.
websitelink = line.strip()

TypeError when writing to .txt file Python

I'm taking data from a website, and writing it to a .txt file.
head = 'mpg123 -q '
tail = ' &'
url = 'http://www.ndtv.com/article/list/top-stories/'
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div",{"class":"nstory_intro"})
log = open("/home/pi/logs/newslog.txt","w")
soup = BeautifulSoup(g_data)
# Will grab data from website, and write it to .txt file
for item in g_data:
shorts = textwrap.wrap(item.text, 100)
text_file = open("Output.txt", "w")
text_file.write("%s" % g_data)
print 'Wrote Data Locally On Pi'
text_file.close()
for sentance in shorts:
print 'End.'
# text_file = open("Output.txt", "w")
# text_file.close()
I know the website pulls the correct information, however, when I run it in the console, I keep getting this error:
TypeError: 'ResultSet' does not have the buffer interface
I tried looking around on Google, and I'm seeing this a lot for strings in a TypeError: 'str' does not have the buffer interface between Python 2.x and Python 3.x. I tried implementing some of those solutions within the code, but It still keep getting the 'ResultSet' error.

ResultSet is type of your g_data:
In [8]: g_data = soup.find_all('div',{'class':'nstory_intro'})
In [9]: type(g_data)
Out[9]: bs4.element.ResultSet
You better use context manager to handle the open and close automatically.
If you just want to write text content of g_data to Output.txt, you should do this:
with open('Output.txt', 'w') as f:
for item in g_data:
f.write(item.text + '\n')

Catch both links and ips with python3

with the help of forum, I made a script that catch all link of the topics of this page https://www.inforge.net/xi/forums/liste-proxy.1118/ . These topics contains a list of proxies. The script is this:
import urllib.request, re
from bs4 import BeautifulSoup
url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")
base = "https://www.inforge.net/xi/"
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
links = tag.get("href")
final = [base + links]
final2 = urllib.request.urlopen(final)
for line in final2:
ip = re.findall("(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3}):(?:[\d]{1,5})", line)
ip = ip[3:-1]
for addr in ip:
print(addr)
The output is:
Traceback (most recent call last):
File "proxygen5.0.py", line 13, in <module>
sourcecode = urllib.request.urlopen(final)
File "/usr/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 456, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
I know that the problem is in the part of: final2 = urllib.request.urlopen(final) but i don't know how to solve
What can I do to print ips?

This code should do what you want, it's commented so you can understand all the passages:
import urllib.request, re
from bs4 import BeautifulSoup
url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")
base = "https://www.inforge.net/xi/"
# Iterate over all the <a> tags
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
# Get the link form the tag
link = tag.get("href")
# Compose the new link
final = base + link
print('Request to {}'.format(final)) # To know what we are doing
# Download the 'final' link content
result = urllib.request.urlopen(final)
# For every line in the downloaded content
for line in result:
# Find one or more IP(s), here we need to convert lines to string because `bytes` objects are given
ip = re.findall("(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3}):(?:[\d]{1,5})", str(line))
# If one ore more IP(s) are found
if ip:
# Print them on separate line
print('\n'.join(ip))

Python URLs in file Requests

I have a problem with my Python script in which I want to scrape the same content from every website. I have a file with a lot of URLs and I want Python to go over them to place them into the requests.get(url) object. After that I write the output to a file named 'somefile.txt'.
I have to the following Python script (version 2.7 - Windows 8):
from lxml import html
import requests
urls = ('URL1',
'URL2',
'URL3'
)
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()
As you can see if have not included the file with the URLs in the script. I tried out many tutorials but failed. The filename would be 'urllist.txt'. In the current script I only get the data from URL3 - in an ideal case I want to get all data from urllist.txt.
Attempt for reading over the text file:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url)

You'll need to remove the newline from your lines:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url.strip())
The str.strip() call removes all whitespace (including tabs and newlines and carriage returns) from the line.
Do make sure you then process page in the loop; if you run your code to extract the data outside the loop all you'll get is the data from the last response you loaded. You may as well open the output file just once, in the with statement so Python closes it again:
with open('urllist.txt', 'r') as urls, open('somefile.txt', 'a') as output:
for url in urls:
page = requests.get(url.strip())
tree = html.fromstring(page.content)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
print >> output, 'Visitors:', visitors

You should either save the each page in a seperate variable, or perform all the computation within the looping of the url list.
Based on your code, by the time your page parsing happens it will only contain the data for the last page get since you are overriding the page variable within each iteration.
Something like the following should append all the pages' info.
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()

Python Read URLs from File and print to file

I have a list of URLs in a text file from which I want to fetch the article text, author and article title. When these three elements are obtained I want them to be written to a file. So far I can read the URLs from the text file but Python only prints out the URLS and one (the final article). How can I re-write my script in order for Python to read and write every single URL and content?
I have to the following Python script (version 2.7 - Mac OS X Yosemite):
from newspaper import Article
f = open('text.txt', 'r') #text file containing the URLS
for line in f:
print line
url = line
first_article = Article(url)
first_article.download()
first_article.parse()
# write/append to file
with open('anothertest.txt', 'a') as f:
f.write(first_article.title)
f.write(first_article.text)
print str(first_article.title)
for authors in first_article.authors:
print authors
if not authors:
print 'No author'
print str(first_article.text)

You're getting the last article, because you're iterating over all the lines of the file:
for line in f:
print line
and once the loop is over, line contains the last value.
url = line
If you move the contents of your code within the loop, so that:
with open('text.txt', 'r') as f: #text file containing the URLS
with open('anothertest.txt', 'a') as fout:
for url in f:
print(u"URL Line: {}".format(url.encode('utf-8')))
# you might want to remove endlines and whitespaces from
# around the URL, which what strip() does
article = Article(url.strip())
article.download()
article.parse()
# write/append to file
fout.write(article.title)
fout.write(article.text)
print(u"Title: {}".format(article.title.encode('utf-8')))
# print authors only if there are authors to show.
if len(article.authors) == 0:
print('No author!')
else:
for author in article.authors:
print(u"Author: {}".format(author.encode('utf-8')))
print("Text of the article:")
print(article.text.encode('utf-8'))
I also made a few changes to improve your code:
use with open() also for reading the file, to properly release the file descriptor
when you don't need it anymore ;
call the output file fout to avoid shadowing the first file
made the opening call for fout done once, before entering the loop to avoid opening/closing the file at each iteration,
check for length of article.authors instead of checking for existence of authors
as authors won't exist when you don't get within the loop because article.authors
is empty.
HTH

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping python: IndexError: list index out of range - python

Related

Iterating website URLs from a text file into BeautifulSoup w/ Python

TypeError when writing to .txt file Python

Catch both links and ips with python3

Python URLs in file Requests

Python Read URLs from File and print to file

Categories

Resources