Catch both links and ips with python3 - python

with the help of forum, I made a script that catch all link of the topics of this page https://www.inforge.net/xi/forums/liste-proxy.1118/ . These topics contains a list of proxies. The script is this:
import urllib.request, re
from bs4 import BeautifulSoup
url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")
base = "https://www.inforge.net/xi/"
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
links = tag.get("href")
final = [base + links]
final2 = urllib.request.urlopen(final)
for line in final2:
ip = re.findall("(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3}):(?:[\d]{1,5})", line)
ip = ip[3:-1]
for addr in ip:
print(addr)
The output is:
Traceback (most recent call last):
File "proxygen5.0.py", line 13, in <module>
sourcecode = urllib.request.urlopen(final)
File "/usr/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 456, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
I know that the problem is in the part of: final2 = urllib.request.urlopen(final) but i don't know how to solve
What can I do to print ips?

This code should do what you want, it's commented so you can understand all the passages:
import urllib.request, re
from bs4 import BeautifulSoup
url = "https://www.inforge.net/xi/forums/liste-proxy.1118/"
soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")
base = "https://www.inforge.net/xi/"
# Iterate over all the <a> tags
for tag in soup.find_all("a", {"class":"PreviewTooltip"}):
# Get the link form the tag
link = tag.get("href")
# Compose the new link
final = base + link
print('Request to {}'.format(final)) # To know what we are doing
# Download the 'final' link content
result = urllib.request.urlopen(final)
# For every line in the downloaded content
for line in result:
# Find one or more IP(s), here we need to convert lines to string because `bytes` objects are given
ip = re.findall("(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3}):(?:[\d]{1,5})", str(line))
# If one ore more IP(s) are found
if ip:
# Print them on separate line
print('\n'.join(ip))

Related

Web scraping python: IndexError: list index out of range

The script reads a single URL from a text file and then imports information from that web page and store it in a CSV file. The script works fine for a single URL.
Problem: I have added several URLs in my text file line by line and now I want my script to read first URL, do the desired operation and then go back to text file to read the second URL and repeat.
Once I added the for loop to get this done, I stated facing the below error:
Traceback (most recent call last):
File "C:\Users\T947610\Desktop\hahah.py", line 22, in
table = soup.findAll("table", {"class":"display"})[0] #Facing error in this statement
IndexError: list index out of range
f = open("URL.txt", 'r')
for line in f.readlines():
print (line)
page = requests.get(line)
print(page.status_code)
print(page.content)
soup = BeautifulSoup(page.text, 'html.parser')
print("soup command worked")
table = soup.findAll("table", {"class":"display"})[0] #Facing error in this statement
rows = table.findAll("tr")
Sometimes findAll throws an exception if it can't find the data in the findall. I have this same issue and I work around it with try/except, except you'll need to deal with empty values probably differently than I've show, which is for example:
f = open("URL.txt", 'r')
for line in f.readlines():
print (line)
page = requests.get(line)
print(page.status_code)
print(page.content)
soup = BeautifulSoup(page.text, 'html.parser')
print("soup command worked")
try:
table = soup.findAll("table", {"class":"display"})[0] #Facing error in this statement
rows = table.findAll("tr")
except IndexError:
table = None
rows = None
If the single url input was working, maybe new input line from .txt is the problem. Try apply .strip() to the line, the line normally has whitespace at the head and tail
page = requests.get(line.strip())
Also, if soup.findall() find nothing, it will return None, which cannot be indexed. Try print the soup and check the content.

Download all Images in a Web directory

I am trying to gather all images in a specific Directory on my webserver, using BeautifulSoup4.
So far I got this code,
from init import *
from bs4 import BeautifulSoup
import urllib
import urllib.request
# use this image scraper from the location that
#you want to save scraped images to
def make_soup(url):
html = urllib.request.urlopen(url)
return BeautifulSoup(html, features="html.parser")
def get_images(url):
soup = make_soup(url)
#this makes a list of bs4 element tags
images = [img for img in soup.findAll('img')]
print (str(len(images)) + "images found.")
print ('Downloading images to current working directory.')
#compile our unicode list of image links
image_links = [each.get('src') for each in images]
for each in image_links:
filename=each.split('/')[-1]
urllib.request.Request(each, filename)
return image_links
#a standard call looks like this
get_images('https://omabilder.000webhostapp.com/img/')
This however, spits out the following error
7images found.
Downloading images to current working directory.
Traceback (most recent call last):
File "C:\Users\MyPC\Desktop\oma projekt\getpics.py", line 1, in <module>
from init import *
File "C:\Users\MyPC\Desktop\oma projekt\init.py", line 9, in <module>
from getpics import *
File "C:\Users\MyPC\Desktop\oma projekt\getpics.py", line 26, in <module>
get_images('https://omabilder.000webhostapp.com/img/')
File "C:\Users\MyPC\Desktop\oma projekt\getpics.py", line 22, in get_images
urllib.request.Request(each, filename)
File "C:\Users\MyPC\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 328, in __init__
self.full_url = url
File "C:\Users\MyPC\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 354, in full_url
self._parse()
File "C:\Users\MyPC\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: '/icons/blank.gif'
What I do not understand is the following,
There is no GIF in the Directory and no /icon/ subdirectory.
Furthermore it spits out 7 images were found, when there are only like 3 uploaded to the website.
The gifs are the icons next to the links on your website (tiny ~20x20 px images). They're actually shown on the website. If I understand correctly, you want to download the png images -- these are links, rather than images at the url you've provided.
If you want to download the png images from the links, then you can use something like this:
from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
# use this image scraper from the location that
#you want to save scraped images to
def make_soup(url):
html = urllib.request.urlopen(url)
return BeautifulSoup(html, features="html.parser")
def get_images(url):
soup = make_soup(url)
# get all links (start with "a")
images = [link["href"] for link in soup.find_all('a', href=True)]
# keep ones that end with png
images = [im for im in images if im.endswith(".png")]
print (str(len(images)) + " images found.")
print ('Downloading images to current working directory.')
#compile our unicode list of image links
for each in images:
urllib.request.urlretrieve(os.path.join(url, each), each)
return images
# #a standard call looks like this
get_images('https://omabilder.000webhostapp.com/img/')

Iterating website URLs from a text file into BeautifulSoup w/ Python

I have a .txt file with a different link on each line that I want to iterate, and parse into BeautifulSoup(response.text, "html.parser"). I'm having a couple issues though.
I can see the lines iterating from the text file, but when I assign them to my requests.get(websitelink), my code that previously worked (without iteration) no longer prints any data that I scrape.
All I receive are some blank lines in the results.
I'm new to Python and BeautifulSoup, so I'm not quite sure what I'm doing wrong. I've tried parsing the lines as a string, but that didn't seem to work.
import requests
from bs4 import BeautifulSoup
filename = 'item_ids.txt'
with open(filename, "r") as fp:
lines = fp.readlines()
for line in lines:
#Test to see if iteration for line to line works
print(line)
#Assign single line to websitelink
websitelink = line
#Parse websitelink into requests
response = requests.get(websitelink)
soup = BeautifulSoup(response.text, "html.parser")
#initialize and reset vars for cd loop
count = 0
weapon = ''
stats = ''
#iterate through cdata on page, and parse wanted data
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
#print(cd)
count += 1
if count == 1:
weapon = cd
if count == 6:
stats = cd
#concatenate cdata info
both = weapon + " " + stats
print(both)
The code should follow these steps:
Read line (URL) from text file, and assign to variable to be used w/ request.get(websitelink)
BeautifulSoup scrapes that link for the CData and prints it
Repeat Step 1 & 2 until final line of the text file (last URL)
Any help would be greatly appreciated,
Thanks
I don't know this could help you or not but I've added a strip() to your link variable when you are assigning it to the websitelink and helped me to make your code work. You could try it.
websitelink = line.strip()

Beautifulsoup download all .zip files from Google Patent Search

What I am trying to do is use Beautifulsoup to download every zip file from the Google Patent archive. Below is the code that i've written thus far. But it seems that I am having troubles getting the files to download into a directory on my desktop. Any help would be greatly appreciated
from bs4 import BeautifulSoup
import urllib2
import re
import pandas as pd
url = 'http://www.google.com/googlebooks/uspto-patents-grants.html'
site = urllib2.urlopen(url)
html = site.read()
soup = BeautifulSoup(html)
soup.prettify()
path = open('/Users/username/Desktop/', "wb")
for name in soup.findAll('a', href=True):
print name['href']
linkpath = name['href']
rq = urllib2.request(linkpath)
res = urllib2.urlope
The result that I am supposed to get, is that all of the zip files are supposed to download into a specific dir. Instead, I am getting the following error:
> #2015 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last)
> <ipython-input-13-874f34e07473> in <module>() 17 print name['href'] 18
> linkpath = name['href'] ---> 19 rq = urllib2.request(namep) 20 res =
> urllib2.urlopen(rq) 21 path.write(res.read()) AttributeError: 'module'
> object has no attribute 'request' –
In addition to using a non-existent request entity from urllib2, you don't output to a file correctly - you can't just open the directory, you have to open each file for output separately.
Also, the 'Requests' package has a much nicer interface than urllib2. I recommend installing it.
Note, that, today anyway, the first .zip is 5.7Gb so streaming to a file is essential.
Really, you want something more like this:
from BeautifulSoup import BeautifulSoup
import requests
# point to output directory
outpath = 'D:/patent_zips/'
url = 'http://www.google.com/googlebooks/uspto-patents-grants.html'
mbyte=1024*1024
print 'Reading: ', url
html = requests.get(url).text
soup = BeautifulSoup(html)
print 'Processing: ', url
for name in soup.findAll('a', href=True):
zipurl = name['href']
if( zipurl.endswith('.zip') ):
outfname = outpath + zipurl.split('/')[-1]
r = requests.get(zipurl, stream=True)
if( r.status_code == requests.codes.ok ) :
fsize = int(r.headers['content-length'])
print 'Downloading %s (%sMb)' % ( outfname, fsize/mbyte )
with open(outfname, 'wb') as fd:
for chunk in r.iter_content(chunk_size=1024): # chuck size can be larger
if chunk: # ignore keep-alive requests
fd.write(chunk)
fd.close()
This is your problem:
rq = urllib2.request(linkpath)
urllib2 is a module and it has no request entity/attribute in it.
I see a Request class in urllib2, but I'm unsure if that's what you intended to actually use...

File is not creating in python when unicode present in the file name

I need to store HTML file as text files. In the same name of web tittle. something went wrong with my code so that it is not creating file in side the directory. I have directory permission to write. I am using Ubuntu 12.04LTS
Directory /home/user1/
File name
print name = Mathrubhumi Sports - ശ്രീനിക്ക് പച്ചക്കൊടി
The file name contain Unicode values
import os
from urllib import urlopen
from bs4 import BeautifulSoup
url= "http://www.mathrubhumi.com/sports/story.php?id=397111"
raw = urlopen(url).read()
soup = BeautifulSoup(raw,'lxml')
texts = soup.findAll(text=True)
name = soup.title.text
name= name+'.txt'
def contains_unicode(text):
try:
str(text)
except:
return True
return False
result = ''.join((text for text in texts if contains_unicode(text)))
# Output to a file
with open(os.path.join('/home/user1/textfiles',name,'w') as out:
out.write(result)
Please help me to debug it
I tried this and it worked, it created a file called Mathrub*.txt with some text in it in the current directory.
import codecs
import os
from urllib import urlopen
from bs4 import BeautifulSoup
url= "http://www.mathrubhumi.com/sports/story.php?id=397111"
raw = urlopen(url).read()
soup = BeautifulSoup(raw,'lxml')
texts = soup.findAll(text=True)
name = soup.title.string
name= name+'.txt'
def contains_unicode(text):
try:
str(text)
except:
return True
return False
result = ''.join((text for text in texts if contains_unicode(text)))
# Output to a file
with codecs.open(name,'w',encoding="utf-8") as out:
out.write(result)
Before adding the codecs part, it complained loudly about trying to write some characters that it did not know how to interpret.

Categories

Resources