Metaprogramming Python Script for e-mail Capture - python

How can I modify the code below to capture all e-mails instead of images:
import urllib2
import re
from os.path import basename
from urlparse import urlsplit
url = "URL WITH IMAGES"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)
# download all images
for imgUrl in imgUrls:
try:
imgData = urllib2.urlopen(imgUrl).read()
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass
Need to get a directory from an array of websites. I'm using C++ to create code for Unix by calling the .py file multiple times and then appending it to an existing file each time.

Parsing/validating email address requires a strong regex. You can look for those on google. I am showing you a simple email address parsing regex.
emails = re.findall('([a-zA-Z0-9\.]+#[a-zA-Z0-9]+\.[a-zA-Z]{2,3})', urlContent)
This is just a rudimentary example. You need to use a powerful one.

Related

How to download Flickr images using photos url (does not contain .jpg, .png, etc.) using Python

I want to download image from Flickr using following type of links using Python:
https://www.flickr.com/photos/66176388#N00/2172469872/
https://www.flickr.com/photos/clairity/798067744/
This data is obtained from xml file given at https://snap.stanford.edu/data/web-flickr.html
Is there any Python script or way to download images automatically.
Thanks.
I try to find answer from other sources and compiled the answer as follows:
import re
from urllib import request
def download(url, save_name):
html = request.urlopen(url).read()
html=html.decode('utf-8')
img_url = re.findall(r'https:[^" \\:]*_b\.jpg', html)[0]
print(img_url)
with open(save_name, "wb") as fp:
fp.write(request.urlopen(img_url).read())
download('https://www.flickr.com/photos/clairity/798067744/sizes/l/', 'image.jpg')

Python download image from short url by keeping its own name

I would like to download image file from shortener url or generated url which doesn't contain file name on it.
I have tried to use [content-Disposition]. However my file name is not in ASCII code. So it can't print the name.
I have found out i can use urlretrieve, request to download file but i need to save as different name.
I want to download by keeping it's own name..
How can i do this?
matches = re.match(expression, message, re.I)
url = matches[0]
print(url)
original = urlopen(url)
remotefile = urlopen(url)
#blah = remotefile.info()['content-Disposition']
#print(blah)
#value, params = cgi.parse_header(blah)
#filename = params["filename*"]
#print(filename)
#print(original.url)
#filename = "filedown.jpg"
#urlretrieve(url, filename)
These are the list that i have try but none of them work
I was able to get this to work with the requests library because you can use it to get the url that the shortened url redirects to. Then, I applied your code to the redirected url and it worked. There might be a way to only use urllib (I assume thats what you are using) with this, but I dont know.
import requests
from urllib.request import urlopen
import cgi
def getFilenameFromURL(url):
req = requests.request("GET", url)
# req.url is now the url the shortened url redirects to
original = urlopen(req.url)
value, params = cgi.parse_header(original.info()['Content-Disposition'])
filename = params["filename*"]
print(filename)
return filename
getFilenameFromURL("https://shorturl.at/lKOY3")
You can then use urlretrieve with this. Its inefficient but it works... Also since you can get the actual url with the requests library, you can probably get the filname through there.

How can I input a filename and download the file in Python?

I have data base of file. I'm writing a program to ask the user to input file name and using that input to find the file, download it,make a folder locally and save the file..which module in Python should be used?
Can be as small as this:
import requests
my_filename = input('Please enter a filename:')
my_url = 'http://www.somedomain/'
r = requests.get(my_url + my_filename, allow_redirects=True)
with open(my_filename, 'wb') as fh:
fh.write(r.content)
Well, do you have the database online?
If so I would suggest you the requests module, very pythonic and fast.
Another great module based on requests is robobrowser.
Eventually, you may need beautiful soup to parse the HTML or XML data.
I would avoid using selenium because it's designed for web-testing, it needs a browser and its webdriver and it's pretty slow. It doesn't fit your needs at all.
Finally, to interact with the database I'd use sqlite3
Here a sample:
from requests import Session
import os
filename = input()
with Session() as session:
url = f'http://www.domain.example/{filename}'
try:
response = session.get(url)
except requests.exceptions.ConnectionError:
print('File not existing')
download_path = f'C:\\Users\\{os.getlogin()}\\Downloads\\your application'
os.makedirs(dowload_path, exist_ok=True)
with open(os.path.join(download_path, filename), mode='wb') as dbfile:
dbfile.write(response.content)
However, you should read how to ask a good question.

Unable to save downloaded images into a folder on the desktop using python

I have made a scraper which is at this moment parsing image links and saving downloaded images into python directory by default. The only thing i wanna do now is choose a folder on the desktop to save those images within but can't. Here is what I'm up to:
import requests
import os.path
import urllib.request
from lxml import html
def Startpoint():
url = "https://www.aliexpress.com/"
response = requests.get(url)
tree = html.fromstring(response.text)
titles = tree.xpath('//div[#class="item-inner"]')
for title in titles:
Pics="https:" + title.xpath('.//span[#class="pic"]//img/#src')[0]
endpoint(Pics)
def endpoint(images):
sdir = (r'C:\Users\ar\Desktop\mth')
testfile = urllib.request.URLopener()
xx = testfile.retrieve(images, images.split('/')[-1])
filename=os.path.join(sdir,xx)
print(filename)
Startpoint()
Upon execution the above code throws an error showing: "join() argument must be str or bytes, not 'tuple'"
you can download images with urllib of python. You can see the official documentation of python here urllib documentation for python 2.7 . If you want to use python 3 then follow this documentation urllib for python 3
You could use urllib.request, BytesIO from io and PIL Image.
(if you have a direct url to the image)
from PIL import Image
from io import BytesIO
import urllib.request
def download_image(url):
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
content = response.read()
img = Image.open(BytesIO(content))
img.filename = url
return img
The images are dynamic now. So, I thought to update this post:
import os
from selenium import webdriver
import urllib.request
from lxml.html import fromstring
url = "https://www.aliexpress.com/"
def get_data(link):
driver.get(link)
tree = fromstring(driver.page_source)
for title in tree.xpath('//li[#class="item"]'):
pics = "https:" + title.xpath('.//*[contains(#class,"img-wrapper")]//img/#src')[0]
os.chdir(r"C:\Users\WCS\Desktop\test")
urllib.request.urlretrieve(pics, pics.split('/')[-1])
if __name__ == '__main__':
driver = webdriver.Chrome()
get_data(url)
driver.quit()
This is the code to download the html file from the web
import random
import urllib.request
def download(url):
name = random.randrange(1, 1000)
#this is the random function to give the name to the file
full_name = str(name) + ".html" #compatible data type
urllib.request.urlretrieve(url,full_name) #main function
download("any url")
This is the code for downloading any html file from the internet just you have to provide the link in the function.
As in your case you have told that you have retrieved the images links from the web page So you can change the extension from ".html" to compatible type, but the problem is that the image can be of different extension may be ".jpg" , ".png" etc.
So what you can do is you can match the ending of the link using if else with string matching and then assign the extension in the end.
Here is the example for the illustration
import random
import urllib.request
if(link extension is ".png"): #pseudo code
def download(url):
name = random.randrange(1, 1000)
#this is the random function to give the name to the file
full_name = str(name) + ".png" #compatible extension with .png
urllib.request.urlretrieve(url,full_name) #main function
download("any url")
else if (link extension is ".jpg"): #pseudo code
def download(url):
name = random.randrange(1, 1000)
#this is the random function to give the name to the file
full_name = str(name) + ".jpg" #compatible extension with .jpg
urllib.request.urlretrieve(url,full_name) #main function
download("any url")
You can use multiple if else for the various type of the extension.
If it helps for your situation have a Thumbs up buddy.

How to download specific GIF images (condition: phd*.gif) from a website using Python's BeautifulSoup?

I have the following code that downloads all images from a web-link.
from BeautifulSoup import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
import sys
def main(url, out_folder="/test/"):
"""Downloads all the images at 'url' to /test/"""
soup = bs(urlopen(url))
parsed = list(urlparse.urlparse(url))
for image in soup.findAll("img"):
print "Image: %(src)s" % image
filename = image["src"].split("/")[-1]
parsed[2] = image["src"]
outpath = os.path.join(out_folder, filename)
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsed), outpath)
def _usage():
print "usage: python dumpimages.py http://example.com [outpath]"
if __name__ == "__main__":
url = sys.argv[-1]
out_folder = "/test/"
if not url.lower().startswith("http"):
out_folder = sys.argv[-1]
url = sys.argv[-2]
if not url.lower().startswith("http"):
_usage()
sys.exit(-1)
main(url, out_folder)
I want to modify it so that it downloads only images named as 'phd210223.gif' (for example), that is, images satisfying the condition: 'phd*.gif'
And I want to put it in a loop, so that after fetching such images from one webpage, it increments the page ID by 1 and downloads the same from the next page: 'http://www.example.com/phd.php?id=2'
How can I do this?
Instead of checking the name in the loop, you can use BeautifulSoup's built-in support for regular expressions. Provide the compiled regular expression as a value of src argument:
import re
from bs4 import BeautifulSoup as bs # note, you should use beautifulsoup4
for image in soup.find_all("img", src=re.compile('phd\d+\.gif$')):
...
phd\d+\.gif$ regular expression would search for text starting with phd, followed by 1 or more digits, followed by dot, followed by gif at the end of the string.
Note that you are using an outdated and unmaintained BeautifulSoup3, switch to beautifulsoup4:
pip install beautifulsoup4
Regular expression can help to solve this! when pattern is found in string/url, a match object would be returned, otherwise None.
import re
reg = re.compile('phd.*\.gif$')
str1 = 'path/phd12342343.gif'
str2 = 'path/dhp12424353153.gif'
print re.search(reg,str1)
print re.search(reg,str2)
I personally prefer using python default tools so I use html.parser, what you need it something like this:
import re, urllib.request, html.parser
class LinksHTMLParser(parse.HTMLParser):
def __init__(self, length):
super().__init__()
self.gifs = list()
def handle_starttag(self, tag, attrs):
if tag == "a":
for name, value in attrs:
if name == "href":
gifName = re.split("/", value)[-1]
if *gifNameCondition*:
self.gifs.append(value)
parser = LinksHTMLParser()
parser.feed(urllib.request.urlopen("YOUR URL HERE").read().decode("utf-8"))
for gif in parser.gifs:
urllib.request.urlretrieve(*local path to download gif to*, gif)

Categories

Resources