Python: Take URL from list and access it

Python: Take URL from list and access it - python

I'm trying to take a URL from a list (~1500 entries) and access them one by one using the twill lib for python. The reason that I'm using twill is because I like it and I might have to perform basic formfilling later on.
The problem I have is declaring the contents of the loop.
I'm sure this is actually pretty simple to solve, but the solution just won't come to my mind at the moment.
from twill.commands import *
CONTAINER = open('urls.txt') #opening file
CONTAINER_CONTENTS = CONTAINER.readlines() #reading
CONTAINER_CONTENTS = map(lambda s: s.strip, CONTAINER_CONTENTS) #this is just to remove the ^N (newline) that was appended to each URL
for i in CONTAINER_CONTENTS:
<educate me>
..
go(url)
etc.
Thanks in Advance.

from twill.commands import *
with open('urls.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
go(url)
# now do something with the page

Related

Pulling info from an api url

I'm trying to pull the average of temperatures from this API from a bunch of different ZIP codes. I can currently do so by manually changing the ZIP code in the URL for the API, but I was hoping it to be able to loop through a list of ZIP codes or ask for input and use those zip codes.
However, I'm rather new and have no idea on how to add variables and stuff to a link, either that or I'm overcomplicating it. So basically I was searching for some methods to add a variable to the link or something to the same effect so I can change it whenever I want.
import urllib.request
import json
out = open("output.txt", "w")
link = "http://api.openweathermap.org/data/2.5/weather?zip={zip-code},us&appid={api-key}"
print(link)
x = urllib.request.urlopen(link)
url = x.read()
out.write(str(url, 'utf-8'))
returnJson = json.loads(url)
print('\n')
print(returnJson["main"]["temp"])

import urllib.request
import json
zipCodes = ['123','231','121']
out = open("output.txt", "w")
for i in zipCodes:
link = "http://api.openweathermap.org/data/2.5/weather?zip=" + i + ",us&appid={api-key}"
x = urllib.request.urlopen(link)
url = x.read()
out.write(str(url, 'utf-8'))
returnJson = json.loads(url)
print(returnJson["main"]["temp"])
out.close()
You can achieve what you want by looping through a list of zipcodes and creating a new URL from them.

Downloading sites from a list

So, I am a bit new to python, and I can't get my head to wrap around why this code snippet is not working.
In short, I have a list of 500 sites, all in the following format: https://www.domain . com/subfolder/subfolder separated by a new line, and I am trying to download them. This is the code:
import wget
f = open("500_sites.txt", "r")
content = f.readlines()
url = ""
for x in range(1, len(content)):
print(content[x])
wget.download(content[x], 'index.html')
input("wait a bit")
I am expecting the code to read the text file line by line in the content list. Then, I would like the wget.download() function to download the whole source code of the content[x] webpage.
Using the wget.download() with a given variable it works perfectly:
...
url = "https://domain . com/subfolder/subfolder"
wget.download(url, 'index.html')
...
Thanks in advance!

Parsing the file name from list of url links

Ok so I am using a script that is downloading a files from urls listed in a urls.txt.
import urllib.request
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
urllib.request.urlretrieve(link)
Unfortunately they are saved as temporary files due to lack of second argument in my urllib.request.urlretrieve function. As there are thousand of links in my text file naming them separately is not an option. The thing is that the name of the file is contained in those links, i.e. /DocumentXML2XLSDownload.vm?firsttime=true&repengback=true&d‌ocumentId=XXXXXX&xsl‌FileName=rher2xml.xs‌l&outputFileName=XXX‌X_2017_06_25_4.xls where the name of the file comes after outputFileName=
Is there an easy way to parse the file names and then use them in urllib.request.urlretrieve function as secondary argument? I was thinking of extracting those names in excel and placing them in another text file that would be read in similar fashion as urls.txt but I'm not sure how to implement it in Python. Or is there a way to make it exclusively in python without using excel?

You could parse the link on the go.
Example using a regular expression:
import re
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
regexp = '((?<=\?outputFileName=)|(?<=\&outputFileName=))[^&]+'
match = re.search(regexp, link.rstrip())
if match is None:
# Make the user aware that something went wrong, e.g. raise exception
# and/or just print something
print("WARNING: Couldn't find file name in link [" + link + "]. Skipping...")
else:
file_name = match.group(0)
urllib.request.urlretrieve(link, file_name)

You can use urlparse and parse_qs to get the query string
from urlparse import urlparse,parse_qs
parse = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html?name=Python&version=2')
print(parse_qs(parse.query)['name'][0]) # prints Python

How do I scrape images using Python while ignoring their height & width in the URL?

I'm attempting to write a Python script to download images from an API.
The API returns the images in a format like this:
https://stackoverflow.com/media/GetImage?ID=98383838&imageName=03833883.jpg&width=640&height=480`
with each image on a new line. I'm trying to use urllib, but struggling to figure out how to ignore the width/height processing each jpg, as I want the full size images rather than the 640x480's.
I've been testing with the following:
import urllib
import re
input_file = open('imgurls.txt','r')
x=0
for line in input_file:
URL= line
urllib.urlretrieve(URL, str(x) + ".jpg")
x+=1
I'm not sure how to approach the width/height issue.
I believe I should use rsplit but not really sure.
I'll also need to move to the next line if the line it is reading is not a URL to avoid errors.

cricket_007's answer looks great to me. A slightly more robust approach could be to use urlparse to break up the URL, remove the query parameters you don't need and reconstruct it:
import urlparse
url = 'https://stackoverflow.com/media/GetImage?ID=98383838&imageName=03833883.jpg&width=640&height=480'
parsed = urlparse.urlparse(url)
query = parsed.query
parsed_query = urlparse.parse_qs(query)
parsed_query.pop('width', None)
parsed_query.pop('height', None)
result = urlparse.urlunparse((parsed.scheme, parsed.netloc, parsed.path, parsed.params, urllib.urlencode(parsed_query, True), parsed.fragment))

You can split off the last two query parameters from the URL, then join the URL back.
url = 'https://stackoverflow.com/media/GetImage?ID=98383838&imageName=03833883.jpg&width=640&height=480'
full_img_url = '&'.join(url.split('&')[:-2])
# 'https://stackoverflow.com/media/GetImage?ID=98383838&imageName=03833883.jpg'
This assumes width and height are always last.

Creating a python program that scrapes file from a website

This is what I have so far
import urllib
Champions=["Aatrox","Ahri","Akali","Alistar","Amumu","Anivia","Annie","Ashe","Azir","Blitzcrank","Brand","Braum","Caitlyn","Cassiopeia","ChoGath","Corki","Darius","Diana","DrMundo","Draven","Elise","Evelynn","Ezreal","Fiddlesticks","Fiora","Fizz","Galio","Gangplank","Garen","Gnar","Gragas","Graves","Hecarim","Heimerdinger","Irelia","Janna","JarvanIV","Jax","Jayce","Jinx","Kalista","Karma","Karthus","Kassadin","Katarina","Kayle","Kennen","KhaZix","KogMaw","LeBlanc","LeeSin","Leona","Lissandra","Lucian","Lulu","Lux","Malphite","Malzahar","Maokai","MasterYi","MissFortune","Mordekaiser","Morgana","Nami","Nasus","Nautilus","Nidalee","Nocturne","Nunu","Olaf","Orianna","Pantheon","Poppy","Quinn","Rammus","RekSai","Renekton","Rengar","Riven","Rumble","Ryze","Sejuani","Shaco","Shen","Shyvana","Singed","Sion","Sivir","Skarner","Sona","Soraka","Swain","Syndra","Talon","Taric","Teemo","Thresh","Tristana","Trundle","Tryndamere","TwistedFate","Twitch","Udyr","Urgot","Varus","Vayne","Veigar","VelKoz","Vi","Viktor","Vladimir","Volibear","Warwick","Wukong","Xerath","XinZhao","Yasuo","Yorick","Zac","Zed","Ziggs","Zilean","Zyra"]
currentCount=0
while currentCount < len(Champions):
urllib.urlretrieve("http://www.lolflavor.com/champions/"+Champions[currentCount]+ "/Recommended/"+Champions[currentCount]+"_lane_scrape.json","C:\Users\Jay\Desktop\LolFlavor\ " +Champions[currentCount]+ "\ "+Champions[currentCount]+ "_lane_scrape.json")
currentCount+=1
What the program is meant to do is to use the list and the currentCount to get the champion, then it goes to the website e.g for "Aatrox" http://www.lolflavor.com/champions/Aatrox/Recommended/Aatrox_lane_scrape.json, then it downloads and stores the file in the folder LolFlavor/Aatrox/Aatrox_lane_scrape.json in this case.
The bit which is Aatrox changes depending on the champion.
Can anyone help me to get it to work?
EDIT: CURRENT CODE WITH VALUE ERROR:
import json
import os
import requests
Champions=["Aatrox","Ahri","Akali","Alistar","Amumu","Anivia","Annie","Ashe","Azir","Blitzcrank","Brand","Braum","Caitlyn","Cassiopeia","ChoGath","Corki","Darius","Diana","DrMundo","Draven","Elise","Evelynn","Ezreal","Fiddlesticks","Fiora","Fizz","Galio","Gangplank","Garen","Gnar","Gragas","Graves","Hecarim","Heimerdinger","Irelia","Janna","JarvanIV","Jax","Jayce","Jinx","Kalista","Karma","Karthus","Kassadin","Katarina","Kayle","Kennen","KhaZix","KogMaw","LeBlanc","LeeSin","Leona","Lissandra","Lucian","Lulu","Lux","Malphite","Malzahar","Maokai","MasterYi","MissFortune","Mordekaiser","Morgana","Nami","Nasus","Nautilus","Nidalee","Nocturne","Nunu","Olaf","Orianna","Pantheon","Poppy","Quinn","Rammus","RekSai","Renekton","Rengar","Riven","Rumble","Ryze","Sejuani","Shaco","Shen","Shyvana","Singed","Sion","Sivir","Skarner","Sona","Soraka","Swain","Syndra","Talon","Taric","Teemo","Thresh","Tristana","Trundle","Tryndamere","TwistedFate","Twitch","Udyr","Urgot","Varus","Vayne","Veigar","VelKoz","Vi","Viktor","Vladimir","Volibear","Warwick","Wukong","Xerath","XinZhao","Yasuo","Yorick","Zac","Zed","Ziggs","Zilean","Zyra"]
for champ in Champions:
os.makedirs("C:\\Users\\Jay\\Desktop\\LolFlavor\\{}\\Recommended".format(champ), exist_ok=True)
with open(r"C:\Users\Jay\Desktop\LolFlavor\{}\Recommended\{}_lane_scrape.json".format(champ,champ),"w") as f:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_lane_scrape.json".format(champ,champ))
json.dump(r.json(),f)
with open(r"C:\Users\Jay\Desktop\LolFlavor\{}\Recommended\{}_jungle_scrape.json".format(champ,champ),"w") as f:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_jungle_scrape.json".format(champ,champ))
json.dump(r.json(),f)
with open(r"C:\Users\Jay\Desktop\LolFlavor\{}\Recommended\{}_support_scrape.json".format(champ,champ),"w") as f:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_support_scrape.json".format(champ,champ))
json.dump(r.json(),f)
with open(r"C:\Users\Jay\Desktop\LolFlavor\{}\Recommended\{}_aram_scrape.json".format(champ,champ),"w") as f:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_aram_scrape.json".format(champ,champ))
json.dump(r.json(),f)

import requests
Champions=["Aatrox","Ahri","Akali","Alistar","Amumu","Anivia","Annie","Ashe","Azir","Blitzcrank","Brand","Braum","Caitlyn","Cassiopeia","ChoGath","Corki","Darius","Diana","DrMundo","Draven","Elise","Evelynn","Ezreal","Fiddlesticks","Fiora","Fizz","Galio","Gangplank","Garen","Gnar","Gragas","Graves","Hecarim","Heimerdinger","Irelia","Janna","JarvanIV","Jax","Jayce","Jinx","Kalista","Karma","Karthus","Kassadin","Katarina","Kayle","Kennen","KhaZix","KogMaw","LeBlanc","LeeSin","Leona","Lissandra","Lucian","Lulu","Lux","Malphite","Malzahar","Maokai","MasterYi","MissFortune","Mordekaiser","Morgana","Nami","Nasus","Nautilus","Nidalee","Nocturne","Nunu","Olaf","Orianna","Pantheon","Poppy","Quinn","Rammus","RekSai","Renekton","Rengar","Riven","Rumble","Ryze","Sejuani","Shaco","Shen","Shyvana","Singed","Sion","Sivir","Skarner","Sona","Soraka","Swain","Syndra","Talon","Taric","Teemo","Thresh","Tristana","Trundle","Tryndamere","TwistedFate","Twitch","Udyr","Urgot","Varus","Vayne","Veigar","VelKoz","Vi","Viktor","Vladimir","Volibear","Warwick","Wukong","Xerath","XinZhao","Yasuo","Yorick","Zac","Zed","Ziggs","Zilean","Zyra"]
for champ in Champions:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_lane_scrape.json".format(champ,champ))
print(r.json())
If you want to save each to a file. dump the json.
import json
import simplejson
for champ in Champions:
with open(r"C:\Users\Jay\Desktop\LolFlavor\{}_lane_scrape.json".format(champ),"w") as f:
try:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_lane_scrape.json".format(champ, champ))
json.dump(r.json(),f)
except simplejson.scanner.JSONDecodeError as e:
print(e.r.url)
The error is from 404 - File or directory not found as one of you calls fails so there is no valid json to decode.
The offending url is:
u'http://www.lolflavor.com/champions/Wukong/Recommended/Wukong_lane_scrape.json'
which if you try in your browser will also give you a 404 error. That is caused by the fact there is no user Wukong which can be confirmed by opening http://www.lolflavor.com/champions/Wukong/ in your browser
There is no need to index the list using a while loop. simply iterate over the list items directly and use str.format to pass the variables into the url. Also make sure you use raw string r for the file path when using \'s as they have a special meaning in python they are using to escape characters so \n or \r etc.. in your paths would cause problems. You can also use / or escape using \\.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Take URL from list and access it - python

from twill.commands import * with open('urls.txt') as inf: urls = (line.strip() for line in inf) for url in urls: go(url) # now do something with the page

Related

Pulling info from an api url

Downloading sites from a list

Parsing the file name from list of url links

How do I scrape images using Python while ignoring their height & width in the URL?

Creating a python program that scrapes file from a website

Categories

Resources