Creating a python program that scrapes file from a website

Creating a python program that scrapes file from a website - python

This is what I have so far
import urllib
Champions=["Aatrox","Ahri","Akali","Alistar","Amumu","Anivia","Annie","Ashe","Azir","Blitzcrank","Brand","Braum","Caitlyn","Cassiopeia","ChoGath","Corki","Darius","Diana","DrMundo","Draven","Elise","Evelynn","Ezreal","Fiddlesticks","Fiora","Fizz","Galio","Gangplank","Garen","Gnar","Gragas","Graves","Hecarim","Heimerdinger","Irelia","Janna","JarvanIV","Jax","Jayce","Jinx","Kalista","Karma","Karthus","Kassadin","Katarina","Kayle","Kennen","KhaZix","KogMaw","LeBlanc","LeeSin","Leona","Lissandra","Lucian","Lulu","Lux","Malphite","Malzahar","Maokai","MasterYi","MissFortune","Mordekaiser","Morgana","Nami","Nasus","Nautilus","Nidalee","Nocturne","Nunu","Olaf","Orianna","Pantheon","Poppy","Quinn","Rammus","RekSai","Renekton","Rengar","Riven","Rumble","Ryze","Sejuani","Shaco","Shen","Shyvana","Singed","Sion","Sivir","Skarner","Sona","Soraka","Swain","Syndra","Talon","Taric","Teemo","Thresh","Tristana","Trundle","Tryndamere","TwistedFate","Twitch","Udyr","Urgot","Varus","Vayne","Veigar","VelKoz","Vi","Viktor","Vladimir","Volibear","Warwick","Wukong","Xerath","XinZhao","Yasuo","Yorick","Zac","Zed","Ziggs","Zilean","Zyra"]
currentCount=0
while currentCount < len(Champions):
urllib.urlretrieve("http://www.lolflavor.com/champions/"+Champions[currentCount]+ "/Recommended/"+Champions[currentCount]+"_lane_scrape.json","C:\Users\Jay\Desktop\LolFlavor\ " +Champions[currentCount]+ "\ "+Champions[currentCount]+ "_lane_scrape.json")
currentCount+=1
What the program is meant to do is to use the list and the currentCount to get the champion, then it goes to the website e.g for "Aatrox" http://www.lolflavor.com/champions/Aatrox/Recommended/Aatrox_lane_scrape.json, then it downloads and stores the file in the folder LolFlavor/Aatrox/Aatrox_lane_scrape.json in this case.
The bit which is Aatrox changes depending on the champion.
Can anyone help me to get it to work?
EDIT: CURRENT CODE WITH VALUE ERROR:
import json
import os
import requests
Champions=["Aatrox","Ahri","Akali","Alistar","Amumu","Anivia","Annie","Ashe","Azir","Blitzcrank","Brand","Braum","Caitlyn","Cassiopeia","ChoGath","Corki","Darius","Diana","DrMundo","Draven","Elise","Evelynn","Ezreal","Fiddlesticks","Fiora","Fizz","Galio","Gangplank","Garen","Gnar","Gragas","Graves","Hecarim","Heimerdinger","Irelia","Janna","JarvanIV","Jax","Jayce","Jinx","Kalista","Karma","Karthus","Kassadin","Katarina","Kayle","Kennen","KhaZix","KogMaw","LeBlanc","LeeSin","Leona","Lissandra","Lucian","Lulu","Lux","Malphite","Malzahar","Maokai","MasterYi","MissFortune","Mordekaiser","Morgana","Nami","Nasus","Nautilus","Nidalee","Nocturne","Nunu","Olaf","Orianna","Pantheon","Poppy","Quinn","Rammus","RekSai","Renekton","Rengar","Riven","Rumble","Ryze","Sejuani","Shaco","Shen","Shyvana","Singed","Sion","Sivir","Skarner","Sona","Soraka","Swain","Syndra","Talon","Taric","Teemo","Thresh","Tristana","Trundle","Tryndamere","TwistedFate","Twitch","Udyr","Urgot","Varus","Vayne","Veigar","VelKoz","Vi","Viktor","Vladimir","Volibear","Warwick","Wukong","Xerath","XinZhao","Yasuo","Yorick","Zac","Zed","Ziggs","Zilean","Zyra"]
for champ in Champions:
os.makedirs("C:\\Users\\Jay\\Desktop\\LolFlavor\\{}\\Recommended".format(champ), exist_ok=True)
with open(r"C:\Users\Jay\Desktop\LolFlavor\{}\Recommended\{}_lane_scrape.json".format(champ,champ),"w") as f:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_lane_scrape.json".format(champ,champ))
json.dump(r.json(),f)
with open(r"C:\Users\Jay\Desktop\LolFlavor\{}\Recommended\{}_jungle_scrape.json".format(champ,champ),"w") as f:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_jungle_scrape.json".format(champ,champ))
json.dump(r.json(),f)
with open(r"C:\Users\Jay\Desktop\LolFlavor\{}\Recommended\{}_support_scrape.json".format(champ,champ),"w") as f:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_support_scrape.json".format(champ,champ))
json.dump(r.json(),f)
with open(r"C:\Users\Jay\Desktop\LolFlavor\{}\Recommended\{}_aram_scrape.json".format(champ,champ),"w") as f:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_aram_scrape.json".format(champ,champ))
json.dump(r.json(),f)

import requests
Champions=["Aatrox","Ahri","Akali","Alistar","Amumu","Anivia","Annie","Ashe","Azir","Blitzcrank","Brand","Braum","Caitlyn","Cassiopeia","ChoGath","Corki","Darius","Diana","DrMundo","Draven","Elise","Evelynn","Ezreal","Fiddlesticks","Fiora","Fizz","Galio","Gangplank","Garen","Gnar","Gragas","Graves","Hecarim","Heimerdinger","Irelia","Janna","JarvanIV","Jax","Jayce","Jinx","Kalista","Karma","Karthus","Kassadin","Katarina","Kayle","Kennen","KhaZix","KogMaw","LeBlanc","LeeSin","Leona","Lissandra","Lucian","Lulu","Lux","Malphite","Malzahar","Maokai","MasterYi","MissFortune","Mordekaiser","Morgana","Nami","Nasus","Nautilus","Nidalee","Nocturne","Nunu","Olaf","Orianna","Pantheon","Poppy","Quinn","Rammus","RekSai","Renekton","Rengar","Riven","Rumble","Ryze","Sejuani","Shaco","Shen","Shyvana","Singed","Sion","Sivir","Skarner","Sona","Soraka","Swain","Syndra","Talon","Taric","Teemo","Thresh","Tristana","Trundle","Tryndamere","TwistedFate","Twitch","Udyr","Urgot","Varus","Vayne","Veigar","VelKoz","Vi","Viktor","Vladimir","Volibear","Warwick","Wukong","Xerath","XinZhao","Yasuo","Yorick","Zac","Zed","Ziggs","Zilean","Zyra"]
for champ in Champions:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_lane_scrape.json".format(champ,champ))
print(r.json())
If you want to save each to a file. dump the json.
import json
import simplejson
for champ in Champions:
with open(r"C:\Users\Jay\Desktop\LolFlavor\{}_lane_scrape.json".format(champ),"w") as f:
try:
r = requests.get("http://www.lolflavor.com/champions/{}/Recommended/{}_lane_scrape.json".format(champ, champ))
json.dump(r.json(),f)
except simplejson.scanner.JSONDecodeError as e:
print(e.r.url)
The error is from 404 - File or directory not found as one of you calls fails so there is no valid json to decode.
The offending url is:
u'http://www.lolflavor.com/champions/Wukong/Recommended/Wukong_lane_scrape.json'
which if you try in your browser will also give you a 404 error. That is caused by the fact there is no user Wukong which can be confirmed by opening http://www.lolflavor.com/champions/Wukong/ in your browser
There is no need to index the list using a while loop. simply iterate over the list items directly and use str.format to pass the variables into the url. Also make sure you use raw string r for the file path when using \'s as they have a special meaning in python they are using to escape characters so \n or \r etc.. in your paths would cause problems. You can also use / or escape using \\.

Related

Search for a word in webpage and save to TXT in Python

I am trying to: Load links from a .txt file, search for a specific Word, and if the word exists on that webpage, save the link to another .txt file but i am getting error: No scheme supplied. Perhaps you meant http://<_io.TextIOWrapper name='import.txt' mode='r' encoding='cp1250'>?
Note: the links has HTTPS://
The code:
import requests
list_of_pages = open('import.txt', 'r+')
save = open('output.txt', 'a+')
word = "Word"
save.truncate(0)
for page_link in list_of_pages:
res = requests.get(list_of_pages)
if word in res.text:
response = requests.request("POST", url)
save.write(str(response) + "\n")
Can anyone explain why ? thank you in advance !

Try putting http:// behind the links.

When you use res = requests.get(list_of_pages) you're creating HTTP connection to list_of_pages. But requests.get takes URL string as a parameter (e.g. http://localhost:8080/static/image01.jpg), and look what list_of_pages is - it's an already opened file. Not a string. You have to either use requests library, or file IO API, not both.
If you have an already opened file, you don't need to create HTTP request at all. You don't need this request.get(). Parse list_of_pages like a normal, local file.
Or, if you would like to go the other way, don't open this text file in list_of_arguments, make it a string with URL of that file.

Parsing the file name from list of url links

Ok so I am using a script that is downloading a files from urls listed in a urls.txt.
import urllib.request
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
urllib.request.urlretrieve(link)
Unfortunately they are saved as temporary files due to lack of second argument in my urllib.request.urlretrieve function. As there are thousand of links in my text file naming them separately is not an option. The thing is that the name of the file is contained in those links, i.e. /DocumentXML2XLSDownload.vm?firsttime=true&repengback=true&d‌ocumentId=XXXXXX&xsl‌FileName=rher2xml.xs‌l&outputFileName=XXX‌X_2017_06_25_4.xls where the name of the file comes after outputFileName=
Is there an easy way to parse the file names and then use them in urllib.request.urlretrieve function as secondary argument? I was thinking of extracting those names in excel and placing them in another text file that would be read in similar fashion as urls.txt but I'm not sure how to implement it in Python. Or is there a way to make it exclusively in python without using excel?

You could parse the link on the go.
Example using a regular expression:
import re
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
regexp = '((?<=\?outputFileName=)|(?<=\&outputFileName=))[^&]+'
match = re.search(regexp, link.rstrip())
if match is None:
# Make the user aware that something went wrong, e.g. raise exception
# and/or just print something
print("WARNING: Couldn't find file name in link [" + link + "]. Skipping...")
else:
file_name = match.group(0)
urllib.request.urlretrieve(link, file_name)

You can use urlparse and parse_qs to get the query string
from urlparse import urlparse,parse_qs
parse = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html?name=Python&version=2')
print(parse_qs(parse.query)['name'][0]) # prints Python

Make python save to a folder created in the directory of the py file being run

I'm trying to save a bunch of pages in a folder next to the py file that creates them. I'm on windows so when I try to make the trailing backslash for the file-path it makes a special character instead.
Here's what I'm talking about:
from bs4 import BeautifulSoup
import urllib2, urllib
import csv
import requests
from os.path import expanduser
print "yes"
with open('intjpages.csv', 'rb') as csvfile:
pagereader = csv.reader(open("intjpages.csv","rb"))
i=0
for row in pagereader:
print row
agentheader = {'User-Agent': 'Nerd'}
request = urllib2.Request(row[0],headers=agentheader)
url = urllib2.urlopen(request)
soup = BeautifulSoup(url)
for div in soup.findAll('div', {"class" : "side"}):
div.extract()
body = soup.find_all("div", { "class" : "md" })
name = "page" + str(i) + ".html"
path_to_file = "\cleanishdata\"
outfile = open(path_to_file + name, 'w')
#outfile = open(name,'w') #this works fine
body=str(body)
outfile.write(body)
outfile.close()
i+=1
I can save the files to the same folder that the .py file is in, but when I process the files using rapidminer it includes the program too. Also it would just be neater if I could save it in a directory.
I am surprised this hasn't already been answered on the entire internet.
EDIT: Thanks so much! I ended up using information from both of your answers. IDLE was making me use r'\string\' to concatenate the strings with the backslashes. I needed use the path_to_script technique of abamert to solve the problem of creating a new folder wherever the py file is. Thanks again! Here's the relevant coding changes:
name = "page" + str(i) + ".txt"
path_to_script_dir = os.path.dirname(os.path.abspath("links.py"))
newpath = path_to_script_dir + r'\\' + 'cleanishdata'
if not os.path.exists(newpath): os.makedirs(newpath)
outfile = open(path_to_script_dir + r'\\cleanishdata\\' + name, 'w')
body=str(body)
outfile.write(body)
outfile.close()
i+=1

Are you sure sure you're escaping your backslashes properly?
The \" in your string "\cleanishdata\" is actually an escaped quote character (").
You probably want
r"\cleanishdata\"
or
"\\cleanishdata\\"
You probably also want to check out the os.path library, particular os.path.join and os.path.dirname.
For example, if your file is in C:\Base\myfile.py and you want to save files to C:\Base\cleanishdata\output.txt, you'd want:
os.path.join(
os.path.dirname(os.path.abspath(sys.argv[0])), # C:\Base\
'cleanishdata',
'output.txt')

A better solution than hardcoding the path to the .py file is to just ask Python for it:
import sys
import os
path_to_script = sys.argv[0]
path_to_script_dir = os.path.dirname(os.path.abspath(path_to_script))
Also, it's generally better to use os.path methods instead of string manipulation:
outfile = open(os.path.join(path_to_script_dir, name), 'w')
Besides making your program continue to work as expected even if you move it to a different location or install it on another machine or give it to a friend, getting rid of the hardcoded paths and the string-based path concatenation means you don't have to worry about backslashes anywhere, and this problem never arises in the first place.

Python: Take URL from list and access it

I'm trying to take a URL from a list (~1500 entries) and access them one by one using the twill lib for python. The reason that I'm using twill is because I like it and I might have to perform basic formfilling later on.
The problem I have is declaring the contents of the loop.
I'm sure this is actually pretty simple to solve, but the solution just won't come to my mind at the moment.
from twill.commands import *
CONTAINER = open('urls.txt') #opening file
CONTAINER_CONTENTS = CONTAINER.readlines() #reading
CONTAINER_CONTENTS = map(lambda s: s.strip, CONTAINER_CONTENTS) #this is just to remove the ^N (newline) that was appended to each URL
for i in CONTAINER_CONTENTS:
<educate me>
..
go(url)
etc.
Thanks in Advance.

from twill.commands import *
with open('urls.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
go(url)
# now do something with the page

Creating a URLretreive function using urllib2 in python

I want to have a function which can save a page from the web into a designated path using urllib2.
Problem with urllib is that it doesn't check for Error 404, but unfortunately urllib2 doesn't have such a function although it can check for http errors.
How can i make a function to save the file permanently to a path?
def save(url,path):
g=urllib2.urlopen(url)
*do something to save g to 'path'*

Just use .read() to get the contents and write it to a file path.
def save(url,path):
g = urllib2.urlopen(url)
with open(path, "w") as fH:
fH.write(g.read())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a python program that scrapes file from a website - python

Related

Search for a word in webpage and save to TXT in Python

Parsing the file name from list of url links

Make python save to a folder created in the directory of the py file being run

Python: Take URL from list and access it

Creating a URLretreive function using urllib2 in python

Categories

Resources