I am working on my first python project. I want to make a crawler that visits a website to extract all its links (with a depth of 2). It should store the links in two lists that form a ono-to-one register that correlates source links to the corresponding target links they contain. Then it should create a csv file with two columns (Target and Source), so I can open it with gephi to create a graph showing the site's topographic structure.
The code breaks down at the for loop in the code execution section, it just never stops extracting links... (I've tried with a fairly small blog, it just never ends). What is the problem? How can I solve it?
A few points to consider:
- I'm really new to programming and python so I realize that my code must be really unpythonic. Also, as I have been searching for ways to build the code and solve my problems it is somewhat patchy, sorry. Thanks for your help!
myurl = raw_input("Introduce URL to crawl => ")
Dominios = myurl.split('.')
Dominio = Dominios[1]
#Variables Block 1
Target = []
Source = []
Estructura = [Target, Source]
links = []
#Variables Block 2
csv_columns = ['Target', 'Source']
csv_data_list = Estructura
currentPath = os.getcwd()
csv_file = "crawleo_%s.csv" % Dominio
# Block 1 => Extract links from a page
def page_crawl(seed):
try:
for link in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(seed).read(), re.I):
Source.append(seed)
Target.append(link)
links.append(link)
except IOError:
pass
# Block 2 => Write csv file
def WriteListToCSV(csv_file, csv_columns, csv_data_list):
try:
with open(csv_file, 'wb') as csvfile:
writer = csv.writer(csvfile, dialect='excel', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(csv_columns)
writer.writerows(izip(Target, Source))
except IOError as (errno, strerror):
print("I/O error({0}): {1}".format(errno, strerror))
return
# Block 3 => Code execution
page_crawl(myurl)
seed_links = (links)
for sublink in seed_links: # Problem is with this loop
page_crawl(sublink)
seed_sublinks = (links)
## print Estructura # Line just to check if code was working
#for thirdlinks in seed_sublinks: # Commented out until prior problems are solved
# page_crawl(thirdlinks)
WriteListToCSV(csv_file, csv_columns, csv_data_list)
seed_links and links points to the same list. So when you are adding elements to links in the page_crawl function you are also extending the list that the for loop is looping over. What you need to do is to clone the list where you create seed_links.
This is because Python passes objects by reference. That is, multiple variables can point to the same object under different names!
If you want to see this with your own eyes, try print sublink inside the for loop. You will notice that there are more links printed than you initially put in. You will probably also notice that you are trying to loop over the entire web :-)
I don't see immediately what is wrong. However there are several remarks about this:
you work with global variables which is bad practice. You better use a local variable that is passed back by the return.
Is it possible that a link in the second level refers back to the first level? That way you have a loop in the data. You need to make provisions for that to prevent a loop. So you need to investigate what is returned.
I would implement this recursively (with the earlier provisions), because that makes the code simpler albeit a little more abstract.
Related
I am very new to programing and trying to learn by doing creating a text adventure game and reading Python documentation/blogs.
My issue is I'm attempting to save/load data in a text game to create some elements which carry over from game to game and are passed as arguments. Specifically with this example my goal recall, update and load an incrementing iteration each time the game is played past the intro. Specially my intention here is to import the saved march_iteration number, display it to the user as a default name suggestion, then iterate the iteration number and save the updated saved march_iteration number.
From my attempts at debugging this I seem to be updating the value and saving the updated value of 2 to the game.sav file correctly, so I believe my issues is either I'm failing to load the data properly or overwriting the saved value with the static one somehow. I've read as much documentation as I can find but from the articles I've read on saving and loading to json I cannot identify where my code is wrong.
Below is a small code snippet I wrote just to try and get the save/load working. Any insight would be greatly appreciated.
import json
def _save(dummy):
f = open("game.sav", 'w+')
json.dump(world_states, f)
f.close
def _continue(dummy):
f = open("game.sav", 'r+')
world_states = json.load(f)
f.close
world_states = {
"march_iteration" : 1
}
def _resume():
_continue("")
_resume()
print ("world_states['march_iteration']", world_states['march_iteration'])
current_iteration = world_states["march_iteration"]
def name_the_march(curent_iteration=world_states["march_iteration"]):
march_name = input("\nWhat is the name of your march? We suggest TrinMar#{}. >".format(current_iteration))
if len(march_name) == 0:
print("\nThe undifferentiated units shift nerviously, unnerved and confused, perhaps even angry.")
print("\nPlease give us a proper name executor. The march must not be nameless, that would be chaos.")
name_the_march()
else:
print("\nThank you Executor. The {} march begins its long journey.".format(march_name))
world_states['march_iteration'] = (world_states['march_iteration'] +1)
print ("world_states['march_iteration']", world_states['march_iteration'])
#Line above used only for debugging purposed
_save("")
name_the_march()
I seem to have found a solution which works for my purposes allowing me to load, update and resave. It isn't the most efficient but it works, the prints are just there to display the number being properly loaded and updated before being resaved.
Pre-requisite: This example assumes you've already created a file for this to open.
import json
#Initial data
iteration = 1
#Restore from previously saved from a file
with open('filelocation/filename.json') as f:
iteration = json.load(f)
print(iteration)
iteration = iteration + 1
print(iteration)
#save updated data
f = open("filename.json", 'w')
json.dump(iteration, f)
f.close
I am using the Instaloader package to scrape some data from Instagram.
Ideally, I am looking to scrape the posts associated to a specific hashtag. I created the code below, and it outputs lines of scraped data, but I am unable to write this output to a .txt, .csv, or .json file.
I tried to first append the loop output to a list, but the list was empty. My efforts to output to a file have also been unsuccessful.
I know I am missing a step here, please provide any input that you have! Thank you.
import instaloader
import json
L= instaloader.Instaloader()
for posts in L.get_hashtag_posts('NewYorkCity'):
L.download_hashtag('NewYorkCity', max_count = 10)
with open('output.json', 'w') as f:
json.dump(posts, f)
break
print('Done')
Looking at your code, it seems the spacing might be a bit off. When you use the open command with the "with" statement it does a try: / finally: on the file object.
In your case you created a variable f representing this file object. I believe if you make the following change you can write your json data to the file.
import instaloader
import json
L= instaloader.Instaloader()
for posts in L.get_hashtag_posts('NewYorkCity'):
L.download_hashtag('NewYorkCity', max_count = 10)
with open('output.json', 'w') as f:
f.write(json.dump(posts))
break
print('Done')
I am sure you intended this, but if your goal in the break was to only get the first value in the loop returned you could make this edit as well
for posts in L.get_hashtag_posts('NewYorkCity')[0]:
This will return the first item in the list.
If you would like the first 3, for example, you could do [:3]
See this tutorial on Python Data Structures
Okay so I decided I'd like a program to download osu maps based on the map number(for lack of a better term). After doing some testing with the links to understand the redirecting, I got a program which gets to the .../download page - when I got to said page, the map will download. However, when trying to download it via requests, I get HTML.
def grab(self, identifier=None):
if not identifier:
print("Missing Argument: 'identifier'")
return
mapLink = f"https://osu.ppy.sh/beatmaps/{identifier}"
dl = requests.get(mapLink, allow_redirects=True)
if not dl:
print("Error: map not found!")
return
mapLink2 = dl.url
mapLink2 = f"https://osu.ppy.sh/beatmapsets/{self.parseLink(mapLink2)}/download"
dl = requests.get(mapLink2)
with open(f"{identifier}.osz", "wb") as f:
f.write(dl.content)
And, in case it is necessary, here is self.parseLink:
def parseLink(self, mapLink=None):
if not mapLink:
return None
id = mapLink.replace("https://osu.ppy.sh/beatmapsets/","")
id = id.split("#")
return id[0]
Ideally, when I open the file at the end of grab(), it should save a usable .osz file - one which is NOT html, and can be dragged into the actual game and used. Of course, this is still extremely early in my testing, and I will figure out a way to make the filename the song name for convenience.
edit: example of an identifier is: OsuMaps().grab("1385415") in case you wanted to test
There is a very quick way to get around:
Needing to be logged in
Needing a specific element
This workaround comes in the form of https://bloodcat.com/osu/ - to get a download link directly to a map, all you need is: https://bloodcat.com/osu/s/<beatmap set number>.
Here is an example:
id = "653534" # this map is ILY - Panda Eyes
mapLink = f"https://bloodcat.com/osu/s/{id}" # adds id to the link
dl = requests.get(mapLink)
if len(dl.content) > 330: # see below for explanation
with open(f"{lines[i].rstrip()}.osz", "wb") as f:
f.write(dl.content)
else:
print("Map doesn't exist")
The line if len(dl.conetent) > 330 is my workaround to a link not working. .osz files can contain thousands upon thousands of lines of unknown characters, whereas the site's "not found" page has less that 330 lines - we can use this to check if the file is too short to be a beatmap or not.
That's all! Feel free to use the code if you'd like.
QUESTION:
I am finding issues with the syntax of the code, in particular the for loop which i use to loop through the external file.
My program is a dice game which is supposed to register users, and the allow them to login to the game afterwards. In the end it must access the external file, which has previously been used to store the winner name (keep in mind the authorised names have a separate file), and loops through it and outputs the top 5 winners names and scores to the shell
I used a for loop to loop through the file and append it to an array called 'Top 5 Winners' however I seem to struggle with the syntax of the code as I am quite new Python.
The code that accesses the file.
with open("Top 5 Winners.txt","r") as db:
top5Winners=[]
for i in db(0,len([db])):
top5Winners.append(line)
top5Winners.sort()
top5Winners.reverse()
for i in range(5):
print(top5Winners[i])
Error Code:
for i in db(0,len([db])):
The len() part of the code is the issue
NOTE:
I also wouldn't mind any tips as to how i make this bit of code more efficient so i can apply it in my later projects.
Your indentation isn't as it should be. You indeed opened a file and made it readable, but after that you didn't do anything with it. See the following example:
with open(file, 'r') as db:
#code with file (db)
#rest of the code
So you can combine with your code like this:
top5winners = [] #Make a list variable
with open("Top 5 Winners.txt","r") as db: #Open your file
for i in db: #Loop trough contents of file
top5winners.append(i) #Append iterable to list
top5winners.sort(reverse=True) #Sort list and use reverse option
for i in range(0, 5): #Loop trough range
print(top5winners[i]) #Print items from list
Please note that StackOverflow is intended for help with specific cases, not a site to ask others to write a piece of code.
Sincerly, Chris Fowl.
I'm relatively new to python, and I'm working through a screen- scraping application that gathers data from multiple financial sites. I have four procedures for now. Two run in just a couple minutes, and the other two... hours each. These two look up information on particular stock symbols that I have in a csv file. There are 4,000+ symbols that I'm using. I know enough to know that the vast majority of the time spent is in IO over the wire. It's essential that I get these down to 1/2 hour each (or, better. Is that too ambitious?) for this to be of any practical use to me. I'm using python 3 and BeautifulSoup.
I have the general structure of what I'm doing below. I've abbreviated conceptually non essential sections. I'm reading many threads on multiple calls/ threads at once to speed things up, and it seems like there are a lot of options. Can anyone point me in the right direction that I should pursue, based on the structure of what I have so far? It'd be a huge help. I'm sure it's obvious, but this procedure gets called along with the other data download procs in a main driver module. Thanks in advance...
from bs4 import BeautifulSoup
import misc modules
class StockOption:
def __init__(self, DateDownloaded, OptionData):
self.DateDownloaded = DateDownloaded
self.OptionData = OptionData
def ForCsv(self):
return [self.DateDownloaded, self.Optiondata]
def extract_options(TableRowsFromBeautifulSoup):
optionsList = []
for opt in range(0, len(TableRowsFromBeautifulSoup))
optionsList.append(StockOption(data parsed from TableRows arg))
return optionsList
def run_proc():
symbolList = read in csv file of tickers
for symb in symbolList:
webStr = #write the connection string
try:
with urllib.request.urlopen(webStr) as url: page = url.read()
soup = BeautifulSoup(page)
if soup.text.find('There are no All Markets results for') == -1:
tbls = soup.findAll('table')
if len(tbls[9]) > 1:
expStrings = soup.findAll('td', text=True, attrs={'align': 'right'})[0].contents[0].split()
expDate = datetime.date(int(expStrings[6]), int(currMonth), int(expStrings[5].replace(',', '')))
calls = extract_options(tbls[9], symb, 'Call', expDate)
puts = extract_options(tbls[13], symb, 'Put', expDate)
optionsRows = optionsRows + calls
optionsRows = optionsRows + puts
except urllib.error.HTTPError as err:
if err.code == 404:
pass
else:
raise
opts = [0] * (len(optionsRows))
for option in range(0, len(optionsRows)):
opts[option] = optionsRows[option].ForCsv()
#Write to the csv file.
with open('C:/OptionsChains.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerows(opts)
if __name__ == '__main__':
run_proc()
There are some mistakes in the abbreviated code you have given, so it is a little hard to understand the code. If you could show more code and check it, it will be easier to understand your problem.
From the code and problem description, I have some advice to share with you:
In run_proc() function, it read webpage for every symbol. If the urls are the same or some urls are repeated, how about read webpages for just one time and write them to memory or hardware, then analyze page contents for every symbol? It will save
BeautifulSoup is easy to write code, but a little slow in performance. If lxml can do your work, it will save a lot time on analyzing webpage contents.
Hope it will help.
I was pointed in the right direction from the following post (thanks to the authors btw):
How to scrape more efficiently with Urllib2?