Download numbered image files in numbered folders from internet using python - python

The Problem
http://www.fdci.org/imagelibrary/EventCollection/1980/Big/IMG_2524.jpg
I have a link of this type where 1980 is initial folder that's changing and secondly the image filenames in format IMG_2524.jpg are changing.
what i wish to do is download all images from these url by iterating and changing these numbers within a range of 1900-2000 in case of folder and IMG_2000.jpg to IMG_4000.jpg in case of filename.
The downloaded files must be saved inside the folder number it comes from.
I think for loop should be the option but being a newbie i am somewhat lost.
please help thanks.
UPDATE
text_file = open('Output.txt', 'w')
for i in xrange(1900,2001):
for j in xrange(2000, 4001):
year = str(i)
image = str(j)
new_link = 'http://www.fdci.org/imagelibrary/EventCollection/'+year+'/Big/IMG_'+image+'.jpg'
text_file.write(new_link)
text_file.close()
thanks to anmol

Actually you need two for loops, nested for loops So now we have all 2000 - 4000 images for a given year in range 1900 - 2001
for i in xrange(1900,2001):
for j in xrange(2000, 4001):
year = str(i)
image = str(j)
new_link = 'http://www.fdci.org/imagelibrary/EventCollection/'+year+'/Big/IMG_'+image+'.jpg'
print new_link
#Now you will get the possible links within the given ranges,
#then you can use urllib2 to fetch the response from the link
# and do whatever you wanna do
Sample output:
http://www.fdci.org/imagelibrary/EventCollection/2000/Big/IMG_3994.jpg
http://www.fdci.org/imagelibrary/EventCollection/2000/Big/IMG_3995.jpg
http://www.fdci.org/imagelibrary/EventCollection/2000/Big/IMG_3996.jpg
http://www.fdci.org/imagelibrary/EventCollection/2000/Big/IMG_3997.jpg
http://www.fdci.org/imagelibrary/EventCollection/2000/Big/IMG_3998.jpg
http://www.fdci.org/imagelibrary/EventCollection/2000/Big/IMG_3999.jpg
http://www.fdci.org/imagelibrary/EventCollection/2000/Big/IMG_4000.jpg

Related

how to take take multiple pages as input in pdfplumber?

I am using pdfplumber to take input from a pdf file.
My question is how can I take from page 1-7 input using pdfplumber.
I'm using this code:
filename = "1st Year 1stSemester.pdf"
pdf = pdfplumber.open(filename)
totalpages = len(pdf.pages)
p0 = pdf.pages[0-6]
table = p0.extract_table()
table
I want to take input from page 1 to 7
I've also tried p0 = pdf.pages[0,1,2,3,6]. It also doesn't work.
for i in range (0,7):
print(filename.pages[i].extract_text)
To show the pages, use the for loop. Input the desired pages you want to show for example in your case you want to display the pages 1-7, just like how you count an array, it start with 0 till the last page which is 7.

select and filtered files in directory with enumerating in loop

I have a folder that contains many eof extension files name I want to sort them in ordinary way with python code (as you can see in my example the name of all my files contain a date like:20190729_20190731 and they are just satellite orbital information files, then select and filtered 1th,24th,47th and.... (index ) of files and delete others because I need every 24 days information files( for example:V20190822T225942_20190824T005942) not all days information .for facility I select and chose these information files from first day I need so the first file is found then I should select 24 days after from first then 47 from first or 24 days after second file and so on. I exactly need to keep my desire files as I said and delete other files in my EOF source folder my desire files are like these
S1A_OPER_AUX_POEORB_OPOD_20190819T120911_V20190729T225942_20190731T005942.EOF
S1A_OPER_AUX_POEORB_OPOD_20190912T120638_V20190822T225942_20190824T005942.EOF
.
.
.
Mr Zach Young wrote this code below and I appreciate him so much I never thought some body would help me. I think I'm very close to the goal
the error is
error is print(f'Keeping {eof_file}') I changed syntax but the same error: print(f"Keeping {eof_file}")
enter code here
from importlib.metadata import files
import pprint
items = os.listdir("C:/Users/m/Desktop/EOF")
eof_files = []
for item in items:
# make sure case of item and '.eof' match
if item.lower().endswith('.eof'):
eof_files.append(item)
eof_files.sort(key=lambda fname : fname.split('_')[5])
print('All EOF files, sorted')
pprint.pprint(eof_files)
print('\nKeeping:')
files_to_delete = []
count = 0
offset = 2
for eof_file in eof_files:
if count == offset:
print(f"Keeping: [eof_file]")
# reset count
count = 0
continue
files_to_delete.append(eof_file)
count += 1
print('\nRemoving:')
for f_delete in files_to_delete:
print(f'Removing: [f_delete]')
staticmethod
Here's a top-to-bottom demonstration.
I recommend that you:
Run that script as-is and make sure your print statements match mine
Swap in your item = os.listdir(...), and see that your files are properly sorted
Play with the offset variable and make sure you can control what should be kept and what should be deleted; notice that an offset of 2 keeps every third file because count starts at 0
You might need to play around and experiment to make sure you're happy before moving to the final step:
Finally, swap in your os.remove(f_delete)
#!/usr/bin/env python3
from importlib.metadata import files
import pprint
items = [
'foo_bar_baz_bak_bam_20190819T120907_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120901_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120905_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120902_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120903_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120904_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120906_V2..._SomeOtherDate.EOF',
'bogus.txt'
]
eof_files = []
for item in items:
# make sure case of item and '.eof' match
if item.lower().endswith('.eof'):
eof_files.append(item)
eof_files.sort(key=lambda fname : fname.split('_')[5])
print('All EOF files, sorted')
pprint.pprint(eof_files)
print('\nKeeping:')
files_to_delete = []
count = 0
offset = 2
for eof_file in eof_files:
if count == offset:
print(f'Keeping {eof_file}')
# reset count
count = 0
continue
files_to_delete.append(eof_file)
count += 1
print('\nRemoving:')
for f_delete in files_to_delete:
print(f'Removing {f_delete}')
When I run that, I get:
All EOF files, sorted
['foo_bar_baz_bak_bam_20190819T120901_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120902_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120903_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120904_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120905_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120906_V2..._SomeOtherDate.EOF',
'foo_bar_baz_bak_bam_20190819T120907_V2..._SomeOtherDate.EOF']
Keeping:
Keeping foo_bar_baz_bak_bam_20190819T120903_V2..._SomeOtherDate.EOF
Keeping foo_bar_baz_bak_bam_20190819T120906_V2..._SomeOtherDate.EOF
Removing:
Removing foo_bar_baz_bak_bam_20190819T120901_V2..._SomeOtherDate.EOF
Removing foo_bar_baz_bak_bam_20190819T120902_V2..._SomeOtherDate.EOF
Removing foo_bar_baz_bak_bam_20190819T120904_V2..._SomeOtherDate.EOF
Removing foo_bar_baz_bak_bam_20190819T120905_V2..._SomeOtherDate.EOF
Removing foo_bar_baz_bak_bam_20190819T120907_V2..._SomeOtherDate.EOF

How to avoid overwriting data when creating a list? Selenium Webdriver, Python

I want to scrape every page on the following website: https://www.top40.nl/top40/2020/week-34 (for each year and weeknumber) by clicking on the song, then move to 'songinfo' and then scrape all data in the table listed there. For this question, I only scraped the title so far.
This the url I use:
url = 'https://www.top40.nl/top40/'
However, when I print the songs_list, it will only return the last title on the website. As such, I believe I am overwriting.
Hopefully someone can explain me which mistake(s) I am making and if there is any easier way to scrape the table on each page, very happy to hear.
Please find my python code below:
for year in range(2015,2016):
for week in range(1,2):
page_url = url+str(year) + '/' + 'week-' + str(week)
driver.get(page_url)
lists = driver.find_elements_by_xpath("//a[#data-linktype='title']")
links = []
for l in lists:
print(l.get_attribute('href'))
links.append(l.get_attribute('href'))
for link in links:
driver.get(link)
driver.find_element_by_xpath("//a[#href='#songinfo']").click()
songs = driver.find_elements_by_xpath(""".//*[#id="songinfo"]/table/tbody/tr[2]/td""")
songs_list = []
for s in songs:
print(s.get_attribute('innerHTML'))
songs_list.append(s.get_attribute('innerHTML'))```
The line songs_list = [] is inside the for link in links loop, so with each new iteration it gets set to an empty list (and then you append to this new, empty list). Once you end all your loops, you only see the songs_list created.
The simplest fix is to place the songs_list = [] line outside all for loops, ex:
songs_list = []
for year in range(2015,2016):
for week in range(1,2):
# etc

Storing Multi-dimensional Lists?

(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.

Python: No Traceback when Scraping Data into Excel Spreadsheet

I'm an inexperienced coder working in python. I wrote a script to automate a process where certain information would be ripped from a webpage and then copied, where it would be pasted into a new excel spreadsheet. I've written and executed the code, but the excel spreadsheet I've designated to receive the data is completely empty. Worst of all, there is no traceback error. Would you help me find the problem in my code? And how do you generally solve your own problems when not provided a traceback error?
import xlsxwriter, urllib.request, string
def main():
#gets the URL for the expert page
open_sesame = urllib.request.urlopen('https://aries.case.com.pl/main_odczyt.php?strona=eksperci')
#reads the expert page
readpage = open_sesame.read()
#opens up a new file in excel
workbook = xlsxwriter.Workbook('expert_book.xlsx')
#adds worksheet to file
worksheet = workbook.add_worksheet()
#initializing the variable used to move names and dates
#in the excel spreadsheet
boxcoA = ""
boxcoB = ""
#initializing expert attribute variables and lists
expert_name = ""
url_ticker = 0
name_ticker = 0
raw_list = []
url_list = []
name_list= []
date_list= []
#this loop goes through and finds all the lines
#that contain the expert URL and name and saves them to raw_list::
#raw_list loop
for i in readpage:
i = str(i)
if i.startswith('<tr><td align=left><a href='):
raw_list += i
#this loop goes through the lines in raw list and extracts
#the name of the expert, saving it to a list::
#name_list loop
for n in raw_list:
name_snip = n.split('target=_blank>','</a></td><')[1]
name_list += name_snip
#this loop fills a list with the dates the profiles were last updated::
#date_list
for p in raw_list:
url_snipoff = p[28:]
url_snip = url_snipoff.split('"')[0]
url_list += url_snip
expert_url = 'https://aries.case.com.pl/'+url_list[url_ticker]
open_expert = urllib2.openurl(expert_url)
read_expert = open_expert.read()
for i in read_expert:
if i.startswith('<p align=left><small>Last update:'):
update = i.split('Last update:','</small>')[1]
open_expert.close()
date_list += update
#now that we have a list of expert names and a list of profile update dates
#we can work on populating the excel spreadsheet
#this operation will iterate just as long as the list is long
#meaning that it will populate the excel spreadsheet
#with all of our names and dates that we wanted
for z in raw_list:
boxcoA = string('A',z)
boxcoB = string('B',z)
worksheet.write(boxcoA, name_list[z])
worksheet.write(boxcoB, date_list[z])
workbook.close()
print('Operation Complete')
main()
The lack of a traceback only means your code raises no exceptions. It does not mean your code is logically correct.
I would look for logic errors by adding print statements, or using a debugger such as pdb or pudb.
One problem I notice with your code is that the first loop seems to presume that i is a line, whereas it is actually a character. You might find splitlines() more useful
If there is no traceback then there is no error.
Most likely something has gone wrong with your scraping/parsing code and your raw_list or other arrays aren't populated.
Try print out the data that should be written to the worksheet in the last loop to see if there is any data to be written.
If you aren't writing data to the worksheet then it will be empty.

Categories

Resources