I'm writing a script where I try to scrape data from json files. The website link structure looks like this:
https://go.lime-go.com/395012/Organization/pase1009/
I want the Python script to go through a certain number and try to visit them. For example, right now the link is at pase1009. After the script has visited this link I want it to go to pase1010 and so on.
I'm really new to Python and trying to learn how to use loops, count, etc. but don't get it.
My PY code:
rlista = "https://go.lime-go.com/395012/Organization/pase1009/getEmployees"
page = self.driver.get(rlista)
time.sleep(2)
Best regards,
Tobias
You can combine several strings to one with the +-operator.
So you could save your base link in a variable and add the number afterwards in the loop.
Would look something like this:
baseLink = "https://your-link.com/any/further/stuff/pase"
for k in range(1000,1010,2):
link = baseLink + str(k)
print(link)
There your links would be
https://your-link.com/any/further/stuff/pase1000
https://your-link.com/any/further/stuff/pase1002
https://your-link.com/any/further/stuff/pase1004
https://your-link.com/any/further/stuff/pase1006
https://your-link.com/any/further/stuff/pase1008
as k will start with 1000, increment by 2 and stop before 1010 (range(start, stop, increment)).
Related
So, I am learning both python and web-scraping, so please forgive me if this is something extremely basic.
I found a script and modified it to scrape yell.com
Now, I understand pagination. And am able to scrape the entire set of one city using code similar to the one below.
for x in range(1,9):
print(f'Scraping page {x}')
content = extract(f'https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=134234234&keywords=dentists&location=birmingham&pageNum={x}')
transform(content)
time.sleep(5)
load()
print('Saved to CSV')
Now, I have a list of cities that I'd like to scrape.
So for instance, the location=birmingham parameter above would change location=portsmouth
The solution I have come up with is to define the entire city list in an array (it could be huge) and then call them.
However, I want the scrape to run through the entire range defined above and then move on to a different city, with the range reset. And I can't figure that bit out.
It sounds like you just need to include a second for loop to go through your long list of cities. Then city can be included into your URL. For example:
cities = ['birmingham', 'portsmouth', 'london'] # long list of cities
for city in cities:
print(f'City - {city}')
for x in range(1, 9):
print(f'Scraping page {x}')
content = extract(f'https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=134234234&keywords=dentists&location={city}&pageNum={x}')
transform(content)
time.sleep(5)
load()
print('Saved to CSV')
I have the following base url that I would like to iterate:
http://www.blabla.com/?mode_id=1
Basically, I would like Python for loop to iterate through id=1 like this:
http://www.blabla.com/?mode_id=1
http://www.blabla.com/?mode_id=2
http://www.blabla.com/?mode_id=3
http://www.blabla.com/?mode_id=4, etc.
I tried with my loop below, but it does not work:
for i in range(0, 200,1):
url = 'http://www.blabla.com/?mode_id= + str(i)'
driver.get(url)
How can I make it run properly? Thank you
You could use:
for i in range(200):
url = 'http://www.blabla.com/?mode_id={}'.format(i)
driver.get(url)
Remarks:
If you're going to start iterating one by one from zero, you can just use range(200), no need for the rest of arguments.
You should avoid concatenating strings in Python. There are better ways, like format (as per my example).
Make sure your indentation is correct.
I plan to automate in python something that will create several .docx files using a while loop. Each file will have its own unique name and have some information inside of it. My problem is that when looping, the information I get inside the documents is stacking.
I believe there is a simple solution out there, I just can't seem to find it.
Here is the block of code:
i=1
while i < 10:
os.chdir("C:\\Users\\user\\Desktop\\" +FolderName)
doc.save(str(doc_number[i])+str(essay_type[i])+' '+str(titles[i])+' '+str(writer[i])+'.docx');
doc.add_paragraph('Title/Keyword:'+str(titles[i]));
doc.add_paragraph('Reasech Link:'+str(link[i]));
doc.add_paragraph('Target Site:'+str(keyword[i]));
doc.save(str(doc_number[i])+str(essay_type[i])+' '+str(titles[i])+' '+str(writer[i])+'.docx');
i+=2
This is the first document. I would like every document to have an output like this
This is the last document created, as you can see the information from the first document as well as the next 3 documents are all stacked and shown in the final output of this last document
Rearrange your code like this:
os.chdir("C:\\Users\\user\\Desktop\\" +FolderName)
i=1
while i < 10:
doc = Document()
doc.add_paragraph('Title/Keyword:'+str(titles[i]));
doc.add_paragraph('Research Link:'+str(link[i]));
doc.add_paragraph('Target Site:'+str(keyword[i]));
doc.save(str(doc_number[i])+str(essay_type[i])+' '+str(titles[i])+' '+str(writer[i])+'.docx');
i+=2
I am doing a little script where I want to collect all the "code:" regarding a tag.
For example:
https://www.instagram.com/explore/tags/%s/?__a=1
The next next page will be:
https://www.instagram.com/explore/tags/plebiscito/?__a=1&max_id=end_cursor
However, my drawback is to make each url get me what I need (which are the comments and username of the people).
So as the script works, it does not do what I need.
The "obtain_max_id" function works, getting the following end_cursors, but I do not know how to adapt it.
I appreciate your help!
In conclusion, I need to adapt the "obtain_max_id" function in my "connect_main" function to extract the information I need with each of the URLs.
This is simple.
import requests
import json
host = "https://www.instagram.com/explore/tags/plebiscito/?__a=1"
r = requests.get(host).json()
for x in r['tag']['media']['nodes']:
print (x['code'])
next = r['tag']['media']['page_info']['end_cursor']
while next:
r = requests.get(host + "&max_id=" + next ).json()
for x in r['tag']['media']['nodes']:
print (x['code'])
next = r['tag']['media']['page_info']['end_cursor']
You have all the data you want in your data variable (in JSON form), right after you execute the line:
data = json.loads(finish.text)
in the while loop inside your obtain_max_id() method. Just use that.
Assuming everything inside the else block of your connect_main() method works, you could simple use that code inside the above while loop, right after you have all the data in your data variable.
I am having a bit of trouble in coding a process or a script that would do the following:
I need to get data from the URL of:
nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140430/gfs_hd_00z
But the file URL's (the days and model runs change), so it has to assume this base structure for variables.
Y - Year
M - Month
D - Day
C - Model Forecast/Initialization Hour
F- Model Frame Hour
Like so:
nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hdYYYYMMDD/gfs_hd_CCz
This script would run, and then import that date (in the YYYYMMDD, as well as CC) with those variables coded -
So while the mission is to get
http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140430/gfs_hd_00z
While these variables correspond to get the current dates in the format of:
http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hdYYYYMMDD/gfs_hd_CCz
Can you please advise how to go about and get the URL's to find the latest date in this format? Whether it'd be a script or something with wget, I'm all ears. Thank you in advance.
In Python, the requests library can be used to get at the URLs.
You can generate the URL using a combination of the base URL string plus generating the timestamps using the datetime class and its timedelta method in combination with its strftime method to generate the date in the format required.
i.e. start by getting the current time with datetime.datetime.now() and then in a loop subtract an hour (or whichever time gradient you think they're using) via timedelta and keep checking the URL with the requests library. The first one you see that's there is the latest one, and you can then do whatever further processing you need to do with it.
If you need to scrape the contents of the page, scrapy works well for that.
I'd try scraping the index one level up at http://nomads.ncep.noaa.gov/dods/gfs_hd ; the last link-of-particular-form there should take you to the daily downloads pages, where you could do something similar.
Here's an outline of scraping the daily downloads page:
import BeautifulSoup
import urllib
grdd = urllib.urlopen('http://nomads.ncep.noaa.gov/dods/gfs_hd/gfs_hd20140522')
soup = BeautifulSoup.BeautifulSoup(grdd)
datalinks = 'http://nomads.ncep.noaa.gov:80/dods/gfs_hd/gfs_hd'
for link in soup.findAll('a'):
if link.get('href').startswith(datalinks):
print('Suitable link: ' + link.get('href')[len(datalinks):])
# Figure out if you already have it, choose if you want info, das, dds, etc etc.
and scraping the page with the last thirty would, of course, be very similar.
The easiest solution would be just to mirror the parent directory:
wget -np -m -r http://nomads.ncep.noaa.gov:9090/dods/gfs_hd
However, if you just want the latest date, you can use Mojo::UserAgent as demonstrated on Mojocast Episode 5
use strict;
use warnings;
use Mojo::UserAgent;
my $url = 'http://nomads.ncep.noaa.gov:9090/dods/gfs_hd';
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;
my #links = $dom->find('a')->attr('href')->each;
my #gfs_hd = reverse sort grep {m{gfs_hd/}} #links;
print $gfs_hd[0], "\n";
On May 23rd, 2014, Outputs:
http://nomads.ncep.noaa.gov:9090/dods/gfs_hd/gfs_hd20140523