I'm new to Python and scraping. I'm trying to run two loops. One goes and scrapes ids from one page. Then, using those ids, I call another API to get more info/properties.
But when I run this program, it just runs the first bit fine (gets the IDs), but then it closes and doesn't run the 2nd part. I feel I'm missing something really basic about control flow in Python here. Why does Python close after the first loop when I run it in Terminal?
import requests
import csv
import time
import json
from bs4 import BeautifulSoup, Tag
file = open('parcelids.csv','w')
writer = csv.writer(file)
writer.writerow(['parcelId'])
for x in range(1,10):
time.sleep(1) # slowing it down
url = 'http://apixyz/Parcel.aspx?Pid=' + str(x)
source = requests.get(url)
response = source.content
soup = BeautifulSoup(response, 'html.parser')
parcelId = soup.find("span", id="MainContent_lblMblu").text.strip()
writer.writerow([parcelId])
out = open('mapdata.csv','w')
with open('parcelIds.csv', 'r') as in1:
reader = csv.reader(in1)
writer = csv.writer(out)
next(reader, None) # skip header
for row in reader:
row = ''.join(row[0].split())[:-2].upper().replace('/','-') #formatting
url="https://api.io/api/properties/"
url1=url+row
time.sleep(1) # slowing it down
response = requests.get(url1)
resp_json_payload = response.json()
address = resp_json_payload['property']['address']
writer.writerow([address])
If you are running in windows (where filenames are not case sensitive), then the file you have open for writing (parcelids.csv) is still open when you reopen it to read from it.
Try closing the file before opening it to read from it.
Related
I'm running a Python script with BeautifulSoup in order to extract Text, topics and tags from web articles. The website contains 210 pages, and each page contain 10 articles. (each article's url is stocked in a txt file)
I'm using the following code :
data = []
with open('urls.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
try:
soup = BeautifulSoup(response.text,"html.parser")
text = soup.select_one('div.para_content_text').get_text(strip=True)
topic = soup.select_one('div.article_tags_topics').get_text(strip=True)
tags = soup.select_one('div.article_tags_tags').get_text(strip=True)
except AttributeError:
print (" ")
data.append(
{
'text':text,
'topic': topic,
'tags':tags
}
)
pd.DataFrame(data).to_csv('text.csv', index = False, header=True)
time.sleep(3)
My code seems to be corret but I ran this code and it has been running for several days now.
I would like to understand if it is an error that is blocking progress or if the process is simply very long.
To do this, I would like to know if it would be possible to add a "component" to my code that would allow me to track the number of urls processed in real time.
Any ideas ?
The way your code is written now, you are accumulating all the data in memory until it's all fetched. The easiest way to keep track of the progress without changing the code too much would be to just print either the current URL, or the number of the URL you're processing.
A better way that involves changing the code a little more would be to write the data to the CSV file as you are parsing it, instead of all at once in the end. Something like
print("text,topic,tags")
with open('urls.txt', 'r') as inf:
for row in inf:
url = row.strip()
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
# Getting the data you want...
print(f"{text},{topic},{tags}")
If you are going with this method, make sure to escape/remove commas, or use an actual CSV library to produce the lines.
I am attempting to read a csv file that contains a long list of urls. I need to iterate through the list and get the urls that throw a 301, 302, or 404 response. In trying to test the script I am getting an exited with code 0 so I know it is error free but it is not doing what I need it to. I am new to python and working with files, my experience has been ui automation primarily. Any suggestions would be gladly appreciated. Below is the code.
import csv
import requests
import responses
from urllib.request import urlopen
from bs4 import BeautifulSoup
f = open('redirect.csv', 'r')
contents = []
with open('redirect.csv', 'r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
def run():
resp = urllib.request.urlopen(url)
print(self.url, resp.getcode())
run()
print(run)
Given you have a CSV similar to the following (the heading is URL)
URL
https://duckduckgo.com
https://bing.com
You can do something like this using the requests library.
import csv
import requests
with open('urls.csv', newline='') as csvfile:
errors = []
reader = csv.DictReader(csvfile)
# Iterate through each line of the csv file
for row in reader:
try:
r = requests.get(row['URL'])
if r.status_code in [301, 302, 404]:
# print(f"{r.status_code}: {row['url']}")
errors.append([row['url'], r.status_code])
except:
pass
Uncomment the print statement if you want to see the results in the terminal. The code at the moment appends a list of URL and status code to an errors list. You can print or continue processing this if you prefer.
I use multiprocessing pool to multiply the speed of scraping and everything is okay, only I don't understand why python write every 30 rows the header of my csv, I know there is a link with the param of pool I entered but how can correct this behavior
def parse(url):
dico = {i: '' for i in colonnes}
r = requests.get("https://change.org" + url, headers=headers, timeout=10)
# sleep(2)
if r.status_code == 200:
# I scrape my data here
...
pprint(dico)
writer.writerow(dico)
return dico
with open(lang + '/petitions_' + lang + '.csv', 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames= colonnes)
writer.writeheader()
with Pool(30) as p:
p.map(parse, liens)
Someone can tell where put the 'writer.writerow(dico)' to avoid repetition of the header?
Thanks
Check if the file exists:
os.path.isfile('mydirectory/myfile.csv')
If it exists don't write the header again. Create a function(def...) for the header and another for data.
Looks like the "header" you are referring to comes from the writer.writeheader() line, not the writer.writerow() line.
Without a complete piece of your code, I can only assume that you have something like an outer loop that wraps around the with open block. So, every time your code enters the with block, a header line is printed, and then 30 lines of your scraped data (because of the pool size).
I am new to scraping using Python. After using a lot of useful resources I was able to scrape the content of a Page. However, I am having trouble saving this data into a .csv file.
Python:
import mechanize
import time
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Firefox(executable_path=r'C:\Users\geckodriver.exe')
driver.get("myUrl.jsp")
username = driver.find_element_by_name('USER')
password = driver.find_element_by_name('PASSWORD')
username.send_keys("U")
password.send_keys("P")
main_frame = driver.find_element_by_xpath('//*[#id="Frame"]')
src = driver.switch_to_frame(main_frame)
table = driver.find_element_by_xpath("/html/body/div/div[2]/div[5]/form/div[7]/div[3]/table")
rows = table.find_elements(By.TAG_NAME, "tr")
for tr in rows:
outfile = open("C:/Users/Scripts/myfile.csv", "w")
with outfile:
writers = csv.writer(outfile)
writers.writerows(tr.text)
Problem:
Only one of the rows gets written to the excel file. However, when I print the tr.text into the console, all the required rows show up. How can I get all the text inside tr elements to be written into an excel file?
Currently your code will open the file, write one line, close it, then on the next row open it again and overwrite the line. Please consider the following code snippet:
# We use 'with' to open the file and auto close it when done
# syntax is best modified as follows
with open('C:/Users/Scripts/myfile.csv', 'w') as outfile:
writers = csv.writer(outfile)
# we only need to open the file once so we open it first
# then loop through each row to print everything into the open file
for tr in rows:
writers.writerows(tr.text)
I have a problem with my Python script in which I want to scrape the same content from every website. I have a file with a lot of URLs and I want Python to go over them to place them into the requests.get(url) object. After that I write the output to a file named 'somefile.txt'.
I have to the following Python script (version 2.7 - Windows 8):
from lxml import html
import requests
urls = ('URL1',
'URL2',
'URL3'
)
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()
As you can see if have not included the file with the URLs in the script. I tried out many tutorials but failed. The filename would be 'urllist.txt'. In the current script I only get the data from URL3 - in an ideal case I want to get all data from urllist.txt.
Attempt for reading over the text file:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url)
You'll need to remove the newline from your lines:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url.strip())
The str.strip() call removes all whitespace (including tabs and newlines and carriage returns) from the line.
Do make sure you then process page in the loop; if you run your code to extract the data outside the loop all you'll get is the data from the last response you loaded. You may as well open the output file just once, in the with statement so Python closes it again:
with open('urllist.txt', 'r') as urls, open('somefile.txt', 'a') as output:
for url in urls:
page = requests.get(url.strip())
tree = html.fromstring(page.content)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
print >> output, 'Visitors:', visitors
You should either save the each page in a seperate variable, or perform all the computation within the looping of the url list.
Based on your code, by the time your page parsing happens it will only contain the data for the last page get since you are overriding the page variable within each iteration.
Something like the following should append all the pages' info.
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()