I am trying to create a database with several articles for Text mining purposes.
I am extracting the body via web scraping and then save the body of these articles on a csv file. However, I couldn't manage to save all the body texts.
The code that I came up with saves only the text the last URL (article) while if I print what I am scraping (and what I am supposed to save) I obtain the body of all the articles.
I just included some of the URL from the list (which contains a larger number of URLs) just to give you an idea:
import requests
from bs4 import BeautifulSoup
import csv
r=["http://www.nytimes.com/2016/10/12/world/europe/germany-arrest-syrian-refugee.html",
"http://www.nytimes.com/2013/06/16/magazine/the-effort-to-stop-the- attack.html",
"http://www.nytimes.com/2016/10/06/world/europe/police-brussels-knife-terrorism.html",
"http://www.nytimes.com/2016/08/23/world/europe/france-terrorist-attacks.html",
"http://www.nytimes.com/interactive/2016/09/09/us/document-Review-of-the-San-Bernardino-Terrorist-Shooting.html",
]
for url in r:
t= requests.get(url)
t.encoding = "ISO-8859-1"
soup = BeautifulSoup(t.content, 'lxml')
text = soup.find_all(("p",{"class": "story-body-text story-content"}))
print(text)
with open('newdb30.csv', 'w', newline='') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=' ',quotechar='|', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow(text)
Try declaring variable such as all_text = "" before the for loop and adding text to all_text by all_text += text + "\n" at the end of the for loop (the \n creates a new line).
Then, in the last row, instead of writing text, you write all_text.
Related
I have a script that reads urls from a text file, performs a request and then saves all the responses in one text file. How can I save each response in a different text file instead of all in the same file? For example, if my text file labeled input.txt has 20 urls, I would like to save the responses in 20 different .txt files like output1.txt, output2.txt instead of just one .txt file. So for each request, the response in saved in a new .txt file. Thank you
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
with open('output.txt', 'a+') as f:
f.write(data + "\n")
print(data)
Here's a quick way to implement what others have hinted at:
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for i, line in enumerate(map(str.strip, f_in)):
if not line:
continue
...
with open(f'output_{i}.txt', 'w') as f:
f.write(data + "\n")
print(data)
You can make a new file by using open('something.txt', 'w'). If the file is found, it'll erase its content. Else, it'll make a new file named 'something.txt'. Now, you can use file.write() to write your info!
I'm not sure, if I understood your problem right.
I would create an array/list and would create an object for each url request and response. Then add the objects to the array/list and write for each object a different file.
There are at least two ways you could generate files for each url. One, shown below, is to create a hash of some data unique data of the file. In this case I chose category but your could also use the whole contents of the file. This creates a unique string to use for a file name so that two links with the same category text don't overwrite each other when saved.
Another way, not shown, is to find some unique value within the data itself and use it as the filename without hashing it. However, this can cause more problems than it solves since data on the Internet should not be trusted.
Here's your code with an MD5 hash used for a filename. MD5 is not a secure hashing function for passwords but it's safe for creating unique filenames.
Updated Snippet
import hashlib
import requests
from bs4 import BeautifulSoup
with open('input.txt', 'r') as f_in:
for line in map(str.strip, f_in):
if not line:
continue
response = requests.get(line)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
categories = soup.find_all("a", {"class":'navlabellink nvoffset nnormal'})
for category in categories:
data = line + "," + category.text
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
Code added
filename = hashlib.sha256()
filename.update(category.text.encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
Capturing Updated Pages
If you care about catching contents of a page at different points in time, hash the whole contents of the file. That way, if anything within the page changes the previous contents of the page aren't lost. In this case, I hash both the url and the file contents and concatenate the hashes with the URL hash followed by a hash of the file contents. That way, all versions of a file are visible when the directory is sorted.
hashed_contents = hashlib.sha256()
hashed_contents.update(category['href'].encode('utf-8'))
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
for category in categories:
data = line + "," + category.text
hashed_url = hashlib.sha256()
hashed_url.update(category['href'].encode('utf-8'))
page = requests.get(category['href'])
hashed_content = hashlib.sha256()
hashed_content.update(page.text.encode('utf-8')
filename = '{}_{}.html'.format(hashed_url.hexdigest(), hashed_content.hexdigest())
with open('{}.html'.format(filename.hexdigest()), 'w') as f:
f.write(data + "\n")
print(data)
I am new to python and I am trying to loop through the list of urls in a csv file and grab the website titleusing BeautifulSoup, which I would like then to save to a file Headlines.csv. But I am unable to grab the webpage title. If I use a variable with single url as follows:
url = 'https://www.space.com/japan-hayabusa2-asteroid-samples-landing-date.html'
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
It works just fine and I get the title Japanese capsule carrying pieces of asteroid Ryugu will land on Earth Dec. 6 | Space
But when I use the loop,
import csv
with open('urls_file2.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for url in reader:
print(url)
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
I get the following
['\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']
and an error message
InvalidSchema: No connection adapters were found for "['\\ufeffhttps://www.foxnews.com/us/this-day-in-history-july-16']"
I am not sure what am I doing wrong.
You have a byte order mark \\ufeff on the URL you parse from your file.
It looks like your file is a signature file and has encoding like utf-8-sig.
You need to read with the file with encoding='utf-8-sig'
Read more here.
As the previous answer has already mentioned about the "\ufeff", you would need to change the encoding.
The second issue is that when you read a CSV file, you will get a list containing all the columns for each row. The keyword here is list. You are passing the request a list instead of a string.
Based on the example you have given, I would assume that your urls are in the first column of the csv. Python lists starts with a index of 0 and not 1. So to extract out the url, you would need to extract the index of 0 which refers to the first column.
import csv
with open('urls_file2.csv', newline='', encoding='utf-8-sig') as f:
reader = csv.reader(f)
for url in reader:
print(url[0])
To read up more on lists, you can refer here.
You can add more columns to the CSV file and experiment to see how the results would appear.
If you would like to refer to the column name while reading each row, you can refer here.
For work, I was asked to create a spreadsheet of the names and addresses of all allopathic medical schools in the United States. Being new to python, I thought that this would be the perfect situation to try web scraping. While I eventually wrote a program that returned the data I needed, I know that there is a better way to do it as there were some extraneous characters (eg: ", ], [) that I had to go into excel and manually remove. I would just like to know if there was a better way I could have written this code so I can get what I needed, minus the extraneous characters.
Edit: I have also attached an image of the csv file that was created to show the extraneous characters that I'm speaking about.
from bs4 import BeautifulSoup
import requests
import csv
link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
# link to the site we want to scrape from
page_response = requests.get(link)
# fetching the content using the requests library
soup = BeautifulSoup(page_response.text, "html.parser")
# Calling BeautifulSoup in order to parse our document
data = []
# Empty list for the first scrape. We only get one column with many rows.
# We still have the line break tags here </br>
for tr in soup.find_all('tr', {'valign': 'top'}):
values = [td.get_text('</b>', strip=True) for td in tr.find_all('td')]
data.append(values)
data2 = []
# New list that we'll use to have name on index i, address on index i+1
for i in data:
test = list(str(i).split('</b>'))
# Using the line breaks to our advantage.
name = test[0].strip("['")
'''Here we are saying that the name of the school is the first element
before the first line break'''
addy = test[1:]
# The address is what comes after this first line break
data2.append(name)
data2.append(addy)
# Append the name of the school and address to our new list.
school_name = data2[::2]
# Making a new list that consists of the school name
school_address = data2[1::2]
# Another list that consists of the school's address.
with open("Medschooltest.csv", 'w', encoding='utf-8') as toWrite:
writer = csv.writer(toWrite)
writer.writerows(zip(school_name, school_address))
'''Zip the two together making a 2 column table with the schools name and
it's address'''
print("CSV Completed!")
Created CSV file
It seems applying conditional statements along with string manipulation can do the trick. I think the following script will lead you real close to what you want.
from bs4 import BeautifulSoup
import requests
import csv
link = "https://members.aamc.org/eweb/DynamicPage.aspx?site=AAMC&webcode=AAMCOrgSearchResult&orgtype=Medical%20School" # noqa
res = requests.get(link)
soup = BeautifulSoup(res.text, "html.parser")
with open("membersInfo.csv","w",newline="") as infile:
writer = csv.writer(infile)
writer.writerow(["Name","Address"])
for tr in soup.find_all('table', class_='bodyTXT'):
items = ', '.join([item.string for item in tr.select_one('td') if item.string!="\n" and item.string!=None])
name = items.split(",")[0].strip()
address = items.split(name)[1].strip(",")
writer.writerow([name,address])
If you have knowledge of SQL AND the data is in such a structured manner, it would be the best solution to extract it to a database.
I've encountered an issue with my writing CSV program for a web-scraping project.
I got a data formatted like this :
table = {
"UR": url,
"DC": desc,
"PR": price,
"PU": picture,
"SN": seller_name,
"SU": seller_url
}
Which I get from a loop that analyze a html page and return me this table.
Basically, this table is ok, it changes every time I've done a loop.
The thing now, is when I want to write every table I get from that loop into my CSV file, it is just gonna write the same thing over and over again.
The only element written is the first one I get with my loop and write it about 10 millions times instead of about 45 times (articles per page)
I tried to do it vanilla with the library 'csv' and then with pandas.
So here's my loop :
if os.path.isfile(file_path) is False:
open(file_path, 'a').close()
file = open(file_path, "a", encoding = "utf-8")
i = 1
while True:
final_url = website + brand_formatted + "+handbags/?p=" + str(i)
request = requests.get(final_url)
soup = BeautifulSoup(request.content, "html.parser")
articles = soup.find_all("div", {"class": "dui-card searchresultitem"})
for article in articles:
table = scrap_it(article)
write_to_csv(table, file)
if i == nb_page:
break
i += 1
file.close()
and here my method to write into a csv file :
def write_to_csv(table, file):
import csv
writer = csv.writer(file, delimiter = " ")
writer.writerow(table["UR"])
writer.writerow(table["DC"])
writer.writerow(table["PR"])
writer.writerow(table["PU"])
writer.writerow(table["SN"])
writer.writerow(table["SU"])
I'm pretty new on writing CSV files and Python in general but I can't find why this isn't working. I've followed many guide and got more or less the same code for writing csv file.
edit: Here's an output in an img of my csv file
you can see that every element is exactly the same, even if my table change
EDIT: I fixed my problems by making a file for each article I scrap. That's a lot of files but apparently it is fine for my project.
This might be solution you wanted
import csv
fieldnames = ['UR', 'DC', 'PR', 'PU', 'SN', 'SU']
def write_to_csv(table, file):
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writerow(table)
Reference: https://docs.python.org/3/library/csv.html
I have a problem with my Python script in which I want to scrape the same content from every website. I have a file with a lot of URLs and I want Python to go over them to place them into the requests.get(url) object. After that I write the output to a file named 'somefile.txt'.
I have to the following Python script (version 2.7 - Windows 8):
from lxml import html
import requests
urls = ('URL1',
'URL2',
'URL3'
)
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()
As you can see if have not included the file with the URLs in the script. I tried out many tutorials but failed. The filename would be 'urllist.txt'. In the current script I only get the data from URL3 - in an ideal case I want to get all data from urllist.txt.
Attempt for reading over the text file:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url)
You'll need to remove the newline from your lines:
with open('urllist.txt', 'r') as f: #text file containing the URLS
for url in f:
page = requests.get(url.strip())
The str.strip() call removes all whitespace (including tabs and newlines and carriage returns) from the line.
Do make sure you then process page in the loop; if you run your code to extract the data outside the loop all you'll get is the data from the last response you loaded. You may as well open the output file just once, in the with statement so Python closes it again:
with open('urllist.txt', 'r') as urls, open('somefile.txt', 'a') as output:
for url in urls:
page = requests.get(url.strip())
tree = html.fromstring(page.content)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
print >> output, 'Visitors:', visitors
You should either save the each page in a seperate variable, or perform all the computation within the looping of the url list.
Based on your code, by the time your page parsing happens it will only contain the data for the last page get since you are overriding the page variable within each iteration.
Something like the following should append all the pages' info.
for url in urls:
page = requests.get(url)
tree = html.fromstring(page.text)
visitors = tree.xpath('//b["no-visitors"]/text()')
print 'Visitors: ', visitors
f = open('somefile.txt', 'a')
print >> f, 'Visitors:', visitors # or f.write('...\n')
f.close()