I am trying to scrape some text from a webpage and saving them in a text file using following code (I am opening links from a text file called links.txt):
import requests
import csv
import random
import string
import re
from bs4 import BeautifulSoup
#Create random string of specific length
def randStr(chars = string.ascii_uppercase + string.digits, N=10):
return ''.join(random.choice(chars) for _ in range(N))
with open("links.txt", "r") as a_file:
for line in a_file:
stripped_line = line.strip()
endpoint = stripped_line
response = requests.get(endpoint)
data = response.text
soup = BeautifulSoup(data, "html.parser")
for pictags in soup.find_all('col-md-2'):
lastfilename = randStr()
file = open(lastfilename + ".txt", "w")
file.write(pictags.txt)
file.close()
print(stripped_line)
the webpage has following attribute:
<div class="col-md-2">
The problem is after running the code noting is happening and I am not receiving any error.
To get all keyword text from the page into a file, you can do:
import requests
from bs4 import BeautifulSoup
url = "http://www.mykeyworder.com/keywords?tags=dog&exclude=&language=en"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
with open("data.txt", "w") as f_out:
for inp in soup.select('input[type="checkbox"]'):
print(inp["value"], file=f_out)
This creates data.txt with content:
dog
animal
canine
pet
cute
puppy
happy
young
adorable
...and so on.
From the documentation of BeautifulSoup here, you can see your line for pictags in soup.find_all('col-md-2') will search for any element with tag name 'col-md-2' not element with that class name. In other word, your code will search element like so <col-md-2></col-md-2>.
You fix your code and try again or pictags in soup.find_all(class_='col-md-2')
you can match the elements with relevant attributes.
pass a dictionary to the attrs parameter of find_all with the
desired attributes of the elements you’re looking for.
pictags = soup.find_all(attrs={'class':'col-md-2'})
this will find all elements with class 'col-md-2'
Related
I want to open a txt file (which contains multiple links) and scrap title using beautifulsoup.
My txt file contains link like this:
https://www.lipsum.com/7845284869/
https://www.lipsum.com/56677788/
https://www.lipsum.com/01127111236/
My code:
import requests as rq
from bs4 import BeautifulSoup as bs
with open('output1.csv', 'w', newline='') as f:
url = open('urls.txt', 'r', encoding='utf8')
request = rq.get(str(url))
soup = bs(request.text, 'html.parser')
title = soup.findAll('title')
pdtitle = {}
for pdtitle in title:
pdtitle.append(pdtitle.text)
f.write(f'{pdtitle}')
I want to open all txt file links and scrap title from the links. The main problem is opening txt file in url variable is not working. How to open a file and save data to csv?
you code isn't working because inside URL is all the URL. you need to run one by one:
import requests as rq
from bs4 import BeautifulSoup as bs
with open(r'urls.txt', 'r') as f:
urls = f.readlines()
with open('output1.csv', 'w', newline='') as f:
for url in urls:
request = rq.get(str(url))
soup = bs(request.text, 'html.parser')
title = soup.findAll('title')
pdtitle = {}
for pdtitle in title:
pdtitle.append(pdtitle.text)
f.write(f'{pdtitle}')
Your urls may not be working because your urls are being read with a return line character: \n. You need to strip the text before putting them in a list.
Also, you are using .find_all('title'), and this will return a list, which is probably not what you are looking for. You probably just want the first title and that's it. In that case, .find('title') would be better. I have provided some possible corrections below.
from bs4 import BeautifulSoup
import requests
filepath = '...'
with open(filepath) as f:
urls = [i.strip() for i in f.readlines()]
titles = []
for url in urls:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
title = soup.find('title') # Note: will find the FIRST title only
titles.append(title.text) # Grabs the TEXT of the title only, removes HTML
new_csv = open('urls.csv', 'w') # Make sure to prepend with desired location, e.g. 'C:/user/name/urls.csv'
for title in titles:
new_csv.write(title+'\n') # The '\n' ensures a new row is written
new_csv.close()
f.close()
I'm working on a program to scrape address data from companies from a website. For this, I already have built a list with the links where this data can be found. On each of those links, the page source of the data I need looks like this:
http://imgur.com/a/V0kBK
For some reason though, I cannot seem to fetch the first line of the address, in this case "'t walletje 104 101"
All the other information comes through fine, as you can see here:
http://imgur.com/a/aUmSI
This is my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import string
import urllib
urllist=[]
url1="http://www.bisy.be/"
fname = "test2.txt"
fh = open(fname)
for line in fh:
line = line.rstrip()
if line.startswith(" <tr class="):
words= line.split()
url2 = words[6]
url3 = url1 + url2[1:48]
urllist.append(url3)
for link in urllist:
document = urllib.request.urlopen(link)
html = document.read()
soup = BeautifulSoup(html,"html.parser")
name = soup.find("br")
name2 = name.text.strip()
print (name2)
This code is doing the basics so far. Once everything works I will clean it and finetune it a bit. Is there anyone who can help me?
An example link for the people who want to check out the page source: http://www.bisy.be/?q=nl/bedrijf_fiche&id=KNO01C2015&nr=2027420160
Is there anyone who can point me in the right direction?
this is a workaround by finding p tag with specific font size:
elements = soup.find_all("p")
for tag in elements:
try:
if "9pt" in tag["style"]:
details = tag.text
print(details)
except:
pass
your can't select text in br it new line not like p, span div or others. you can try using BeautifulSoup above or use regex below which is faster
import re
for link in urllist:
document = urllib.request.urlopen(link)
html = document.read().decode('utf-8')
name = re.compile(r"9pt; font-family: Arial;'>(.*?)<a", re.DOTALL).findall(html)
# clean html tag
name = re.sub('<[^>]*>', '', name[0])
print(name)
I got a suggestion to use BeautifulSoup to delete a tag with a certain id from an HTML. For example, deleting <div id=needDelete>...</div> Below is my code, but doesn't seem to be working correctly:
import os, re
from bs4 import BeautifulSoup
cwd = os.getcwd()
print ('Now you are at this directory: \n' + cwd)
# find files that have an extension with HTML
Files = os.listdir(cwd)
print Files
def func(file):
for file in os.listdir(cwd):
if file.endswith('.html'):
print ('HTML files are \n' + file)
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
matches = str(soup.find_all("div", id="jp-post-flair"))
#The soup.find_all part should be correct as I tested it to
#print the matches and the result matches the texts I want to delete.
f.write(f.read().replace(matches,''))
#maybe the above line isn't correct
f.close()
func(file)
Would you help check which part has the wrong code and maybe how should I approach it?
Thank you very much!!
You can use the .decompose() method to remove the element/tag:
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
elements = soup.find_all("div", id="jp-post-flair")
for element in elements:
element.decompose()
f.write(str(soup))
It's also worth mentioning that you can probably just use the .find() method because an id attribute should be unique within a document (which means that there will likely only be one element in most cases):
f = open(file, "r+")
soup = BeautifulSoup(html_doc, 'html.parser')
element = soup.find("div", id="jp-post-flair")
if element:
element.decompose()
f.write(str(soup))
As an alternative, based on the comments below:
If you only want to parse and modify part of the document, BeautifulSoup has a SoupStrainer class that allows you to selectively parse parts of the document.
You mentioned that the indentations and formatting in the HTML file was being changing. Rather than just converting the soup object directly into a string, you can check out the relevant output formatting section in the documentation.
Depending on the desired output, here are a few potential options:
soup.prettify(formatter="minimal")
soup.prettify(formatter="html")
soup.prettify(formatter=None)
https://example.net/users/x
Here, x is a number that ranges from 1 to 200000. I want to run a loop to get all the URLs and extract contents from every URL using beautiful soup.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
content = urlopen(re.compile(r"https://example.net/users/[0-9]//"))
soup = BeautifulSoup(content)
Is this the right approach? I have to perform two things.
Get a continuous set of URLs
Extract & store retrieved contents from every page/URL.
UPDATE:
I've to get only one particular value from each of the webpages.
soup = BeautifulSoup(content)
divTag = soup.find_all("div", {"class":"classname"})
for tag in divTag:
ulTags = tag.find_all("ul", {"class":"classname"})
for tag in ulTags:
aTags = tag.find_all("a",{"class":"classname"})
for tag in aTags:
name = tag.find('img')['alt']
print(name)
You could try this:
import urllib2
import shutil
urls = []
for i in range(10):
urls.append(str('https://www.example.org/users/' + i))
def getUrl(urls):
for url in urls:
# Only a file_name based on url string
file_name = url.replace('https://', '').replace('.', '_').replace('/', '_')
response = urllib2.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl(urls)
If you just need the contents of a web page, you could probably use lxml, from which you could parse the content. Something like:
from lxml import etree
r = requests.get('https://example.net/users/x')
dom = etree.fromstring(r.text)
# parse seomthing
title = dom.xpath('//h1[#class="title"]')[0].text
Additionally, if you are scraping 10s or 100s of thousands of pages, you might want to look into something like grequests where you can do multiple asynchronous HTTP requests.
My overall goal is isolate tags that contain a certain word in the text and have only those print to a text file.
So far, I have been able to extract particular the tag, in this case the and get those to print to a text file.
My question is once I've got all the text in the extracted, what can I do with it? I am having trouble figuring out a way to isolate a particular word and further trim the text down to only what I need.
Here is what I have so far:
import urllib2
from BeautifulSoup import BeautifulSoup
url = 'http://www.website.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
page = soup.findAll('title')
for element in page:
print element
file_name = raw_input("What do you want to name the file?> ")
text_file = open("/Users/user1/Projects/%s.txt" % file_name, "w")
text_file.write("%s" % page)
text_file.close()
What gets returned to me is:
$<title>food</title>
<title>ball</title>
<title>car</title>
<title>desk</title>
<title>blue food</title>
<title>green food</title>
<title>red ball</title>
How would I get it to only print results that include 'food'?
You can get the contents of an element using .string. If you only wish to include results with food, add a check for that:
for element in page:
if 'food' in element.string:
print element.string
For exemple if you want to extract the data from the page and put it in a CSV file, you can do like that :
import urllib2
from BeautifulSoup import BeautifulSoup
import csv
file_name = raw_input("What do you want to name the file?> ")
c = csv.writer(open("%s.csv" % (file_name), "a"),delimiter=";" ) # Open the CSV File and Write in
url = 'http://www.website.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
page = soup.findAll('title')
for element in page:
element = element.text.encode('utf-8')
c.writerow([element])
You can use your CSV file in Excel or/and text editor software. Can be useful
My code is far away from the perfection but anyway, should work :)