How To Scrape Specific Chracter in Selenium using Python - python

I Want To Scrape 70 character in this HTML code:
<p>2) Proof of payment emailed to satrader03<strong>#gmail.com</strong> direct from online banking 3) Selfie of you holding your ID 4) Selfie of you holding your bank card from which payment will be made OR 5) Skype or what's app Video call while logged onto online banking displaying account name which should match personal verified name Strictly no 3rd party payments</p>
I Want To Know How To Scrape Specific Character with selenium for example i want to scrape 30 character or other
Here is my code:
description = driver.find_elements_by_css_selector("p")
items = len(title)
with open('btc_gmail.csv','a',encoding="utf-8") as s:
for i in range(items):
s.write(str(title[i].text) + ',' + link[i].text + ',' + description[i].text + '\n')
How to scrape 30 characters or 70 or something
Edit (full code):
driver = webdriver.Firefox()
r = randrange(3,7)
for url_p in url_pattren:
time.sleep(3)
url1 = 'https://www.bing.com/search?q=site%3alocalbitcoins.com+%27%40gmail.com%27&qs=n&sp=-1&pq=site%3alocalbitcoins+%27%40gmail.com%27&sc=1-31&sk=&cvid=9547A785CF084BAE94D3F00168283D1D&first=' + str(url_p) + '&FORM=PERE3'
driver.get(url1)
time.sleep(r)
title = driver.find_elements_by_tag_name('h2')
link = driver.find_elements_by_css_selector("cite")
description = driver.find_elements_by_css_selector("p")
items = len(title)
with open('btc_gmail.csv','a',encoding="utf-8") as s:
for i in range(items):
s.write(str(title[i].text) + ',' + link[i].text + ',' + description[i].text[30:70] + '\n')
Any Solution?

You can get text of the tag and then use slice on string
>>> description = driver.find_elements_by_css_selector("p")[0].text
>>> print(description[30:70]) # printed from 30th to 70th symbol
'satrader03<strong>#gmail.com</strong>'

Related

Excluding 'duplicated' scraped URLs in Python app?

I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11
Really, what I would ideally want to scrape is just one of these.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
Here is my script:
from bs4 import BeautifulSoup
import requests
def get_source(url):
return requests.get(url).content
def is_forum_link(self):
return self.find('special string') != -1
def fetch_all_links_with_word(url, word):
source = get_source(url)
soup = BeautifulSoup(source, 'lxml')
return soup.select("a[href*=" + word + "]")
main_url = "http://example.com/forum/"
forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
forums.append(link.attrs['href']);
print('Fetched ' + str(len(forums)) + ' forums')
threads = {}
for link in forums:
threadLinks = fetch_all_links_with_word(main_url + link, "threads")
for threadLink in threadLinks:
print(link + ': ' + threadLink.attrs['href'])
threads[link] = threadLink
print('Fetched ' + str(len(threads)) + ' threads')
This solution assumes that what should be removed from the url to check for uniqueness is always going to be "/page-#...". If that is not the case this solution will not work.
Instead of using a list to store your urls you can use a set, which will only add unique values. Then in the url remove the last instance of "page" and anything that comes after it if it is in the format of "/page-#", where # is any number, before adding it to the set.
forums = set()
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
url = link.attrs['href']
position = url.rfind('/page-')
if position > 0 and url[position + 6:position + 7].isdigit():
url = url[:position + 1]
forums.add(url);

How to make my session.get() link into variable?

My goal is to scrape multiple profile links and then scrape specific data on each of these profiles.
Here is my code to get multiple profile links (it should work fine):
from bs4 import BeautifulSoup
from requests_html import HTMLSession
import re
session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
profiles = soup.find_all(href=re.compile("/profile/kaid"))
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
print(final_profile_link)
Now here is my code to get the specific data on just one profile (it should work fine too):
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
import re
r = session.get('https://www.khanacademy.org/profile/Kkasparas/')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'
user_socio_table=soup.find_all('div', class_='discussion-stat')
data = {}
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
#for getdate in user_calendar:
last_activity = user_calendar.find('span',class_='streak-cell filled')
last_activity_date = last_activity['title']
#print(last_activity)
#print(last_activity_date)
else:
last_activity_date='NA'
filename = "khanscrapetry1.csv"
f = open(filename, "w")
headers = "date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n")
f.close()
My question is : how can I automate my scripts?
In other words: How can I merge these two scripts?
The goal is to create a sort of variable that is going to be a different profile link every time.
And then for each profile link to get the specific data and then put it into the csv file (a new row for each profile).
It is fairly very straight forward to do this. I instead of printing the profile links store them to a list variable. Then loop through the list variable to scrape each link and then write to the csv file. Some pages do not have all the details so you have to handle those exceptions as well. In the code below I have marked them also as 'NA', following the convention used in your code. One other note for future is to consider using the python's inbuilt csv module for reading and writing csv files.
Merged Script
from bs4 import BeautifulSoup
from requests_html import HTMLSession
import re
session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
profile_list.append(final_profile_link)
filename = "khanscrapetry1.csv"
f = open(filename, "w")
headers = "date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)
for link in profile_list:
print("Scraping ",link)
session = HTMLSession()
r = session.get(link)
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'
user_socio_table=soup.find_all('div', class_='discussion-stat')
data = {}
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
last_activity = user_calendar.find('span',class_='streak-cell filled')
try:
last_activity_date = last_activity['title']
except TypeError:
last_activity_date='NA'
else:
last_activity_date='NA'
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n")
f.close()
Sample Output from khanscrapetry1.csv
date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date
6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0,Saturday Jun 4 2016
6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0,Saturday Jun 4 2016
6 years ago,3164708,1276,164,2793,348,67,16,3,5663,885,Wednesday Oct 31 2018
6 years ago,3164708,1276,164,2793,348,67,16,3,5663,885,Wednesday Oct 31 2018
NA,NA,NA,18,NA,0,0,NA,NA,0,NA,Monday Dec 24 2018
NA,NA,NA,18,NA,0,0,NA,NA,0,NA,Monday Dec 24 2018
5 years ago,240334,56,7,42,6,0,2,NA,12,2,Tuesday Nov 20 2018
5 years ago,240334,56,7,42,6,0,2,NA,12,2,Tuesday Nov 20 2018
...

How to use python to interpret a url

I'm writing code that is attempting to extract the text from the Library of Babel.
They basically use a system of Hexes, Walls, Shelfs, Volumes and Pages to split up their library of randomly generated text files. Here is an example (https://libraryofbabel.info/book.cgi?2-w1-s2-v22:1)
Here we have Hex: 2, Wall: 1, Shelf: 2, Volume: 22, Page: 1.
I would ideally like to randomly generate a page across all these variables to extract text from, however I am not getting the output I would imagine.
Here is my code:
import requests
from bs4 import BeautifulSoup
from urlparse import urlparse
import random
hex = str(random.randint(0, 6))
wall = str(random.randint(1, 4))
shelf = str(random.randint(1, 5))
vol = str(random.randint(1, 32))
page = str(random.randint(1, 410))
print("Fetching: " + " Hex: " + hex + ", Wall: " + wall + ", Shelf: " + shelf + ", Vol: " + vol + ", Page: " + page)
babel_url = str("https://libraryofbabel.info/browse.cgi?" + hex + "-w" + wall + "-s" + shelf + "-v" + vol + ":" + page)
r = requests.get(babel_url)
soup = BeautifulSoup(r.text)
print(soup.get_text())
My output would be identical to that if I changed the url to be https://libraryofbabel.info/browse.cgi. print(babel_url) shows me that the way I wrote the url is fine but something isn't interpreting what I have written in the way I want.
I've found that just pasting https://libraryofbabel.info/book.cgi?2-w1-s2-v22:1 into chrome drops me at https://libraryofbabel.info/book.cgi. But if I navigate to https://libraryofbabel.info/book.cgi?2-w1-s2-v22:1 (or any other page) I can move between pages at will.
The only thing I get in the output worth mentioning is:
It appears your browser has javascript disabled. Follow this link to browse without javascript.
Put on you glasses :
You are requesting browse.cgi instead of book.cgi
https://libraryofbabel.info/browse.cgi?2-w2-s1-v10:72
instead of
https://libraryofbabel.info/book.cgi?2-w2-s1-v10:72

Get text from webpage as iterable object in python 3.3

Im trying to get the text from a webpage with Python 3.3 and then search through that text for certain strings. When I find a matching string I need to save the following text. For example I take this page: http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy
and I need to save the text after each category (card text, rarity, etc) in the card info.
Currently Im using beautiful Soup but get_text causes a UnicodeEncodeError and doesnt return an iterable object. Here is the relevant code:
urlStr = urllib.request.urlopen(
'http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName
).read()
htmlRaw = BeautifulSoup(urlStr)
htmlText = htmlRaw.get_text
for line in htmlText:
line = line.strip()
if "Converted Mana Cost:" in line:
cmc = line.next()
message += "*Converted Mana Cost: " + cmc +"* \n\n"
elif "Types:" in line:
type = line.next()
message += "*Type: " + type +"* \n\n"
elif "Card Text:" in line:
rulesText = line.next()
message += "*Rules Text: " + rulesText +"* \n\n"
elif "Flavor Text:" in line:
flavor = line.next()
message += "*Flavor Text: " + flavor +"* \n\n"
elif "Rarity:" in line:
rarity == line.next()
message += "*Rarity: " + rarity +"* \n\n"
This is incorrect:
htmlText = htmlRaw.get_text
As get_text is a method of the BeautifulSoup class, you're assigning the method to htmlText and not its result. There is a property variant of it that will do what you want here:
htmlText = htmlRaw.text
You're also using a HTML parser to simply strip tags, when you could use it to target the data you want:
# unique id for the html section containing the card info
card_id = 'ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_rightCol'
# grab the html section with the card info
card_data = htmlRaw.find(id=card_id)
# create a generator to iterate over the rows
card_rows = ( row for row in card_data.find_all('div', 'row') )
# create a generator to produce functions for retrieving the values
card_rows_getters = ( lambda x: row.find('div', x).text.strip() for row in card_rows )
# create a generator to get the values
card_values = ( (get('label'), get('value')) for get in card_rows_getters )
# dump them into a dictionary
cards = dict( card_values )
print cards
{u'Artist:': u'Scott Chou',
u'Card Name:': u'Dark Prophecy',
u'Card Number:': u'93',
u'Card Text:': u'Whenever a creature you control dies, you draw a card and lose 1 life.',
u'Community Rating:': u'Community Rating: 3.617 / 5\xa0\xa0(64 votes)',
u'Converted Mana Cost:': u'3',
u'Expansion:': u'Magic 2014 Core Set',
u'Flavor Text:': u'When the bog ran short on small animals, Ekri turned to the surrounding farmlands.',
u'Mana Cost:': u'',
u'Rarity:': u'Rare',
u'Types:': u'Enchantment'}
Now you have a dictionary of the information you want (plus a few extra) which will be a lot easier to deal with.

Using beautifulsoup to get prices from craigslist

I am new to coding in python (maybe a couple of days in) and basically learning of other people's code on stackoverflow. The code I am trying to write uses beautifulsoup to get the pid and the corresponding price for motorcycles on craigslist. I know there are many other ways of doing this but my current code looks like this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
u = ""
count = 0
while (count < 9):
site = "http://sfbay.craigslist.org/mca/" + str(u)
html = urlopen(site)
soup = BeautifulSoup(html)
postings = soup('p',{"class":"row"})
f = open("pid.txt", "a")
for post in postings:
x = post.getText()
y = post['data-pid']
prices = post.findAll("span", {"class":"itempp"})
if prices == "":
w = 0
else:
z = str(prices)
z = z[:-8]
w = z[24:]
filewrite = str(count) + " " + str(y) + " " +str(w) + '\n'
print y
print w
f.write(filewrite)
count = count + 1
index = 100 * count
print "index is" + str(index)
u = "index" + str(index) + ".html"
It works fine and as I keep learning i plan to optimize it. The problem I have right now, is that entries without price are still showing up. Is there something obvious that I am missing.
thanks.
The problem is how you're comparing prices. You say:
prices = post.findAll("span", {"class":"itempp"})
In BS .findAll returns a list of elements. When you're comparing price to an empty string, it will always return false.
>>>[] == ""
False
Change if prices == "": to if prices == [] and everything should be fine.
I hope this helps.

Categories

Resources