Preserving formatting (\t) in scraped text - Python Selenium

Preserving formatting (\t) in scraped text - Python Selenium - python

I have a program that takes the text from a website using this following code:
import selenium
driver = selenium.webdriver.Chrome(executable_path=r"\chromedriver.exe")
def get_raw_input(link_input, website_input, driver):
driver.get(f'{website_input}')
try:
here_button = driver.find_element_by_xpath('/html/body/div[2]/h3/a')
here_button.click()
raw_data = driver.find_element_by_xpath('/html/body/pre').text
except:
move_on = False
while move_on == False:
try:
raw_data = driver.find_element_by_class_name('output').text
move_on == True
except:
pass
driver.close()
return raw_data
the section of text it is targeting,is formatted like so
englishword tab frenchword
however, the return I get is in this format:
englishword space frenchword
the english part of the text could be a phrase with spaces in it, I cannot simply .split(" ") since it may split the phrase as well.
My end goal is to keep the formatting using tab instead of space so I can .split("\t") to make things easier for later manipulation.
Any help would be greatly appreciated :)

Selenium returns element text in the way how browser renders it. So it typically "normalizes" whitespaces (all inner space symbols turn into a single space).
You can see some discussion here. The solution to get the actually spaced text suggested by Selenium guys is to query textContent property from element.
Here is the example:
raw_data = driver.find_element_by_class_name('output').get_property('textContent')

Related

How to extract text (PyPDF2) from specific location/span on PDF

I have already extracted a text from a PDF page to Text variable.
I'm looking to extract the number that comes after the string 'your number is' (14 length string was matched on span (982,996):
object=PyPDF2.PdfFileReader(filename)
Text = PageObj.extractText()
PageObj = object.getPage(0)
ResSearch = re.search(String, Text)
I'm getting a result: span = (982, 996) match = 'your number is'. Now all I need is to scrape the three digit text that comes after that ('your number is 105'), as the files are changing daily and the fetching should be dynamic.
Thank you everyone !!

The problem is about regex not pdf itself. Under the assumption that at most one match per page you can use search, otherwise use findall. Have a look at the doc on how to use group, section with (...).
import PyPDF2, re
filename = '' #
pdf_r = PyPDF2.PdfFileReader(open(filename, 'rb'))
text = pdf_r.getPage(0).extractText() # from 1st page or make a loop
if p := re.search(r'your number is (\d{3})', text):
my_number = int(p.groups()[0]) # as int
Use PyPDF4, the syntax is the same and it doesn't "have" such extractText issue:
from the doc:
This works well for some PDF files, but poorly for others, depending on the generator used. [...] Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Extract parts of text (html) file based on characters before & after with python

I am trying to build a script that will extract specific parts (namely the link & its related description) out of an html file and return the result per line.
I 'm trying to build it using the lists in python, yet I 'm making a mistake somehow!
This is what I 've done so far, but it returns blank my values list:
import re
def subtext (data, first_link, last_link, first_descr, last_descr):
values = []
link = re.search('''"first_link"(.+?)"last_link"''', data)
values.append(link)
descr = re.search('''"first_descr"(.+?)"last_descr"''', data)
values.append(descr)
while values:
print(values)
html_file = input ("Type filepath: ")
html_code = open (html_file, "r")
html_data = html_code.read()
subtext (html_data, '''11px;">''', '''</td><td style="font-''')
html_code.close()

There is a html parser for python. But if you want use your code then you need fix those mistakes:
link = re.search('''"first_link"(.+?)"last_link"''', data)
values.append(link)
First of all, Your regex will search for strings "first_link" and "last_link" instead of values from function args. Use .format to create string form args.
Also in above code link will be re.Match object, not a string. Use group() to pick string from object - just make sure that it found something. Same story with next re.search.
while values:
print(values)
Here you will get into infinite loop of prints. Simply do print(values) without any loop.

Is there a way to grab all the links on a page except ones containing a specific word in selenium?

I've been trying for hours to find a way to do this, and so far I've found nothing. I've tried using find element by css, xpath, and partial text using the not function. I'm trying to scan a webpage for all the links that don't contain the word 'google', and append them to an array.
Keep in mind speak and get_audio are seperate functions I have not included.
driver = webdriver.Chrome(executable_path='mypath')
url = "https://www.google.com/search?q="
driver.get(url + text.lower())
speak("How many articles should I pull?")
n = get_audio()
speak(f"I'll grab {n} articles")
url_array = []
for a in driver.find_elements_by_xpath("//*[not(contains(text(), 'google'))]"):
url_array.append(a.get_attribute('href'))
print(url_array)
I always get something along the lines of find_elements_* can't take (whatever I put in here), or it works but it adds everything to the array, even the ones with google in them. Anyone have any ideas? Thanks!

I finally got it by defining a new function and filtering the list after it was made, instead of trying to get selenium to do it.
def Filter(string, substr):
return [str for str in string if
any(sub not in str for sub in substr)]
Then using that and a filter to get rid of the None
url_array_2 = []
for a in driver.find_elements_by_xpath('.//a'):
url_array_2.append(a.get_attribute('href'))
url_array_1 = list(filter(None, url_array_2))
flist = ['google']
url_array = Filter(url_array_1, flist)
print(url_array)
Worked perfectly :)

Incorrect string being output into DF Pandas Python

I have a loop which scans a website for a particular element and then scrapes it and places it within a list and then this gets put into a string variable.
Postalcode3 outputs fine to the DF and this in turn outputs correctly to the csv, however, postalcode4 does not output anything and those cells are simply skipped from the csv
Here is the loop function -
for i in range (30):
page = requests.get('https://www.example.com'+ df.loc[i,'ga:pagePath'])
tree = html.fromstring(page.content)
postalcode2 = tree.xpath('//span[#itemprop="postalCode"]/text()')
postalcode = tree.xpath('//span[#itemprop="addressRegion"]/text()')
if not postalcode2 and not postalcode:
print(postalcode,postalcode2)
elif not postalcode2:
postalcode4 = postalcode[0]
# postalcode4 = postalcode4.replace(' ','')
df.loc[i,'postcode'] = postalcode4
elif not postalcode:
postalcode3 = postalcode2[0]
if 'Â' not in postalcode3:
postalcode3 = postalcode3.replace('\\xa0','')
postalcode3 = postalcode3.replace(' ','')
else:
postalcode3 = postalcode3.replace('\\xa0Â','')
postalcode3 = postalcode3.replace(' ','')
df.loc[i,'postcode'] = postalcode3
I have debugged it and can see that the string output by postalcode4 is correct and in the same format as postalcode3.
Postalcode3 has a load of character removal elements placed in as that particular web element comes full of useless characters.
I'm not entirely sure what's gone wrong.
This is how I read in the DF and insert the new column which will be written into by the loop function.
files = 'example.csv'
df = pandas.read_csv(files, index_col=0)
df.insert(5,'postcode','')

It's possible you aren't handling the web output correctly.
The content attribute of a requests.get response is a bytestring, but HTML content is text. If you don't decode the bytestring before you create the HTML then you may well find extraneous characters due to the encoding appear in your text. The correct way to handle these is not, however, to continue with a bytestring, but instead to convert the incoming bytestring to text by decoding it before calling html.fromstring.
You should really find the correct encoding using the Content-Encoding header, if it's present. As an experiment you might try replacing
tree = html.fromstring(page.content)
with
tree = html.fromstring(page.content.decode('utf-8')`
since many web sites will use UTF8 encoding. You may find that the responses then appear to make more sense, and that you don't need to strip so much "extraneous" stuff out.

Cannot decode or read website URL for counting a string

I am trying to perform a search and count of data in a website using the code below, you can see I have added a few extra prints in the code for debugging, currently the result is always "0" which says to me there is an error in reading the file of some sort. If I print the variable called html, I can clearly see that all three strings I am searching for are contained in the html, yet as previously mentioned none of my prints print anything, and the final print count simply returns "0". As you can see I have tried three different methods, same problem each time.
import urllib2
import urllib
import re
import json
import mechanize
post_url = "url_of_fishermans_finds"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
html = browser.open(post_url).read().decode('UTF-8')
# Attempted method 1
print html.count("SEA BASS")
# Attempted method 2
count = 0
enabled = False
for line in html:
if 'MAIN FISHERMAN' in line:
print "found main fisherman"
enabled = True
elif 'SEA BASS' in line:
print "found fish"
count += 1
elif 'SECONDARY FISHERMAN' in line:
print "found secondary fisherman"
enabled = False
print count
# Attempted method 3
relevant = re.search(r"MAIN FISHERMAN(.*)SECONDARY FISHERMAN", html)[1]
found = relevant.count("SEA BASS")
print found
It is probably something really simple, any comments or help would be greatly appreciated. Kind regards AEA

Regarding your regular expressions method #3, it appears you aren't grouping your search result prior to running count. I don't have the HTML you're looking at but you may also be running into trouble with your use of the '.' if there are newlines between your two search terms. With these issues in mind, try something like the following to correct these errors (note: in Python 3 syntax):
relevantcompile = re.compile("MAIN FISHERMAN(.*)SECONDARY FISHERMAN", re.DOTALL)
relevantsearch = re.search(relevantcompile, html)
relevantgrouped = relevantsearch.group()
relevantcount = relevantgrouped.count("SEA BASS")
print(relevantcount)
Also, keep in mind comments above regarding the case sensitivity of regular expressions searches. Hope this helps :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Preserving formatting (\t) in scraped text - Python Selenium - python

Related

How to extract text (PyPDF2) from specific location/span on PDF

Extract parts of text (html) file based on characters before & after with python

Is there a way to grab all the links on a page except ones containing a specific word in selenium?

Incorrect string being output into DF Pandas Python

Cannot decode or read website URL for counting a string

Categories

Resources