Incorrect string being output into DF Pandas Python

Incorrect string being output into DF Pandas Python - python

I have a loop which scans a website for a particular element and then scrapes it and places it within a list and then this gets put into a string variable.
Postalcode3 outputs fine to the DF and this in turn outputs correctly to the csv, however, postalcode4 does not output anything and those cells are simply skipped from the csv
Here is the loop function -
for i in range (30):
page = requests.get('https://www.example.com'+ df.loc[i,'ga:pagePath'])
tree = html.fromstring(page.content)
postalcode2 = tree.xpath('//span[#itemprop="postalCode"]/text()')
postalcode = tree.xpath('//span[#itemprop="addressRegion"]/text()')
if not postalcode2 and not postalcode:
print(postalcode,postalcode2)
elif not postalcode2:
postalcode4 = postalcode[0]
# postalcode4 = postalcode4.replace(' ','')
df.loc[i,'postcode'] = postalcode4
elif not postalcode:
postalcode3 = postalcode2[0]
if 'Â' not in postalcode3:
postalcode3 = postalcode3.replace('\\xa0','')
postalcode3 = postalcode3.replace(' ','')
else:
postalcode3 = postalcode3.replace('\\xa0Â','')
postalcode3 = postalcode3.replace(' ','')
df.loc[i,'postcode'] = postalcode3
I have debugged it and can see that the string output by postalcode4 is correct and in the same format as postalcode3.
Postalcode3 has a load of character removal elements placed in as that particular web element comes full of useless characters.
I'm not entirely sure what's gone wrong.
This is how I read in the DF and insert the new column which will be written into by the loop function.
files = 'example.csv'
df = pandas.read_csv(files, index_col=0)
df.insert(5,'postcode','')

It's possible you aren't handling the web output correctly.
The content attribute of a requests.get response is a bytestring, but HTML content is text. If you don't decode the bytestring before you create the HTML then you may well find extraneous characters due to the encoding appear in your text. The correct way to handle these is not, however, to continue with a bytestring, but instead to convert the incoming bytestring to text by decoding it before calling html.fromstring.
You should really find the correct encoding using the Content-Encoding header, if it's present. As an experiment you might try replacing
tree = html.fromstring(page.content)
with
tree = html.fromstring(page.content.decode('utf-8')`
since many web sites will use UTF8 encoding. You may find that the responses then appear to make more sense, and that you don't need to strip so much "extraneous" stuff out.

Related

Preserving formatting (\t) in scraped text - Python Selenium

I have a program that takes the text from a website using this following code:
import selenium
driver = selenium.webdriver.Chrome(executable_path=r"\chromedriver.exe")
def get_raw_input(link_input, website_input, driver):
driver.get(f'{website_input}')
try:
here_button = driver.find_element_by_xpath('/html/body/div[2]/h3/a')
here_button.click()
raw_data = driver.find_element_by_xpath('/html/body/pre').text
except:
move_on = False
while move_on == False:
try:
raw_data = driver.find_element_by_class_name('output').text
move_on == True
except:
pass
driver.close()
return raw_data
the section of text it is targeting,is formatted like so
englishword tab frenchword
however, the return I get is in this format:
englishword space frenchword
the english part of the text could be a phrase with spaces in it, I cannot simply .split(" ") since it may split the phrase as well.
My end goal is to keep the formatting using tab instead of space so I can .split("\t") to make things easier for later manipulation.
Any help would be greatly appreciated :)

Selenium returns element text in the way how browser renders it. So it typically "normalizes" whitespaces (all inner space symbols turn into a single space).
You can see some discussion here. The solution to get the actually spaced text suggested by Selenium guys is to query textContent property from element.
Here is the example:
raw_data = driver.find_element_by_class_name('output').get_property('textContent')

How Do I Start Pulling Apart This Block of JSON Data?

I'd like to make a program that makes offline copies of math questions from Khan Academy. I have a huge 21.6MB text file that contains data on all of their exercises, but I have no idea how to start analyzing it, much less start pulling the questions from it.
Here is a pastebin containing a sample of the JSON data. If you want to see all of it, you can find it here. Warning for long load time.
I've never used JSON before, but I wrote up a quick Python script to try to load up individual "sub-blocks" (or equivalent, correct term) of data.
import sys
import json
exercises = open("exercises.txt", "r+b")
byte = 0
frontbracket = 0
backbracket = 0
while byte < 1000: #while byte < character we want to read up to
#keep at 1000 for testing purposes
char = exercises.read(1)
sys.stdout.write(char)
#Here we decide what to do based on what char we have
if str(char) == "{":
frontbracket = byte
while True:
char = exercises.read(1)
if str(char)=="}":
backbracket=byte
break
exercises.seek(frontbracket)
block = exercises.read(backbracket-frontbracket)
print "Block is " + str(backbracket-frontbracket) + " bytes long"
jsonblock = json.loads(block)
sys.stdout.write(block)
print jsonblock["translated_display_name"]
print "\nENDBLOCK\n"
byte = byte + 1

Ok, the repeated pattern appears to be this: http://pastebin.com/4nSnLEFZ
To get an idea of the structure of the response, you can use JSONlint to copy/paste portions of your string and 'validate'. Even if the portion you copied is not valid, it will still format it into something you can actually read.
First I have used requests library to pull the JSON for you. It's a super-simple library when you're dealing with things like this. The API is slow to respond because it seems you're pulling everything, but it should work fine.
Once you get a response from the API, you can convert that directly to python objects using .json(). What you have is essentially a mixture of nested lists and dictionaries that you can iterate through and pull specific details. In my example below, my_list2 has to use a try/except structure because it would seem that some of the entries do not have two items in the list under translated_problem_types. In that case, it will just put 'None' instead. You might have to use trial and error for such things.
Finally, since you haven't used JSON before, it's also worth noting that it can behave like a dictionary itself; you are not guaranteed the order in which you receive details. However, in this case, it seems the outermost structure is a list, so in theory it's possible that there is a consistent order but don't rely on it - we don't know how the list is constructed.
import requests
api_call = requests.get('https://www.khanacademy.org/api/v1/exercises')
json_response = api_call.json()
# Assume we first want to list "author name" with "author key"
# This should loop through the repeated pattern in the pastebin
# access items as a dictionary
my_list1 = []
for item in json_response:
my_list1.append([item['author_name'], item['author_key']])
print my_list1[0:5]
# Now let's assume we want the 'sha' of the SECOND entry in translated_problem_types
# to also be listed with author name
my_list2 = []
for item in json_response:
try:
the_second_entry = item['translated_problem_types'][0]['items'][1]['sha']
except IndexError:
the_second_entry = 'None'
my_list2.append([item['author_name'], item['author_key'], the_second_entry])
print my_list2[0:5]

Not picking up all XML elementree sub-sub elements in python

I am trying to capture all claims text tax in a bunch of xml patent files but having trouble with tags within the <claim-test>. Sometimes there's another <claim-text> and sometimes there is also <claim-ref> interrupting the text. In my output, the code gets cut off. Usually there are over 10 claims. I am trying to only get the text in the claim text.
I've already looked and tried the following but these don't work:
xml elementree missing elements python and
How to get all sub-elements of an element tree with Python ElementTree?
I've included a snippet here as it does get quite long to capture all.
My code for this is below (where fullname is the file name and directory).
for _, elem in iterparse(fullname):
description = '' # reset to empty string at beginning of each loop
abtext = '' # reset to empty string at beginning of each loop
claimtext= '' # reset to empty string
if elem.tag == 'claims':
for node4 in tree.findall('.//claims/claim/claim-text'):
claimtext = claimtext + node4.text
f.write('\n\nCLAIMTEXT\n\n\n')
f.write(smart_str(claimtext) + '\n\n')
#put row in df
row = dict(zip(['PATENT_ID', 'CLASS', 'ABSTRACT', 'DESCRIPTION','CLAIMS'], [data,cat,abtext,description,claimtext]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
So the resulting problem is twofold a) I only get one of the text printed to fil and b) nothing comes into the dataframe at all. I'm not sure if that's part of the same problem or two separate problems. I can get the claims to print into a file and that works fine but skips some of the text.

Creating Properly-Nested XML Output in Python

I'm attempting to save data from several lists in XML format, but I cannot understand how to make the XML display properly. An example of my code right now is as follows:
from lxml import etree
#Create XML Root
articles = etree.Element('root')
#Create Lists & Data
t_list = ['title1', 'title2', 'title3', 'title4', 'title5']
c_list = ['content1', 'content2', 'content3', 'content4', 'content5']
sum_list = ['summary1', 'summary2', 'summary3', 'summary4', 'summary5']
s_list = ['source1', 'source2', 'source3', 'source4', 'source5']
i = 0
for t in t_list:
for i in range(len(t_list)):
#Create SubElements of XML Root
article = etree.SubElement(articles, 'Article')
titles = etree.SubElement(article, 'Title')
summary = etree.SubElement(article, 'Summary')
source = etree.SubElement(article, 'Source')
content = etree.SubElement(article, 'Content')
#Add List Data to SubElements
titles.text = t_list[i]
summary.text = sum_list[i]
source.text = s_list[i]
content.text = c_list[i]
print(etree.tostring(articles, pretty_print=True))
My Current Output is written in one very jumbled fashion, all on a single line as follows:
b'<root>\n <Article>\n <Title>title1</Title>\n <Summary>summary1</Summary>\n <Source>source1</Source>\n <Content>content1</Content>\n </Article>\n
It looks like the pretty_print function within lxml is adding proper indentation, as well as \n breaks as I would want, but it doesn't seem to be getting interpreted correctly during output; it write on a single line.
The output I'm trying to get is as follows:
<root>
<Article>
<Title>title1</Title>
<Summary>summary1</Summary>
<Source>source1</Source>
<Content>content1</Content>
</Article>
Ideally, I'd like for my output to be viewed as a valid XML document, and display in proper nested format.

Your "Current Output" is the representation (internal python representation) of the bytestring generated by etree.tostring(), and seems that in Python3 print(somebytestring) prints the representation instead of the actual string.
Hopefully the solution is quite simple: just pass the desired encoding to etree.tostring(), ie:
xml = etree.tostring(articles, encoding="unicode", pretty_print=True)
print(xml)

I've only used the base ET module in Python and can't find an lxml download for python 3.5 (which I'm on) in order to test it, but the b before the line indicates bytes and a quick glance at the documentation indicates that tostring() has an encoding keyword, so you should just need to set that to unicode or utf-8.
I'll also mention that you don't need to set "i" before your for-loop (python will create the "i" it needs for the for-loop), though I- personally- would zip the lists and iterate the items in the lists themselves (though that's not going to have any real impact on the code in this situation).

Trying to split a string in Python 3, get 'str' does not support buffer interface

So I'm trying to take data from a website and parse it into an object. The data is separated by vertical bars ("|"). However, when I split my string using .split('|'), I get
TypeError: 'str' does not support the buffer interface
I am still attempting to learn Python. I have done some digging on this error but every example I could find related to sending or receiving data and there was a lot of jargon I was unfamiliar with. One solution said to use .split(b'|'), but then this somehow turned my string into a byte and prevented me from printing it in the final line.
Below is my code. Any help you can offer would be greatly appreciated!
with urllib.request.urlopen('ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqtraded.txt') as response:
html = response.read()
rawStockList = html.splitlines()
stockList = []
for i in rawStockList:
stockInfo = i.split('|')
try:
stockList.append(Stock(stockInfo[1], stockInfo[2], stockInfo[5], stockInfo[10], 0))
except IndexError:
pass
print([str(item) for item in stockList])

The items inside rawStockList is actually of the type byte, as response.read() returns that since it doesn't necessary know the encoding. You need to turn that into a proper string by decoding it with an encoding. Assuming the files are encoded in utf8, you need something like:
for i in rawStockList:
stockInfo = i.decode('utf8').split('|')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Incorrect string being output into DF Pandas Python - python

Related

Preserving formatting (\t) in scraped text - Python Selenium

How Do I Start Pulling Apart This Block of JSON Data?

Not picking up all XML elementree sub-sub elements in python

Creating Properly-Nested XML Output in Python

Trying to split a string in Python 3, get 'str' does not support buffer interface

Categories

Resources