Cannot remove headers-first XPATH in Selenium - python

I require your assistance with this issue below.
My code;
def rows_to_sheet():
line = 1
for i in data:
browser.get(i)
trs = browser.find_elements_by_xpath("/html/body/table[3]/tbody/tr")
for n, tr in enumerate(trs):
row=[td.text for td in tr.find_elements_by_tag_name("td")]
worksheet.write_row("A{}".format(line), row)
line += 1
tr[1] contains headers. It is being written to excel.xlsx in a separate function.
/html/body/table[3]/tbody/tr[1]/td
And data I want to write to excel go through tr[2], tr[3], tr[4].. based on the search keyword like this;
/html/body/table[3]/tbody/tr[2]/td
/html/body/table[3]/tbody/tr[3]/td
/html/body/table[3]/tbody/tr[4]/td
/html/body/table[3]/tbody/tr[5]/td
However, tr[1] (header) is also written to excel.xlsx each time since I could not remove "tr[1]" from code block.
Could anyone help me on how to avoid/remove "tr[1]" iteration from my code please?
Many thanks in advance!

Related

How to find an average from a specific row of a csv file by using loops in python?

f = open('TB_burden_countries_2014-09-29.csv')
for row in csv.reader(f):
print(row[7])
This basically reads the file and prints out the specific row, now how do I find the average of that very row by using loops. Thank you
After a quick google look up I found this post by "Billy".
Formatting data in a CSV file (calculating average) in python
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?

Loop through changing xpath values w/ Selenium

I'm working on scraping a site that has a dropdown menu of hundreds of schools. I am trying to go through and grab tables for only schools from a certain district in the state. So far I have isolated the values for only those schools, but I've bee unable to replace the xpath values from what is stored in my dataframe/list.
Here is my code:
ousd_list = ousd['name'].to_list()
for i in range(0,129):
n = 0
driver.find_element_by_xpath(('"//option[#value="',ousd_list[n],']"'))
driver.find_elements_by_name("submit1").click()
table = driver.find_elements_by_id("ContentPlaceHolder1_grdDisc")
tdf = pd.read_html(table)
tdf.to_csv(index=False)
n += 1
driver.get('https://dq.cde.ca.gov/dataquest/Expulsion/ExpSearchName.asp?TheYear=2018-19&cTopic=Expulsion&cLevel=School&cName=&cCounty=&cTimeFrame=S')
I suspect the issue is on the find_element_by_xpath line, but I'm not sure how else I would go about resolving this issue. Any advice?
The mistake is not in the scraping part but your code logic, since you put n=0 in the beginning of your loop, it resets to 0 and every loop will just find your ousd_list[0].
Try,
ousd_list = ousd['name'].to_list()
for ousd_name in ousd_list :
driver.find_element_by_xpath(f'//option[#value="{ousd_name}"]')
driver.find_elements_by_name("submit1").click()
table = driver.find_elements_by_id("ContentPlaceHolder1_grdDisc")
tdf = pd.read_html(table)
tdf.to_csv(index=False)
driver.get('https://dq.cde.ca.gov/dataquest/Expulsion/ExpSearchName.asp?TheYear=2018-19&cTopic=Expulsion&cLevel=School&cName=&cCounty=&cTimeFrame=S')

Using copy.deepcopy with python-pptx to add a column to a table leads to cell attributes being corrupted

I'm trying to append a column to a table in PowerPoint using python-pptx. A number of threads mention the solution:
def append_col(prs_obj, sl_i, sh_i):
# prs_obj is a pptx.Presentation('path') object.
# sli_i and sh_i are int indexs to locate a particular table object.
tab = prs_obj.slides[sl_i].shapes[sh_i].table
new_col = copy.deepcopy(tab._tbl.tblGrid.gridCol_lst[-1])
tab._tbl.tblGrid.append(new_col) # copies last grid element
for tr in tab._tbl.tr_lst:
# duplicate last cell of each row
new_tc = copy.deepcopy(tr.tc_lst[-1])
tr.append(new_tc)
cell = _Cell(new_tc, tr.tc_lst)
cell.text = '--'
return tab
After running this, when you open PowerPoint the new column will be there, but it won't contain the cell.text. If you click in the cell and type, the letters will appear in the cell of the previous column. Saving powerpoint enables you to edit the column as normal, but obviously you've lost the cell.text (and formatting).
QUESTION UPDATE 1- FOLLOWING COMMENT FROM #scanny
For the simplest possible case, a (1x3) table, like so: |xx|--|xx| the tab._tbl.xml prints before and after appending the column are:
xml diff 1
xml diff 2
xml diff 3
xml diff 4
QUESTION UPDATE 2- FOLLOWING COMMENT FROM #scanny
I modified the above append_col function to forcibly remove the extLst element from the copied gridCol. This stopped the problem of typing in one cell and text appearing in another cell.
def append_col(prs_obj, sl_i, sh_i):
# existing lines removed for brevity
# New Code
tblchildren = tab._tbl.getchildren()
for child in tblchildren:
if isinstance(child, oxml.table.CT_TableGrid):
ws = set()
for j in child:
if j.w not in ws:
ws.add(j.w)
else:
for elem in j:
j.remove(elem)
return tab
However cell.text(and formatting)are still missing. Moreover, manually saving the presentation changes the tab.xml object back. The screenshots before and after manually opening the PowerPoint presentation are:
AFTER removing extLst, before manual save - xml diff 1
AFTER removing extLst, AFTER manual save - xml diff 2
If you're serious about solving this sort of problem, you'll need to reverse-engineer the Word XML for this aspect of tables.
The place to start is with before and after (adding a column) XML dumps of the table, identifying the changes made by Word, then duplicating those that matter (things like revision-numbers probably don't matter).
This process is simplified by having a small example, say a 2 x 2 table to a 2 x 3 table.
You can get the XML for a python-docx XML element using its .xml attribute, like:
print(tab._tbl.xml)
You could compare the deepcopy results and then have concrete differences to start to explain the results not working. I expect you'll find that table items have unique ids and when you duplicate those, funky things happen.
With help from Scanny, I've come up with the following workaround which works:
def append_col(prs_obj, sl_i, sh_i):
tab = prs_obj.slides[sl_i].shapes[sh_i].table
new_col = copy.deepcopy(tab._tbl.tblGrid.gridCol_lst[-1])
tab._tbl.tblGrid.append(new_col) # copies last grid element
for tr in tab._tbl.tr_lst:
new_tc = copy.deepcopy(tr.tc_lst[-1])
tr.tc_lst[-1].addnext(new_tc)
cell = _Cell(new_tc, tr.tc_lst)
for paragraph in cell.text_frame.paragraphs:
for run in paragraph.runs:
run.text = '--'
tblchildren = tab._tbl.getchildren()
for child in tblchildren:
if isinstance(child, oxml.table.CT_TableGrid):
ws = set()
for j in child:
if j.w not in ws:
ws.add(j.w)
else:
# print('j:\n', j.xml)
for elem in j:
j.remove(elem)
return tab

Loop list in Python Selenium

I have a main page where there are links to 5 other pages with the following xpaths from tr[1] to tr[5].
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[1]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[2]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[3]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[4]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[5]/td[3]/div[1]/a
Inside every page there I have the following actions:
driver.find_element_by_name('key').send_keys('test_1')
driver.find_element_by_name('i18n[en_EN][value]').send_keys('Test 1')
# and at the end this takes me back to the main page again
driver.find_element_by_xpath('/html/body/div[3]/div[2]/div/div[3]/div/ul/li[2]/a').click()
How can I iterate so that the script will go through all 5 pages and do the above actions. Tried for loop but I guess I didn't do it right... any help would be very appreciated.
You can try this:
xpath = '/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[{}]/td[3]/div[1]/a'
for i in range(1, 6):
driver.find_element_by_xpath(xpath.format(i)).click()
seems like I figured it out so here is the answer which works for me now.
wls = ['/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[1]/td[3]/div[1]/a',
'/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[2]/td[3]/div[1]/a',
'/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[3]/td[3]/div[1]/a',
'/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[4]/td[3]/div[1]/a',
'/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[5]/td[3]/div[1]/a']
for i in wls:
driver.find_element_by_xpath(i).click()
below
template = '/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[{}]/td[3]/div[1]/a'
for x in range(1,6):
a = template.format(x)
print(a)
# do what you need to do with the 'a' element.
output
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[1]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[2]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[3]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[4]/td[3]/div[1]/a
/html/body/div[3]/div[2]/div/div[5]/div/div[2]/table/tbody/tr[5]/td[3]/div[1]/a

Python: No Traceback when Scraping Data into Excel Spreadsheet

I'm an inexperienced coder working in python. I wrote a script to automate a process where certain information would be ripped from a webpage and then copied, where it would be pasted into a new excel spreadsheet. I've written and executed the code, but the excel spreadsheet I've designated to receive the data is completely empty. Worst of all, there is no traceback error. Would you help me find the problem in my code? And how do you generally solve your own problems when not provided a traceback error?
import xlsxwriter, urllib.request, string
def main():
#gets the URL for the expert page
open_sesame = urllib.request.urlopen('https://aries.case.com.pl/main_odczyt.php?strona=eksperci')
#reads the expert page
readpage = open_sesame.read()
#opens up a new file in excel
workbook = xlsxwriter.Workbook('expert_book.xlsx')
#adds worksheet to file
worksheet = workbook.add_worksheet()
#initializing the variable used to move names and dates
#in the excel spreadsheet
boxcoA = ""
boxcoB = ""
#initializing expert attribute variables and lists
expert_name = ""
url_ticker = 0
name_ticker = 0
raw_list = []
url_list = []
name_list= []
date_list= []
#this loop goes through and finds all the lines
#that contain the expert URL and name and saves them to raw_list::
#raw_list loop
for i in readpage:
i = str(i)
if i.startswith('<tr><td align=left><a href='):
raw_list += i
#this loop goes through the lines in raw list and extracts
#the name of the expert, saving it to a list::
#name_list loop
for n in raw_list:
name_snip = n.split('target=_blank>','</a></td><')[1]
name_list += name_snip
#this loop fills a list with the dates the profiles were last updated::
#date_list
for p in raw_list:
url_snipoff = p[28:]
url_snip = url_snipoff.split('"')[0]
url_list += url_snip
expert_url = 'https://aries.case.com.pl/'+url_list[url_ticker]
open_expert = urllib2.openurl(expert_url)
read_expert = open_expert.read()
for i in read_expert:
if i.startswith('<p align=left><small>Last update:'):
update = i.split('Last update:','</small>')[1]
open_expert.close()
date_list += update
#now that we have a list of expert names and a list of profile update dates
#we can work on populating the excel spreadsheet
#this operation will iterate just as long as the list is long
#meaning that it will populate the excel spreadsheet
#with all of our names and dates that we wanted
for z in raw_list:
boxcoA = string('A',z)
boxcoB = string('B',z)
worksheet.write(boxcoA, name_list[z])
worksheet.write(boxcoB, date_list[z])
workbook.close()
print('Operation Complete')
main()
The lack of a traceback only means your code raises no exceptions. It does not mean your code is logically correct.
I would look for logic errors by adding print statements, or using a debugger such as pdb or pudb.
One problem I notice with your code is that the first loop seems to presume that i is a line, whereas it is actually a character. You might find splitlines() more useful
If there is no traceback then there is no error.
Most likely something has gone wrong with your scraping/parsing code and your raw_list or other arrays aren't populated.
Try print out the data that should be written to the worksheet in the last loop to see if there is any data to be written.
If you aren't writing data to the worksheet then it will be empty.

Categories

Resources