i'm trying to write a web spider to gather me some links and text.
I have a table i'm working with and the second cell of each row has a number in it, all i want to do is get that number, if it's the one i need then grab the links and text in cell 2&4.
Everything works fine except that i can't seem to be able to compare the numbers from the cell to a list of numbers i have.
I get the number using cells[1].get_text() (i create a list of all the cells for each row), this works fine and the type() returns 'class 'str'', i also make sure to convert my numbers list to string.
But when i try to compare them it always returns 'False'
import bs4
file = open(r"some html file", 'rb')
rng_lst = [str(x) for x in range(5, 43)]
soup = bs4.BeautifulSoup(file)
table = soup.findAll('table')[0]
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) >= 6:
check = cells[1].get_text()
for n in rng_lst:
if n == check:
# do stuff
I've tried everything i can think of and i ALWAYS get 'False', using == or 'is' doesn't work, if i try using 'in' it does work but then if i need cell number 5 i can get 15 or 25 also.
Most likely, you just need to strip the text you are getting from a cell:
check = cells[1].get_text(strip=True)
It is still a guess, but an "educated" one.
Related
I am trying to find the numbers in the Data frame of URL’s which are 8 to 16 digits in length. There are 1000’s of URL's and there is no pattern. The number sometimes appears in between sometimes at the end. The only pattern I see is the there is always an "=" before the number. I want to save the the extracted results to a Column in DF.
I tried the below, they work for some URL's but not all. Please help
Example- 1 (Works)
url="http://www.dx.com/cgi-bin/tracking?action=track&language=english&ascend_header=1&cntry_code=us&initial=x&mps=y&tracknumbers=9261297937924338299022"
url.partition("&tracknumbers=")[2]
Result- 9261297937924338299022
Example-2 (Failed)
url= "http://www.dx.com/track/?trknbr=279076160403&utm_source=email&utm_medium=flow-email&utm_campaign=Email%20%231%20%28UbXvKS%29&_kx=t2f6aIumzJbeNUfOHnSk_hHhn4e7OS4SAoAiz2KwVYg%3D.Nv6kNb"
url.partition("?trknbr=")[2]
Result- 279076160403&utm_source=email&utm_medium=flow-email&utm_campaign=Email%20%231%20%28UbXvKS%29&_kx=t2f6aIumzJbeNUfOHnSk_hHhn4e7OS4SAoAiz2KwVYg%3D.Nv6kNb
I want to get only the number.
import re
PATTERN = re.compile(r"\w*=(\d{8,16})")
def find_numbers(url):
return PATTERN.findall(url)
# update your dataframe
df["values"] = df["URL"].map(lambda x: find_numbers(x))
I'm trying to iterate through a list of NFL QBs (over 100) and add create a list of links that I will use later.
The links follow a standard format, however if there are multiple players with the same name (such as 'Josh Allen') the link format needs to change.
I've been trying to do this with different nested while/for loops with Try/Except with little to no success. This is what I have so far:
test = ['Josh Allen', 'Lamar Jackson', 'Derek Carr']
empty_list=[]
name_int = 0
for names in test:
try:
q_b_name = names.split()
link1=q_b_name[1][0].capitalize()
link2=q_b_name[1][0:4].capitalize()+q_b_name[0][0:2].capitalize()+f'0{name_int}'
q_b = pd.read_html(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
q_b1 = q_b[0]
#filter_status is a function that only works with QB data
df = filter_stats(q_b1)
#triggers the try if the link wasn't a QB
df.head(5)
empty_list.append(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
except:
#adds one to the variable to change the link to find the proper QB link
name_int += 1
The result only appends the final correct link. I need to append each correct link to the empty list.
Still a beginner in Python and trying to challenge myself with different projects. Thanks!
As stated, the try/except will work in that it will try the code under the try block. If at any point within that block it fails or raises and exception/error, it goes and executes the block of code under the except.
There are better ways to go about this problem (for example, I'd use BeautifulSoup to simply check the html for the "QB" position), but since you are a beginner, I think trying to learn this process will help you understand the loops.
So what this code does:
1 It formats your player name into the link format.
2 We initialize a while loop that will it will enter
3 It gets the table.
4a) It enters a function that checks if the table contains 'passing'
stats by looking at the column headers.
4b) If it finds 'passing' in the column, it will return a True statement to indicate it is a "QB" type of table (keep in mind sometimes there might be runningbacks or other positions who have passing stats, but we'll ignore that). If it returns True, the while loop will stop and go to the next name in your test list
4c) If it returns False, it'll increment your name_int and check the next one
5 To take care of a case where it never finds a QB table, the while loop will go to False if it tries 10 iterations
Code:
import pandas as pd
def check_stats(q_b1):
for col in q_b1.columns:
if 'passing' in col.lower():
return True
return False
test = ['Josh Allen', 'Lamar Jackson', 'Derek Carr']
empty_list=[]
for names in test:
name_int = 0
q_b_name = names.split()
link1=q_b_name[1][0].capitalize()
qbStatsInTable = False
while qbStatsInTable == False:
link2=q_b_name[1][0:4].capitalize()+q_b_name[0][0:2].capitalize()+f'0{name_int}'
url = f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/'
try:
q_b = pd.read_html(url, header=0)
q_b1 = q_b[0]
except Exception as e:
print(e)
break
#Check if "passing" in the table columns
qbStatsInTable = check_stats(q_b1)
if qbStatsInTable == True:
print(f'{names} - Found QB Stats in {link1}/{link2}/gamelog/')
empty_list.append(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
else:
name_int += 1
if name_int == 10:
print(f'Did not find a link for {names}')
qbStatsInTable = False
Output:
print(empty_list)
['https://www.pro-football-reference.com/players/A/AlleJo02/gamelog/', 'https://www.pro-football-reference.com/players/J/JackLa00/gamelog/', 'https://www.pro-football-reference.com/players/C/CarrDe02/gamelog/']
I'm trying to append a column to a table in PowerPoint using python-pptx. A number of threads mention the solution:
def append_col(prs_obj, sl_i, sh_i):
# prs_obj is a pptx.Presentation('path') object.
# sli_i and sh_i are int indexs to locate a particular table object.
tab = prs_obj.slides[sl_i].shapes[sh_i].table
new_col = copy.deepcopy(tab._tbl.tblGrid.gridCol_lst[-1])
tab._tbl.tblGrid.append(new_col) # copies last grid element
for tr in tab._tbl.tr_lst:
# duplicate last cell of each row
new_tc = copy.deepcopy(tr.tc_lst[-1])
tr.append(new_tc)
cell = _Cell(new_tc, tr.tc_lst)
cell.text = '--'
return tab
After running this, when you open PowerPoint the new column will be there, but it won't contain the cell.text. If you click in the cell and type, the letters will appear in the cell of the previous column. Saving powerpoint enables you to edit the column as normal, but obviously you've lost the cell.text (and formatting).
QUESTION UPDATE 1- FOLLOWING COMMENT FROM #scanny
For the simplest possible case, a (1x3) table, like so: |xx|--|xx| the tab._tbl.xml prints before and after appending the column are:
xml diff 1
xml diff 2
xml diff 3
xml diff 4
QUESTION UPDATE 2- FOLLOWING COMMENT FROM #scanny
I modified the above append_col function to forcibly remove the extLst element from the copied gridCol. This stopped the problem of typing in one cell and text appearing in another cell.
def append_col(prs_obj, sl_i, sh_i):
# existing lines removed for brevity
# New Code
tblchildren = tab._tbl.getchildren()
for child in tblchildren:
if isinstance(child, oxml.table.CT_TableGrid):
ws = set()
for j in child:
if j.w not in ws:
ws.add(j.w)
else:
for elem in j:
j.remove(elem)
return tab
However cell.text(and formatting)are still missing. Moreover, manually saving the presentation changes the tab.xml object back. The screenshots before and after manually opening the PowerPoint presentation are:
AFTER removing extLst, before manual save - xml diff 1
AFTER removing extLst, AFTER manual save - xml diff 2
If you're serious about solving this sort of problem, you'll need to reverse-engineer the Word XML for this aspect of tables.
The place to start is with before and after (adding a column) XML dumps of the table, identifying the changes made by Word, then duplicating those that matter (things like revision-numbers probably don't matter).
This process is simplified by having a small example, say a 2 x 2 table to a 2 x 3 table.
You can get the XML for a python-docx XML element using its .xml attribute, like:
print(tab._tbl.xml)
You could compare the deepcopy results and then have concrete differences to start to explain the results not working. I expect you'll find that table items have unique ids and when you duplicate those, funky things happen.
With help from Scanny, I've come up with the following workaround which works:
def append_col(prs_obj, sl_i, sh_i):
tab = prs_obj.slides[sl_i].shapes[sh_i].table
new_col = copy.deepcopy(tab._tbl.tblGrid.gridCol_lst[-1])
tab._tbl.tblGrid.append(new_col) # copies last grid element
for tr in tab._tbl.tr_lst:
new_tc = copy.deepcopy(tr.tc_lst[-1])
tr.tc_lst[-1].addnext(new_tc)
cell = _Cell(new_tc, tr.tc_lst)
for paragraph in cell.text_frame.paragraphs:
for run in paragraph.runs:
run.text = '--'
tblchildren = tab._tbl.getchildren()
for child in tblchildren:
if isinstance(child, oxml.table.CT_TableGrid):
ws = set()
for j in child:
if j.w not in ws:
ws.add(j.w)
else:
# print('j:\n', j.xml)
for elem in j:
j.remove(elem)
return tab
(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.
I am trying to create a loop involving Pandas/ Python and an Excel file. The column in question is named "ITERATION" and it has numbers ranging from 1 to 6. I'm trying to query the number of hits in the Excel file in the following iteration ranges:
1 to 2
3
4 to 6
I've already made a preset data frame named "df".
iteration_list = ["1,2", "3", "4,5,6"]
i = 1
for k in iteration_list:
table = df.query('STATUS == ["Sold", "Refunded"]')
table["ITERATION"] = table["ITERATION"].apply(str)
table = table.query('ITERATION == ["%s"]' % k)
table = pd.pivot_table(table, columns=["Month"], values=["ID"], aggfunc=len)
table.to_excel(writer, startrow = i)
i = i + 3
The snippet above works only for the number "3". The other 2 scenarios don't seem to work as it literally searches for the string "1,2". I've tried other ways such as:
iteration_list = [1:2, 3, 4:6]
iteration_list = [{1:2}, 3, {4:6}]
to no avail.
Does anyone have any suggestions?
EDIT
After looking over Stidgeon's answer, I seemed to come up with the following alternatives. Stidgeon's answer DOES provide an output but not the one I'm looking for (it gives 6 outputs - from iteration 1 to 6 in each loop).
Above, my list was the following:
iteration_list = ["1,2", "3", "4,5,6"]
If you play around with the quotation marks, you could input exactly what you want. Since your strings is literally going to be inputted into this line where %s is:
table = table.query('ITERATION == ["%s"]' % k)
You can essentially play around with the list to fit your precise needs with quotations. Here is a solution that could work:
list = ['1", "2', 3, '4", "5", "6']
Just focusing on getting the values out of the list of strings, this works for me (though - as always - there may be more Pythonic approaches):
lst = ['1,2','3','4,5,6']
for item in lst:
items = item.split(',')
for _ in items:
print int(_)
Though instead of printing at the end, you can pass the value to your script.
This will work if all your strings are either single numbers or numbers separated by commas. If the data are consistently formatted like that, you may have to tweak this code.