I am using pdfplumber to take input from a pdf file.
My question is how can I take from page 1-7 input using pdfplumber.
I'm using this code:
filename = "1st Year 1stSemester.pdf"
pdf = pdfplumber.open(filename)
totalpages = len(pdf.pages)
p0 = pdf.pages[0-6]
table = p0.extract_table()
table
I want to take input from page 1 to 7
I've also tried p0 = pdf.pages[0,1,2,3,6]. It also doesn't work.
for i in range (0,7):
print(filename.pages[i].extract_text)
To show the pages, use the for loop. Input the desired pages you want to show for example in your case you want to display the pages 1-7, just like how you count an array, it start with 0 till the last page which is 7.
Related
I am trying to find the numbers in the Data frame of URL’s which are 8 to 16 digits in length. There are 1000’s of URL's and there is no pattern. The number sometimes appears in between sometimes at the end. The only pattern I see is the there is always an "=" before the number. I want to save the the extracted results to a Column in DF.
I tried the below, they work for some URL's but not all. Please help
Example- 1 (Works)
url="http://www.dx.com/cgi-bin/tracking?action=track&language=english&ascend_header=1&cntry_code=us&initial=x&mps=y&tracknumbers=9261297937924338299022"
url.partition("&tracknumbers=")[2]
Result- 9261297937924338299022
Example-2 (Failed)
url= "http://www.dx.com/track/?trknbr=279076160403&utm_source=email&utm_medium=flow-email&utm_campaign=Email%20%231%20%28UbXvKS%29&_kx=t2f6aIumzJbeNUfOHnSk_hHhn4e7OS4SAoAiz2KwVYg%3D.Nv6kNb"
url.partition("?trknbr=")[2]
Result- 279076160403&utm_source=email&utm_medium=flow-email&utm_campaign=Email%20%231%20%28UbXvKS%29&_kx=t2f6aIumzJbeNUfOHnSk_hHhn4e7OS4SAoAiz2KwVYg%3D.Nv6kNb
I want to get only the number.
import re
PATTERN = re.compile(r"\w*=(\d{8,16})")
def find_numbers(url):
return PATTERN.findall(url)
# update your dataframe
df["values"] = df["URL"].map(lambda x: find_numbers(x))
I'm working on scraping a site that has a dropdown menu of hundreds of schools. I am trying to go through and grab tables for only schools from a certain district in the state. So far I have isolated the values for only those schools, but I've bee unable to replace the xpath values from what is stored in my dataframe/list.
Here is my code:
ousd_list = ousd['name'].to_list()
for i in range(0,129):
n = 0
driver.find_element_by_xpath(('"//option[#value="',ousd_list[n],']"'))
driver.find_elements_by_name("submit1").click()
table = driver.find_elements_by_id("ContentPlaceHolder1_grdDisc")
tdf = pd.read_html(table)
tdf.to_csv(index=False)
n += 1
driver.get('https://dq.cde.ca.gov/dataquest/Expulsion/ExpSearchName.asp?TheYear=2018-19&cTopic=Expulsion&cLevel=School&cName=&cCounty=&cTimeFrame=S')
I suspect the issue is on the find_element_by_xpath line, but I'm not sure how else I would go about resolving this issue. Any advice?
The mistake is not in the scraping part but your code logic, since you put n=0 in the beginning of your loop, it resets to 0 and every loop will just find your ousd_list[0].
Try,
ousd_list = ousd['name'].to_list()
for ousd_name in ousd_list :
driver.find_element_by_xpath(f'//option[#value="{ousd_name}"]')
driver.find_elements_by_name("submit1").click()
table = driver.find_elements_by_id("ContentPlaceHolder1_grdDisc")
tdf = pd.read_html(table)
tdf.to_csv(index=False)
driver.get('https://dq.cde.ca.gov/dataquest/Expulsion/ExpSearchName.asp?TheYear=2018-19&cTopic=Expulsion&cLevel=School&cName=&cCounty=&cTimeFrame=S')
(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.
There's the website Forbes Most Admired Companies with a list of 50 companies and I am trying to parse that list and export it into a csv file
The code I have only get me 20 because the page load when you scroll down. is there a way to simulate the scroll down or make it load entirely?
from lxml import html
import requests
def schindler(max): # create a list of the companies
page = requests.get('http://beta.fortune.com/worlds-most-admired-companies/list/')
tempContainer = html.fromstring(page.content)
names = []
position = 1
while position <= max:
names.extend(tempContainer.xpath('//*[#id="pageContent"]/div[2]/div/div/div[1]/div[1]/ul/li['+str(position)+']/a/span[2]/text()'))
position = position + 1
return names
(That was only the list creation, no problem with the .csv exporter)
I then print it to chek and only 20 items appear in the list
print(schindler(50))
It would appear that you're able to fetch the data as JSON. The 20 in the url appears to be the rank at which to start and 30 the number of items.
Sample code:
url = "http://fortune.com/api/v2/list/1918408/expand/item/ordering/asc/20/30"
resp = requests.get(url)
for entry in resp.json()['list-items']:
print(entry['rank'], entry['name'])
i'm trying to write a web spider to gather me some links and text.
I have a table i'm working with and the second cell of each row has a number in it, all i want to do is get that number, if it's the one i need then grab the links and text in cell 2&4.
Everything works fine except that i can't seem to be able to compare the numbers from the cell to a list of numbers i have.
I get the number using cells[1].get_text() (i create a list of all the cells for each row), this works fine and the type() returns 'class 'str'', i also make sure to convert my numbers list to string.
But when i try to compare them it always returns 'False'
import bs4
file = open(r"some html file", 'rb')
rng_lst = [str(x) for x in range(5, 43)]
soup = bs4.BeautifulSoup(file)
table = soup.findAll('table')[0]
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) >= 6:
check = cells[1].get_text()
for n in rng_lst:
if n == check:
# do stuff
I've tried everything i can think of and i ALWAYS get 'False', using == or 'is' doesn't work, if i try using 'in' it does work but then if i need cell number 5 i can get 15 or 25 also.
Most likely, you just need to strip the text you are getting from a cell:
check = cells[1].get_text(strip=True)
It is still a guess, but an "educated" one.