Parse HTML and Read HTML Table with Selenium Python - python

I am converting some of my web-scraping code from R to Python (I can't get geckodriver to work with R, but it's working with Python). Anyways, I am trying to understand how to parse and read HTML tables with Python. Quick background, here is my code for R:
doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")
WebElem <- readHTMLTable(doc, stringsAsFactors = FALSE)[[7]]
I would parse the HTML page to the doc object. Then I would start with doc[[1]], and move through higher numbers until I saw the data I wanted. In this case I got to doc[[7]] and saw the data I wanted. I then would read that HTML table and assign it to the WebElem object. Eventually I would turn this into a dataframe and play with it.
So what I am doing in Python is this:
html = None
doc = None
html = driver.page_source
doc = BeautifulSoup(html)
Then I started to play with doc.get_text but I don't really know how to get just the data I want to see. The data I want to see is like a 10x10 matrix. When I used R, I would just use doc[[7]] and that matrix would almost be in a perfect structure for me to convert it to a dataframe. However, I just can't seem to do that with Python. Any advice would be much appreciated.
UPDATE:
I have been able to get the data I want using Python--I followed this blog for creating a dataframe with python: Python Web-Scraping. Here is the website that we are scraping in that blog: Most Popular Dog Breeds. In that blog post, you have to work your way through the elements, create a dict, loop through each row of the table and store the data in each column, and then you are able to create a dataframe.
With R, the only code I had to write was:
doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")
df <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE)
With just that, I have a pretty nice dataframe that I only need to adjust the column names and data types--it looks like this with just that code:
NULL.V1 NULL.V2 NULL.V3 NULL.V4
1 BREED 2015 2014 2013
2 Retrievers (Labrador) 1 1 1
3 German Shepherd Dogs 2 2 2
4 Retrievers (Golden) 3 3 3
5 Bulldogs 4 4 5
6 Beagles 5 5 4
7 French Bulldogs 6 9 11
8 Yorkshire Terriers 7 6 6
9 Poodles 8 7 8
10 Rottweilers 9 10 9
Is there not something available in Python to make this a bit simpler, or is this just simpler in R because R is more built for dataframes(at least that's how it seems to me, but I could be wrong)?

Ok, after some hefty digging around I feel like I came to good solution--matching that of R. If you are looking at the HTML provided in the link above, Dog Breeds, and you have the web driver running for that link you can run the following code:
tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')
df = pd.read_html(tbl)
Then you are looking a pretty nice dataframe after only a couple lines of code:
In [145]: df
Out[145]:
[ 0 1 2 3
0 BREED 2015 2014 2013.0
1 Retrievers (Labrador) 1 1 1.0
2 German Shepherd Dogs 2 2 2.0
3 Retrievers (Golden) 3 3 3.0
4 Bulldogs 4 4 5.0
5 Beagles 5 5 4.0
I feel like this is much easier than working through the tags, creating a dict, and looping through each row of data as the blog suggests. It might not be the most correct way of doing things, I'm new to Python, but it gets the job done quickly. I hope this helps out some fellow web-scrapers.

tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')
df = pd.read_html(tbl)
it Worked pretty well.

First, read Selenium with Python, you will get basic idea of how Selenium work with Python.
Than, if you want to locate element in Python, there are tow ways:
Use Selenium API, you can refer Locating Elements
Use BeautifulSoup, there is nice Document you can read
BeautifulSoupDocumentation

Related

Making a table by using data which getting together 4 different list

I try to make a table by getting together four list.
My code as below:
from selenium import webdriver
import time
driver_path= "C:\\Users\\Bacanli\\Desktop\\chromedriver.exe"
browser=webdriver.Chrome(driver_path)
browser.get("http://www.bddk.org.tr/BultenHaftalik")
time.sleep(3)
Krediler=browser.find_element_by_xpath("//*[#id='tabloListesiItem-253']/span")
Krediler.click()
elements = browser.find_elements_by_css_selector("td.ortala:nth-child(2)")
TPs=browser.find_elements_by_css_selector("td[data-label='TP']")
YPs=browser.find_elements_by_css_selector("td[data-label='YP']")
Toplams=browser.find_elements_by_css_selector("td[data-label='Toplam']")
My intend is that make a new table by getting together elements, TPs, YPs, Toplams.
Thanks for your helping.
Pandas makes this easy for you:
import pandas as pd
df = pd.read_html('http://www.bddk.org.tr/BultenHaftalik')
will create a list of pandas dataframes from html tables on the page. The table you want is df[3].
Result df[3].head():
Unnamed: 0
Sektör / Krediler ( 9 Temmuz 2021 Cuma ) (Milyon TL)
TP
YP
TOPLAM
0
1
Toplam Krediler (2+10)
2.479.94928
1.427.80395
3.907.75323
1
2
Tüketici Kredileri ve Bireysel Kredi Kartları (3+7)
877.62363
30181
877.92544
2
3
Tüketici Kredileri (4+5+6)
710.18775
11070
710.29845
3
4
a) Konut
278.38213
7473
278.45686
4
5
b) Taşıt
14.91958
000
14.91958
export to csv with df[3].to_csv('filename.csv')
(or you could use the export to excel button above the table on the website)

Pandas.read_csv issue

I trying to read the message from database, but under the class label can't really read same as CSV dataset.
messages = pandas.read_csv('bitcoin_reddit.csv', delimiter='\t',
names=["title","class"])
print (messages)
Under the class label the pandas only can read as NaN
The version of my CSV file
title,url,timestamp,class
"It's official! 1 Bitcoin = $10,000 USD",https://v.redd.it/e7io27rdgt001,29/11/2017 17:25,0
The last 3 months in 47 seconds.,https://v.redd.it/typ8fdslz3e01,4/2/2018 18:42,0
It's over 9000!!!,https://i.imgur.com/jyoZGyW.gifv,26/11/2017 20:55,1
Everyone who's trading BTC right now,http://cdn.mutually.com/wp-content/uploads/2017/06/08-19.jpg,7/1/2018 12:38,1
I hope James is doing well,https://i.redd.it/h4ngqma643101.jpg,1/12/2017 1:50,1
Weeeeeeee!,https://i.redd.it/iwl7vz69cea01.gif,17/1/2018 1:13,0
Bitcoin.. The King,https://i.redd.it/4tl0oustqed01.jpg,1/2/2018 5:46,1
Nothing can increase by that much and still be a good investment.,https://i.imgur.com/oWePY7q.jpg,14/12/2017 0:02,1
"This is why I want bitcoin to hit $10,000",https://i.redd.it/fhzsxgcv9nyz.jpg,18/11/2017 18:25,1
Bitcoin Doesn't Give a Fuck.,https://v.redd.it/ty2y74gawug01,18/2/2018 15:19,-1
Working Hard or Hardly Working?,https://i.redd.it/c2o6204tvc301.jpg,12/12/2017 12:49,1
The separator in your csv file is a comma, not a tab. And since , is the default, there is no need to define it.
However, names= defines custom names for the columns. Your header already provides these names, so passing the column names you are interested in to usecols is all you need then:
>>> pd.read_csv(file, usecols=['title', 'class'])
title class
0 It's official! 1 Bitcoin = $10,000 USD 0
1 The last 3 months in 47 seconds. 0
2 It's over 9000!!! 1
3 Everyone who's trading BTC right now 1
4 I hope James is doing well 1
5 Weeeeeeee! 0

Best way to match list of words with a list of job descriptions python

Here is my problem (I'm working on python) :
I have a Dataframe with columns: Index(['job_title', 'company', 'job_label', 'description'], dtype='object')
And I have a list of words that contains 300 skills:
keywords = ["C++","Data Analytics","python","R", ............ "Django"]
I need to match those keywords with each of the jobs descriptions and obtain a new dataframe saying if is true or false that C++ is in job description[0]...job description[1], job description[2] and so on.
My new dataframe will be:
columns : ['job_title', 'company', 'description', "C++", "Data Analytics",
....... "Django"]
Where each column of keywords said true or false if it match(is found) or not on the job description.
There might be another ways to structure the dataframe (I'm listening suggestions).
Hope I'm clear with my question. I try regex but I can't make it iterate trough each row, I try with a loop using "fnmatch" library and I can't make it work. The best approach so far was:
df["microservice"]= df.description.str.contains("microservice")
df["cloud-based architecture"] = df.description.str.contains("cloud-based architecture")
df["service oriented architecture"] = df.description.str.contains("service oriented architecture")
However, First I could not manage to make it loop trough each rows of description column, so i have to input 300 times the code with each word (it doesn't make sense). Second, trough this way, I have problems with few words such as "R" because it find the letter R in each description, so it will pull true in each of them.
Iterate over list of keywords and extract each column from the description one:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if name in x else False)
EDIT:
That doesn't solve the problem with R. To do so you could add a space to make sure it's isolated so the code would be:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if ' '+str(name)+' ' in x else False)
But that's really ugly and not optimised. Regular expression should do the trick but I have to look back into it: found it! [ ]*+[str(name)]+[.?!] is better! (and more appropriate)
One way is to build a regex string to identify any keyword in your string... this example is case insensitive and will find any substring matches - not just whole words...
import pandas as pd
import re
keywords = ['python', 'C++', 'admin', 'Developer']
rx = '(?i)(?P<keywords>{})'.format('|'.join(re.escape(kw) for kw in keywords))
Then with a sample DF of:
df = pd.DataFrame({
'job_description': ['C++ developer', 'traffic warden', 'Python developer', 'linux admin', 'cat herder']
})
You can find all keywords for the relevant column...
matches = df['job_description'].str.extractall(rx)
Which gives:
keyword
match
0 0 C++
1 developer
2 0 Python
1 developer
3 0 admin
Then you want to get a list of "dummies" and take the max (so you always get a 1 where a word was found) using:
dummies = pd.get_dummies(matches).max(level=0)
Which gives:
keyword_C++ keyword_Python keyword_admin keyword_developer
0 1 0 0 1
2 0 1 0 1
3 0 0 1 0
You then left join that back to your original DF:
result = df.join(dummies, how='left')
And the result is:
job_description keyword_C++ keyword_Python keyword_admin keyword_developer
0 C++ developer 1.0 0.0 0.0 1.0
1 traffic warden NaN NaN NaN NaN
2 Python developer 0.0 1.0 0.0 1.0
3 linux admin 0.0 0.0 1.0 0.0
4 cat herder NaN NaN NaN NaN
skill = "C++", or any of the others
frame = an instance of
Index(['job_title', 'company', 'job_label', 'description'],
dtype='object')
jobs = a list/np.array of frames, which is probably your input
A naive implementation could look a bit like this:
for skill in keywords:
for frame in jobs:
if skill in frame["description"]: # or more exact matching, but this is what's in the question
# exists
But you need to put more work into what output structure you are going to use. Just having an output array of 300 columns most of which just contain a False isn't going to be a good plan. I've never worked with Panda's myself, but if it were normal numpy arrays (which panda's DataFrames are under the hood), I would add a column "skills" that can enumerate them.
You can leverage .apply() like so (#Jacco van Dorp made a solid suggestion of storing all of the found skills inside a single column, which I agree is likely the best approach to your problem):
df = pd.DataFrame([['Engineer','Firm','AERO1','Work with python and Django'],
['IT','Dell','ITD4','Work with Django and R'],
['Office Assistant','Dental','OAD3','Coordinate schedules'],
['QA Engineer','Factory','QA2','Work with R and python'],
['Mechanic','Autobody','AERO1','Love the movie Django']],
columns=['job_title','company','job_label','description'])
Which yields:
job_title company job_label description
0 Engineer Firm AERO1 Work with python and Django
1 IT Dell ITD4 Work with Django and R
2 Office Assistant Dental OAD3 Coordinate schedules
3 QA Engineer Factory QA2 Work with R and python
4 Mechanic Autobody AERO1 Love the movie Django
Then define your skill set and your list comprehension to pass to .apply():
skills = ['python','R','Django']
df['skills'] = df.apply(lambda x: [i for i in skills if i in x['description'].split()], axis=1)
Which yields this column:
skills
0 [python, Django]
1 [R, Django]
2 []
3 [python, R]
4 [Django]
If you are still interested in having individual columns for each skill, I can edit my answer to provide that as well.

Assistance with numpy savetext function

I am very new with python and am struggling to figure out how to output my data to a file.
The following section of my script works great, however I would like to be able to output the printed data to a text file.
x = np.linspace(141, 144.5, 500)
print x
y = np.linspace(-38.53, -38.53, 500)
z = np.linspace(0, 0, 500)
gz = tesseroid.gz(x,y,z,model)
print gz
As an addendum to that; when it prints in my terminal it prints the data as follows...
(x)
1 2 3 4
5 ... 500
(gz)
1 2 3 4
5 ... 500
...however I would love it to output the data in a single column like this:
1
2
3
4
5
...
8
It would be even better if they could both be in the same file...
1 1
2 2
3 3
4 4
5 5
...
500 500
...however this is not necessary, as I could easily manually manipulate them from there.
Many thanks in advance, and apologies if it is a very simple question; it is just something that I haven't been able to figure out with my limited experience.
Edit: Please note that it does not necessarily have to use the numpy savetext function. If there is an easier way to perform this task then I am perfectly willing to use it instead.
As you title says, you can use np.savetxt() to save in one column:
np.savetxt('x.txt', x.ravel())
np.savetxt('gz.txt', gz.ravel())
or to save in two columns:
np.savetxt('x_gz.txt', np.vstack((x.ravel(), gz.ravel())).T)

Pivoting a DataFrame in Pandas for output to CSV

This is a simple question for which answer are surprisingly difficult to find online. Here's the situation:
>>> A
[('hey', 'you', 4), ('hey', 'not you', 5), ('not hey', 'you', 2), ('not hey', 'not you', 6)]
>>> A_p = pandas.DataFrame(A)
>>> A_p
0 1 2
0 hey you 4
1 hey not you 5
2 not hey you 2
3 not hey not you 6
>>> B_p = A_p.pivot(0,1,2)
>>> B_p
1 not you you
0
hey 5 4
not hey 6 2
This isn't quite what's suggested in the documentation for pivot -- there, it shows results without the 1 and 0 in the upper-left-hand corner. And that's what I'm looking for, a DataFrame object that prints as
not you you
hey 5 4
not hey 6 2
The problem is that the normal behavior results in a csv file whose first line is
0,not you,you
when I really want
not you, you
When the normal csv file (with the preceding "0,") reads into R, it doesn't properly set the column and row names from the frame object, resulting in painful manual manipulation to get it in the right format. Is there a way to get pivot to give me a DataFrame object without that additional information in the upper-left corner?
Well, you have:
In [17]: B_p.to_csv(sys.stdout)
0,not you,you
hey,5.0,4.0
not hey,6.0,2.0
In [18]: B_p.to_csv(sys.stdout, index=False)
not you,you
5.0,4.0
6.0,2.0
But I assume you want the row names. Setting the index name to None (B_p.index.name = None) gives a leading comma:
In [20]: B_p.to_csv(sys.stdout)
,not you,you
hey,5.0,4.0
not hey,6.0,2.0
This roughly matches (ignoring quoted strings) what R writes in write.csv when row.names=TRUE:
"","a","b"
"foo",0.720538259472741,-0.848304940318957
"bar",-0.64266667412325,-0.442441171401282
"baz",-0.419181615269841,-0.658545964124229
"qux",0.881124313748992,0.36383198969179
"bar2",-1.35613767310069,-0.124014006180608
Any of these help?
EDIT: Added the index_label=False option today which does what you want:
In [2]: df
Out[2]:
A B
one 1 4
two 2 5
three 3 6
In [3]: df.to_csv('foo.csv', index_
index_exp index_label= index_name=
In [3]: df.to_csv('foo.csv', index_name=False)
In [4]:
11:24 ~/code/pandas (master)$ R
R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[Previously saved workspace restored]
re> read.csv('foo.csv')
A B
one 1 4
two 2 5
three 3 6

Categories

Resources