Extract information from website using Xpath, Python

Extract information from website using Xpath, Python - python

Trying to extract some useful information from a website. I came a bit now im stuck and in need of your help!
I need the information from this table
http://gbgfotboll.se/serier/?scr=scorers&ftid=57700
I wrote this code and i got the information that i wanted:
import lxml.html
from lxml.etree import XPath
url = ("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700")
rows_xpath = XPath("//*[#id='content-primary']/div[1]/table/tbody/tr")
name_xpath = XPath("td[1]//text()")
team_xpath = XPath("td[2]//text()")
league_xpath = XPath("//*[#id='content-primary']/h1//text()")
html = lxml.html.parse(url)
divName = league_xpath(html)[0]
for id,row in enumerate(rows_xpath(html)):
scorername = name_xpath(row)[0]
team = team_xpath(row)[0]
print scorername, team
print divName
I get this error
scorername = name_xpath(row)[0]
IndexError: list index out of range
I do understand why i get the error. What i really need help with is that i only need the first 12 rows. This is what the extract should do in these three possible scenarios:
If there are less than 12 rows: Take all the rows except THE LAST ROW.
If there are 12 rows: same as above..
If there are more than 12 rows: Simply take the first 12 rows.
How can i can i do this?
EDIT1
It is not a duplicate. Sure it is the same site. But i have already done what that guy wanted to which was to get all the values from the row. Which i can already do. I don't need the last row and i dont want it to extract more than 12 rows if there is..

I think is it what you want:
#coding: utf-8
from lxml import etree
import lxml.html
collected = [] #list-tuple of [(col1, col2...), (col1, col2...)]
dom = lxml.html.parse("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700")
#all table rows
xpatheval = etree.XPathDocumentEvaluator(dom)
rows = xpatheval('//div[#id="content-primary"]/div/table[1]/tbody/tr')
# If there are less than 12 rows (or <=12): Take all the rows except the last.
if len(rows) <= 12:
rows.pop()
else:
# If there are more than 12 rows: Simply take the first 12 rows.
rows = rows[0:12]
for row in rows:
# all columns of current table row (Spelare, Lag, Mal, straffmal)
columns = row.findall("td")
# pick textual data from each <td>
collected.append([column.text for column in columns])
for i in collected: print i
Output:

This is how you can get the rows you need based on what you described in your post. This is just the logic based on concept that rows is a list, you have to incorporate this into your code as needed.
if len(rows) <=12:
print rows[0:-1]
elif len(rows) > 12:
print rows[0:12]

Related

How to display a count of combinations from a data set in python [duplicate]

This question already has answers here:
Count occurrences in DataFrame
(2 answers)
Closed 5 months ago.
I have a data set of customers and products and I would like to know which combination of products are more popular combinations chosen by customers and display that in a table (like a traditional Mileage chart or other neat way).
Example dataset:
Example output:
I am able to tell that the most popular combination of products for customers are P1 with P2 and the least popular is P1 with P3. My actual dataset is of course much larger in terms of customers and products.
I'd also be keen to hear any ideas on better outputs visualisations too, especially as I can't figure out how to best display 3 way or 4 way popular combinations.
Thank you

I have a full code example that may work for what you are doing... or at least give you some ideas on how to move forward.
This script uses OpenPyXl to scrape the info from the first sheet. It is turned into a dictionary where the key's are strings of the combinations. The combinations are then counted and it is then placed into a 2nd sheet (see image).
Results:
The Code:
from openpyxl import load_workbook
from collections import Counter
#Load master workbook/Worksheet and the file to be processed
data_wb = load_workbook(filename=r"C:\\Users\---Place your loc here---\SO_excel.xlsx")
data_ws = data_wb['Sheet1']
results_ws = data_wb['Sheet2']
#Finding Max rows in sheets
data_max_rows = data_ws.max_row
results_max_rows = results_ws.max_row
#Collecting Values and placing in array
customer_dict = {}
for row in data_ws.iter_rows(min_row = 2, max_col = 2, max_row = data_max_rows): #service_max_rows
#Gathering row values and creatin var's for relevant ones
row_list = [cell.value for cell in row]
customer_cell = row_list[0]
product_cell = row_list[3]
#Creating Str value for dict
if customer_cell not in customer_dict:
words = ""
words += product_cell
customer_dict.update({customer_cell:words})
else:
words += ("_" + product_cell)
customer_dict.update({customer_cell:words})
#Counting Occurances in dict for keys
count_dict = Counter(customer_dict.values())
#Column Titles
results_ws.cell(1, 1).value = "Combonation"
results_ws.cell(1, 2).value = "Occurances"
#Placing values into spreadsheet
count = 2
for key, value in count_dict.items():
results_ws.cell(count, 1).value = key
results_ws.cell(count, 2).value = value
count += 1
data_wb.save(filename = r"C:\\Users\---Place your loc here---\SO_excel.xlsx")
data_wb.close()

How to calculate the number of occurrences between data in excel?

I have a huge CSV table of thousands of data, I want to make a table of number of occurrence of two elements together divided by how many that element presented
[
Like Bitcoin appeared 8 times in this rows with 2 times with API so the relation between bitcoin to API: is that API always exists with bitcoin so the value of API appearing with bitcoin is 1 and bitcoin appearing with API is 1/4.
I want something looks like this in the end
How I can do it with python or any other tool?
This is sample of file
sample of the file

This, I think, does do the job. I typed your spreadsheet into a csv by hand (would have been nice to be able to cut and paste), and the results seem reasonable.
import itertools
import csv
import numpy as np
words = {}
for row in open('input.csv'):
parts = row.rstrip().split(',')
for a,b in itertools.combinations(parts,2):
if a not in words:
words[a] = [b]
else:
words[a].append( b )
if b not in words:
words[b] = [a]
else:
words[b].append( a )
print(words)
size = len(words)
keys = list(words.keys())
track = np.zeros((size,size))
for i,k in enumerate(keys):
track[i,i] = len(words[k])
for j in words[k]:
track[i,keys.index(j)] += 1
track[keys.index(j),i] += 1
print(keys)
# Scale to [0,1].
for row in range(track.shape[0]):
track[row,:] /= track[row,row]
# Create a csv with the results.
fout = open('corresp.csv','w')
print( ','.join([' ']+keys), file=fout )
for row in range(track.shape[0]):
print( keys[row], file=fout, end=',')
print( ','.join(f"{track[row,i]}" for i in range(track.shape[1])), file=fout )
Here's the first few lines of the result:
,API,Backend Development,Bitcoin,Docker,Article Rewriting,Article writing,Blockchain,Content Writing,Ghostwriting,Android,Ethereum,PHP,React.js,C Programming,C++ Programming,ASIC,Digital ASIC Coding,Embedded Software,Article Writing,Blog,Copy Typing,Affiliate Marketing,Brand Marketing,Bulk Marketing,Sales,BlockChain,Business Strategy,Non-fungible Tokens,Technical Writing,.NET,Arduino,Software Architecture,Bluetooth Low Energy (BLE),C# Programming,Ada programming,Programming,Haskell,Rust,Algorithm,Java,Mathematics,Machine Learning (ML),Matlab and Mathematica,Data Entry,HTML,Circuit Designs,Embedded Systems,Electronics,Microcontroller, C++ Programming,Python
API,1.0,0.14285714285714285,0.5714285714285714,0.14285714285714285,0.0,0.0,0.2857142857142857,0.0,0.0,0.0,0.14285714285714285,0.0,0.14285714285714285,0.2857142857142857,0.2857142857142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Backend Development,0.6666666666666666,1.0,0.6666666666666666,0.6666666666666666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bitcoin,0.21052631578947367,0.05263157894736842,1.0,0.05263157894736842,0.0,0.0,0.2631578947368421,0.0,0.0,0.05263157894736842,0.10526315789473684,0.10526315789473684,0.05263157894736842,0.15789473684210525,0.21052631578947367,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.05263157894736842,0.0,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Docker,0.6666666666666666,0.6666666666666666,0.6666666666666666,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

I had a look at this by creating a pivot table in Excel for every combination of columns there are: AB AC, AD, BC, BD, CD and putting the unique entries from the first column, eg A, in the rows and the unique entries from the second, eg B, in the column and then putting column A in the values area, I find all matches and the count of all matches
This is a clunky method but I note from the Python based method that has been submitted, my answer is essentially no more or less clunky than that!

Beautifulsoup: how to iterate a table

I am trying to extract data from a dynamic table with the following structure:
Team 1 - Score - Team 2 - Minute first goal.
It is a table of soccer match results and there are about 10 matches per table and one table for each matchday. This is an example of the website in working with: https://www.resultados-futbol.com/premier/grupo1/jornada1
For this I am trying web scraping with BeautifulSoup in Python. Although I've made good progress, I'm running into a problem. I would like to generate a code that would iterate data by data each row of the table and I would get each data to a list so that I would have, for example:
List Team 1: Real Madrid, Barcelona
Score list: 1-0, 1-0
List Team 2: Atletico Madrid, Sevilla
First goal minutes list: 17', 64'
Once I have the lists, my intention is to make a complete dataframe with all the extracted data. However, I have the following problem: the matches that end 0-0. This implies that in the column Minute first goal there is none and it doesn't extract anything, so I can't 'fill' that value in any way in my dataframe and I get an error. To continue with the previous example, imagine that the second game has ended 0-0 and that in the 'Minutes first goal list' there is only one data (17').
In my mind the solution would be to create a loop that takes the data cell by cell and put a condition in 'Score' that if it is 0-0 to the list of Minutes first goal a value for example 'No goals' would be added.
This is the code I am using. I paste only the part in which I would like to create the loop:
page = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
table = page.find('div', class_= 'contentitem').find_all('tr', class_= 'vevent')
teams1 = []
teams2 = []
scores = []
for cell in table:
team1 = cell.find('td', class_='team1')
for name in local:
nteam1 = name.text
teams1.append(nteam1)
team2 = cell.find('td', class_='team2')
for name in team2:
nteam2 = name.text
teams2.append(nteam2)
score = cell.find('span', class_='clase')
for name in score:
nscore = name.text
scores.append(nscore)
It is not clear to me how to iterate over the table to be able to store in the list the content of each cell and it is essential to include a condition "when the score cell is 0-0 create a non-goals entry in the list".
If someone could help me, I would be very grateful. Best regards

You are close to your goal, but can optimize your script a bit.
Do not use these different lists, just use one:
data = []
Try to get all information in one loop, there is an td that contains all the information and push a dict to your list:
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
Push your data into DataFrame
pd.DataFrame(data)
Example
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://www.resultados-futbol.com/premier/grupo1/jornada1'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
data = []
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
pd.DataFrame(data)

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

Apologies in advance for long question- I am new to Python and I'm trying to be as explicit as I can with a fairly specific situation.
I am trying to identify specific data points from SEC Filings on a routine basis however I want to automate this instead of having to manually go search a companies CIK ID and Form filing. So far, I have been able to get to a point where I am downloading metadata about all filings received by the SEC in a given time period. It looks like this:
index cik conm type date path
0 0 1000045 NICHOLAS FINANCIAL INC 10-Q 2019-02-14 edgar/data/1000045/0001193125-19-039489.txt
1 1 1000045 NICHOLAS FINANCIAL INC 4 2019-01-15 edgar/data/1000045/0001357521-19-000001.txt
2 2 1000045 NICHOLAS FINANCIAL INC 4 2019-02-19 edgar/data/1000045/0001357521-19-000002.txt
3 3 1000045 NICHOLAS FINANCIAL INC 4 2019-03-15 edgar/data/1000045/0001357521-19-000003.txt
4 4 1000045 NICHOLAS FINANCIAL INC 8-K 2019-02-01 edgar/data/1000045/0001193125-19-024617.txt
Despite having all this information, as well as being able to download these text files and see the underlying data, I am unable to parse this data as it is in xbrl format and is a bit out of my wheelhouse. Instead I came across this script (kindly provided from this site https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python):
from bs4 import BeautifulSoup
import requests
import sys
# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'
# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text
# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if '2015' in cells[3].text:
doc_link = 'https://www.sec.gov' + cells[1].a['href']
# Exit if document link couldn't be found
if doc_link == '':
print("Couldn't find the document link")
sys.exit()
# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text
# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if 'INS' in cells[3].text:
xbrl_link = 'https://www.sec.gov' + cells[2].a['href']
# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text
# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == 'us-gaap:stockholdersequity':
print("Stockholder's equity: " + tag.text)
Just running this script works exactly how I'd like it to. It returns the stockholders equity for a given company (IBM in this case) and I can then take that value and write it to an excel file.
My two-part question is this:
I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [('1009759', 'D', '20190215'),('1009891', 'D', '20190206'),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I'm interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.
Thank you for the help!

You need to define a function which can be essentially most of the code you have posted and that function should take 3 keyword arguments (your 3 values). Then rather than define the three in your code, you just pass in those values and return a result.
Then you take your list which you created and make a simple for loop around it to cal the function you defined with those three values and then do something with the result.
def get_data(value1, value2, value3):
# your main code here but replace with your arguments above.
return content
for company in companies:
content = get_data(value1, value2, value3)
# do something with content

Assuming you have a dataframe sec with correctly named columns for your list of filings, above, you first need to extract from the dataframe the relevant information into three lists:
cik = list(sec['cik'].values)
dat = list(sec['date'].values)
typ = list(sec['type'].values)
Then you create your base_url, with the items inserted and get your data:
for c, t, d in zip(cik, typ, dat):
base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
edgar_resp = requests.get(base_url)
And go from there.

Scraping an HTML table with BeautifulSoup

I'm trying to scrape the data in a table on the FT website, but I cannot get my code to work. I've been through other similar questions here on Stack Overflow, and while they have helped, it's beyond my skill to get the code working.
I'm looking to scrape the table and output to a list of dicts, or a dict of dicts, which I would then put into a pandas DataFrame.
EDIT FOR CLARITY:
I want to:
Scrape the table
strip out the html tags
return a dict where
the first cell of each row is the key, and the rest are values of the
key
So far I can do (1), (2) I see as more of a cleanup exercise, shouldn't be too hard, (3) is where I have issues. Some of the rows contain only one entry because they are section headings, but are not marked up as such in the html, so the standard dict comprehensions I've seen in other answers are returning either an error, because key with no values, or setting the first entry as the key for the all the rest of the data.
The table is here.
My code so far is:
from bs4 import BeautifulSoup
import urllib2
import lxml
soup = BeautifulSoup(urllib2.urlopen('http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet').read())
table = soup.find('table', {'data-ajax-content' : 'true'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
print cell.findAll(text = True)
Which gets me this kind of output:
[u'Fiscal data as of Dec 31 2013']
[u'2013']
[u'2012']
[u'2011']
[u'ASSETS']
[u'Cash And Short Term Investments']
[u'416']
[u'660']
[u'495']
I have tried:
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))
which may work, but I'm getting:
urllib2.HTTPError: HTTP Error 407: Proxy Authorization Required,
which I assume is because I'm behind a corporate proxy.
EDIT:
I tried the code above from home, and it gets this:
{u'Fiscal data as of Dec 31 2013201320122011': u'ASSETS'}
{u'Fiscal data as of Dec 31 2013201320122011': u'LIABILITIES'}
{u'Fiscal data as of Dec 31 2013201320122011': u'SHAREHOLDERS EQUITY'}
which is tantalisingly close, but has only captured the first row of each section.
Any help is greatly appreciated. I am new to python, so if you have time to explain your answer, that will also meet with my gratitude.
EDIT:
I've read around a bit more and tried a few more things:
table = soup.find('table', {'data-ajax-content' : 'true'})
rows = table.findAll('tr')
dict_for_series = {row[0]:row[1:] for row in rows}
print dict_for_series
Which results in:
{<tr><td class="label">Fiscal data as of Dec 31 2013</td><td>2013</td><td>2012</td><td>2011</td></tr>: [<tr class="section even"><td colspan="4">ASSETS</td></tr>, <tr class="odd"><td class="label">Cash And Short Term Investments</td><td>416</td><td>660</td><td>495</td></tr>, <tr class="even"><td class="label">Total Receivables, Net</td><td>1,216</td><td>1,122</td><td>1,102</td></tr>, <tr class="odd"><td class="label">Total Inventory</td><td>49</td><td>55</td><td>72</td><
In this case it seems the code sets the first entry as the key, and the rest as values.
Another attempt:
table = soup.find('table', {'data-ajax-content' : 'true'})
rows = table.findAll('tr')
d = []
for row in rows:
d.append(row.findAll('td'))
rowsdict = {t[0]:t[1:] for t in d}
dictSer = Series(rowsdict)
dictframe = DataFrame(dictSer)
print dictframe
Which results in:
0
<td class="label">Fiscal data as of Dec 31 2013</td> [<td>2013</td>, <td>2012</td>, <td>2011</td>]
<td colspan="4">ASSETS</td> []
<td class="label">Cash And Short Term Investments</td> [<td>416</td>, <td>660</td>, <td>495</td>]
<td class="label">Total Receivables, Net</td> [<td>1,216</td>, <td>1,122</td>, <td>1,102</td>]
which is very close to what I want, the structure is almost right, but judging by the placement of the square brackets, this is treating all the values ie <td>1,216</td> as a single cell.
Anyway, I'll keep playing around and trying to make it work, but if anyone has any pointers, please let me know!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract information from website using Xpath, Python - python

This is how you can get the rows you need based on what you described in your post. This is just the logic based on concept that rows is a list, you have to incorporate this into your code as needed. if len(rows) <=12: print rows[0:-1] elif len(rows) > 12: print rows[0:12]

Related

How to display a count of combinations from a data set in python [duplicate]

How to calculate the number of occurrences between data in excel?

Beautifulsoup: how to iterate a table

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

Scraping an HTML table with BeautifulSoup

Categories

Resources