I'm trying to create a pdf reader in python, I already got the pdf read and
I got a list with the content of the pdf and I want now to give me back the numbers with eleven characters, like 123.456.789-33 or 124.323.432.33
from PyPDF2 import PdfReader
import re
reader = PdfReader(r"\\abcdacd.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
num = re.findall(r'\d+', text)
print(num)
here's the output:
['01', '01', '2000', '26', '12', '2022', '04483203983', '044', '832', '039', '83', '20210002691450', '5034692', '79', '2020', '8', '24', '0038', '1', '670', '03', '2', '14', '2', '14', '1', '670', '03', '2', '14', '2', '14', '1', '1', '8', '21', '1']
If someone could help me, I'll be really thankful.
Change regex pattern to the following (to match groups of digits):
s = 'text text 123.456.789-33 or 124.323.432.33 text or 12323112333 or even 123,231,123,33 '
num = re.findall(r'\d{3}[.,]?\d{3}[.,]?\d{3}[.,-]?\d{2}', s)
print(num)
['123.456.789-33', '124.323.432.33', '12323112333', '123,231,123,33']
You can try:
\b(?:\d[.-]*){11}\b
Regex demo.
import re
s = '''\
123.456.789-33
124.323.432.33
111-2-3-4-5-6-7-8-9'''
pat = re.compile(r'\b(?:\d[.-]*){11}\b')
for m in pat.findall(s):
print(m)
Prints:
123.456.789-33
124.323.432.33
111-2-3-4-5-6-7-8-9
Related
I'm a newbie in this sector. Here is the website I need to crawling "http://py4e-data.dr-chuck.net/comments_1430669.html" and here is it source code "view-source:http://py4e-data.dr-chuck.net/comments_1430669.html"
It's a simple website for practice. The HTML code look something like:
<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
I need to get the number between comments and span (100,100,99)
Below is my code:
html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup=BeautifulSoup(html,'html.parser')
tag=soup.span
print(tag) #<span class="comments">100</span>
print(tag.string) #100
I got the number 100 but only the first one, now I want to get all of them by iterating through a list or sth like that. What is the method to do this with beautifulsoup?
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup = BeautifulSoup(html,'html.parser')
tags = soup.find_all("span")
for i in tags:
print(i.string)
You can use find_all() function and then iterate it to get the numbers.
If you want names also you can use python dictionary :
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup = BeautifulSoup(html,'html.parser')
tags = soup.find_all("span")
comments = {}
for index, tag in enumerate(tags):
commentorName = tag.find_previous('tr').text
commentorComments = tag.string
comments[commentorName] = commentorComments
print(comments)
This will give you the output as :
{'Melodie100': '100', 'Machaela100': '100', 'Rhoan99': '99', 'Murrough96': '96', 'Lilygrace93': '93', 'Ellenor93': '93', 'Verity89': '89', 'Karlie88': '88', 'Berlin85': '85', 'Skylar84': '84', 'Benny84': '84', 'Crispin81': '81', 'Asya79': '79', 'Kadi76': '76', 'Dua74': '74', 'Stephany73': '73', 'Eila71': '71', 'Jennah70': '70', 'Eduardo67': '67', 'Shannan61': '61', 'Chymari60': '60', 'Inez60': '60', 'Charlene59': '59', 'Rosalin54': '54', 'James53': '53', 'Rhy53': '53', 'Zein52': '52', 'Ayren50': '50', 'Marissa46': '46', 'Mcbride46': '46', 'Ruben45': '45', 'Mikee41': '41', 'Carmel38': '38', 'Idahosa37': '37', 'Brooklin37': '37', 'Betsy36': '36', 'Kayah34': '34', 'Szymon26': '26', 'Tea24': '24', 'Queenie24': '24', 'Nima23': '23', 'Eassan23': '23', 'Haleema21': '21', 'Rahma17': '17', 'Rob17': '17', 'Roma16': '16', 'Jeffrey14': '14', 'Yorgos12': '12', 'Denon11': '11', 'Jasmina7': '7'}
Try the following approach:
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()
soup = BeautifulSoup(html, 'html.parser')
data = []
for tr in soup.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
data.append(row[1]) # or data.append(row) for both
print(data)
Giving you data holding a list containing just the one column:
['Comments', '100', '100', '99', '96', '93', '93', '89', '88', '85', '84', '84', '81', '79', '76', '74', '73', '71', '70', '67', '61', '60', '60', '59', '54', '53', '53', '52', '50', '46', '46', '45', '41', '38', '37', '37', '36', '34', '26', '24', '24', '23', '23', '21', '17', '17', '16', '14', '12', '11', '7']
First locate all of the table <tr> rows. Then extract all of the <td> values for each row. As you only want the second one, append row[1] to a data list holding your values.
You can skip the first one if needed with data[1:].
This approach would let you also save the name at the same time by appending the whole of row. e.g. use data.append(row) instead...
You could then display the entries using:
for name, comment in data[1:]:
print(name, comment)
Giving output starting:
Melodie 100
Machaela 100
Rhoan 99
Murrough 96
Lilygrace 93
Ellenor 93
Verity 89
Karlie 88
I am trying to use beautiful soup to pull the table corresponding to the HTML code below
<table class="sortable stats_table now_sortable" id="team_pitching" data-cols-to-freeze=",2">
<caption>Team Pitching</caption>
from https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2. Here is a screenshot of the site layout and HTML code I am trying to extract from.
I was using the code
url = 'https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2'
res = requests.get(url)
soup1 = BS(res.content, 'html.parser')
table1 = soup1.find('table',{'id':'team_pitching'})
table1
I can't seem to figure out how to get this working. The table above can be extracted with the line
table1 = soup1.find('table',{'id':'team_batting'})
and I figured similar code should work for the one below. Additionally, is there a way to extract this using the table class "sortable stats_table now_sortable" rather than id?
The problem is that if you open the page normally it shows all the tables, however if you load the page with Developer Tools just the first table is shown. So, when you do your request the left tables are not included into the HTML you're getting. The table you're looking for is not shown until "Show team pitchin" button is pressed, to do this you could use Selenium and get the full HTML response.
That is because the table you are looking for - i.e. <table> with id="team_pitching" is present as a comment inside the soup. You can check it for yourself by printing soup.
You need to
Extract that comment from the soup
Convert it into a soup object
Extract the table data from the soup object.
Here is the complete code that does the above mentioned steps.
from bs4 import BeautifulSoup, Comment
import requests
url = 'https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
main_div = soup.find('div', {'id': 'all_team_pitching'})
# Extracting the comment from the above selected <div>
for comments in main_div.find_all(text=lambda x: isinstance(x, Comment)):
temp = comments.extract()
# Converting the above extracted comment to a soup object
s = BeautifulSoup(temp, 'lxml')
trs = s.find('table', {'id': 'team_pitching'}).find_all('tr')
# Printing the first five entries of the table
for tr in trs[1:5]:
print(list(tr.stripped_strings))
The first 5 entries from the table
['1', 'Tyler Ahearn', '21', '1', '0', '1.000', '1.93', '6', '0', '0', '1', '9.1', '8', '5', '2', '0', '4', '14', '0', '0', '0', '42', '1.286', '7.7', '0.0', '3.9', '13.5', '3.50']
['2', 'Jack Anderson', '20', '2', '0', '1.000', '0.79', '4', '1', '0', '0', '11.1', '6', '4', '1', '0', '3', '11', '1', '0', '0', '45', '0.794', '4.8', '0.0', '2.4', '8.7', '3.67']
['3', 'Shane Drohan', '*', '21', '0', '1', '.000', '4.08', '4', '4', '0', '0', '17.2', '15', '12', '8', '0', '11', '27', '1', '0', '2', '82', '1.472', '7.6', '0.0', '5.6', '13.8', '2.45']
['4', 'Conor Grady', '21', '2', '0', '1.000', '3.00', '4', '4', '0', '0', '15.0', '10', '5', '5', '3', '8', '15', '1', '0', '2', '68', '1.200', '6.0', '1.8', '4.8', '9.0', '1.88']
I have a number of pages containing statistics in lists that I am scraping. Everything is working except this one minor issue I cannot seem to resolve. In using the text of the data fields to find them, one heading that is very similar to another picks up the wrong value. Anyone know how to correct for this?
HTML looks like this:
<li><span class="bp3-tag p p-50">50</span> <span class="some explaining words.">Positioning</span>
<li><span class="bp3-tag p p-14">14</span> <span class="some other explaining words.">BB Positioning</span>
Code looks like this, and the output returns 14 for both values when it should return 50 for Positioning and 14 for BB Positioning...
stats = ['Positioning', 'BB Positioning']
url = urlopen(req)
soups = bs(url, 'lxml')
def statistics(soups):
data = {}
divs_without_skill = soups[1].find_all('div', {'class': 'col-3'})
more_lis = [div.find_all('li') for div in divs_without_skill]
lis = soups[0].find_all('li') + more_lis[0]
for li in lis:
for stats in fifa_stats:
if stats in li.text:
data[stats.replace(' ', '_').lower()] = str(
(li.text.split(' ')[0]).replace('\n', ''))
return(data)
Any help greatly appreciated.
import requests
from bs4 import BeautifulSoup
from pprint import pp
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
goal = {x.h5.text: [i.text for i in x.select(
'.bp3-tag')] for x in soup.select('div.column.col-3')[7:-1]}
pp(goal)
main('https://sofifa.com/player/244042/moussa-djitte/210049')
Output:
{'Attacking': ['56', '71', '64', '62', '53'],
'Skill': ['72', '46', '29', '36', '70'],
'Movement': ['78', '79', '83', '65', '74'],
'Power': ['67', '77', '74', '70', '59'],
'Mentality': ['51', '29', '69', '57', '65', '55'],
'Defending': ['33', '14', '16'],
'Goalkeeping': ['8', '8', '6', '15', '13']}
Hi i want create invoice like this image
i use Reportlap and for my header i use SPAN but my output is it :
my code is :
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter, inch,A5
from reportlab.platypus import Image, Paragraph, SimpleDocTemplate, Table
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("complex_cell_values.pdf", pagesize=A5)
elements = []
styleSheet = getSampleStyleSheet()
I = Image('replogo.gif')
I.drawHeight = 1.6*inch
I.drawWidth = 5*inch
data= [['','',I,'',''],
['Total Price', 'Price', 'QTY','Description', 'S.No'],
['00', 'rial 360,000', '02', '05', '04'],
['10', '11', '12', '06', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', '33', '34']]
t=Table(data,style=[('BOX',(0,0),(-1,-1),2,colors.black),
('GRID',(0,1),(-1,-1),0.5,colors.black),
('SPAN',(0,0),(1,0)),
('SPAN',(3,0),(4,0)),
('ALIGN',(1,0),(4,-1),'CENTER')
])
t._argW[3]=1.5*inch
elements.append(t)
doc.build(elements)
anybody have ab idea how can i fix this ?
i find answer if i use t._argW[X] x=Column for all Column i can have this output.
i add this code:
t._argW[4]=0.4*inch
t._argW[3]=2*inch
t._argW[2]=0.6*inch
t._argW[1]=1*inch
t._argW[0]=1.3*inch
I have a csv files with player attributes:
['Peter Regin', '2', 'DAN', 'N', '1987', '6', '6', '199', '74', '2', '608000', '', '77', '52', '74', '72', '58', '72', '71', '72', '70', '72', '74', '68', '74', '41', '40', '51']
['Andrej Sekera', '8', 'SVK', 'N', '1987', '6', '6', '198', '72', '3', '1323000', '', '65', '39', '89', '78', '75', '70', '72', '56', '53', '56', '57', '72', '57', '59', '70', '51']
For example, I want to check if a player is a CENTER ('2' in position 1 in my list) and after I want to modify the 12 element (which is '77' for Peter Regin)
How can I do that using the CSV module ?
import csv
class ManipulationFichier:
def __init__(self, fichier):
self.fichier = fichier
def read(self):
with open(self.fichier) as f:
reader = csv.reader(f)
for row in reader:
print(row)
def write(self):
with open(self.fichier) as f:
writer = csv.writer(f)
for row in f:
if row[1] == 2:
writer.writerows(row[1] for row in f)
Which do nothing important..
Thanks,
In general, CSV files cannot be reliably modified in-place.
Read the entire file into memory (usually a list of lists, as in your example), modify the data, then write the entire file back.
Unless your file is really huge, and you do this really often, the performance hit will be negligible.