BeautifulSoup xml get class name value - python

I am using BeautifulSoup to parse Tableau twb XML files to get list of worksheets in the report.
The XML that holds the value I am looking for is
<window class='worksheet' name='ML Productivity'>
Struggling with how to get all of the class='worksheet' and then get the name value from those eg I want to get the 'ML Productivity' value.
Code I have so far is below.
import sys, os
import bs4 as bs
twbpath = "C:/tbw tbwx files/"
outpath = "C:/out/"
outFile = open(outpath + 'output.txt', "w")
#twbList = open(outpath + 'twb.txt', "w")
for subdir, dirs, files in os.walk(twbpath):
for file in files:
if file.endswith('.twb'):
print(subdir.replace(twbpath,'') + '-' + file)
filepath = open(subdir + '/' + file, encoding='utf-8').read()
soup = bs.BeautifulSoup(filepath, 'xml')
classnodes = soup.findAll('window')
for classnode in classnodes:
if str(classnode) == 'worksheet':
outFile.writelines(file + ',' + str(classnode) + '\n')
print(subdir.replace(twbpath,'') + '-' + file, classnode)
outFile.close()

You can filter the desired window element by the class attribute value and then treat the result like a dictionary to get the desired attribute:
soup.find('window', {'class': 'worksheet'})['name']
If there are multiple window elements you need to locate, use find_all():
for window in soup.find_all('window', {'class': 'worksheet'}):
print(window['name'])

Related

Extract date from file name in python, not fixed name

I need to get date from file name in python code. I found many solutions, but from fixed name and date. But I dont know what the name of the file will be, date is changing. How to do that?
I have a code which is working for known file name (current date), file is called micro20230125.txt
import re
import os
from datetime import datetime
header = """#SANR0000013003;*;#CNR0010;*;#RINVAL-777.0;*;"""
current_timestamp = datetime.today().strftime('%Y%m%d')
input_file = "micro" + current_timestamp + ".txt"
output_file = os.path.splitext(input_file)[0] + ".zrxp"
with open(input_file, "r") as f:
first_line = f.readline().strip('\n')
text = re.search('(\d{6})', first_line).group(1)
text = header + "\n" + text + "\n"
with open(output_file, "w") as f:
f.write(text)
print(text)
`
but I dont need current date. I will get file with some random date, so how can I extract unknown date from file name? How to change this variable current_timestamp?
I tried to use regex but I messed something up
EDIT: DIFF CODE, SIMILAR PROBLEM:
I was dealing with this code and then realized: python doesnt know what those numbers in name represent, so why treat them like a date and complicate things? Those are just numbers. As a matter of fact, I need those numbers as long as full file name. So I came up with different code.
import re
import os
def get_numbers_from_filename(filename):
return re.search(r'\d+', filename).group(0) #returns only numbers
for filename in os.listdir("my path"):
print (get_numbers_from_filename(filename))
def get_numbers_from_filename(filename):
return re.search(r"(.)+", filename).group(0) #returns all name
for filename in os.listdir("my path"):
print(get_numbers_from_filename(filename))
file was: micro20230104.txt
and result is:
result
Now, I want to use that result, dont want to print it.
No matter how I get that returns me error.
import re
import os
def get_numbers_from_filename(filename):
return re.search(r"(.)+", filename).group(0)
for filename in os.listdir("my path"):
print(get_numbers_from_filename(filename))
m = get_numbers_from_filename(filename)
output_file = os.path.splitext(m)[0] + ".zrxp"
with open(m, "r") as f:
first_line = f.readline().strip('\n')
text = re.search('(\d{6})', first_line).group(1)
text = header + "\n" + text + "\n"
with open(output_file, "w") as f:
f.write(text)
print(text)
but it it says error
error:there is no such file
what to do? what am I doing wrong?
Well, in case all the files have the format 'micro[YearMonthDay].txt', you can try this solution:
import os
from datetime import datetime
header = """#SANR0000013003;*;#CNR0010;*;#RINVAL-777.0;*;"""
#Change the variable folder_path for your actual directory path.
folder_path = "\\path_files\\"
filenames = []
# Iterate directory
for path in os.listdir(folder_path):
# check if current path is a file
if os.path.isfile(os.path.join(folder_path, path)):
filenames.append(path)
dates = []
for filename in filenames:
# First solution:
filename = filename.replace('micro', '')
filename = filename.replace('.txt', '')
date = datetime.strptime(filename, "%Y%m%d")
# Second solution:
# date = datetime.strptime(filename, "micro%Y%m%d.txt")
dates.append(date)
for date in dates:
print(date.strftime("%Y/%m/%d"))
with open(f'.\\micro{date.strftime("%Y/%m/%d")}.txt', "r") as f:
first_line = f.readline().strip('\n')
text = re.search('(\d{6})', first_line).group(1)
text = header + "\n" + text + "\n"
with open(output_file, "w") as f:
f.write(text)
print(text)
Use the solution you prefer and comment the other one.
Testing:
Text files for test
Code
Result
I hope I could help! :D

How can I merge two PDF files without overlapping content

Using another stackoverflow question & answer, I was able to locate code which partially resolves what I am trying to do Merge PDF files.
However, this modified code results in the contents of two PDFs overlapping each outer. I am trying to stack them or vertically concatenate the results:
Example:
PDF1 Contents -> "Hello World"
PDF2 Contents -> "I am Bill"
Code below results in the following overlapping image:
Desired results would look as follows:
Code Used resulting in overlapping imge:
import pdfrw
dirPATH = r'c:\users\<username>\projects\concat_pdfs'
pdf1 = os.path.join(dirPATH, 'PDF1.pdf')
pdf2 = os.path.join(dirPATH, 'PDF2.pdf')
def concat_pdfs(pdf1, pdf2, output):
form = pdfrw.PdfReader(pdf1)
olay = pdfrw.PdfReader(pdf2)
for form_page, overlay_page in zip(form.pages, olay.pages):
merge_obj = pdfrw.PageMerge()
overlay = merge_obj.add(overlay_page)[0]
pdfrw.PageMerge(form_page).add(overlay).render()
writer = pdfrw.PdfWriter()
writer.write(output, form)
concat_pdfs(section1, section2, 'result.pdf')
Thanks in advance!
Have you tried
def combine_pdfs(dir_path1, dir_path2, save_path):
pdf1 = pdfrw.PdfReader(dir_path1)
pdf2 = pdfrw.PdfReader(dir_path2)
pdf_writer = pdfrw.PdfWriter()
for page in pdf1.pages:
pdf_writer.addpage(page)
for page in pdf2.pages:
pdf_writer.addpage(page)
pdf_writer.write(save_path)
Here's an example using PyPDF2 library:
merger = PdfFileMerger()
for filename in files:
f = files[filename]
loc = "/tmp/" + secure_filename(filename).replace(".pdf", "") + "_" + str(time.time()) + ".pdf"
f.save(loc)
f.close()
reader = PdfFileReader(loc, "rb")
merger.append(reader)
dest = "/tmp/merged_" + str(time.time()) + ".pdf"
merger.write(dest)
Here is another using pike pdf:
pdf = Pdf.new()
for filename in files:
f = files[filename]
loc = "/tmp/" + secure_filename(filename).replace(".pdf", "") + "_" + str(time.time()) + ".pdf"
f.save(loc)
f.close()
reader = Pdf.open(loc)
pdf.pages.extend(reader.pages)
dest = "/tmp/merged_" + str(time.time()) + ".pdf"
pdf.save(dest)
Imports might look something like:
import time
import pdfkit
import os
from PyPDF2 import PdfFileMerger, PdfFileReader
from werkzeug.utils import secure_filename
from pikepdf import Pdf

How to download and save all PDF from a dynamic web?

I am trying to download and save in a folder all the PDFs contained in some webs with dynamic elements i.e: https://www.bankinter.com/banca/nav/documentos-datos-fundamentales
Every PDF in this url have similar href. Here they are two of them:
"https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc=workspace://SpacesStore/fb029023-dd29-47d5-8927-31021d834757;1.0&nameDoc=ISIN_ES0213679FW7_41-Bonos_EstructuradosGarantizad_19.16_es.pdf"
"https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc=workspace://SpacesStore/852a7524-f21c-45e8-a8d9-1a75ce0f8286;1.1&nameDoc=20-Dep.Estruc.Cont.Financieros_18.1_es.pdf"
Here it is what I did for another web, this code is working as desired:
link = 'https://www.bankia.es/estaticos/documentosPRIIPS/json/jsonSimple.txt'
base = 'https://www.bankia.es/estaticos/documentosPRIIPS/{}'
dirf = os.environ['USERPROFILE'] + "\Documents\TFM\PdfFolder"
if not os.path.exists(dirf2):os.makedirs(dirf2)
os.chdir(dirf2)
res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
for item in res.json():
if not 'nombre_de_fichero' in item: continue
link = base.format(item['nombre_de_fichero'])
filename_bankia = item['nombre_de_fichero'].split('.')[-2] + ".PDF"
with open(filename_bankia, 'wb') as f:
f.write(requests.get(link).content)
You have to make a post http requests with appropriate json parameter. Once you get the response, you have to parse two fields objectId and nombreFichero to use them to build right links to the pdf's. The following should work:
import os
import json
import requests
url = 'https://bancaonline.bankinter.com/publico/rs/documentacionPrix/list'
base = 'https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc={}&nameDoc={}'
payload = {"cod_categoria": 2,"cod_familia": 3,"divisaDestino": None,"vencimiento": None,"edadActuarial": None}
dirf = os.environ['USERPROFILE'] + "\Desktop\PdfFolder"
if not os.path.exists(dirf):os.makedirs(dirf)
os.chdir(dirf)
r = requests.post(url,json=payload)
for item in r.json():
objectId = item['objectId']
nombreFichero = item['nombreFichero'].replace(" ","_")
filename = nombreFichero.split('.')[-2] + ".PDF"
link = base.format(objectId,nombreFichero)
with open(filename, 'wb') as f:
f.write(requests.get(link).content)
After executing the above script, wait a little for it to work as the site is real slow.

python writing jpg files into a new html file

Please can someone help me?
I have been trying to write in an html file that I create myself, a bunch of jpg files just to display them, but I can't seem to do anything, and there have been so many errors till I got even anywhere...
Can anyone please help, I have no experience in html from python.
Here's the code:
def download_images(img_urls, dest_dir):
#html_file = open("index.html", 'rb')
html_file = open("index.html", 'w')
print("Retrieving...")
html_file.write("""<verbatim>""")
html_file.write("""<html>""")
html_file.write("""<body>""")
for url,i in zip(img_urls,range(len(img_urls))):
image_opened = urllib.request.urlopen(url)
urllib.request.urlretrieve(url, "img" + str(i) + ".jpg")
img_tag = r'"""<img"""' + str(i) + r' src="/edu/python/exercises/img' + str(i) + r'"""">"""'.format(urllib.request.urlretrieve(url, "img" + str(i) + ".jpg"))
html_file = open("index.html", 'w')
html_file.write(urllib.request.urlretrieve(url, "img" + str(i) + ".jpg"))
#print('<img' + str(i) + ' src="/edu/python/exercises/img"' + str(i) + '>')
html_file.write(r""""</body>""""")
html_file.write("""</html>""")
html_file.close()
I will go through what you have so far and comment on it.
The first few bits look okay until
image_opened = urllib.request.urlopen(url)
This line opens a stream to the requested url, you don't do anything with this and you don't need it as you download the image using:
urllib.request.urlretrieve(url, "img" + str(i) + ".jpg")
You then create the html img line, which you have overcomplicated a bit. You are trying to produce a line that reads something like this:
<img src='img1.jpg' />
What you seem to be doing in:
img_tag = r'"""<img"""' + str(i) + r' src="/edu/python/exercises/img' + str(i) + r'"""">"""'.format(urllib.request.urlretrieve(url, "img" + str(i) + ".jpg"))
Is starting to create the right string but then you attempt to download the jpg again. You just need to do create the string as follows:
img_tag = "<img src='src" + str(i) + ".jpg' />"
You then open the html output file again using:
html_file = open("index.html", 'w')
You don't need to do this as you still have the file open from when you opened it at the beginning of the method.
You then attempt to write the html string to the file doing;
html_file.write(urllib.request.urlretrieve(url, "img" + str(i) + ".jpg"))
This is instead trying to download the jpg again and output the result into the html file. Instead you want to write the img_tag by:
html_file.write(img_tag)
You write the end of the file okay and close it.
html_file.write(r""""</body>""""")
html_file.write("""</html>""")
html_file.close()
Once you fix this you should have a function that looks like:
import urllib.request
def download_images(img_urls, dest_dir):
#html_file = open("index.html", 'rb')
html_file = open("index.html", 'w')
print("Retrieving...")
html_file.write("<html>")
html_file.write("<body>")
for url,i in zip(img_urls,range(len(img_urls))):
urllib.request.urlretrieve(url, "img" + str(i) + ".jpg") # Downloads image
img_tag = "<img src='img" + str(i) + ".jpg' />"
html_file.write(img_tag)
html_file.write("</body>")
html_file.write("</html>")
html_file.close()
That you can call with something like:
download_images(["a.jpg","b.jpg"],"")

Iterate through multiple files and append text from HTML using Beautiful Soup

I have a directory of downloaded HTML files (46 of them) and I am attempting to iterate through each of them, read their contents, strip the HTML, and append only the text into a text file. However, I'm unsure where I'm messing up, though, as nothing gets written to my text file?
import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (path)
soup = BeautifulSoup(markup)
with open("example.txt", "a") as myfile:
myfile.write(soup)
f.close()
-----update----
I've updated my code as below, however the text file still doesn't get created.
import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(markup)
with open("example.txt", "a") as myfile:
myfile.write(soup)
myfile.close()
-----update 2-----
Ah, I caught that I had my directory incorrect, so now I have:
import os
import glob
from bs4 import BeautifulSoup
path = "c:\\users\\me\\downloads\\"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(markup)
with open("example.txt", "a") as myfile:
myfile.write(soup)
myfile.close()
When this is executed, I get this error:
Traceback (most recent call last):
File "C:\Users\Me\Downloads\bsoup.py, line 11 in <module>
myfile.write(soup)
TypeError: must be str, not BeautifulSoup
I fixed this last error by changing
myfile.write(soup)
to
myfile.write(soup.get_text())
-----update 3 ----
It's working properly now, here's the working code:
import os
import glob
from bs4 import BeautifulSoup
path = "c:\\users\\me\\downloads\\"
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(open(markup, "r").read())
with open("example.txt", "a") as myfile:
myfile.write(soup.get_text())
myfile.close()
actually you are not reading html file, this should work,
soup=BeautifulSoup(open(webpage,'r').read(), 'lxml')
If you want to use lxml.html directly here is a modified version of some code I've been using for a project. If you want to grab all the text, just don't filter by tag. There may be a way to do it without iterating, but I don't know. It saves the data as unicode, so you will have to take that into account when opening the file.
import os
import glob
import lxml.html
path = '/'
# Whatever tags you want to pull text from.
visible_text_tags = ['p', 'li', 'td', 'h1', 'h2', 'h3', 'h4',
'h5', 'h6', 'a', 'div', 'span']
for infile in glob.glob(os.path.join(path, "*.html")):
doc = lxml.html.parse(infile)
file_text = []
for element in doc.iter(): # Iterate once through the entire document
try: # Grab tag name and text (+ tail text)
tag = element.tag
text = element.text
tail = element.tail
except:
continue
words = None # text words split to list
if tail: # combine text and tail
text = text + " " + tail if text else tail
if text: # lowercase and split to list
words = text.lower().split()
if tag in visible_text_tags:
if words:
file_text.append(' '.join(words))
with open('example.txt', 'a') as myfile:
myfile.write(' '.join(file_text).encode('utf8'))

Categories

Resources