How do I read all html-files in a directory recursively? - python

I am trying to get all html-files Doctype printed to a txt-file. I have no experience in Python, so bear with me a bit. :)
Final script is supposed to delete elements from the html-file depending on the html-version given in the Doctype set in the html-file. I've attempted to list files in PHP as well, and it works to some extent. I think Python is a better choice for this task.
The script below is what I've got right now, but I cant figure out how to write a "for each" to get the Doctype of every html-file in the arkivet folder recursively. I currently only prints the filename and extension, and I don't know how to get the path to it, or how to utilize BeautifulSoup to edit and get info out of the files.
import fnmatch
from urllib.request import urlopen as uReq
import os
from bs4 import BeautifulSoup as soup
from bs4 import Doctype
files = ['*.html']
matches = []
for root, dirnames, filenames in os.walk("arkivet"):
for extensions in files:
for filename in fnmatch.filter(filenames, extensions):
matches.append(os.path.join(root, filename))
print(filename)
matches is an array, but I am not sure how to handle it properly in Python. I would like to print the foldernames, filenames with extension and it's doctype into a text-file in root.
Script runs in CLI on a local Vagrant Debian server with Python 3.5 (Python 2.x present too). All files and folders exist in folder called arkivet (archive) under servers public root.
Any help appreciated! I'm stuck here :)

As you did not mark any of the answers solutions, I'm guessing you never quite got you answer. Here's a chunk of code that recursively searches for files, prints the full filepath, and shows the Doctype string in the html file if it exists.
import os
from bs4 import BeautifulSoup, Doctype
directory = '/home/brian/Code/sof'
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
if filename.endswith('.html'):
fname = os.path.join(root, filename)
print('Filename: {}'.format(fname))
with open(fname) as handle:
soup = BeautifulSoup(handle.read(), 'html.parser')
for item in soup.contents:
if isinstance(item, Doctype):
print('Doctype: {}'.format(item))
break

Vikas's answer is probably what you are asking for, but in case he interpreted the question incorrectly, it's worth it to know that you have access to all three of those variables as you're looping through, root, dirnames, and filenames. You are currently printing just the basefile name:
print(filename)
It is also possible to print the full path instead:
print(os.path.join(root, filename))
Vikas solved the lack of directory name by using a different function (os.listdir), but I think that will lose the ability to recurse.
A combination of os.walk as you posted, and reading the interior of the file with open as Vikas posted is perhaps what you are going for?

If you want to read all the html files in a particular directory you can try this one :
import os
from bs4 import BeautifulSoup
directory ='/Users/xxxxx/Documents/sample/'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
# parse the html as you wish

Related

Cleanup HTML using lxml and XPath in Python

I'm learning python and lxml toolkit. I need process multiple .htm files in the local directory (recursively) and remove unwanted tags include its content (divs with IDs "box","columnRight", "adbox", footer", div class="box", plus all stylesheets and scripts).
Can't figure out how to do this. I have code that list all .htm files in directory:
#!/usr/bin/python
import os
from lxml import html
import lxml.html as lh
path = '/path/to/directory'
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(".htm"):
doc=lh.parse(filename)
So I need to add part, that creates a tree, process html and remove unnecessary divs, like
for element in tree.xpath('//div[#id="header"]'):
element.getparent().remove(element)
how to adjust the code for this?
html page example.
It's hard to tell without seeing your actual files, but try the following and see if it works:
First you don't need both
from lxml import html
import lxml.html as lh
So you can drop the first. Then
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(".htm"):
tree = lh.parse(name)
root = tree.getroot()
for element in root.xpath('//div[#id="header"]'):
element.getparent().remove(element)

How do I iterate through files in my directory so they can be opened/read using PyPDF2?

I am working on an invoice scraper for work, where I have successfully written all the code to scrape the fields that I need using PyPDF2. However, I am having trouble figuring out how to put this code into a for loop so I can iterate through all the invoices stored in my directory. There could be anywhere from 1 to 250+ files depending on which project I am using this for.
I thought I would be able to use "*.pdf" in place of the pdf name, but it does not work for me. I am relatively new to Python and have not used that many loops before, so any guidance would be appreciated!
import re
pdfFileObj = open(r'C:\Users\notylerhere\Desktop\Test Invoices\SampleInvoice.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
#Print all text on page
#print(pageObj.extractText())
#Grab Account Number Meter Number
accountNumber = re.compile(r'\d\d\d\d\d-\d\d\d\d\d')
meterNumber = re.compile(r'(\d\d\d\d\d\d\d\d)')
moAccountNumber = accountNumber.search(pageObj.extractText())
moMeterNumber = meterNumber.search(pageObj.extractText())
print('Account Number: '+moAccountNumber.group())
print('Meter Number: '+moMeterNumber.group(1))'''
Thanks very much!
Another option is glob:
import glob
files = glob.glob("c:/mydirectory/*.pdf")
for file in files:
(Do your processing of file here)
You need to ensure everything past the colon is properly indented.
You want to iterate over your directory and deal with every file independently.
There are many functions depending on your use case. os.walk is a good place to start.
Example:
import os
for root, directories, files in os.walk('.'):
for file in files:
if '.pdf' in file:
openAndDoStuff(file)
import os
import PyPDF2
for el in os.listdir(os.getcwd()):
if el.endswith("pdf"):
pdf_reader = PyPDF2.PdfFileReader(open(os.getcwd() + "/" + el))

Find a file recursively using python

I have the below files located at some location in RHEL machine.
temp_file2.txt
temp_file3.txt
Looking for a python script to find above files recursively in all directories(I used a wild card, but it didn't work), and print a message if the file exists or not.
The below code snippet returns Nothing
import glob
for filename in glob.iglob('*/*.txt', recursive=True):
print(filename)
It returns the file name if it exists only in the current working directory
import glob
for filename in glob.iglob('.txt', recursive=True):
print(filename)
This approach seems to have worked for me, using python3.6
import glob
for f in glob.iglob('./**/*.yml', recursive=True):
print(f)
I was also able to use os.getcwd() + '/**/*.yml'. It appears there must be a directory definition at the start of the glob.

Read all PDFs in a directory (image)

I have attached an image to help show what I've done. I'm trying to write a program that will add a blank page to all PDFs in the directory that have an odd number of pages. However I can't seem to read all the PDFs in a directory.
The script I have works on a single PDF, but I have 1000's of these to do. Why can't I read all the PDFs in the user_input directory?
Screenshot of code and error here
code is here
from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
import os
user_input = input("Enter the path of your file: ")
files = os.listdir(user_input)
for file in files:
print(file)
pdfReader = PdfFileReader(open(files, 'rb'))
Use following code. This code will give all list of pdf files from directory
import glob, os
def readfiles():
os.chdir(path)
pdfs = []
for file in glob.glob("*.pdf"):
print(file)
pdfs.append(file)
In order to process every PDF file in the folder, you need a few things.
get to the right directory
get all files
get only the PDF files
OS is perfect for this. It can get all the files and then let you determine what to do with them. One problem I had (may be yours as well) was that my path had spaces in it, and os.chdir() was looking at the path ("something\ long\ with\ spaces/abcd/pdf\ folder") and was replacing all the spaces with "\ " meaning my final path was "something\ long\ with\ spaces/abcd/pdf\ folder" which is not a valid path. Removing the "\" from the original user input worked just fine. Let me know if you need any further help.
import os
os.chdir(raw_input("enter the path: ").replace("\\", ""))
print os.listdir(".")
for file in os.listdir("."):
if file.endswith(".pdf"):
print file
process(file) # do whatever it is you need to here
Is the .py file in the same directory as the pdfs? If not, you will need the full path to read the file not just the filename, which is returned by os.listdir

Retriving a tag value from multiple XML files in a directory using python

I am currently learning python to automate a few things in my job. I need to retrieve a tag value from multiple xml files in a directory. The directory has many subfolders too.
I tried the following code, and understood what is missing. But I am not able to fix this. Here is my code:
from xml.dom.minidom import parse, parseString
import os
def jarv(dir):
for r,d,f in os.walk(dir):
for files in f:
if files.endswith(".xml"):
print files
dom=parse(files)
name = dom.getElementsByTagName('rev')
print name[0].firstChild.nodeValue
jarv("/path)
I understand that while executing the dom=parse(files) line, it has got the filename without the path. So it says no such files/directory.
I don't know how to fix this.
You have to use os.path.join() to build the correct path from the dirname and the filename:
dom=parse(os.path.join(r, files))
should do it

Categories

Resources