How can I delete a piece of text in WSDL by Python - python

I need to delete some text with tags in many XML files. For example:
this:deleteMeAlways different text</this:deleteMe>.
Im used re.sub("\<this*\<\/this\:deleteMe\>", "", file), but it doesnt work

Related

Converting pdf files to txt files but only getting last page of pdf file

I'm trying to convert a list of PDF files in a directory to txt. At the moment, however, I'm only getting the last page of the pdf files in the newly created txt. files.
The code:
import os, PyPDF2
import re
for file in os.listdir("Documents/Python/"):
if file.endswith(".pdf"):
fpath=os.path.join("Documents/Python/", file)
pdffileobj=open(fpath,'rb')
pdfreader=PyPDF2.PdfFileReader(pdffileobj)
x=pdfreader.numPages
pageobj=pdfreader.getPage(x-1)
text=pageobj.extractText()
newfpath=re.sub(".pdf","txt",fpath)
file1=open(newfpath,"a")
file1.writelines(text)
Txt files with all the pages
You are only getting the text of the last page because you are only ever reading the text of the last page of each pdf pageobj=pdfreader.getPage(x-1)
Although it works, it looks like pdfreader.numPages is deprecated now. The way to do it is len(reader.pages) if you wanted the number of pages. You could also just loop through each page object without getting the number of pages for page in pdfreader.pages:
The main thing you are missing is a second loop to go through each page of the pdf and extract the text.
for file in os.listdir("/your/path"):
if file.endswith(".pdf"):
fpath=os.path.join("/your/path", file)
pdffileobj=open(fpath,'rb')
pdfreader=PyPDF2.PdfFileReader(pdffileobj)
newfpath=re.sub(".pdf",".txt",fpath)
with open(newfpath,"w") as file:
#loop through each page and write to file
for page in pdfreader.pages:
text=page.extractText()
file.write(text)

Python: Iterate through directory to find specific text file, open it, write some of its contents to a new file, close it, and move on to the next dir

I have a script that takes an input text file then finds data in it, puts that data as a variable, then later I call that variable to write to a new file. This snippet of code is just for reading the txt file and storing the data from it as variables.
searchfile = open('C://Users//Me//DynamicFolder//report//summary.txt','r', encoding='utf-8')
slab_count=0
slab_number=[]
slab_total=0
for line in searchfile:
if "Slab" in line:
slab_num = ([float(s) for s in re.findall(r'[-+]?(?:\d*\.\d+|\d+)', line)])
slab_percent = slab_num[-1]
slab_number.append(slab_percent)
slab_count=slab_count+1
slab_total=0
for slab_percent in slab_number:
slab_total+=slab_percent
searchfile.close()
I am using xlsxwriter to write the variables to an excel doc.
My question is, how do I iterate this to search through a given directories sub-directories for summary.txt when there is a dynamic folder.
So C://Users//Me//DynamicFolder//report//summary.txt is a path to one of the files. There are several folders I named DynamicFolder that are there because another process puts them there, they change their names all the time. I need have this script go into each of those dynamic folders to a subdir called report, this is a static name and is always the same. So each of those dynamicfolders has another subdir called report, and in the report folder is a file called summary.txt. I am trying to go through each of those dynamicfolders into the subdir report > summary.txt and then opening and writing data from those txt files.
How do I iterate or loop this? Right now I have 18 folders with those DynamicFolder names that will change when they are over written. How can I put this snip of code to iterate through?
for path in Path('C://Users//Me//DynamicFolder//report//summary.txt').rglob('summary.txt'):
report folder is not the only folder with a summary.txt file, but its the only folder with the file I want. So this code above pulls ALL summary.txt files from all subdir's under the DynamicFolder (not just report folder). I am wondering if I can make this JUST do the 'report' subdir folders under DynamicFolders, and somehow use this to iterate the rest of my code?

reading in multiple text file extensions .pdf, .txt and .htm

I have a folder where I want to read all the text files from and put them into a corpus, however I am only able to do it with .txt files. How can I expand the code below to read in .pdf, .htm and .txt files?
corpus_raw = u""
for file_name in file_names:
with codecs.open(file_name, "r", "utf-8") as file_name:
corpus_raw += file_name.read()
print("Document is {0} characters long".format(len(corpus_raw)))
print()
For example:
with open ('/data/text_file.txt', "r", encoding = "utf-8") as f:
print(f.read())
Read in data where I can view it on a notebook.
with open ('/data/text_file.pdf', "r", encoding = "utf-8") as f:
print(f.read())
Read nothing.
There are two types of files, binary files and plain-text files. A file can have one or the other, or sometimes both.
Html files are plaintext, human readable files, which you can edit by hand, but PDF Files are binary + Text files where you'll need special programs to edit them.
If you want to read from pdf or html, it's possible. I wasn't sure if you meant to extract the text, or to extract the source code, so I'll provide explanations to both.
Extracting Text
Extracting text can be done easily for html files. Using webbrowser, you can open your file in the browser, and then use urllib for extracting text. For more info, refer to the answers here: Extracting text from HTML file using Python
For pdf files, you can use a python module called PyPDF2. Download it using pip:
$ pip install PyPDF2
and get started.
Here is an example of a simple program I found on the internet:
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Extracting Source Code
Extracting source code is best done using python's open function as you did above.
For html files, you can just do what you did with text files. Or maybe to be simpler,
file = open("c:\\path\\to\\file")
print(file.read())
you can just do the above.
For pdf files, you do pretty much the same, but specifying the mode for editing in a different parameter in the open function. For more info, visit the sites in the More Info section.
file = open("c:\\path\\to\\file.extension", "a") #specifies the mode of editing. Unfortunately, you'll only be able to store data, not display it. But you can edit it, then save it after wards
print(file.readable()) #Will return false, proving to be not readable.
file.save("c:\\path\\to\\save\\in.extension")
More Info
https://www.geeksforgeeks.org/working-with-pdf-files-in-python/
https://www.programiz.com/python-programming/methods/built-in/open
This should work for htm/html files with no problem - they are basically just text files. Above, I only see that reading in .pdf has failed - was there a problem with .htm?
Also, reading in a .pdf may be much more difficult/involved than you think. A pdf contains a lot more information than just plaintext, and cannot be meaningfully edited in, say, notepad. As an example of what I mean, here's a small sample of what I got when I opened a .pdf in notepad:
%PDF-1.7
%âãÏÓ
1758 0 obj
<</Filter/FlateDecode/First 401/Length 908/N 51/Type/ObjStm>>stream
hÞ”ØQk\7à¿2ÍK,i4
Cã(Á”¾•–öâ.Ýn‚w]òó3rm˜Ÿ =ÄÜÝèΑ®?ÉÍ…e¦ê?Å/2e¥ÂJÙˆ+SÉT«ù7$"T„ZËT”´ù2£®L~©¯fÊ©±É–iÌ(¦ÄF¹&OðÑ’Œ|hnžU}Žñ¾®ûDOÉæCÄç'¿IF¸/±Å¿”±/ÿ!¾›Ú˜Æ>¤ùeiêóuÚ3õ®äU̘Է’Ìhì´$!_Êœ3©o­úaÇÖÅÏç·rGòuê‡Gé¾é>Žà›ì¾õä›ò£Õì›ðѵx¨ùQXÇ3ð'åC=ªJÃ6óç:¯Öý—ZòóúI¹ù…Ÿ3—ñ$<Éw‘èÍ›«›/dz/¸z¿¿?Ço'ÑoW¿îÆõX矮¯}Ý»ítþ#?~ö¥ç_ü”×éÓÕÇíÛyü6Ç÷·»û͇åòøé÷ýù°ýôöá´?n§}8ž·Ãa·ÿÜ>ßÞo‡ý¿§Wat£õ…Ñ~ûÏ[ýQÌÍß»¯çížRŽI
$L’ù¤“úËI%Ã$OâTHb˜dóI5&$(éé´SI“€ˆE”-&Š("4&E”=$1ÁPDYa1 ˆ`(‚çEä“€†"x^DŽÁ#C</"ÇŽ` ¢B</"ÇŽ¨#D…"x^DŽQˆ
EÔ±#*Q¡ˆº "vD"*QDÄŽ¨#„#uADì"Š¨"bG!P„Ì‹(±#ˆ(BæE”ØD!ó"Jì"!ó"JìˆD4(BæE”Ø
ˆhPD[;¢
Šh"bG4 ¢AmADìˆD(ÑDÄŽP B¡ˆ¶ "v„
E輎¡#„B:/‚cG(¡P„΋àØ
Dt(BçEpìˆDt(BçEpìˆDt(¢/ˆˆÑˆEô±#:Ñ¡ˆ¾ "vD"Šè"bGaPD_;€ƒ"l^Da#„A6/¢ÆŽ0  ›QcG1Þ¡¨y5–DN eA6¢Ö‹¬‚² ‹ç#O…ÉEzQ•ð›ª´#£]„¡wU ¿¬J:ô"ñPüŸÑçSÿ(íÃñ¯íÛÿA?û°§7¿8ìBÀawü‡nww›ßû]€ %“xw
endstream
endobj
1759 0 obj
<</Filter/FlateDecode/First 1907/Length 3450/N 200/Type/ObjStm>>stream
There are, however, options. I would suggest reading the page at https://www.geeksforgeeks.org/working-with-pdf-files-in-python/ as a starting point.

How to rename filename in Arabic in Python

I have a script to rename folder and file name from English to another language. So far this script work well for language reading from left to right. However, when I run this script on a language reading from right to left (e.g. Arabic), I run into an:
errorFileExistsError: [WinError 183] Cannot create a file when that file already exists
My folder structure looks like this:
C:\Users\ABC\Desktop\Template\Report Element Snippets\Review.
Inside the Review folder, I have a file called Blue Review Note.xml. The full file path of this xml file should be
C:\User\ABC\Desktop\Template\Report Element Snippets\Review\Blue Review Note.xml
I will need to rename the Report Element Snippets and Review folders first then run another loop to rename the xml file to Arabic.
The code to rename the xml file is:
os.rename(os.path.join(dirpath,file)
os.path.join(dirpath,newfname))
The problem I could see from tracing the print out of the path is that os.path.join(dirpath,file) give me:
C:\Users\ABC\Desktop\Template\تقرير قصاصات العنصر\إعادة النظر\Blue Review Note. xml
where إعادة النظر is Review and تقرير قصاصات العنصر is Report Element Snippets
but os.path.join(dirpath,newfname) give me:
C:\Users\ABC\Desktop\Template\تقرير قصاصات العنصر\إعادة النظر\ملاحظة مراجعة باللون الأزرق.xml
ملاحظة مراجعة باللون الأزرق.xml is for Blue Review Note.xml
As you can see, the join statement has split the ملاحظة مراجعة باللونالأزرق.xml into 2 parts in the full path. The ملاحظة مراجعة bit is not put in at the starting of the Arabic path and leave the file name as باللون الأزرق.xml but it doesn't have the \ to separate the xml file name with تقرير قصاصات العنصر folder. To me the paths between before and after rename of the xml file are different hence Python can't apply the rename on the folder.
I just wonder has anyone has this issue before when working with Arabic file name?

Create a .xml file with same as .txt file after conversion using elementree python

I am in a project in which I have the task to convert a text file to xml file.
But the restriction is, the file names should be same. I use element tree
module of python for this but while writing to the tree I use tree.write()
method in which I have to explicitly define / hardcode the xml filename
myself.
I want to know is there any procedure to automatically create the .xml file
with same name as the text file
the sh module provides great flexibility and may help you for what you want to do if I understand the question correctly. I have shown an example in the following:
import sh
name = "filename.xml"
new_name = name[-3] + "txt"
sh.touch(new_name)
from there you can open the created file and write directly to it.

Categories

Resources