Python replacement abbreviation in html file

Python replacement abbreviation in html file - python

The idea is to open a textual file that contains abbreviation and full word.
Like table with 2 column and n rows.
Then open html file, strip html signs , search for abbreviations , replace them and save them in new text file.
-------------------------It should open in file:
RASPUKNUTI, raspuknutivi
topografski u slucaju reflektivni za svaki...
code
import re
from bs4 import BeautifulSoup
import codecs
#--------------------------------unos podataka za pretrazivanje
dat=open('citaj.txt',"r")
bs4_objekt=BeautifulSoup(dat,"lxml",from_encoding="UTF-8")
onlytext=bs4_objekt.text.strip()
#
z=open('zamijeni_kratice3.txt','r')
text=z.read()
lista_rijeci=text.split('\n')
for rijec in lista_rijeci:
odjeli=rijec.split("|")
samotext=re.sub("\s({0})".format(odjeli[0]),"{0}".format(odjeli[1]),onlytext)
#sm2=re.sub(r'\s(refl.)','reflektivni',samotext)
z.close()
with codecs.open('novi_HAZU.txt','w',encoding='utf8') as f:
f.write(sm2)
f.close()
The words in format does not work , and it does not show error. When i put replace for just one word , works fine:
#sm2=re.sub(r'\s(refl.)','reflektivni',samotext)
I'm spinning in a loop here. Any suggestions , ideas ?
01.02.2016. 19:26
My goal is to get something similar to python interpreter as opposed to current state in file: picture
Or the closest i can get to original : address

The problem I see is that, your code does not keep the change after substitution. Please try:
import re
from bs4 import BeautifulSoup
import codecs
#--------------------------------unos podataka za pretrazivanje
dat=open('citaj.txt',"r")
bs4_objekt=BeautifulSoup(dat,"lxml",from_encoding="UTF-8")
onlytext=bs4_objekt.text #.strip()
#
z=open('zamijeni_kratice3.txt','r')
text=z.read()
lista_rijeci=text.split('\n')
for rijec in lista_rijeci:
odjeli=rijec.split("|")
onlytext=re.sub("({0})".format(odjeli[0]),"{0}".format(odjeli[1]),onlytext)
z.close()
with codecs.open('novi_HAZU.txt','w',encoding='utf8') as f:
f.write(onlytext)
f.close()
Not sure if this fits your needs (I used copy/paste, and made two <tr> elements for illustration purpose):

Related

How can I extract unformated, table-like text from PDF's using python?

I have scenario where I have PDFs with a letterhead and table-like body of text. I have tried using pdfminer but I'm struggling to figure out how to approach my problem
An example of the format for one my PDFs
In specific, pdf miner reads the data starting from the letterhead up until the table header. It then reads the table header in a row like fashion from left to right. From there it's just beyond messy.
Here is python to convert pdf to text:
import pdfminer
import sys
from pdfminer.high_level import extract_text
text = extract_text('./quote2.pdf')
print((text))
f = open("results2.txt", "w")
f.write(text)
And here is a snippet of what the output looks like:
... letter head info
ITEM�#
DESCRIPTION
561347
55�PCs-792.00�LB
6061-T651�PLATE�AMS�4027
4�S/C�6"�SQUARE
CUTTING�PLATE�SAW�ALUM
PACKAGING�SKIDDING
SHIP�VIA�:�OUR�TRUCK
Quotation
DATE:
CUSTOMER NUMBER:
QUOTE NUMBER:
FOB:
4/1/2022
319486
957242
Destination
SHIP TO:
The idea was to use regex to extract relevant numbers. As you can see it read the first 2 records for columns ITEM and DESCRIPTION, but from there it starts back up from the letterhead, and it's even more messy below
Is there perhaps a way to seperate the letterhead from the rest of the body as a starting step? Very new to python, not sure how to get what I want, help much appreciated!

Parsing the Skills Section in a Resume in Python

I am trying to parse the skills section of a resume in python. I found a library by Mr. Omkar Pathak called pyresparser and I was able to extract a PDF resume's contents into a resume.txt file.
However, I was wondering how I can go about only extracting the skills section from the resume into a list and then writing that list possibly into a query.txt file.
I'm reading the contents of the resume.txt into a list and then comparing that to a list called skills which stores the extracted contents from a skill.cv file. Currently, the skills list is empty and I was wondering how I can go about storing the skills into that list? Is this the correct approach? Any help is greatly appreciated, thank you!
import string
import csv
import re
import sys
import importlib
import os
import spacy
from pyresparser import ResumeParser
import pandas as pd
import nltk
from spacy.matcher import matcher
import multiprocessing as mp
def main():
data = ResumeParser("C:/Users/infinitel88p/Downloads/resume.pdf").get_extracted_data()
print(data)
# Added encoding utf-8 to prevent unicode error
with open("C:/Users/infinitel88p/Downloads/resume.txt", "w", encoding='utf-8') as rf:
rf.truncate()
rf.write(str(data))
print("Resume results are getting printed into resume.txt.")
# Extracting skills
resume_list = []
skill_list = []
data = pd.read_csv("skills.csv")
skills = list(data.columns.values)
resume_file = os.path.dirname(__file__) + "/resume.txt"
with open(resume_file, 'r', encoding='utf-8') as f:
for line in f:
resume_list.append(line.strip())
for token in resume_list:
if token.lower() in skills:
skill_list.append(token)
print(skill_list)
if __name__ == "__main__":
main()

An easy way ( but not an efficient way ) to do:
Have a set of all possible relevant skills in a text file. For the words in skills sections of the resume or for all the words in resume, take each words and check whether it matches with any of the word from the text file. If a word matched, then that skill is present in resume. This way, you could identify a set of skills present in the resume.
For further addition or better identification, you can use naive-bayes classification or uni-gram probabilities to extract more relevant skills.

How to save webpages text content as a text file using python

I did python script:
from string import punctuation
from collections import Counter
import urllib
from stripogram import html2text
myurl = urllib.urlopen("https://www.google.co.in/?gfe_rd=cr&ei=v-PPV5aYHs6L8Qfwwrlg#q=samsung%20j7")
html_string = myurl.read()
text = html2text( html_string )
file = open("/home/nextremer/Final_CF/contentBased/contentCount/hi.txt", "w")
file.write(text)
file.close()
Using this script I didn't get perfect output only some HTML code.
I want save all webpage text content in a text file.
I used urllib2 or bs4 but I didn't get results.
I don't want output as a html structure.
I want all text data from webpage

What do you mean with "webpage text"?
It seems you don't want the full HTML-File. If you just want the text you see in your browser, that is not so easily solvable, as the parsing of a HTML-document can be very complex, especially with JavaScript-rich pages.
That starts with assessing if a String between "<" and ">" is a regular tag and includes analyzing the CSS-Properties changed by JavaScript-behavior.
That is why people write very big and complex rendering-Engines for Webpage-Browsers.

You dont need to write any hard algorithms to extract data from search result. Google has a API to do this.
Here is an example:https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py
But to use it, first you must to register in google for API Key.
All information you can find here: https://developers.google.com/api-client-library/python/start/get_started

import urllib
urllib.urlretrieve("http://www.example.com/test.html", "test.txt")

searching beautiful soup output without html tags

I'm working on a project that requires input from data that's displayed in a live flash graph (a data logging chart at http://137.205.144.34/flash/index.html#menuIndex=1&accordionIndex=2&menuId=mimic1&menuStruct=S1R2M3C1H1. As the html couldn't be accessed directly, I used Firebug to monitor my activity, and found the data I wanted stored at http://137.205.144.34/services/unload.cmd?format=csvx&sched=&start=-240:00:00&id=75631&step=864. However, when I try and access this url, it automatically saves a file (containing the data) to my pc, so I can't access the html source code. Using the url, I have used beautifulsoup to import the data, but i can't search or manipulate it using html tags as they are unknown. The only data i actually want is the latest hourly reading - one of ~ 1300 lines. And of that line, i only need the last value. Is there a way I could find the html tags? If not, what would be the best way to extract the bit of data I need?
Any help would be greatly appreciated,
Thanks.

The file you are downloading has no HTML in it. It is a comma-separated fle and you should use the csv module to parse it.
This code will print the first item in each row (the item that contains the date and time):
import csv
with open('unload.cmd', 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
print row[0]
This works assuming that you are using the file downloaded with the default name.
In order to first download the file programmatically, import it into a string, and then use it as a source file for the csv.reader():
import urllib
import csv
import StringIO
url = 'http://137.205.144.34/services/unload.cmd?format=csvx&sched=&start=-240:00:00&id=75631&step=864'
f = urllib.urlopen(url)
data = f.read()
reader = csv.reader(StringIO.StringIO(data))
for row in reader:
if row: print row[0]

How to extract tables from websites in Python

Here,
http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500
There is a table. My goal is to extract the table and save it to a csv file. I wrote a code:
import urllib
import os
web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
web.close()
ff = open(r"D:\ex\python_ex\urllib\output.txt", "w")
ff.write(s)
ff.close()
I lost from here. Anyone who can help on this? Thanks!

Pandas can do this right out of the box, saving you from having to parse the html yourself. read_html() extracts all tables from your html and puts them in a list of dataframes. to_csv() can be used to convert each dataframe to a csv file. For the web page in your example, the relevant table is the last one, which is why I used df_list[-1] in the code below.
import requests
import pandas as pd
url = 'http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')
It's simple enough to do in one line, if you prefer:
pd.read_html(requests.get(<url>).content)[-1].to_csv(<csv file>)
P.S. Just make sure you have lxml, html5lib, and BeautifulSoup4 packages installed in advance.

So essentially you want to parse out html file to get elements out of it. You can use BeautifulSoup or lxml for this task.
You already have solutions using BeautifulSoup. I'll post a solution using lxml:
from lxml import etree
import urllib.request
web = urllib.request.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[#id="Report1_dgReportDemographic"]/tr')
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]
## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

I would recommend BeautifulSoup as it has the most functionality. I modified a table parser that I found online that can extract all tables from a webpage, as long as there are no nested tables. Some of the code is specific to the problem I was trying to solve, but it should be pretty easy to modify for your usage. Here is the pastbin link.
http://pastebin.com/RPNbtX8Q
You could use it as follows:
from urllib2 import Request, urlopen, URLError
from TableParser import TableParser
url_addr ='http://foo/bar'
req = Request(url_addr)
url = urlopen(req)
tp = TableParser()
tp.feed(url.read())
# NOTE: Here you need to know exactly how many tables are on the page and which one
# you want. Let's say it's the first table
my_table = tp.get_tables()[0]
filename = 'table_as_csv.csv'
f = open(filename, 'wb')
with f:
writer = csv.writer(f)
for row in table:
writer.writerow(row)
The code above is an outline, but if you use the table parser from the pastbin link you should be able to get to where you want to go.

You need to parse the table into an internal data structure and then output it in CSV form.
Use BeautifulSoup to parse the table. This question is about how to do that (the accepted answer uses version 3.0.8 which is out of date by now, but you can still use it, or convert the instructions to work with BeautifulSoup version 4).
Once you have the table in a data structure (probably a list of lists in this case) you can write it out with csv.write.

Look at BeautifulSOup module. In documentation you will find many examples of parsing html.
Also for csv you have ready solution - csv module.
It should be quite easy.

Look at this answer parsing table with BeautifulSoup and write in text file.
Also use google with next words "python beautifulsoup"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python replacement abbreviation in html file - python

Related

How can I extract unformated, table-like text from PDF's using python?

Parsing the Skills Section in a Resume in Python

How to save webpages text content as a text file using python

searching beautiful soup output without html tags

How to extract tables from websites in Python

Categories

Resources