How can I extract a text from a bytes file using python - python

I am trying to code a script that gets the code of a website, saves all html in a file and after that extracts some information.
For the moment I´ve done the first part, I've saved all html into a text file.
Now I have to extract the relevant information and then save it in another text file.
But I'm having problems with encoding and also I don´t know very well how to extract the text in python.
Parsing a website:
import urllib.request
file name to store the data
file_name = r'D:\scripts\datos.txt'
I want to get the text that goes after this tag <p class="item-description"> and before this other one </p>
tag_starts_with = '<p class="item-description">'
tag_ends_with = '</p>'
I get the website code and I save it into a text file
with urllib.request.urlopen("http://www.website.com/") as response, open(file_name, 'wb') as out_file:
data = response.read()
out_file.write(data)
print (out_file) # First question how can I print the file? Gives me an error, I can´t print bytes
the file is now full of html text so I want to open it and process it
file_for_results = open(r'D:\scripts\datos.txt',encoding="utf8")
Extract information from the file
second question how to do a substring of the lines that contain the file and get the text between p class="item-description" and
/p so i can store in file_for_results
Here is the pseudocode that I'm not capable to code.
for line in file_to_filter:
if line contains word_starts_with
copy in file_for_results until you find </p>

I am assuming this is an assignment of some sort, where you need to parse the html given an algorithm, if not just use Beautiful Soup.
The pseudocode actually translates to python code quite easily:
file_to_filter = open("file.html", 'r')
out_file = open("text_output",'w')
for line in file_to_filter:
if word_starts_with in line:
print(line, end='', file=out_file) # Store data in another file
if word_ends_with in line:
break
And of course you need to close the files, make sure you remove the tags and so on, but this is roughly what your code should be given this algorithm.

Related

Read in Python code/regular expressions from external file

I have a section of Python (Sigil) code:
for (id, href) in bk.text_iter():
html = bk.readfile(id)
html = re.sub(r'<title></title>', '<title>Mara’s Tale</title>', html)
html = re.sub(r'<p>Mara’s Tale</p>', '<p class="title">Mara’s Tale</p>',html)
bk.writefile(id, html)
Ideally, I'd like to read the regular expressions in from an external text-file (or just read in that block of code). Any suggestions? I've done similar in Perl with a try, but I'm a Python-novice.
Also, quick supplementary question - shouldn't bk.writefile be indented? And, if so, why is my code working? It looks as though it's outside the for block, and therefore will only write to the final file, if that (it's an epub, so there are several html files), but it's updating all relevant files.
Regarding bk, my understanding is that this object is the whole epub, and what this code is doing is reading each html file that makes up an epub via text_iter, so id is each individual file.
EDIT TO ADD
Ah! That bk.writefile should indeed be indented. I got away with it because, at the point I run this code, I only have a single html file.
As for the reading something from a file - it's easy. Assume you have the file 'my_file.txt' in the same folder where the script is saved:
f = open('my_file.txt', 'r')
content = f.read() # read all content of the file in the sting 'content'
lines = f.read().splitlines() # read lines of the file in array 'lines'
f.close()
print(lines[0]) # first line
print(lines[1]) # second line
# etc
As for shouldn't bk.writefile be indented? Yep, it seems the loop makes and changes the variable html for several times but saves only the last iteration. It looks weird. Perhaps it should be indented. But it's just a guess.

How to read/edit a .txt file that has html tags

I'm trying to clean a .txt file of html tags. I have the content of this link saved to a .txt file.
https://www.sec.gov/Archives/edgar/data/1630970/000149315218014686/0001493152-18-014686.txt
I want to remove the html tags, but having trouble having the actually reading / writing the file.
I've just tried opening the file before processing it with BeautifulSoup.
f = open('test_file.txt',"r")
print(f)
returns:
<_io.TextIOWrapper name='test_file.txt' mode='r' encoding='UTF-8'>
The desired output would print the file. Feeling slightly crazy for not being able to open this.
If you use a proper HTML parser like Beautiful Soup you can remove the HTML tags and get the text only easily:
from pathlib import Path
import BeautifulSoup
contents = Path(file_path).read_text()
soup=BeautifulSoup.BeautifulSoup(contents)
print soup.text
Note that the above is Python 3 code
The problem is you're printing out the file object returned by open, not the text that the file represented by the object contains.
You need to tell it to read the file. The simplest ways of doing that are using readlines, or, as the documention notes, just iterate the object directly:
for line in f:
print(line)
You can read the file natively. Like this. You are missing the .read
f=open("test_file.txt", "r")
if f.mode == 'r':
contents =f.read()
print(contents)

Add new line to output in csv file in python

I am a newbie at Python & I have a web scraper program that retrieved links and puts them into a .csv file. I need to add a new line after each web link in the output but I do not know how to use the \n properly. Here is my code:
file = open('C:\Python34\census_links.csv', 'a')
file.write(str(census_links))
file.write('\n')
Hard to answer your question without knowing the format of the variable census_links.
But presuming it is a list that contains multiple links composed of strings, you would want to parse through each link in the list and append a newline character to the end of a given link and then write that link + newline to the output file:
file = open('C:/Python34/census_links.csv', 'a')
# Simulating a list of links:
census_links = ['example.com', 'sample.org', 'xmpl.net']
for link in census_links:
file.write(link + '\n') # append a newline to each link
# as you process the links
file.close() # you will need to close the file to be able to
# ensure all the data is written.
E. Ducateme has already answered the question, but you could also use the csv module (most of the code is from here):
import csv
# This is assuming that “census_links” is a list
census_links = ["Example.com", "StackOverflow.com", "Google.com"]
file = open('C:\Python34\census_links.csv', 'a')
writer = csv.writer(file)
for link in census_links:
writer.writerow([link])

Searching for and manipulating the content of a keyword in a huge file

I have a huge HTML file that I have converted to text file. (The file is Facebook home page's source). Assume the text file has a specific keyword in some places of it. For example: "some_keyword: [bla bla]". How would I print all the different bla blas that are followed by some_keyword?
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
Imagine there are 50 different names with this format in the page. How would I print all the names followed by "name:", considering the text is very large and crashes when you read() it or try to search through its lines.
Sample File:
shortProfiles:{"100000094503825":{id:"100000094503825",name:"Bla blah",firstName:"Blah",vanity:"blah",thumbSrc:"https://scontent-lax3-1.xx.fbcdn.net/v/t1.0-1/c19.0.64.64/p64x64/10354686_10150004552801856_220367501106153455_n.jpg?oh=3b26bb13129d4f9a482d9c4115b9eeb2&oe=5883062B",uri:"https://www.facebook.com/blah",gender:2,i18nGender:16777216,type:"friend",is_friend:true,mThumbSrcSmall:null,mThumbSrcLarge:null,dir:null,searchTokens:["Bla"],alternateName:"",is_nonfriend_messenger_contact:false},"1347968857":
Based on your comment, since you are the person responsible for writting the data to the file. Write the data in JSON format and read it from file using json.loads() as:
import json
json_file = open('/path/to/your_file')
json_str = json_file.read()
json_data = json.loads(json_str)
for item in json_data:
print item['name']
Explanation:
Lets say data is the variable storing
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
which will be dynamically changing within your code where you are performing write operation in the file. Instead append it to the list as:
a = []
for item in page_content:
# data = some xy logic on HTML file
a.append(data)
Now write this list to the file using: json.dump()
I just wanted to throw this out there even though I agree with all the comments about just dealing with the html directly or using Facebook's API (probably the safest way), but open file objects in Python can be used as a generator yielding lines without reading the entire file into memory and the re module can be used to extract information from text.
This can be done like so:
import re
regex = re.compile(r"(?:some_keyword:\s\[)(.*?)\]")
with open("filename.txt", "r") as fp:
for line in fp:
for match in regex.findall(line):
print(match)
Of course this only works if the file is in a "line-based" format, but the end effect is that only the line you are on is loaded into memory at any one time.
here is the Python 2 docs for the re module
here is the Python 3 docs for the re module
I cannot find documentation which details the generator capabilities of file objects in Python, it seems to be one of those well-known secrets...Please feel free to edit and remove this paragraph if you know where in the Python docs this is detailed.

File size increasing after extraction?

This is a pretty general question, and I don't even know whether this is the correct community for the question, if not just tell me.
I have recently had an html file from which I was extracting ~90 lines of HTML code (total lines were ~8000). I did this with a simple Python script. I stored my output (the shortened html code) into a text file. Now I am curious because the file size has increased? what could cause the file to get bigger after I extracted some part out of it?
File size before: 319.374 Bytes
File size after: 321.516 Bytes
Is this because of the different file formats html and txt?
Any help or suggestions appreciated!
Code:
import glob
import os
import re
def extractor():
os.chdir(r"F:\Test") # the directory containing my html
for file in glob.iglob("*.html"): # iterates over all files in the directory ending in .html
with open(file, encoding="utf8") as f, open((file.rsplit(".", 1)[0]) + ".txt", "w", encoding="utf8") as out:
contents = f.read()
extract = re.compile(r'StartTag.*?EndTag', re.S)
cut = extract.sub('', contents)
if re.search(extract, contents) is not None:
out.write(cut)
out.close()
extractor()
EDIT: I also tried using ".html" instead of ".txt" as filem format for my output file. However the difference still remains.
This code does not write to the original HTML file. Something else must be causing the increased file size .

Categories

Resources