File size increasing after extraction? - python

This is a pretty general question, and I don't even know whether this is the correct community for the question, if not just tell me.
I have recently had an html file from which I was extracting ~90 lines of HTML code (total lines were ~8000). I did this with a simple Python script. I stored my output (the shortened html code) into a text file. Now I am curious because the file size has increased? what could cause the file to get bigger after I extracted some part out of it?
File size before: 319.374 Bytes
File size after: 321.516 Bytes
Is this because of the different file formats html and txt?
Any help or suggestions appreciated!
Code:
import glob
import os
import re
def extractor():
os.chdir(r"F:\Test") # the directory containing my html
for file in glob.iglob("*.html"): # iterates over all files in the directory ending in .html
with open(file, encoding="utf8") as f, open((file.rsplit(".", 1)[0]) + ".txt", "w", encoding="utf8") as out:
contents = f.read()
extract = re.compile(r'StartTag.*?EndTag', re.S)
cut = extract.sub('', contents)
if re.search(extract, contents) is not None:
out.write(cut)
out.close()
extractor()
EDIT: I also tried using ".html" instead of ".txt" as filem format for my output file. However the difference still remains.

This code does not write to the original HTML file. Something else must be causing the increased file size .

Related

Editing a txt file in Python to edit the formatting and then create new txt file

and thank you for taking the time to read this post. This is literally my first time trying to use Python so bare with me.
My Target/Goal: Edit the original text file (Original .txt file) so that for every domain listed an "OR" is added in between them (below target formatting image). Any help is greatly appreciated.
I have been able to google the information to open and read the txt file, however, I am not sure how to do the formatting part.
Script
Original .txt file
Target formatting
You can achieve this in a couple lines as:
with open(my_file) as fd:
result = fd.read().replace("\n", " OR ")
You could then write this to another file with:
with open(formatted_file, "w") as fd:
fd.write(result)
something you could do is the following
import re
# This opens the file in read mode
with open('Original.txt', 'r') as file:
# Read the contents of the file
contents = file.read()
# Seems that your original file has line breaks to each domain so
# you could replace it with the word "OR" using a regular expression
contents = re.sub(r'\n+', ' OR ', contents)
# Then you should open the file in write mode
with open('Original.txt', 'w') as file:
# and finally write the modified contents to the file
file.write(contents)
a suggestion is, maybe you want to try first writing in a different file to see if you are happy with the results (or do a copy of Original.txt just in case)
with open('AnotherOriginal.txt', 'w') as file:
file.write(contents)

Edit Minecraft .dat File in Python

I'm looking to edit a Minecraft Windows 10 level.dat file in python. I've tried using the package nbt and pyanvil but get the error OSError: Not a gzipped file. If I print open("level.dat", "rb").read() I get a lot of nonsensical data. It seems like it needs to be decoded somehow, but I don't know what decoding it needs. How can I open (and ideally edit) one of these files?
To read data just do :
from nbt import nbt
nbtfile = nbt.NBTFile("level.dat", 'rb')
print(nbtfile) # Here you should get a TAG_Compound('Data')
print(nbtfile["Data"].tag_info()) # Data came from the line above
for tag in nbtfile["Data"].tags: # This loop will show us each entry
print(tag.tag_info())
As for editing :
# Writing data (changing the difficulty value
nbtfile["Data"]["Difficulty"].value = 2
print(nbtfile["Data"]["Difficulty"].tag_info())
nbtfile.write_file("level.dat")
EDIT:
It looks like Mojang doesn't use the same formatting for Java and bedrock, as bedrock's level.dat file is stored in little endian format and uses non-compressed UTF-8.
As an alternative, Amulet-Nbt is supposed to be a Python library written in Cython for reading and editing NBT files (supposedly works with Bedrock too).
Nbtlib also seems to work, as long as you set byteorder="little when loading the file.
Let me know if u need more help...
You'll have to give the path either relative to the current working directory
path/to/file.dat
Or you can use the absolute path to the file
C:user/dir/path/to/file.dat
Read the data,replace the values and then write it
# Read in the file
with open('file.dat', 'r') as file :
filedata = file.read()
# Replace the target string
filedata = filedata.replace('yuor replacement or edit')
# Write the file out again
with open('file.dat', 'w') as file:
file.write(filedata)

Python - doc to docx file converter input, file path from a txt file

Hi stackoverflow community,
Situation,
I'm trying to run this converter found from here,
However what I want is for it to read an array of file path from a text file and convert them.
Reason being, these file path are filtered manually, so I don't have to convert unnecessary files. There are a large amount of unnecessary files in the folder.
How can I go about with this? Thank you.
with open("file_path",'r') as file_content:
content=file_content.read()
content=content.split('\n')
You can read the data of the file using the method above, Then covert the data of file into a list(or any other iteratable data type) so that we can use it with for loop.I used content=content.split('\n') to split the data of content by '\n' (Every time you press enter key, a new line character '\n' is sended), you can use any other character to split.
for i in content:
# the code you want to execute
Note
Some useful links:
Split
File writing
File read and write
By looking at your situation, I guess this is what you want (to only convert certain file in a directory), in which you don't need an extra '.txt' file to process:
import os
for f in os.listdir(path):
if f.startswith("Prelim") and f.endswith(".doc"):
convert(f)
But if for some reason you want to stick with the ".txt" processing, this may help:
with open("list.txt") as f:
lines = f.readlines()
for line in lines:
convert(line)

How can I extract a text from a bytes file using python

I am trying to code a script that gets the code of a website, saves all html in a file and after that extracts some information.
For the moment I´ve done the first part, I've saved all html into a text file.
Now I have to extract the relevant information and then save it in another text file.
But I'm having problems with encoding and also I don´t know very well how to extract the text in python.
Parsing a website:
import urllib.request
file name to store the data
file_name = r'D:\scripts\datos.txt'
I want to get the text that goes after this tag <p class="item-description"> and before this other one </p>
tag_starts_with = '<p class="item-description">'
tag_ends_with = '</p>'
I get the website code and I save it into a text file
with urllib.request.urlopen("http://www.website.com/") as response, open(file_name, 'wb') as out_file:
data = response.read()
out_file.write(data)
print (out_file) # First question how can I print the file? Gives me an error, I can´t print bytes
the file is now full of html text so I want to open it and process it
file_for_results = open(r'D:\scripts\datos.txt',encoding="utf8")
Extract information from the file
second question how to do a substring of the lines that contain the file and get the text between p class="item-description" and
/p so i can store in file_for_results
Here is the pseudocode that I'm not capable to code.
for line in file_to_filter:
if line contains word_starts_with
copy in file_for_results until you find </p>
I am assuming this is an assignment of some sort, where you need to parse the html given an algorithm, if not just use Beautiful Soup.
The pseudocode actually translates to python code quite easily:
file_to_filter = open("file.html", 'r')
out_file = open("text_output",'w')
for line in file_to_filter:
if word_starts_with in line:
print(line, end='', file=out_file) # Store data in another file
if word_ends_with in line:
break
And of course you need to close the files, make sure you remove the tags and so on, but this is roughly what your code should be given this algorithm.

How to write lines in a txt file, with data from a csv file

How can I tell Python to open a CSV file, and merge all columns per line, into new lines in a new TXT file?
To explain:
I'm trying to download a bunch of member profiles from a website, for a research project. To do this, I want to write a list of all the URLs in a TXT file.
The URLs are akin to this: website.com-name-country-title-id.html
I have written a script that takes all these bits of information for each member and saves them in columns (name/country/title/id), in a CSV file, like this:
mark japan rookie married
john sweden expert single
suzy germany rookie married
etc...
Now I want to open this CSV and write a TXT file with lines like these:
www.website.com/mark-japan-rookie-married.html
www.website.com/john-sweden-expert-single.html
www.website.com/suzy-germany-rookie-married.html
etc...
Here's the code I have so far. As you can probably tell I barely know what I'm doing so help will be greatly appreciated!!!
import csv
x = "http://website.com/"
y = ".html"
csvFile=csv.DictReader(open("NameCountryTitleId.csv")) #This file is stored on my computer
file = open("urls.txt", "wb")
for row in csvFile:
strArgument=str(row['name'])+"-"+str(row['country'])+"-"+str(row['title'])+"-"+str(row['id'])
try:
file.write(x + strArgument + y)
except:
print(strArgument)
file.close()
I don't get any error messages after running this, but the TXT file is completely empty.
Rather than using a DictReader, use a regular reader to make it easier to join the row:
import csv
url_format = "http://website.com/{}.html"
csv_file = 'NameCountryTitleId.csv'
urls_file = 'urls.txt'
with open(csv_file, 'rb') as infh, open(urls_file, 'w') as outfh:
reader = csv.reader(infh)
for row in reader:
url = url_format.format('-'.join(row))
outfh.write(url + '\n')
The with statement ensures the files are closed properly again when the code completes.
Further changes I made:
In Python 2, open a CSV files in binary mode, the csv module handles line endings itself, because correctly quoted column data can have embedded newlines in them.
Regular text files should be opened in text mode still though.
When writing lines to a file, do remember to add a newline character to delineate lines.
Using a string format (str.format()) is far more flexible than using string concatenations.
str.join() lets you join a sequence of strings together with a separator.
its actually quite simple, you are working with strings yet the file you are opening to write to is being opened in bytes mode, so every single time the write fails and it prints to the screen instead. try changing this line:
file = open("urls.txt", "wb")
to this:
file = open("urls.txt", "w")
EDIT:
i stand corrected, however i would like to point out that with an absence of newlines or some other form of separator, how do you intend to use the URLs later on? if you put newlines between each URL they would be easy to recover

Categories

Resources