How to save data in multiple files on python - python

I'm trying to make a bot in python which copy some texts from a webpage. In every run it grab 10k+ texts. so i want to save those texts in different files. every file will keep 100+ texts.
How can i do this in python?
Thanks.

Assuming you don't care what the file names are, you could write each batch of messages to a new temp file, thus:
import tempfile
texts = some_function_grabbing_text()
while texts:
with tempfile.TemporaryFile() as fp:
fp.write(texts)
texts = some_function_grabbing_text()

Related

PDF File dedupe issue with same content, but generated at different time periods from a docx

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.
def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()
while len(data) > 0:
hasher.update(data)
data = file.read()
file.close()
return hasher.hexdigest()
One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.
Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.

Printing top few lines of a large JSON file in Python

I have a JSON file whose size is about 5GB. I neither know how the JSON file is structured nor the name of roots in the file. I'm not able to load the file in the local machine because of its size So, I'll be working on high computational servers.
I need to load the file in Python and print the first 'N' lines to understand the structure and Proceed further in data extraction. Is there a way in which we can load and print the first few lines of JSON in python?
If you want to do it in Python, you can do this:
N = 3
with open("data.json") as f:
for i in range(0, N):
print(f.readline(), end = '')
You can use the command head to display the N first line of the file. To get a sample of the json to know how is it structured.
And use this sample to work on your data extraction.
Best regards

is there a method to read pdf files line by line?

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried:
PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).
I did this for .docx with this code. Where txt is the .docx. Hope this help link
import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)
print(new)

What's the best way to save historical data?

I️ want to write a program in Python that saves the history of crypto currency purchases/sales I️ make. I️ want to be able to save data such as time of purchase, price of currency at time of transaction, and profit to look for patterns. How would I️ go about saving this data?
You can make yourself a program writing a history file or log as a plain .txt file very easily:
import datetime
import os
os.chdir("C:\Users\<username>\Documents") #Location of history file
day_of_purchase = datetime.date.today()
price_of_currency = 10
profit = 0
file_object = open("textfile.txt", "w")
file_object.write(str(day_of_purchase)+"," +str(price_of_currency)+"," +str(profit))
file_object.close()
If you want to keep adding history output to the file just append the .txt file. You could also set up a more sophisticated .csv file using the csv module: https://docs.python.org/3.4/library/csv.html. You can also use some more elaborated libraries to save an excel file although excel is very capable of reading .csv files directly, using a designated character as a column deliminator. (In the above case, select the comma as the deliminator.)
You can make the program more sophisticated by creating some interactive components like the input() parameter.

Python: reading local HTML files, using findall function to extract text into new HTML file

im trying to extract elements from a number of different HTML files using findall and put them into a new HTML file. so far i have
news = ['16-10-2017.html', '17-10-2017.html', '18-10-2017.html', '19-10-2017.html', '21-10,2017.html', '22-10-2017.html']
def extracted():
raw_news = open(news, 'r', encoding = 'UTF-8')
im creating a function that will be able to read these files, extract specific parts so i can put them into a new html file but im not sure if this code for reading the files is correct. how would i be able to extract elements from these files.
You need to loop over the list, open one file (python would ask for a 'string' and say that it got a 'list' instead). Once you are into the loop, you can operate over the file and maybe save the text you want to find and put it in some other data structure.
Change your working directory to the directory where you have these files and then:
def extracted(news):
for page in news:
raw_news = open(news[page], 'r', encoding = 'UTF-8')
# Now you have raw_news from one page and you can operate over it
# Once the loop is over, the same code would run on the next html file

Categories

Resources