Python Script Only Reads 100 First Files - python

I have a folder with 616 files, but my script only reads the first 100. What settings do I need to change around to get it to read them all? It's probably relevant, I'm using Anaconda Navigator's Jupyter Notebook.
Here's my code:
import re
import string
from collections import Counter
import os
import glob
def word_count(file_tokens):
for word in file_tokens:
count = Counter(file_tokens)
return count
files_list = glob.glob("german/test/*/negative/*")
print(files_list)
for path in files_list:
corpus, tache, classe, file_name = path.split("\\")
file = open(path, mode="r", encoding="utf-8")
read_file = file.read()
##lowercase
file_clean = read_file.lower()
##tokenize
file_tokens = file_clean.split()
##word count and sort
print(word_count(file_tokens))

You are probably hitting some max open files limit in your system. You can either close every file at the end of the loop, or use a context manager in the loop:
with open(path, mode="r", encoding="utf-8") as file:
....

Have you tried printing the length of the files_list variable and check if it is 616 or 100 ?
print(len(files_list))

Related

Python: Look for Files that will constantly change name

So, I'll explain briefly my idea, then, what I've tried and errors that I've got so far.
I want to make a Python script that will:
Search for files in a directory, example: /home/mystuff/logs
If he found it, he will execute a command like print('Errors found'), and then stop.
If not, he will keep it executing on and on.
But other logs will be there, so, my intention is to make Python read logs in /home/mystuff/logs filtering by the current date/time only.. since I want it to be executed every 2 minutes.
Here is my code:
import time
import os
from time import sleep
infile = r"/home/mystuff/logs`date +%Y-%m-%d`*"
keep_phrases = ["Error",
"Lost Connection"]
while True:
with open(infile) as f:
f = f.readlines()
if phrase in f:
cmd = ['#print something']
erro = 1
else:
sleep(1)
I've searched for few regex cases for current date, but nothing related to files that will keep changing names according by the date/time.. do you have any ideas?
You can't use shell features like command substitutions in file names. To the OS, and to Python, a file name is just a string. But you can easily create a string which contains the current date and time.
from datetime import datetime
infile = r"/home/mystuff/logs%s" % datetime.now().strftime('%Y-%m-%d')
(The raw string doesn't do anything useful, because the string doesn't contain any backslashes. But it's harmless, so I left it in.)
You also can't open a wildcard; but you can expand it to a list of actual file names with glob.glob(), and loop over the result.
from glob import glob
for file in glob(infile + '*'):
with open(file, 'r') as f:
# ...
If you are using a while True: loop you need to calculate today's date inside the loop; otherwise you will be perpetually checking for files from the time when the script was started.
In summary, your changed script could look something like this. I have changed the infile variable name here because it isn't actually a file or a file name, and fixed a few other errors in your code.
# Unused imports
# import time
# import os
from datetime import datetime
from glob import glob
from time import sleep
keep_phrases = ["Error",
"Lost Connection"]
while True:
pattern = "/home/mystuff/logs%s*" % datetime.now().strftime('%Y-%m-%d')
for file in glob(pattern):
with open(file) as f:
for line in f:
if any(phrase in line for phrase in keep_phrases):
cmd = ['#print something']
erro = 1
break
sleep(120)

Writing in a CSV variables captured from every JSON in current dir

I'm trying to go thru every json file in my current directory and find two specific variables, productId and userProfileId (both are getting well captured on the output file) but cant get it to run for every file in the folder.
This is my best try so far
import json
import csv
import os
KEYS = ['user_id','product_id']
for files in os.walk("."):
for filename in files:
for i in filename:
if i.endswith(".json"):
print(i)
with open(i) as json_data:
order_parsed = json.load(json_data)
products_data = order_parsed['items']
user_data = order_parsed['clientProfileData']
with open('user-item.csv','w') as dataFile:
newFileWriter = csv.writer(dataFile)
newFileWriter.writerow(KEYS)
for item in products_data:
productId = (products_data[0]['productId'])
userId = (user_data["userProfileId"])
print(productId)
print(userId)
newFileWriter.writerow([userId,productId])
To loop though all files in a folder, you can use this for.
for file in os.listdir('folder_path'):
if file[-5:] == ".json":
arq = open(file,'r')
You are doing a dictionary key search under the with loop. Try not doing your search under the with loop by unindenting products_data and user_data once.

How to combine several text files into one file?

I want to combine several text files into one output files.
my original code is to download 100 text files then each time I filter the text from several words and the write it to the output file.
Here is part of my code that suppose to combine the new text with the output text. The result each time overwrite the output file, delete the previous content and add the new text.
import fileinput
import glob
urls = ['f1.txt', 'f2.txt','f3.txt']
N =0;
print "read files"
for url in urls:
read_files = glob.glob(urls[N])
with open("result.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(infile.read())
N+=1
and I tried this also
import fileinput
import glob
urls = ['f1.txt', 'f2.txt','f3.txt']
N =0;
print "read files"
for url in urls:
file_list = glob.glob(urls[N])
with open('result-1.txt', 'w') as file:
input_lines = fileinput.input(file_list)
file.writelines(input_lines)
N+=1
Is there any suggestions?
I need to concatenate/combine approximately 100 text files into one .txt file In sequence manner. (Each time I read one file and add it to the result.txt)
The problem is that you are re-opening the output file on each loop iteration which will cause it to overwrite -- unless you explicitly open it in append mode.
The glob logic is also unnecessary when you already know the filename.
Try this instead:
with open("result.txt", "wb") as outfile:
for url in urls:
with open(url, "rb") as infile:
outfile.write(infile.read())

Error when trying to read and write multiple files

I modified the code based on the comments from experts in this thread. Now the script reads and writes all the individual files. The script reiterates, highlight and write the output. The current issue is, after highlighting the last instance of the search item, the script removes all the remaining contents after the last search instance in the output of each file.
Here is the modified code:
import os
import sys
import re
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = source+'\\'+f
infile = open(filepath, 'r+')
source_content = infile.read()
color = ('red')
regex = re.compile(r"(\b be \b)|(\b by \b)|(\b user \b)|(\bmay\b)|(\bmight\b)|(\bwill\b)|(\b's\b)|(\bdon't\b)|(\bdoesn't\b)|(\bwon't\b)|(\bsupport\b)|(\bcan't\b)|(\bkill\b)|(\betc\b)|(\b NA \b)|(\bfollow\b)|(\bhang\b)|(\bbelow\b)", re.I)
i = 0; output = ""
for m in regex.finditer(source_content):
output += "".join([source_content[i:m.start()],
"<strong><span style='color:%s'>" % color[0:],
source_content[m.start():m.end()],
"</span></strong>"])
i = m.end()
outfile = open(filepath, 'w+')
outfile.seek(0)
outfile.write(output)
print "\nProcess Completed!\n"
infile.close()
outfile.close()
raw_input()
The error message tells you what the error is:
No such file or directory: 'sample1.html'
Make sure the file exists. Or do a try statement to give it a default behavior.
The reason why you get that error is because the python script doesn't have any knowledge about where the files are located that you want to open.
You have to provide the file path to open it as I have done below. I have simply concatenated the source file path+'\\'+filename and saved the result in a variable named as filepath. Now simply use this variable to open a file in open().
import os
import sys
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = source+'\\'+f # This is the file path
infile = open(filepath, 'r')
Also there are couple of other problems with your code, if you want to open the file for both reading and writing then you have to use r+ mode. More over in case of Windows if you open a file using r+ mode then you may have to use file.seek() before file.write() to avoid an other issue. You can read the reason for using the file.seek() here.

Python - How to convert many separate PDFs to text?

Question: How can I read in many PDFs in the same path using Python package "slate"?
I have a folder with over 600 PDFs.
I know how to use the slate package to convert single PDFs to text, using this code:
migFiles = [filename for filename in os.listdir(path)
if re.search(r'(.*\.pdf$)', filename) != None]
with open(migFiles[0]) as f:
doc = slate.PDF(f)
len(doc)
However, this limits you to one PDF at a time, specified by "migFiles[0]" - 0 being the first PDF in my path file.
How can I read in many PDFs to text at once, retaining them as separate strings or txt files? Should I use another package? How could I create a "for loop" to read in all the PDFs in the path?
Try this version:
import glob
import os
import slate
for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
with open(pdf_file) as pdf:
txt_file = "{}.txt".format(os.path.splitext(pdf_file)[0])
with open(txt_file,'w') as txt:
txt.write(slate.pdf(pdf))
This will create a text file with the same name as the pdf in the same directory as the pdf file with the converted contents.
Or, if you want to save the contents - try this version; but keep in mind if the translated content is large you may exhaust your available memory:
import glob
import os
import slate
pdf_as_text = {}
for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
with open(pdf_file) as pdf:
file_without_extension = os.path.splitext(pdf_file)[0]
pdf_as_text[file_without_extension] = slate.pdf(pdf)
Now you can use pdf_as_text['somefile'] to get the text contents.
What you can do is use a simple loop:
docs = []
for filename in migFiles:
with open(filename) as f:
docs.append(slate.pdf(f))
# or instead of saving file to memory, just process it now
Then, docs[i] will hold the text of the (i+1)-th pdf file, and you can do whatever you want with the file whenever you want. Alternatively, you can process the file inside the for loop.
If you want to convert to text, you can do:
docs = []
separator = ' ' # The character you want to use to separate contents of
# consecutive pages; if you want the contents of each pages to be separated
# by a newline, use separator = '\n'
for filename in migFiles:
with open(filename) as f:
docs.append(separator.join(slate.pdf(f))) # turn the pages into plain-text
or
separator = ' '
for filename in migFiles:
with open(filename) as f:
txtfile = open(filename[:-4]+".txt",'w')
# if filename="abc.pdf", filename[:-4]="abc"
txtfile.write(separator.join(slate.pdf(f)))
txtfile.close()

Categories

Resources