I have some txt files that their names have the following pattern:
arc.1.txt, arc.2.txt,...,arc.100.txt,..., arc.500.txt,...,arc.838.txt
I know that we can write a program using for loop to open the files one by one, if we now the total numbers of files. I want to know is it possible to use While loop without counting the number files to open them ?
import glob
for each_file in glob.glob("arc\.\d+\.txt"):
print(each_file)
It is definitely possible to use a while loop assuming that the files are numbered in sequential order:
i = 0
while True:
i += 1
filename = 'arc.%d.txt' % i
try:
with open(filename, 'r') as file_handle:
...
except IOError:
break
Though this becomes pretty ugly with all the nesting. You're probably better off getting the list of filenames using something like glob.glob.
from glob import glob
filenames = glob('arc.*.txt')
for filename in filenames:
with open(filename) as file_handle:
...
There are some race conditions associated with this second approach -- If the file somehow gets deleted between when glob found it and when it actually is time to process the file then your program could have a bad day.
If you add them all to a list and then remove them one by one and set the while condition to while (name of list) > 0: then open the next file
Related
For example, I have some 43000 txt files in my folder, however, I want to read not all the files but just some of them by giving in a range, like from 1.txt till 14400.txt`. How can I achieve this in Python? For now, I'm reading all the files in a directory like
for each in glob.glob("data/*.txt"):
with open(each , 'r') as file:
content = file.readlines()
with open('{}.csv'.format(each[0:-4]) , 'w') as file:
file.writelines(content)
Any way I can achieve the desired results?
Since glob.glob() returns an iterable, you can simply iterate over a certain section of the list using something like:
import glob
for each in glob.glob("*")[:5]:
print(each)
Just use variable list boundaries and I think this achieves the results you are looking for.
Edit: Also, be sure that you are not trying to iterate over a list slice that is out of bounds, so perhaps a check for that prior might be in order.
If the files have numerically consecutive names starting with 1.txt, you can use range() to help construct the filenames:
for num in range(1, 14400):
filename = "data/%d.txt" % num
I found a solution here: How to extract numbers from a string in Python?
import os
import re
filepath = './'
for filename in os.listdir():
numbers_in_name = re.findall('\d',filename)
if (numbers_in_name != [] and int(numbers_in_name[0]) < 5 ) :
print(os.path.join(filepath,filename))
#do other stuff with the filenames
You can use re to get the numbers in the filename. This prints all filenames where the first number is smaller than 5 for example.
I want to iterate through the filenames in a particular folder. I then wish to choose the first filename that satisfies a criteria (file name ends with '.txt')
Should I use a For loop and break it when I see the first file ending with .txt?
Or should I use a While loop?
The While loop does not seem to work. It keeps on continuing. It keeps on printing the filename as per below code.
Following is the code I am using:
import os
import pdb
asciipath = 'C:\\Users\\rmore\\Desktop\\Datalab'
x = 0
file = os.listdir(asciipath)
while file[x].endswith('.txt'):
print(file[x])
x = x+1
It is possible to do this with a while loop, but it will overly complicate the code.
I would use a for loop here. I would also rename file to files just to make what is happening a little more clear.
Edit:
As pointed out, an else clause for the loop would make for a
files = os.listdir(asciipath)
for file in files:
if file.endswith('.txt'):
print(file)
break
else:
print('No txt file found')
The break is key for stopping the loop after you find the first file that ends with .txt
Also note that the else statement is on the for loop and NOT inside of the loop. The else will only be triggered if the break statement is NOT triggered.
A pythonic way would be to use next on a generator:
next((f for f in file if f.endswith('.txt')), 'file not found')
Or you can loop over the files and return as soon as the condition is matched:
def find_first_txt_file(files)
for file in files:
if file.endswith('.txt'):
return file
return 'file not found'
Make it easier for yourself and use pathlib and glob.
from pathlib import Path
p = Path(asciipath)
print(next(p.glob('*.txt')))
I'm new to python and get stuck by a problem I encountered while studying loops and folder navigation.
The task is simple: loop through a folder and count all '.txt' files.
I believe there may be some modules to tackle this task easily and I would appreciate it if you can share them. But since this is just a random question I encountered while learning python, it would be nice if this can be solved using the tools I just acquired, like for/while loops.
I used for and while clauses to loop through a folder. However, I'm unable to loop through a folder entirely.
Here is the code I used:
import os
count=0 # set count default
path = 'E:\\' # set path
while os.path.isdir(path):
for file in os.listdir(path): # loop through the folder
print(file) # print text to keep track the process
if file.endswith('.txt'):
count+=1
print('+1') #
elif os.path.isdir(os.path.join(path,file)): #if it is a subfolder
print(os.path.join(path,file))
path=os.path.join(path,file)
print('is dir')
break
else:
path=os.path.join(path,file)
Since the number of files and subfolders in a folder is unknown, I think a while loop is appropriate here. However, my code has many errors or pitfalls I don't know how to fix. for example, if multiple subfolders exist, this code will only loop the first subfolder and ignore the rest.
Your problem is that you quickly end up trying to look at non-existent files. Imagine a directory structure where a non-directory named A (E:\A) is seen first, then a file b (E:\b).
On your first loop, you get A, detect it does not end in .txt, and that it is a directory, so you change path to E:\A.
On your second iteration, you get b (meaning E:\b), but all your tests (aside from the .txt extension test) and operations concatenate it with the new path, so you test relative to E:\A\b, not E:\b.
Similarly, if E:\A is a directory, you break the inner loop immediately, so even if E:\c.txt exists, if it occurs after A in the iteration order, you never even see it.
Directory tree traversal code must involve a stack of some sort, either explicitly (by appending and poping from a list of directories for eventual processing), or implicitly (via recursion, which uses the call stack to achieve the same purpose).
In any event, your specific case should really just be handled with os.walk:
for root, dirs, files in os.walk(path):
print(root) # print text to keep track the process
count += sum(1 for f in files if f.endswith('txt'))
# This second line matches your existing behavior, but might not be intended
# Remove it if directories ending in .txt should not be included in the count
count += sum(1 for d in files if d.endswith('txt'))
Just for illustration, the explicit stack approach to your code would be something like:
import os
count = 0 # set count default
paths = ['E:\\'] # Make stack of paths to process
while paths:
# paths.pop() gets top of directory stack to process
# os.scandir is easier and more efficient than os.listdir,
# though it must be closed (but with statement does this for us)
with os.scandir(paths.pop()) as entries:
for entry in entries: # loop through the folder
print(entry.name) # print text to keep track the process
if entry.name.endswith('.txt'):
count += 1
print('+1')
elif entry.is_dir(): #if it is a subfolder
print(entry.path, 'is dir')
# Add to paths stack to get to it eventually
paths.append(entry.path)
You probably want to apply recursion to this problem. In short, you will need a function to handle directories that will call itself when it encounters a sub-directory.
This might be more than you need, but it will allow you to list all the files within the directory that are .txt files but you can also add criteria to the search within the files as well. Here is the function:
def file_search(root,extension,search,search_type):
import pandas as pd
import os
col1 = []
col2 = []
rootdir = root
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if "." + extension in file.lower():
try:
with open(os.path.join(subdir, file)) as f:
contents = f.read()
if search_type == 'any':
if any(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
elif search_type == 'all':
if all(word.lower() in contents.lower() for word in search):
col1.append(subdir)
col2.append(file)
except:
pass
df = pd.DataFrame({'Folder':col1,
'File':col2})[['Folder','File']]
return df
Here is an example of how to use the function:
search_df = file_search(root = r'E:\\',
search=['foo','bar'], #words to search for
extension = 'txt', #could change this to 'csv' or 'sql' etc.
search_type = 'all') #use any or all
search_df
The analysis of your code has already been addressed by #ShadowRanger's answer quite well.
I will try to address this part of your question:
there may be some modules to tackle this task easily
For these kind of tasks, there actually exists the glob module, which implements Unix style pathname pattern expansion.
To count the number of .txt files in a directory and all its subdirectories, one may simply use the following:
import os
from glob import iglob, glob
dirpath = '.' # for example
# getting all matching elements in a list a computing its length
len(glob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
# or iterating through all matching elements and summing 1 each time a new item is found
# (this approach is more memory-efficient)
sum(1 for _ in iglob(os.path.join(dirpath, '**/*.txt'), recursive=True))
# 772
Basically glob.iglob() is the iterator version of glob.glob().
for nested Directories it's easier to use functions like os.walk
take this for example
subfiles = []
for dirpath, subdirs, files in os.walk(path):
for x in files:
if x.endswith(".txt"):
subfiles.append(os.path.join(dirpath, x))`
and it'ill return a list of all txt files
else ull need to use Recursion for task like this
Currently I am trying to upload a set of files via API call. The files have sequential names: part0.xml, part1.xml, etc. It loops through all the files and uploads them properly, but it seems it doesn't break the loop and after it uploads the last available file in the directory I am getting an error:
No such file or directory.
And I don't really understand how to make it stop as soon as the last file in the directory is uploaded. Probably it a very dumb question, but I am really lost. How do I stop it from looping through non-existent files?
The code:
part = 0
with open('part%d.xml' % part, 'rb') as xml:
#here goes the API call code
part +=1
I also tried something like this:
import glob
part = 0
for fname in glob.glob('*.xml'):
with open('part%d.xml' % part, 'rb') as xml:
#here goes the API call code
part += 1
Edit: Thank you all for the answers, learned a lot. Still lots to learn. :)
You almost had it. This is your code with some stuff removed:
import glob
for fname in glob.glob('part*.xml'):
with open(fname, 'rb') as xml:
# here goes the API call code
It is possible to make the glob more specific, but as it is it solves the "foo.xml" problem. The key is to not use counters in Python; the idiomatic iteration is for x in y: and you don't need a counter.
glob will return the filenames in alphabetical order so you don't even have to worry about that, however remember that ['part1', 'part10', 'part2'] sort in that order. There are a few ways to cope with that but it would be a separate question.
Alternatively, you can simply use a regex.
import os, re
files = [f for f in os.listdir() if re.search(r'part[\d]+\.xml$', f)]
for f in files:
#process..
This will be really useful in case you require advanced filtering.
Note: you can do similar filtering using list returned by glob.glob()
If you are not familiar with the list comprehension and regex, I would recommend you to refer to:
Regex - howto
List Comprehensions
Your for loop is saying "for every file that ends with .xml"; if you have any file that ends with .xml that isn't a sequential part%d.xml, you're going to get an error. Imagine you have part0.xml and foo.xml. The for loop is going to loop twice; on the second loop, it's going to try to open part1.xml, which doesn't exist.
Since you know the filenames already, you don't even need to use glob.glob(); just check if each file exists before opening it, until you find one that doesn't exist.
import os
from itertools import count
filenames = ('part%d.xml' % part_num for part_num in count())
for filename in filenames:
if os.path.exists(filename):
with open(filename, 'rb') as xmlfile:
do_stuff(xml_file)
# here goes the API call code
else:
break
If for any reason you're worried about files disappearing between os.path.exists(filename) and open(filename, 'rb'), this code is more robust:
import os
from itertools import count
filenames = ('part%d.xml' % part_num for part_num in count())
for filename in filenames:
try:
xmlfile = open(filename, 'rb')
except IOError:
break
else:
with xmlfile:
do_stuff(xmlfile)
# here goes the API call code
Consider what happens if there are other files that match the '*.xml'
suppose that you have 11 files "part0.xml"..."part10.xml" but also a file called "foo.xml"
Then the for loop will iterate 12 times (since there are 12 matches for the glob). On the 12th iteration, you are trying to open "part11.xml" which doesn't exist.
On approach is to dump the glob and just handle the exception.
part = 0
while True:
try:
with open('part%d.xml' % part, 'rb') as xml:
#here goes the API call code
part += 1
except IOerror:
break
When you use a counter, you need to test, if the file exists:
import os
from itertools import count
for part in count():
filename = 'part%d.xml' % part
if not os.path.exists(filename):
break
with open(filename) as inp:
# do something
You are doing it wrong.
Suppose folder has 3 files- part0.xml part1.xml and foo.xml. So loop will iterate 3 times and it will give error for third iteration, it will try to open part2.xml, which is not present.
Don't loop through all files with extension .xml.
Only Loop through files which start with 'part', have a digit in the name before the extension and having extension .xml
So your code will look like this:
import glob
for fname in glob.glob('part*[0-9].xml'):
with open(fname, 'rb') as xml:
#here goes the API call code
Read - glob – Filename pattern matching
If you want files to be uploaded in sequential order then read : String Natural Sort
So i'm importing a list of names
e.g.
Textfile would include:
Eleen
Josh
Robert
Nastaran
Miles
my_list = ['Eleen','Josh','Robert','Nastaran','Miles']
Then i'm assigning each name to a list and I want to write a new excel file for each name in that list.
#1. Is there anyway I can create a for loop where on the line:
temp = os.path.join(dir,'...'.xls')
_________________________
def high_throughput(names):
import os
import re
# Reading file
in_file=open(names,'r')
dir,file=os.path.split(names)
temp = os.path.join(dir,'***this is where i want to put a for loop
for each name in the input list of names***.xls')
out_file=open(temp,'w')
data = []
for line in in_file:
data.append(line)
in_file.close()
I'm still not sure what you're trying to do (and by "not sure", I mean "completely baffled"), but I think I can explain some of what you're doing wrong, and how to do it right:
in_file=open(names,'r')
dir,file=os.path.split(names)
temp = os.path.join(dir,'***this is where i want to put a for loop
for each name in the input list of names***.xls')
At this point, you don't have the input list of names. That's what you're reading from in_file, and you haven't read it yet. Later on, you read those named into data, after which you can use them. So:
in_file=open(names,'r')
dir,file=os.path.split(names)
data = []
for line in in_file:
data.append(line)
in_file.close()
for name in data:
temp = os.path.join(dir, '{}.xls'.format(name))
out_file=open(temp,'w')
Note that I put the for loop outside the function call, because you have to do that. And that's a good thing, because you presumably want to open each path (and do stuff to each file) inside that loop, not open a single path made out of a loop of files.
But if you don't insist on using a for loop, there is something that may be closer to what you were looking for: a list comprehension. You have a list of names. You can use that to build a list of paths. And then you can use that to build a list of open files. Like this:
paths = [os.path.join(dir, '{}.xls'.format(name)) for name in data]
out_files = [open(path, 'w') for path in paths]
Then, later, after you've built up the string you want to write to all the files, you can do this:
for out_file in out_files:
out_file.write(stuff)
However, this is kind of an odd design. Mainly because you have to close each file. They may get closed automatically by the garbage collection, and even if they don't, they may get flushed… but unless you get lucky, all that data you wrote is just sitting around in buffers in memory and never gets written to disk. Normally you don't want to write programs that depend on getting lucky. So, you want to close your files. With this design, you'd have to do something like:
for out_file in out_files:
out_file.close()
It's probably a lot simpler to go back to the one big loop I suggested in the first place, so you can do this:
for name in data:
temp = os.path.join(dir, '{}.xls'.format(name))
out_file=open(temp,'w')
out_file.write(stuff)
out_file.close()
Or, even better:
for name in data:
temp = os.path.join(dir, '{}.xls'.format(name))
with open(temp,'w') as out_file:
out_file.write(stuff)
A few more comments, while we're here…
First, you really shouldn't be trying to generate .xls files manually out of strings. You can use a library like openpyxl. Or you can create .csv files instead—they're easy to create with the csv library that comes built in with Python, and Excel can handle them just as easily as .xls files. Or you can use win32com or pywinauto to take control of Excel and make it create your files. Really, anything is better than trying to generate them by hand.
Second, the fact that you can write for line in in_file: means that an in_file is some kind of sequence of lines. So, if all you want to do is convert it to a list of lines, you can do that in one step:
data = list(in_file)
But really, the only reason you want this list in the first place is so you can loop around it later, creating the output files, right? So why not just hold off, and loop over the lines in the file in the first place?
Whatever you do to generate the output stuff, do that first. Then loop over the file with the list of filenames and write stuff. Like this:
stuff = # whatever you were doing later, in the code you haven't shown
dir = os.path.dirname(names)
with open(names, 'r') as in_file:
for line in in_file:
temp = os.path.join(dir, '{}.xls'.format(line))
with open(temp, 'w') as out_file:
out_file.write(stuff)
That replaces all of the code in your sample (except for that function named high_throughput that imports some modules locally and then does nothing).
Take a look at openpyxl, especially if you need to create .xlsx files. Below example assumes the Excel workbooks are created as blank.
from openpyxl import Workbook
names = ['Eleen','Josh','Robert','Nastaran','Miles']
for name in names:
wb = Workbook()
wb.save('{0}.xlsx'.format(name))
Try this:
in_file=open(names,'r')
dir,file=os.path.split(names)
for name in in_file:
temp = os.path.join(dir, name + '.xls')
with open(temp,'w') as out_file:
# write data to out_file