Mapping articles from folder into a list

Mapping articles from folder into a list - python

I have a folder with few articles and I would like to map text of each article into a common list in order to use the list for the tf-idf transformation. For example:
folder = [article1, article2, article3]
into list
list = ['text_of_article1', 'text_of_article2', 'text_of_article3']
def multiple_file(arg): #arg is path to the folder with multiple files
'''Function opens multiple files in a folder and maps each of them to a list
as a string'''
import glob, sys, errno
path = arg
files = glob.glob(path)
list = [] #list where file string would be appended
for item in files:
try:
with open(item) as f: # No need to specify 'r': this is the default.
list.append(f.read())
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
return list
When I set the path to the folder with my articles I get an empty list. However, when I set it directly to one article, then that article appears in the list. How could I get all of them mapped into my list. :S
This is the code, not sure if this is what you had in mind:
def multiple_files(arg): #arg is path to the folder with multiple files
'''Function opens multiple files in a folder and maps each of them to a list
as a string'''
import glob, sys, errno, os
path = arg
files = os.listdir(path)
list = [] #list where file string would be appended
for item in files:
try:
with open(item) as f: # No need to specify 'r': this is the default.
list.append(f.read())
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
return list
And this is the error:
Traceback (most recent call last):
File "<ipython-input-7-13e1457699ff>", line 1, in <module>
x = multiple_files(path)
File "<ipython-input-5-6a8fab5c295f>", line 10, in multiple_files
with open(item) as f: # No need to specify 'r': this is the default.
IOError: [Errno 2] No such file or directory: 'u02.txt'
Article No. 2 is actually the first one in the newly created list.

Suppose path == "/home/docs/guzdeh". If you just say glob.glob(path) you only get [path] because nothing else matches the pattern. You want glob.glob(path + "/*") to get everything in that directory, or glob.glob(path + "/*.txt") for all the txt files.
Alternatively you could use import os; os.listdir(path), which I think makes more sense.
UPDATE:
Regarding the new code, the problem is that os.listdir only returns the path relative to the directory listed. Therefore you need to combine the two for python to know where you're talking about. Add:
item = os.path.join(path, item)
before trying to open(item). You might also want to name your variables better.

Related

How to use os.system to convert all files in a folder at once using external python script

I've managed to find out the method to convert a file from one file extension to another (.evtx to .xml) using an external script. Below is what I am using:
os.system("file_converter.py file1.evtx > file1.xml")
This successfully converts a file from .txt to .xml using the external script I called (file_converter.py).
I am now trying to find out a method on how I can use 'os.system' or perhaps another method to convert more than one file at once, I would like for my program to dive into a folder and convert all of the 10 files I have at once to .xml format.
The questions I have are how is this possible as os.system only takes 1 argument and I'm not sure on how I could make it locate through a directory as unlike the first file I converted was on my standard home directory, but the folder I want to access with the 10 files is inside of another folder, I am trying to find out a way to address this argument and for the conversion to be done at once, I also want the file name to stay the same for each individual file with the only difference being the '.xml' being changed from '.evtx' at the end.
The file "file_converter.py" is downloadable from here

import threading
import os
def file_converter(file):
os.system("file_converter.py {0} > {1}".format(file, file.replace(".evtx", ".xml")))
base_dir = "C:\\Users\\carlo.zanocco\\Desktop\\test_dir\\"
for file in os.listdir(base_dir):
threading.Thread(target=file_converter, args=(file,)).start()
Here my sample code.
You can generate multiple thread to run the operation "concurrently". The program will check for all files in the directory and convert it.
EDIT python2.7 version
Now that we have more information about what you want I can help you.
This program can handle multiple file concurrently from one folder, it check also into the subfolders.
import subprocess
import os
base_dir = "C:\\Users\\carlo.zanocco\\Desktop\\test_dir\\"
commands_to_run = list()
#Search all files
def file_list(directory):
allFiles = list()
for entry in os.listdir(directory):
fullPath = os.path.join(directory, entry)
#if is directory search for more files
if os.path.isdir(fullPath):
allFiles = allFiles + file_list(fullPath)
else:
#check that the file have the right extension and append the command to execute later
if(entry.endswith(".evtx")):
commands_to_run.append("C:\\Python27\\python.exe file_converter.py {0} > {1}".format(fullPath, fullPath.replace(".evtx", ".xml")))
return allFiles
print "Searching for files"
file_list(base_dir)
print "Running conversion"
processes = [subprocess.Popen(command, shell=True) for command in commands_to_run]
print "Waiting for converted files"
for process in processes:
process.wait()
print "Conversion done"
The subprocess module can be used in two ways:
subprocess.Popen: it run the process and continue the execution
subprocess.call: it run the process and wait for it, this function return the exit status. This value if zero indicate that the process terminate succesfully
EDIT python3.7 version
if you want to solve all your problem just implement the code that you share from github in your program. You can easily implement it as function.
import threading
import os
import Evtx.Evtx as evtx
import Evtx.Views as e_views
base_dir = "C:\\Users\\carlo.zanocco\\Desktop\\test_dir\\"
def convert(file_in, file_out):
tmp_list = list()
with evtx.Evtx(file_in) as log:
tmp_list.append(e_views.XML_HEADER)
tmp_list.append("<Events>")
for record in log.records():
try:
tmp_list.append(record.xml())
except Exception as e:
print(e)
tmp_list.append("</Events>")
with open(file_out, 'w') as final:
final.writelines(tmp_list)
#Search all files
def file_list(directory):
allFiles = list()
for entry in os.listdir(directory):
fullPath = os.path.join(directory, entry)
#if is directory search for more files
if os.path.isdir(fullPath):
allFiles = allFiles + file_list(fullPath)
else:
#check that the file have the right extension and append the command to execute later
if(entry.endswith(".evtx")):
threading.Thread(target=convert, args=(fullPath, fullPath.replace(".evtx", ".xml"))).start()
return allFiles
print("Searching and converting files")
file_list(base_dir)
If you want to show your files generate, just edit as above:
def convert(file_in, file_out):
tmp_list = list()
with evtx.Evtx(file_in) as log:
with open(file_out, 'a') as final:
final.write(e_views.XML_HEADER)
final.write("<Events>")
for record in log.records():
try:
final.write(record.xml())
except Exception as e:
print(e)
final.write("</Events>")
UPDATE
If you want to delete the '.evtx' files after the conversion you can simply add the following rows at the end of the convert function:
try:
os.remove(file_in)
except(Exception, ex):
raise ex
Here you just need to use try .. except because you run the thread only if the input value is a file.
If the file doesn't exist, this function throws an exception, so it's necessary to check os.path.isfile() first.

import os, sys
DIR = "D:/Test"
# ...or as a command line argument
DIR = sys.argv[1]
for f in os.listdir(DIR):
path = os.path.join(DIR, f)
name, ext = os.path.splitext(f)
if ext == ".txt":
new_path = os.path.join(DIR, f"{name}.xml")
os.rename(path, new_path)
Iterates over a directory, and changes all text files to XML.

Need a python script which outputs a list of files with the same content in a single directory

Given a directory containing say a few thousand files, please output a list of all the names of the
files in the directory that are exactly the same,
I can write script to read files from dir, but i need to compare contents of each and every file in the directory and output should be like matches {f1,f2} {f4,f6}
#!/usr/bin/env python
import sys
import glob
import errno
path = '/home/harish/myfiles'
files = glob.glob(path)
for name in files: # 'file' is a builtin type, 'name' is a less-ambiguous variable name.
try:
with open(name) as f: # No need to specify 'r': this is the default.
sys.stdout.write(f.read())
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.

You can find a md5 sum for each file
md5sum = hashlib.md5(open(full_path, 'rb').read()).hexdigest()
Put them in dictionary like this:
{'md5' : ["file_name1, file_name2"]}
then
for key, value in md5s:
if len(value) > 1:
print(values)

How to get the files with the biggest size in the folders, change their name and save to a different folder

I need to get files with the biggest size in different folders, change their name to folder name that they belong to and save to a new folder. I have something like this and I got stuck:
import os
# Core settings
rootdir = 'C:\\Users\\X\\Desktop\\humps'
to_save = 'C:\\Users\\X\\Desktop\\new'
for root, dirs, files in os.walk(rootdir):
new_list = []
for file in files:
if file.endswith(".jpg"):
try:
print(file)
os.chdir(to_save)
add_id = root.split("humps\\")[1]
add_id = add_id.split("\\")[0]
file_name = os.path.join(root,file)
new_list.append(file_name)
bigfile = max(new_list, key=lambda x: x.stat().st_size)
except:
pass
To make it more clear: Let's say the name of the sub-folder is "elephant" and there are different elephant photos and subfolders in in this elephant folder. I want to go through those photos and subfolders and find the elephant foto with the biggest size, name it as elephant and save it to my target folder. Also repaet it for other sub folders such as lion, puma etc.
How I could achieve what I want ?

To find biggest file and save to another location
import os
import shutil
f_list = []
root = "path/to/directory"
root = os.path.abspath(root)
for folder, subfolders, files in os.walk(root):
for file in files:
filePath = os.path.join(folder, file)
f_list.append(filePath)
bigest_file = max(f_list,key=os.path.getsize)
new_path = "path/where/you/want/to/save"
shutil.copy(biggest_file,new_path)
if you want only images then add one more condition in loop
for folder, subfolders, files in os.walk(root):
for file in files:
if file.endswith(".jpg"):
filePath = os.path.join(folder, file)
f_list.append(filePath)
To get all folders biggest file
root = "demo"
root = os.path.abspath(root)
def test(path):
big_files = []
all_paths = [x[0] for x in os.walk(path)]
for paths in all_paths:
f_list = filter(os.path.isfile, os.listdir(paths))
if len(f_list) > 0:
big_files.append((paths,max(f_list,key=os.path.getsize)))
return big_files
print test(root)

How to get the files with the biggest size in the folders, change their name and save to a different folder
Basically you already have a good description of what you need to do. You just need to follow it step by step:
get all files in some search directory
filter for relevant files ("*.jpg")
get their sizes
find the maximum
copy to new directory with name of search directory
IMO it's an important skill to be able to break down a task into smaller tasks. Then, you just need to implement the smaller tasks and combine:
def iterate_files_recursively(directory="."):
for entry in os.scandir(directory):
if entry.is_dir():
for file in iterate_files_recursively(entry.path):
yield file
else:
yield entry
files = iterate_files_recursively(subfolder_name)
I'd use os.scandir because it avoids building up a (potentially) huge list of files in memory and instead allows me (via a generator) to work one file at a time. Note that starting with 3.6 you can use the result of os.scandir as a context manager (with syntax).
images = itertools.filterfalse(lambda f: not f.path.endswith('.jpg'), files)
Filtering is relatively straightforward except for the IMO strange choice of ìtertools.filterfalse to only keep elements for which its predicate returns False.
biggest = max(images, key=(lambda img: img.stat().st_size))
This is two steps in one: Get the maximum with the builtin max function, and use the file size as "key" to establish an order. Note that this raises a ValueError if you don't have any images ... so you might want to supply default=None or handle that exception.
shutil.copy(biggest.path, os.path.join(target_directory, subfolder_name + '.jpg')
shutil.copy copies the file and some metadata. Instead of hardcoding path separators, please use os.path.join!
Now all of this assumes that you know the subfolder_name. You can scan for those easily, too:
def iterate_directories(directory='.'):
for entry in os.scandir(directory):
if entry.is_dir():
yield entry

Here's some code that does what you want. Instead of using the old os.walk function, it uses modern pathlib functions.
The heart of this code is the recursive biggest function. It scans all the files and directories in folder, saving the matching file names to the files list, and recursively searching any directories it finds. It then returns the path of the largest file that it finds, or None if no matching files are found.
from pathlib import Path
import shutil
def filesize(path):
return path.stat().st_size
def biggest(folder, pattern):
''' Find the biggest file in folder that matches pattern
Search recursively in all subdirectories
'''
files = []
for f in folder.iterdir():
if f.is_file():
if f.match(pattern):
files.append(f)
elif f.is_dir():
found = biggest(f, pattern)
if found:
files.append(found)
if files:
return max(files, key=filesize)
def copy_biggest(src, dest, pattern):
''' Find the biggest file in each folder in src that matches pattern
and copy it to dest, using the folder's name as the new file name
'''
for path in src.iterdir():
if path.is_dir():
found = biggest(path, pattern)
if found:
newname = dest / path
print(path, ':', found, '->', newname)
shutil.copyfile(found, newname)
You can call it like this:
rootdir = r'C:\Users\X\Desktop\humps'
to_save = r'C:\Users\X\Desktop\new'
copy_biggest(Path(rootdir), Path(to_save), '*.jpg')
Note that the copied files will have the same name as the top-level folder in rootdir that they were found in, with no file extension. If you want to give them a .jpg extension, you can change
newname = dest / path
to
newname = (dest / path).with_suffix('.jpg')
The shutil module on older versions of Python 3 doesn't understand pathlib paths. But that's easy enough to remedy. In the copy_biggest function, replace
shutil.copyfile(found, newname)
with
shutil.copyfile(str(found), str(newname))

Python zip a sub folder and not the entire folder path

I have a program to zip all the contents in a folder. I did not write this code but I found it somewhere online and I am using it. I intend to zip a folder for example say, C:/folder1/folder2/folder3/ . I want to zip folder3 and all its contents in a file say folder3.zip. With the below code, once i zip it, the contents of folder3.zip wil be folder1/folder2/folder3/and files. I do not want the entire path to be zipped and i only want the subfolder im interested to zip (folder3 in this case). I tried some os.chdir etc, but no luck.
def makeArchive(fileList, archive):
"""
'fileList' is a list of file names - full path each name
'archive' is the file name for the archive with a full path
"""
try:
a = zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED)
for f in fileList:
print "archiving file %s" % (f)
a.write(f)
a.close()
return True
except: return False
def dirEntries(dir_name, subdir, *args):
# Creates a list of all files in the folder
'''Return a list of file names found in directory 'dir_name'
If 'subdir' is True, recursively access subdirectories under 'dir_name'.
Additional arguments, if any, are file extensions to match filenames. Matched
file names are added to the list.
If there are no additional arguments, all files found in the directory are
added to the list.
Example usage: fileList = dirEntries(r'H:\TEMP', False, 'txt', 'py')
Only files with 'txt' and 'py' extensions will be added to the list.
Example usage: fileList = dirEntries(r'H:\TEMP', True)
All files and all the files in subdirectories under H:\TEMP will be added
to the list. '''
fileList = []
for file in os.listdir(dir_name):
dirfile = os.path.join(dir_name, file)
if os.path.isfile(dirfile):
if not args:
fileList.append(dirfile)
else:
if os.path.splitext(dirfile)[1][1:] in args:
fileList.append(dirfile)
# recursively access file names in subdirectories
elif os.path.isdir(dirfile) and subdir:
print "Accessing directory:", dirfile
fileList.extend(dirEntries(dirfile, subdir, *args))
return fileList
You can call this by makeArchive(dirEntries(folder, True), zipname).
Any ideas as to how to solve this problem? I am uing windows OS annd python 25, i know in python 2.7 there is shutil make_archive which helps but since i am working on 2.5 i need another solution :-/

You'll have to give an arcname argument to ZipFile.write() that uses a relative path. Do this by giving the root path to remove to makeArchive():
def makeArchive(fileList, archive, root):
"""
'fileList' is a list of file names - full path each name
'archive' is the file name for the archive with a full path
"""
a = zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED)
for f in fileList:
print "archiving file %s" % (f)
a.write(f, os.path.relpath(f, root))
a.close()
and call this with:
makeArchive(dirEntries(folder, True), zipname, folder)
I've removed the blanket try:, except:; there is no use for that here and only serves to hide problems you want to know about.
The os.path.relpath() function returns a path relative to root, effectively removing that root path from the archive entry.
On python 2.5, the relpath function is not available; for this specific usecase the following replacement would work:
def relpath(filename, root):
return filename[len(root):].lstrip(os.path.sep).lstrip(os.path.altsep)
and use:
a.write(f, relpath(f, root))
Note that the above relpath() function only works for your specific case where filepath is guaranteed to start with root; on Windows the general case for relpath() is a lot more complex. You really want to upgrade to Python 2.6 or newer if at all possible.

ZipFile.write has an optional argument arcname. Use this to remove parts of the path.
You could change your method to be:
def makeArchive(fileList, archive, path_prefix=None):
"""
'fileList' is a list of file names - full path each name
'archive' is the file name for the archive with a full path
"""
try:
a = zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED)
for f in fileList:
print "archiving file %s" % (f)
if path_prefix is None:
a.write(f)
else:
a.write(f, f[len(path_prefix):] if f.startswith(path_prefix) else f)
a.close()
return True
except: return False
Martijn's approach using os.path is much more elegant, though.

Extract files from zip without keep the top-level folder with python zipfile

I'm using the current code to extract the files from a zip file while keeping the directory structure:
zip_file = zipfile.ZipFile('archive.zip', 'r')
zip_file.extractall('/dir/to/extract/files/')
zip_file.close()
Here is a structure for an example zip file:
/dir1/file.jpg
/dir1/file1.jpg
/dir1/file2.jpg
At the end I want this:
/dir/to/extract/file.jpg
/dir/to/extract/file1.jpg
/dir/to/extract/file2.jpg
But it should ignore only if the zip file has a top-level folder with all files inside it, so when I extract a zip with this structure:
/dir1/file.jpg
/dir1/file1.jpg
/dir1/file2.jpg
/dir2/file.txt
/file.mp3
It should stay like this:
/dir/to/extract/dir1/file.jpg
/dir/to/extract/dir1/file1.jpg
/dir/to/extract/dir1/file2.jpg
/dir/to/extract/dir2/file.txt
/dir/to/extract/file.mp3
Any ideas?

If I understand your question correctly, you want to strip any common prefix directories from the items in the zip before extracting them.
If so, then the following script should do what you want:
import sys, os
from zipfile import ZipFile
def get_members(zip):
parts = []
# get all the path prefixes
for name in zip.namelist():
# only check files (not directories)
if not name.endswith('/'):
# keep list of path elements (minus filename)
parts.append(name.split('/')[:-1])
# now find the common path prefix (if any)
prefix = os.path.commonprefix(parts)
if prefix:
# re-join the path elements
prefix = '/'.join(prefix) + '/'
# get the length of the common prefix
offset = len(prefix)
# now re-set the filenames
for zipinfo in zip.infolist():
name = zipinfo.filename
# only check files (not directories)
if len(name) > offset:
# remove the common prefix
zipinfo.filename = name[offset:]
yield zipinfo
args = sys.argv[1:]
if len(args):
zip = ZipFile(args[0])
path = args[1] if len(args) > 1 else '.'
zip.extractall(path, get_members(zip))

Read the entries returned by ZipFile.namelist() to see if they're in the same directory, and then open/read each entry and write it to a file opened with open().

This might be a problem with the zip archive itself. In a python prompt try this to see if the files are in the correct directories in the zip file itself.
import zipfile
zf = zipfile.ZipFile("my_file.zip",'r')
first_file = zf.filelist[0]
print file_list.filename
This should say something like "dir1"
repeat the steps above substituting and index of 1 into filelist like so first_file = zf.filelist[1] This time the output should look like 'dir1/file1.jpg' if this is not the case then the zip file does not contain directories and will be unzipped all to one single directory.

Based on the #ekhumoro's answer I come up with a simpler funciton to extract everything on the same level, it is not exactly what you are asking but I think can help someone.
def _basename_members(self, zip_file: ZipFile):
for zipinfo in zip_file.infolist():
zipinfo.filename = os.path.basename(zipinfo.filename)
yield zipinfo
from_zip="some.zip"
to_folder="some_destination/"
with ZipFile(file=from_zip, mode="r") as zip_file:
os.makedirs(to_folder, exist_ok=True)
zip_infos = self._basename_members(zip_file)
zip_file.extractall(path=to_folder, members=zip_infos)

Basically you need to do two things:
Identify the root directory in the zip.
Remove the root directory from the paths of other items in the zip.
The following should retain the overall structure of the zip while removing the root directory:
import typing, zipfile
def _is_root(info: zipfile.ZipInfo) -> bool:
if info.is_dir():
parts = info.filename.split("/")
# Handle directory names with and without trailing slashes.
if len(parts) == 1 or (len(parts) == 2 and parts[1] == ""):
return True
return False
def _members_without_root(archive: zipfile.ZipFile, root_filename: str) -> typing.Generator:
for info in archive.infolist():
parts = info.filename.split(root_filename)
if len(parts) > 1 and parts[1]:
# We join using the root filename, because there might be a subdirectory with the same name.
info.filename = root_filename.join(parts[1:])
yield info
with zipfile.ZipFile("archive.zip", mode="r") as archive:
# We will use the first directory with no more than one path segment as the root.
root = next(info for info in archive.infolist() if _is_root(info))
if root:
archive.extractall(path="/dir/to/extract/", members=_members_without_root(archive, root.filename))
else:
print("No root directory found in zip.")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mapping articles from folder into a list - python

Related

How to use os.system to convert all files in a folder at once using external python script

Need a python script which outputs a list of files with the same content in a single directory

How to get the files with the biggest size in the folders, change their name and save to a different folder

Python zip a sub folder and not the entire folder path

Extract files from zip without keep the top-level folder with python zipfile

Categories

Resources