Read xml files directly from a zip file using Python

Read xml files directly from a zip file using Python - python

I have a following zip file structure:
some_file.zip/folder/folder/files.xml
So I have a lot of xml files within a subfolder of the zip file.
So far I have managed to unpack the zip file using the following code:
import os.path
import zipfile
with zipfile.ZipFile('some_file.zip') as zf:
for member in zf.infolist():
# Path traversal defense copied from
# http://hg.python.org/cpython/file/tip/Lib/http/server.py#l789
words = member.filename.split('/')
path = "output"
for word in words[:-1]:
drive, word = os.path.splitdrive(word)
head, word = os.path.split(word)
if word in (os.curdir, os.pardir, ''): continue
path = os.path.join(path, word)
zf.extract(member, path)
But I do not need to extract the files but to read them directly from the zip file. So either read each file within a for loop and process it or to save each file in some kind of data structure in Python. Is it possible?

as Robin Davis has written zf.open() will do the trick. Here is a small example:
import zipfile
zf = zipfile.ZipFile('some_file.zip', 'r')
for name in zf.namelist():
if name.endswith('/'): continue
if 'folder2/' in name:
f = zf.open(name)
# here you do your magic with [f] : parsing, etc.
# this will print out file contents
print(f.read())
As OP wished in comment only files from the "folder2" will be processed...

zf.open() will return a file like object without extracting it.

Related

Copy and rename pictures based on xml nodes

I'm trying to copy all pictures from one directory (also including subdirectories) to another target directory. Whenever the exact picture name is found in one of the xml files the tool should grap all information (attributes in the parent and child nodes) and create subdirectories based on those node informations, also it should rename the picture file.
The part when it extracts all the information from the nodes is already done.
from bs4 import BeautifulSoup as bs
path_xml = r"path\file.xml"
content = []
with open(res, "r") as file:
content = file.readlines()
content = "".join(content)
def get_filename(_content):
bs_content = bs(_content, "html.parser")
# some code
picture_path = f'{pm_1}{pm_2}\{pm_3}\{pm_4}\{pm_5}_{pm_6}_{pm_7}\{pm_8}\{pm_9}.jpg'
get_filename(content)
So in the end I get a string value with the directory path and the file name I want.
Now I struggle with opening all xml files in one directory instead of just opening one file. I tryed this:
import os
dir_xml = r"path"
res = []
for path in os.listdir(dir_xml):
if os.path.isfile(os.path.join(dir_xml, path)):
res.append(path)
with open(res, "r") as file:
content = file.readlines()
but it gives me this error: TypeError: expected str, bytes or os.PathLike object, not list
How can i read through all xml files instead of just one? I have hundreds of xml files so that will take a wile :D
And another question: How can i create directories base on string?
Lets say the value of picture_path is AB\C\D\E_F_G\H\I.jpg
I would need another directory path for the destination of the created folders and a function that somehow creates folders based on that string. How can I do that?

To read all XML files in a directory, you can modify your code as follows:
import os
dir_xml = r"path"
for path in os.listdir(dir_xml):
if path.endswith(".xml"):
with open(os.path.join(dir_xml, path), "r") as file:
content = file.readlines()
content = "".join(content)
get_filename(content)
This code uses the os.listdir() function to get a list of all files in the directory specified by dir_xml. It then uses a for loop to iterate over the list of files, checking if each file ends with the .xml extension. If it does, it opens the file, reads its content, and passes it to the get_filename function.
To create directories based on a string, you can use the os.makedirs function. For example:
import os
picture_path = r'AB\C\D\E_F_G\H\I.jpg'
dest_path = r'path_to_destination'
os.makedirs(os.path.join(dest_path, os.path.dirname(picture_path)), exist_ok=True)
In this code, os.path.join is used to combine the dest_path and the directory portion of picture_path into a full path. os.path.dirname is used to extract the directory portion of picture_path. The os.makedirs function is then used to create the directories specified by the path, and the exist_ok argument is set to True to allow the function to succeed even if the directories already exist.
Finally, you can use the shutil library to copy the picture file to the destination and rename it, like this:
import shutil
src_file = os.path.join(src_path, picture_path)
dst_file = os.path.join(dest_path, picture_path)
shutil.copy(src_file, dst_file)
Here, src_file is the full path to the source picture file and dst_file is the full path to the destination. The shutil.copy function is then used to copy the file from the source to the destination.

You can use os.walk() for recursive search of files:
import os
dir_xml = r"path"
for root, dirs, files in os.walk(dir_xml): #topdown=False
for names in files:
if ".xml" in names:
print(f"file path: {root}\n XML-Files: {names}")
with open(names, 'r') as file:
content = file.readlines()

How to read pairwise csv and json files having same names inside a folder using python?

Consider my folder structure having files in these fashion:-
abc.csv
abc.json
bcd.csv
bcd.json
efg.csv
efg.json
and so on i.e. a pair of csv files and json files having the same names, i have to perform the same operation by reading the same named files , do some operation and proceed to the next pair of files. How do i go about this?
Basically what i have in mind as a pseudo code is:-
for files in folder_name:
df_csv=pd.read_csv('abc.csv')
df_json=pd.read_json('abc.json')
# some script to execute
#now read the next pair and repeat for all files

Did you think of something like this?
import os
# collects filenames in the folder
filelist = os.listdir()
# iterates through filenames in the folder
for file in filelist:
# pairs the .csv files with the .json files
if file.endswith(".csv"):
with open(file) as csv_file:
pre, ext = os.path.splitext(file)
secondfile = pre + ".json"
with open(secondfile) as json_file:
# do something

You can use the glob module to extract the file names matching a pattern:
import glob
import os.path
for csvfile in glob.iglob('*.csv'):
jsonfile = csvfile[:-3] + 'json'
# optionaly control file existence
if not os.path.exists(jsonfile):
# show error message
...
continue
# do smth. with csvfile
# do smth. else with jsonfile
# and proceed to next pair

If the directory structure is consistent you could do the following:
import os
for f_name in {x.split('.')[0] for x in os.listdir('./path/to/dir')}:
df_csv = pd.read_csv("{f_name}.csv")
df_json = pd.read_json("{f_name}.json")
# execute the rest

Get only the txt file you want from the folder containing the txt file - Python

I have a folder with a .txt files. the name of the files are:
my_file1.txt
my_file2.txt
my_file3.txt
my_file4.txt
In this way, only the last number is different.
import pickle
my_list = []
with open("/Users/users_a/Desktop/website-basic/sub_domain/sub_domain01.txt", "rb") as f1,
open("/Users/users_a/Desktop/website-ba\
sic/sub_domain/sub_domain02.txt", "rb") as f2, open("/Users/users_a/Desktop/website-
basic/sub_domain/sub_domain03.txt", "rb") as f3:
my_list.append(pickle.load(f1))
my_list.append(pickle.load(f2))
my_list.append(pickle.load(f3))
print(my_list)
In this way, I load a file and put it in the my_list variable to make a list and work. As the number of files to work increases, the code becomes too long and cumbersome.
Is there an easier and more pythonic way to load only the desired txt file??

You can use os.listdir():
import os
import pickle
my_list = []
path = "/Users/users_a/Desktop/website-basic/sub_domain"
for file in os.listdir(path):
if file.endswith(".txt"):
with open(f"{path}/{file}","r") as f:
my_list.append(pickle.load(f))
Where file is the filename of a file in path
I suggest using os.path.join() instead of hard coding the file paths
If your folder only contains the files you want to load you can just use:
for file in os.listdir(path):
with open(f"{path}/{file}","r") as f:
my_list.append(pickle.load(f))
Edit for my_file[number].txt
If you only want files in the form of my_file[number].txt, use:
import os
import re
import pickle
my_list = []
path = "/Users/users_a/Desktop/website-basic/sub_domain"
for file in os.listdir(path):
if re.match(r"my_file\d+.txt", file):
with open(f"{path}/{file}","r") as f:
my_list.append(pickle.load(f))
Online regex demo https://regex101.com/r/XJb2DF/1

How to open a file only using its extension?

I have a Python script which opens a specific text file located in a specific directory (working directory) and perform some actions.
(Assume that if there is a text file in the directory then it will always be no more than one such .txt file)
with open('TextFileName.txt', 'r') as f:
for line in f:
# perform some string manipulation and calculations
# write some results to a different text file
with open('results.txt', 'a') as r:
r.write(someResults)
My question is how I can have the script locate the text (.txt) file in the directory and open it without explicitly providing its name (i.e. without giving the 'TextFileName.txt'). So, no arguments for which text file to open would be required for this script to run.
Is there a way to achieve this in Python?

You could use os.listdir to get the files in the current directory, and filter them by their extension:
import os
txt_files = [f for f in os.listdir('.') if f.endswith('.txt')]
if len(txt_files) != 1:
raise ValueError('should be only one txt file in the current directory')
filename = txt_files[0]

You Can Also Use glob Which is easier than os
import glob
text_file = glob.glob('*.txt')
# wild card to catch all the files ending with txt and return as list of files
if len(text_file) != 1:
raise ValueError('should be only one txt file in the current directory')
filename = text_file[0]
glob searches the current directory set by os.curdir
You can change to the working directory by setting
os.chdir(r'cur_working_directory')

Since Python version 3.4, it is possible to use the great pathlib library. It offers a glob method which makes it easy to filter according to extensions:
from pathlib import Path
path = Path(".") # current directory
extension = ".txt"
file_with_extension = next(path.glob(f"*{extension}")) # returns the file with extension or None
if file_with_extension:
with open(file_with_extension):
...

Reading files of same typolgy in a folder

I need a little help to finish my program.
I have in a folder 20 files of the same typology, strings with corresponding values.
Is there a way to create a function that opens all the files in this way
file1 = [line.strip() for line in open("/Python34/elez/file1.txt", "r")]?
I hope I explained it well.
Thanks!

from os import listdir
from os.path import join, isfile
def contents(filepath):
with open(filepath) as f:
return f.read()
directory = '/Python34/elez'
all_file_contents = [contents(join(directory, filename))
for filename in listdir(directory)
if isfile(join(directory, filename)]

Hi Gulliver this is how i will do it:
import os
all_files = [] ## create a list to keep all the lines for all files
for file in os.listdir('./'): ## use list dir to list all files in the dir
with open(file, 'r') as f: ## use with to open file
fields = [line.strip() for line in f] ## list comprehension to finish reading the field
all_fields.extend(fields) ## store in big list
For more information about using the with statement to open and read files, please refer to this answer Correct way to write to files?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read xml files directly from a zip file using Python - python

zf.open() will return a file like object without extracting it.

Related

Copy and rename pictures based on xml nodes

How to read pairwise csv and json files having same names inside a folder using python?

Get only the txt file you want from the folder containing the txt file - Python

How to open a file only using its extension?

Reading files of same typolgy in a folder

Categories

Resources