Getting file name while reading files from local system using pyspark

Getting file name while reading files from local system using pyspark - python

Additional update:
I tried writing same code for my files present in hdfs there it is working but when i am using same code for my local files system i am getting error. Caused by: java.io.FileNotFoundException: File file:/root/cd/parsed_cd_5.xml does not exist
Original question and initial update
I am using ElementTree to parse XML files. I ran the code in python and it worked like charm. But when i am trying to run the same using spark i am getting below error.
Error:
File "/root/sparkCD.py", line 82, in
for filename in glob.glob(os.path.join(path, '*.xml')): File "/usr/lib64/python2.6/posixpath.py", line 67, in join
elif path == '' or path.endswith('/'):
From the error it is clear that issue is with "for filename in glob.glob(os.path.join(path, '*.xml'))". But i don't know how to achieve the same in pyspark.
since i can't share my code i will only share the snippet where i am getting error compared to the python code where i am not getting the error.
Python:
path = '/root/cd'
for filename in glob.glob(os.path.join(path, '*.xml')):
tree = ET.parse(filename)
doc = tree.getroot()
Pyspark:
path = sc.textFile("file:///root/cd/")
for filename in glob.glob(os.path.join(path, '*.xml')):
tree = ET.parse(filename)
doc = tree.getroot()
how can i resolve this issue. All i want is the filename that i am currently processing that is currently in my local system cd directory using pyspark.
Forgive me if this sounds stupid to you.
Update:
I tried the suggestion given below but i am not getting the file name.
below is my code:
filenme = sc.wholeTextFiles("file:///root/cd/")
nameoffile = filenme.map(lambda (name, text): name.split("/").takeRight(1)(0)).take(0)
print (nameoffile)
result i am gettng is
PythonRDD[22] at RDD at PythonRDD.scala:43
Update:
I have written below code instead of wholeTextFiles but i am getting same error. Also i want to say that according to my question i want to get the name of my file so textFile will not help me with that. I tried running the code you suggested but same result i am getting.
path = sc.textFile("file:///root/cd/")
print (path)

If input directory contains many small files then wholeTextFiles would help, check detailed description here.
>>pairRDD = sc.wholeTextFiles('<path>')
>>pairRDD.map(lambda x:x[0]).collect() #print all file names
pairRDD each record contains key as absolute file path and value as entire file content.

Not a full solution, but this appears to be a clear problem with your code.
In python you have:
path = '/root/cd'
Now path should contain the location that you are interested in.
In pySpark however, you do this:
path = sc.textFile("file:///root/cd/")
Now path contains the text in the file at the location that you are interested in.
If you try to run your followup command on that, it makes sense that it tries to do something strange (and thus fails).

Related

Modify all files in specified directory (including subfolders) and saving them in new directory while presevering folder structure (Python)

(I'm new to python so please excuse the probably trivial question. I tried my best looking for similar issues but suprisingly couldn't find someone with the same question.)
I'm trying to build a simple static site generator in Python. The script should take all .txt files in a specific directory (including subfolders), paste the content of each into a template .html file and then save all the newly generated .html files into a new directory while recreating the folder structure of the original directory.
So for I got the code which does the conversion itself for a single file but I'm unsure how to do it for multiple files in a directory.
with open('template/page.html', 'r') as template:
templatedata = template.read()
with open('content/content.txt', 'r') as content:
contentdata = content.read()
pagedata = templatedata.replace('!PlaceholderContent!', contentdata)
with open('www/content.html', 'w') as output:
output.write(pagedata)

To manipulate files and directories, you will need to import some system functionalites under the built-in module os.
import os
The functionalities under the os module include :
Listing the content of a directory :
path_to_template_dir = 'template/'
template_files = os.listdir(path_to_template_dir)
print(template_files)
# Outputs : ['page.html']
Creating a directory (If it does not already exist) :
path_to_output_dir = 'www/'
try :
os.mkdir(path_to_output_dir)
except FileExistsError as e:
print('Directory exists:', path_to_output_dir)
And since you know the names of the directories you want to use, and using these two functions, you now know the names of the files you want to use and generate, you can now concatenate the name of each file to the names of its directories to create the string str of the final file path, which you can then open() for reading and/or writing.
It's hard to give a perfect code example for your question since the logic of how you want to manipulate each of the template and content file is missing, but here is an example for writing a file inside the newly created directory :
path_to_output_file = path_to_output_dir + 'content.html'
with open(path_to_output_file, 'w') as output:
output.write('Content')
And an example for reading all the template files inside the template/ directory and then printing them to the screen.
for template_file in template_files:
path_to_template_file = path_to_template_dir + template_file
with open(path_to_template_file, 'r') as template:
print(template.read())
In the end, manipulating files is all about creating the path string you want to read from or write to, and then accessing it.
Anymore functionalities you might need (for example : checking if a path is a file os.path.isfile() or if it's for a directory os.path.isdir() can be found under the os module.

How to iterate through multiple excel files using python

I am trying to develop a python script that will iterate through several Excel .xlsx files, search each file for a set of values and save them to a new .xlsx template.
The issue I'm having is when I'm trying to get a proper list of files in the folder I'm looking at. I'm saving these filenames in a list variable 'fileList' to manage iteration.
When I run the code os.chdir(sourcepath),
I'm constantly getting a FileNotFoundError: [WinError 2] The system cannot find the file specified: C:\\Users\\username\\PycharmProjects\\projectName\\venv\\Site List\\siteListfolder
I think this has to do with the '\\' that is displaying in the error, but when I run a print(sourcepath) in this code, the path is properly displayed, with just one '\' between each subdirectory instead of two.
I need to be able to get the list of files in the siteListfolder, and be able to iterate through them using this kind of logic:
priCLLI = sys.argv[1]
secCLLI = sys.argv[2]
sourcepath = os.path.join(homepath, 'Site List', f'{priCLLI}_{secCLLI}')
siteListfolder = os.listdir(sourcepath)
for file in siteListfolder:
for row in file:
<script does its work>
'siteListfolder = os.listdir(sourcepath)' is generating the error
Thanks to all in advance for supporting this kind of forum.

import os
directory = ('your/path/directory')
Source_Workbook = []
for filename in os.listdir(directory):
if filename.endswith(".xlsx"):
Source_Workbook.append(filename)
print(Source_Workbook)

File based strings/variables to set file path etc in python operation

I am trying to create part of a program that will take the values found in two CFG files and use them to determine what filetype to search for as well as what folder location to use. The code I found online sort of suits my needs, However I would like to not use a hard coded file path. Here is the code I have modified so far:
import glob
location = open("config.cfg", encoding = 'cp1252')
location = location.read()
filetype = open("filetype.cfg", encoding = 'cp1252')
filetype = filetype.read()
fileset = [file for file in glob.glob(location + filetype, recursive=True)]
print(location)
print(filetype)
for file in fileset:
print(file)
The config.cfg contains one line, which is the file path to a folder with 3 sample JPG files in it.
C:/test
The filetype.cfg contains one line as well, which is the file type to search for
"**/*.jpg"
I've gotten to the point where this code throws no errors, but it also doesn't work as intended either, it seems to read the files properly, but doesn't list the files in the folder. The Config.CFG file contains the folder path, i.e. C:/test, while the filetype.cfg contains "**/*.jpg", which is the type of file I would like searched for. I found the original code here: https://www.techbeamers.com/python-list-all-files-directory/, Look under the 'glob' method.
The original (fully working) code from the link above:
import glob
location = 'c:/test/temp/'
fileset = [file for file in glob.glob(location + "**/*.py", recursive=True)]
for file in fileset:
print(file)
Using Python 3.8 64bit on Windows 10.

Moved from an edit to the question by the OP to an answer.
Remove the quotes around "**/*.jpg" in the filetype.cfg file:
**/*.jpg

How to fix "No such file or directory" error with csv creation in Python

I'm trying to make a new .csv file, but I'm getting a "No such file or directory" in the with open(...) portion of the code.
I modified the with open(...) portion of the code to exclude a direction, substituting a string name, and it worked just fine. The document was created with all my PyCharm scratches on the C Drive.
I believe it's worth noting that I'm running python on my C: Drive while the directory giving me issues exists on the D: Drive. Not sure if that actually makes a difference, but i
path = r"D:\Folder_Location\\"
plpath = pathlib.PurePath(path)
files = []
csv_filename = r"D:\Folder_Location\\"+str(plpath.name)+".csv"
#Create New CSV
with open(csv_filename, mode='w',newline='') as c:
writer = csv.writer(c)
writer.writerow(['Date','Name'])
I expected the code to create a new .csv file that would then be used by the rest of the script in the specific folder location, but instead I got the following error:
File "C:/Users/USER/.PyCharm2018.2/config/scratches/file.py", line 14, in <module>
with open(csv_filename, mode='w',newline='') as c:
FileNotFoundError: [Errno 2] No such file or directory: '[INTENDED FILE NAME]'
Process finished with exit code 1
The error code correctly builds the file name, but then says that it can't find the location, leading me to believe, again, that it's not the code itself but an issue with the separate drives (speculating). Also, line 14 is where the with open(...) starts.
EDIT: I tested a theory, and moved the folder to the C: drive, updated the path with just a copy and paste from the new location (still using the \ at the end of the file path in Python), and it worked. The new .csv file is now there. So why would the Drive make a difference? Permission issue for Python?

The raw string can not end with one single backslash '\' so what you are using in your code like in path = r"D:\Folder_Location\\" is the right thing but actually you don't need any backslashes at the end of your path:
i ran some similar tests like yours and all goes well, only got the same error when i used a non existing directory
this is what i got:
FileNotFoundError: [Errno 2] No such file or directory: 'E:\\python\\myProgects\\abc\\\\sample3.txt'
so my bet is you have a non existing path assigned in path = r"D:\Folder_Location\\" or your path is referring to a file not a folder
to make sure just run this:
import os
path = r"D:\Folder_Location\\"
print(os.path.isdir(path)) # print true if folder already exists
better approach:
file_name = str(plpath.name)+".csv"
path = r"D:\Folder_Location"
csv_filename = os.path.join(path, file_name)

Python - Sort files in directory and use latest file in code

Long time reader, first time poster. I am very new to python and I will try to ask my question properly.
I have posted a snippet of the .py code I am using below. I am attempting to get the latest modified file in the current directory to be listed and then pass it along later in the code.
This is the error I get in my log file when I attempt to run the file:
WindowsError: [Error 2] The system cannot find the file specified: '05-30-2012_1500.wav'
So it appears that it is in fact pulling a file from the directory, but that's about it. And actually, the file that it pulls up is not the most recently modified file in that directory.
latest_page = max(os.listdir("/"), key=os.path.getmtime)
cause = channel.FilePlayer.play(latest_page)

os.listdir returns the names of files, not full paths to those files. Generally, when you use os.listdir(SOME_DIR), you then need os.path.join(SOME_DIR, fname) to get a path you can use to work with the file.
This might work for you:
files = [os.path.join("/", fname) for fname in os.listdir("/")]
latest = max(files, key=os.path.getmtime)
cause = channel.FilePlayer.play(latest)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting file name while reading files from local system using pyspark - python

If input directory contains many small files then wholeTextFiles would help, check detailed description here. >>pairRDD = sc.wholeTextFiles('<path>') >>pairRDD.map(lambda x:x[0]).collect() #print all file names pairRDD each record contains key as absolute file path and value as entire file content.

Related

Modify all files in specified directory (including subfolders) and saving them in new directory while presevering folder structure (Python)

How to iterate through multiple excel files using python

File based strings/variables to set file path etc in python operation

How to fix "No such file or directory" error with csv creation in Python

Python - Sort files in directory and use latest file in code

Categories

Resources