Translate Scala code to Rename and Move CSV file - Spark - PySpark - python

I am using the Scala code below to rename a CSV file into TXT file and move TXT file. I need to translate this code to Python/Pyspark but I am having problems (not well versed in Python). I would highly appreciate your help. Thanks in advance!
//Prepare to rename file
import org.apache.hadoop.fs._
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(sc.hadoopConfiguration)
//Create variables
val table_name = dbutils.widgets.get("table_name") // getting table name
val filePath = "dbfs:/mnt/datalake/" + table_name + "/" // path where original csv file name is located
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName // getting original csv file name
val newfilename = table_name + ".txt" // renaming and transforming csv into txt
val curatedfilePath = "dbfs:/mnt/datalake/" + newfilename // curated path + new file name
//Move to curated folder
dbutils.fs.mv(filePath + fileName, curatedfilePath)
Here is the Python Code
%python
#Create variables
table_name = dbutils.widgets.get("table_name") # getting table name
filePath = "dbfs:/mnt/datalake/" + table_name + "/" # path where original csv file name is located
newfilename = table_name + ".txt" # transforming csv into txt
curatedfilePath = "dbfs:/mnt/datalake/" + newfilename # curated path + new file name
#Save CSV file
df_curated.coalesce(1).replace("", None).write.mode("overwrite").save(filePath,format='csv', delimiter='|', header=True, nullValue=None)
# getting original csv file name
for f in filePath:
if f[1].startswith("part-00000"):
original_file_name = f[1]
#move to curated folder
dbutils.fs.mv(filePath + fileName, curatedfilePath)
I am having problem with the "getting original file name" part. It throws the following error:
IndexError: string index out of range
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<command-3442953727364942> in <module>()
11 # getting original csv file name
12 for f in filePath:
---> 13 if f[1].startswith("part-00000"):
14 original_file_name = f[1]
15
IndexError: string index out of range

In the Scala code, you're using hadoop.fs.golobStatus to list the part files from the folder where you save the DataFrame.
In Python you can do the same by accessing hadoop.fs via the JVM like this:
conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
part_files = Path(filePath).getFileSystem(conf).globStatus(Path(filePath + "/part*"))
file_name = part_files[0].getPath().getName()

Related

Pyinstaller exe fails execution and is slow

I used pyinstaller to pack the following script in an exe file with pyinstaller -F.
For clarity the script concat csv files in one file and export them to a new csv file.
# import necessary libraries
import pandas as pd
import os
import glob
from datetime import datetime
#Set filename
file_name = 'daglig_LCR_RLI'
# in the folder
path = os.path.dirname(os.path.abspath(__file__))
# Delete CSV file
# first check whether file exists or not
# calling remove method to delete the csv file
# in remove method you need to pass file name and type
del_file = path+"\\" + file_name +'.csv'
## If file exists, delete it ##
if os.path.isfile(del_file):
os.remove(del_file)
print("File deleted")
else: ## Show an error ##
print("File not found: " +del_file)
# use glob to get all the csv files
csv_files = glob.glob(os.path.join(path, "*.csv"))
df_list= list()
#format columns
dict_conv={'line_item': lambda x: str(x),
'column_item': lambda x: str(x)}
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f, sep=";", converters = dict_conv, encoding='latin1') #test latin1
df_list.append(df)
#print the location and filename
print('Location:', f)
print('File Name:', f.split("\\")[-1])
#add data frames to a list
RLI_combined = pd.concat(df_list, axis=0)
#Write date to approval_text
now = datetime.now()
# dd/mm/YY
print_date = now.strftime("%d/%m/%Y")
RLI_combined.loc[:, 'approval_text'] = print_date
#replace value_text with n/a
RLI_combined.loc[:, 'value_text'] = "n/a"
#Sum columns
m = RLI_combined['column_item'].isin(['0030', '0050', '0080'])
RLI_combined_sum = RLI_combined[~m].copy()
RLI_combined_sum['amount'] = RLI_combined_sum.groupby(['report_name', 'line_item', 'column_item'])['amount'].transform('sum')
RLI_combined_sum = RLI_combined_sum.drop_duplicates(['report_name', 'line_item', 'column_item'])
RLI_combined = pd.concat([RLI_combined_sum, RLI_combined[m]])
#export to csv
RLI_combined.to_csv(path + "//" + file_name + '.csv', index=False, sep=";", encoding='latin1')
#Make log
# Create the directory
directory = "Log"
parent_dir = path
# Path
path_log = os.path.join(parent_dir, directory)
try:
os.mkdir(path_log)
print('Log folder dannet')
except OSError as error:
print('Log folder eksisterer')
#export to csv
log_name = now.strftime("%d-%m-%Y_%H-%M-%S")
print(log_name)
RLI_combined.to_csv(path + "//" + 'Log' +"//" + file_name+'_' + log_name + '.csv', index=False, sep=";", encoding='latin1')
Everything works as intended when don't use pyinstaller. If I run the exe file after 10 sec of nothing it writes the following:
What am I doing wrong that causes the error? and could I improve performance so the exe file isn't that slow.
I hope you can point me in the right direction.
I believe I found the solution to the problem. I use anaconda and pyinstaller uses all the installed libaries on the machine.
So using a clean install of python and only installling the nescecary libaries might fix the problem.
The error seems to be a numpy error and the script isn't using that libary.

Folder Compare using Python

I have to compare 100s of files in two folders(directories). There is a way in which we can derive the second file based on the first file and vice versa. I was asked to develop a script so that we can do this task quickly. Following were the requirements
a) HTML Report showing the differences
b) Txt file showing the basic information i.e., count,header,trailer info.
I have written the following script using python but after processing 14 files, there is no movement.
{#Take two folders as input and compare the same files in them using pandas and sqlite
#!/usr/bin/env python3
# Path: folder_compare.py
import os
import pandas as pd
import sqlite3
import logging
import difflib
import sys
#function to write the message sent to the txt file passed as an argument
def write_to_txt(file_name, message):
#path to the file
d_path = 'C:/Upgrade/File-Compare/Differences/' + os.path.basename(file_name)
os.makedirs(d_path, exist_ok=True)
file_path = d_path + '/' + file_name + '.txt'
#Create the file if it does not exist
if not os.path.exists(file_path):
open(file_path, 'w').close()
f = open(file_path, 'a')
f.write(message)
f.close()
def convert_windows_path_to_python(path):
path = path.replace("\\","/")
return path
#get the folders as input from the user
fol1 = input("Enter the first folder path: ")
fol2 = input("Enter the second folder path: ")
folder1 = convert_windows_path_to_python(fol1)
folder2 = convert_windows_path_to_python(fol2)
#function to derive the second file name from the first file name
def get_file_name(file_name):
#file_name = file_name.split('.')
#file_name = file_name[0].replace('BZ1CV','BZ1DV') + '.' + file_name[1]
file_name = file_name.replace('BZ1CV','BZ1DV')
return file_name
#function to compare the two files and write the difference to a html using html.table
def compare_files(file1, file2):
#read the two files
f1 = pd.read_table(file1, encoding='unicode_escape', header=None)
f2 = pd.read_table(file2, encoding='unicode_escape', header=None)
#Get the filesize of the two files
f1_size = os.path.getsize(file1)
f2_size = os.path.getsize(file2)
d_path = 'C:/Upgrade/File-Compare/Differences/' + os.path.basename(file1)
os.makedirs(d_path, exist_ok=True)
#if file size of any of the two files is greater than 10MB, then compare the files using pandas concat and drop_duplicates
if f1_size > 10485760 or f2_size > 10485760:
#compare the two files using pandas concat and drop_duplicates, where both the files can be viewed side by side
difference = pd.concat([f1, f2]).drop_duplicates(keep=False)
difference.to_html(d_path + '_diff.html')
#if the file size of any of the two files is less than 10MB, then compare the files using difflib.html_diff
else:
#compare the two files using difflib.html_diff
first_file_lines = open(file1).readlines()
second_file_lines = open(file2).readlines()
diff = difflib.HtmlDiff().make_file(first_file_lines, second_file_lines, file1, file2, context=True, numlines=0)
diff_report = open(d_path + '_diff.html', 'w')
diff_report.writelines(diff)
diff_report.close()
logging.info('The files are compared successfully')
#Now start logging findings of the files
#Count the number of rows in the two data frames and log the rowcount of both the data frames in a log file with the name as the first file name and extension as .txt
#Loop through the files in the folder1 and compare them with the files in the folder2
for file in os.listdir(folder1):
file1 = folder1 + '/' + file
file2 = folder2 + '/' + get_file_name(file)
#if the second file does not exist in folder 2, then log the error and continue
if not os.path.isfile(file2):
logging.error('File not found: ' + os.path.basename(file2))
continue
f1 = pd.read_table(file1, encoding='unicode_escape', header=None)
f2 = pd.read_table(file2, encoding='unicode_escape', header=None)
#Get the first row(header) of the first data frame and the first row(header) of the second data frame and write both the headers to a text file using the first file name and extension as .txt
f1_header = f1.iloc[0]
f2_header = f2.iloc[0]
#write the headers to a text file using the first file name and extension as .txt and writing a sentence to the text file
write_to_txt(os.path.basename(file1) , 'The headers of the first file are: ' + str(f1_header) + '\n')
write_to_txt(os.path.basename(file1) , 'The headers of the second file are: ' + str(f2_header) + '\n')
#Get the rowcount of the first data frame and the rowcount of the second data frame and write both the rowcounts to a text file using the first file name and extension as .txt
f1_rowcount = f1.shape[0]
f2_rowcount = f2.shape[0]
write_to_txt(os.path.basename(file1) , 'The rowcount of the first file(including header and trailer rows) is: ' + str(f1_rowcount) + '\n')
write_to_txt(os.path.basename(file1) , 'The rowcount of the second file(including header and trailer rows) is: ' + str(f2_rowcount) + '\n')
#Get the last row (footer) of the first data frame and the last row (footer) of the second data frame and write both the footers to a text file using the first file name and extension as .txt
f1_footer = f1.iloc[-1]
f2_footer = f2.iloc[-1]
write_to_txt(os.path.basename(file1) , 'The trailer of the first file are: ' + str(f1_footer) + '\n')
write_to_txt(os.path.basename(file1) , 'The trailer of the second file are: ' + str(f2_footer) + '\n')
compare_files(file1, file2)
}

saving a filename with an extension to an existing filename

I would like to save a file by adding an extension to the existing filename, by extension I do not mean change .csv to .HTML.
What I mean is if I have an existing file file1.csv
I would like to save the other file as file1_processed.csv.
I tried doing this
data = pd.read_csv("file1.csv")
df = x
df.to_csv(os.path.basename(data) + '_' + 'processed' + '.csv')
however, it gives an error
TypeError: expected str, bytes or os.PathLike object, not DataFrame
filename = "file1"
data = pd.read_csv(filename + ".csv")
# ...
df.to_csv(filename + '_processed.csv')
data has contents of file file1.csv, not the file name. You need to assign the file name to a variable if the name of the file is not same always.
file_name = "file1.csv"
data = pd.read_csv(file_name )
df = x
name, extension = os.path.splitext(os.path.basename(file_name))
df.to_csv(name + '_' + 'processed' + extension)

Python looping over folders and its subfolders to read CSV is getting file names but on read_csv it is returning file not found

I am trying to loop over folders and subfolder to access and read CSV files before transforming them into JSON. Here is the code I am working on:
cursor = conn.cursor()
try:
# Specify the folder containing needed files
folderPath = 'C:\\Users\\myUser\\Desktop\\toUpload' # Or using input()
fwdPath = 'C:/Users/myUser/Desktop/toUpload'
for countries in os.listdir(folderPath):
for sectors in os.listdir(folderPath+'\\'+countries):
for file in os.listdir(folderPath+'\\'+countries+'\\'+sectors):
data = pd.DataFrame()
filename, _ext = os.path.splitext(os.path.basename(folderPath+'\\'+countries+'\\'+file))
print(file + ' ' + filename+ ' ' + sectors + ' ' + countries)
data = pd.read_csv(file)
# cursor.execute('SELECT * FROM SECTORS')
# print(list(cursor))
finally:
cursor.close()
conn.close()
The following print line is returning the file with its filename without the extension, and then sectors and countries folder names:
print(file + ' ' + filename+ ' ' + sectors + ' ' + countries)
myfile.csv myfile WASHSector CTRYIrq
Now when it comes to reading the CSV, it will take lots and lots of time and at the end O get the following error:
[Errno 2] File myfile.csv does not exist
you need to give pd.read_csv the full path of the file, so change it to:
data = pd.read_csv(folderPath+'\\'+countries+'\\'+sectors + '\\' +file)
Before reading the csv file, you should compose the whole path to the file, otherwise, pandas won't be able to read that file.
import os
# ...
path = os.path.join(folderPath, countries, sectors, file)
data = pd.read_csv(path)
Also instead of using three nested for loops I recommend you using the os.walk method. It will automatically recurse through directories
>>> folderPath = 'C:\\Users\\myUser\\Desktop\\toUpload'
>>> for root, _, files in os.walk(folderPath):
>>> ... for f in files:
>>> ... pd.read_csv(os.path.join(root, f))

Adding Multiple .xls files to a Single .xls file, using the file name to name tabs

I have multiple directories, each of which containing any number of .xls files.
I'd like to take the files in any given directory and combine them into one .xls file, using the file names as the tab names.
For example if there are the files NAME.xls, AGE.xls, LOCATION.xls, I'd like to combine them into a new file with the data from NAME.xls on a tab called NAME, the data from AGE.xls on a tab called AGE and so on.
Each source .xls file only has one column of data with no headers.
This is what I have so far, and well it's not working.
Any help would be greatly appreciated (I'm fairly new to Python and I've never had to do anything like this before).
wkbk = xlwt.Workbook()
xlsfiles = glob.glob(os.path.join(path, "*.xls"))
onlyfiles = [f for f in listdir(path) if isfile(join(path, f))]
tabNames = []
for OF in onlyfiles:
if str(OF)[-4:] == ".xls":
sheetName = str(OF)[:-4]
tabNames.append(sheetName)
else:
pass
for TN in tabNames:
outsheet = wkbk.add_sheet(str(TN))
data = pd.read_excel(path + "\\" + TN + ".xls", sheet_name="data")
data.to_excel(path + "\\" + "Combined" + ".xls", sheet_name = str(TN))
Here is a small helper function - it supports both .xls and .xlsx files:
import pandas as pd
try:
from pathlib import Path
except ImportError: # Python 2
from pathlib2 import Path
def merge_excel_files(dir_name, out_filename='result.xlsx', **kwargs):
p = Path(dir_name)
with pd.ExcelWriter(out_filename) as xls:
_ = [pd.read_excel(f, header=None, **kwargs)
.to_excel(xls, sheet_name=f.stem, index=False, header=None)
for f in p.glob('*.xls*')]
Usage:
merge_excel_files(r'D:\temp\xls_directory', 'd:/temp/out.xls')
merge_excel_files(r'D:\temp\xlsx_directory', 'd:/temp/out.xlsx')
Can you try
import pandas as pd
import glob
path = 'YourPath\ToYour\Files\\' # Note the \\ at the end
# Create a list with only .xls files
list_xls = glob.glob1(path,"*.xls")
# Create a writer for pandas
writer = pd.ExcelWriter(path + "Combined.xls", engine = 'xlwt')
# Loop on all the files
for xls_file in list_xls:
# Read the xls file and the sheet named data
df_data = pd.read_excel(io = path + xls_file, sheet_name="data")
# Are the sheet containing data in all your xls file named "data" ?
# Write the data into a sheet named after the file
df_data.to_excel(writer, sheet_name = xls_file[:-4])
# Save and close your Combined.xls
writer.save()
writer.close()
Let me know if it works for you, I never tried engine = 'xlwt' as I don't work with .xls file but .xlsx

Categories

Resources