Convert files from different paths using Python - python

I´m trying to convert Excel files from different paths but it only converts the file in the last path in path list.
What is the proper way to loop trough the paths in the list to to get the files to be converted?
import pandas as pd
import glob, os
import csv, json
import openpyxl
from pathlib import Path
list_path = Path("excel_files/PLM", "excel_files/PTR", "excel_files/TMR")
for xlsx_file in glob.glob(os.path.join(list_path,"*.xlsx*")):
data_xls = pd.read_excel(xlsx_file, 'Relatório - DADOS', index_col=None, engine = 'openpyxl')
csv_file = os.path.splitext(xlsx_file)[0]+".csv"
data_xls.to_csv(csv_file, encoding='utf-8', index=False)

Path("excel_files/PLM", "excel_files/PTR", "excel_files/TMR") returns a single path, not a list of paths:
>>> Path("excel_files/PLM", "excel_files/PTR", "excel_files/TMR")
PosixPath('excel_files/PLM/excel_files/PTR/excel_files/TMR')
I'm not sure why it finds any files at all, to be honest.
Instead, you will probably have to do another loop - something like:
for path in ["excel_files/PLM", "excel_files/PTR", "excel_files/TMR"]:
for xlsx_file in glob.glob(os.path.join(path, "*.xlsx*")):
...

Related

Convert Excel to CSV using Python

I would like to name new CSV files similar to their corresponding xlsx files
import pandas as pd
for filename in my_path.glob('*.xlsx'):
read_file = pd.read_excel (str(filename)', sheet_name='My Excel sheet name')
read_file.to_csv ("XLSX NAME SHOULD = CSV NAME.csv', index = None, header=True)
To get the filename with path but without extension use os.path.splitext
from os import path
path = "/path/to/file.txt"
path.splitext(path)
# -> ["/path/to/file", "txt"]
To get the filename without the path :
from os import path
path = "/path/to/file.txt"
path.basename(path)
# -> "file.txt"
So to change the extension from xlsx to csv :
from os import path
path = "/path/to/file.xlsx"
filename = path.splitext(path)[0] + '.csv'
# -> "/path/to/file.csv"
And if you need to change the path to save the file in another folder, then you can use basename first.

Reading all files in a folder with relative urls in both windows and linux

I can read a csv with relative path using below.
import pandas as pd
file_path = './Data Set/part-0000.csv'
df = pd.read_csv(file_path )
but when there are multiple files, I am using glob, File paths are mixed with forward and backward slash. thus unable to read file due to wrong path.
allPaths = glob.glob(path)
file path looks like below for path = "./Data Set/UserIdToUrl/*"
"./Data Set/UserIdToUrl\\part-0000.csv"
file path looks like below for path = ".\\Data Set\\UserIdToUrl\\*"
".\\Data Set\\UserIdToUrl\\part-0000.csv"
If i am using
normalPath = os.path.normpath(path)
normalPath is missing the relative ./ or .\\ like below.
'Data Set\UserIdToUrl\part-00000.csv'
Below could work, what is the best way to do it so that it work in both windows and linux?
".\\Data Set\\UserIdToUrl\\part-0000.csv"
or
"./Data Set/UserIdToUrl/part-0000.csv"
Please ask clarification question, if any. Thanks in advance for comments and answers.
More Info:
I guess the problem is only in windows but not in linux.
Below is shortest program to show issue. consider there are files in path './Data Set/UserIdToUrl/*' and it is correct as i can read file when providing path to file directly to pd.read_csv('./Data Set/UserIdToUrl/filename.csv').
import os
import glob
import pandas as pd
path = "./Data Set/UserIdToUrl/*"
allFiles = glob.glob(path)
np_array_list = []
for file_ in allFiles:
normalPath = os.path.normpath(file_)
print(file_)
print(normalPath)
df = pd.read_csv(file_,index_col=None, header=0)
np_array_list.append(df.as_matrix())
Update2
I just googled glob library. Its definition says 'glob — Unix style pathname pattern expansion'. I guess, I need some utility function that could work in both unix and windows.
you can use abspath
for file in os.listdir(os.path.abspath('./Data Set/')):
...: if file.endswith('.csv'):
...: df = pandas.read_csv(os.path.abspath(file))
Try this:
import pandas as pd
from pathlib import Path
dir_path = 'Data Set'
datas = []
for p in Path(dir_path).rglob('*.csv'):
df = pd.read_csv(p)
datas.append(df)

How can I open multiple json files in Python with a for loop?

For a data challenge at school we need to open a lot of json files with python. There are too many to open manually. Is there a way to open them with a for loop?
This is the way I open one of the json files and make it a dataframe (it works).
file_2016091718 = '/Users/thijseekelaar/Downloads/airlines_complete/airlines-1474121577751.json'
json_2016091718 = pd.read_json(file_2016091718, lines=True)
Here is a screenshot of how the map where the data is in looks (click here)
Yes, you can use os.listdir to list all the json files in your directory, create the full path for all of them and use the full path using os.path.join to open the json file
import os
import pandas as pd
base_dir = '/Users/thijseekelaar/Downloads/airlines_complete'
#Get all files in the directory
data_list = []
for file in os.listdir(base_dir):
#If file is a json, construct it's full path and open it, append all json data to list
if 'json' in file:
json_path = os.path.join(base_dir, file)
json_data = pd.read_json(json_path, lines=True)
data_list.append(json_data)
print(data_list)
Try this :
import os
# not sure about the order
for root, subdirs, files in os.walk('your/json/dir/'):
for file in files:
with open(file, 'r'):
#your stuff here

How to convert multiple .xlsx files in to .csv using pandas and globe

I have a folder JanuaryDataSentToResourcePro that contain multiple .xlsx files.
I want to iterate through folder and convert all of them into .csv and keep the same file name.
For that I'm trying to implement glob, but getting an error: TypeError: 'module' object is not callable
import glob
excel_files = glob('*xlsx*')
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(r'''C:\Users\username\Documents\TestFolder\JanuaryDataSentToResourcePro\ResourceProDailyDataset_01_01_2018.xlsx''', 'ResourceProDailyDataset')
df.to_csv(out)
I am new to python. Does it look right?
UPDATE:
import pandas as pd
import glob
excel_files = glob.glob("*.xlsx")
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel, 'ResourceProDailyDataset')
df.to_csv(out)
But still not converting convert .xlsx to .csv
The glob package should be used like:
import glob
f = glob.glob("*.xlsx")
The glob is not a method but glob.glob is.
========================================
import glob
excel_files = glob.glob('C:/Users/username/Documents/TestFolder/JanuaryDataSentToResourcePro/*.xlsx') # assume the path
for excel in excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel) # if only the first sheet is needed.
df.to_csv(out)

Python Renaming Multiple Files in Directory

Beginner question: I am trying to rename all .xlsx files within a directory. I understand how to replace a character in string with another, but how about removing? More specifically, I have multiple files in a directory: 0123_TEST_01, 0456_TEST_02. etc. I am trying to remove the prefix in the file name, which would result in the following: TEST_01, TEST_02.
I am attempting to use os.rename and throw it into a loop, but am unsure if I should use len() and some math to try and return the correct naming convention. The below code is where I currently stand. Please let me know if this does not make sense. Thanks.
import os
import shutil
import glob
src_files = os.listdir('C:/Users/acars/Desktop/b')
for file_name in src_files:
os.rename(fileName, filename.replace())
Just split once on an underscore and use the second element, glob will also find all your xlsx file for you are return the full path:
from os import path, rename
from glob import glob
src_files = glob('C:/Users/acars/Desktop/b/*.xlsx')
pth = 'C:/Users/acars/Desktop/b/'
for file_name in src_files:
rename(file_name, path.join(pth, path.basename(file_name).split("_",1)[1])
If you only have xlsx files and you did not use glob you would need to join the paths:
from os import path, rename
from glob import glob
pth = 'C:/Users/acars/Desktop/b'
src_files = os.listdir(pth)
for file_name in src_files:
new = file_name.split("_", 1)[1]
file_name = path.join(pth, file_name)
rename(file_name, path.join(pth, new))
Just split the file name by underscore, ignore the first part, and join it back again.
>>> file_name = '0123_TEST_01'
>>> '_'.join(file_name.split('_')[1:])
'TEST_01'
Your code will look like this:
for file_name in src_files:
os.rename(file_name, '_'.join(file_name.split('_')[1:]))

Categories

Resources