FileNotFoundError in iterating over files under a directory - python

import os
import pandas as pd
FILES = os.listdir("/CADEC/original")
for file in FILES:
if file.startswith("ARTHROTEC."):
print(file)
ARTHROTEC.1.ann
ARTHROTEC.10.ann
ARTHROTEC.100.ann
ARTHROTEC.101.ann
ARTHROTEC.102.ann
ARTHROTEC.103.ann
ARTHROTEC.104.ann
ARTHROTEC.105.ann
ARTHROTEC.106.ann
ARTHROTEC.107.ann
ARTHROTEC.108.ann
ARTHROTEC.109.ann
ARTHROTEC.11.ann
ARTHROTEC.110.ann
ARTHROTEC.111.ann
ARTHROTEC.112.ann
ARTHROTEC.113.ann
ARTHROTEC.114.ann
ARTHROTEC.115.ann
...
I want to extract data from all the files starting with certain letters under a directory. As shown above, when I iterate over the directory and print every file name that fits, I get a column of file names (strings). Meanwhile, data = pd.read_csv("/CADEC/original/ARTHROTEC.1.ann", sep='\t', header=None) works perfectly well. However, running the following code would just return error. Why is the file not found? What should I do to fix this?
for file in FILES:
if file.startswith("ARTHROTEC."):
data = pd.read_csv(file, sep='\t', header=None)
FileNotFoundError: [Errno 2] File ARTHROTEC.1.ann does not exist: 'ARTHROTEC.1.ann'

os.listdir only returns the file names in the directory, it does not return the path, and pandas needs the path (or relative path) to the file, unless the file is in the same directory as the code.
You will be better off to learn the pathlib module, which treats paths as objects with methods, instead of strings.
.glob - produces a Generator of objects matching the pattern
Python 3's pathlib Module: Taming the File System
pathlib may take some getting used to, but all the methods for extracting specific parts of the path, like .suffix for the file extension, or .stem for the file name, make it worthwhile.
import pandas as pd
from pathlib import Path
# create the path object and get the files with .glob
files = Path('/CADEC/original').glob('ARTHROTEC*.ann')
# create a list of dataframes, 1 dataframe for each file
df_list = [pd.read_csv(file, sep='\t', header=None) for file in files]
# alternatively, create a dict of dataframes with the filename as the key
df_dict = {file.stem: pd.read_csv(file, sep='\t', header=None) for file in files}
Example
Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] on win32
import os
...: from pathlib import Path
...: os.listdir('e:/PythonProjects/stack_overflow/t-files')
Out[2]:
['.ipynb_checkpoints',
'03900169.txt',
'142233.0.txt',
'153431.2.txt',
'17371271.txt',
'274301.5.txt',
'42010316.txt',
'429237.7.txt',
'570651.4.txt',
'65500027.txt',
'688599.3.txt',
'740103.5.txt',
'742537.6.txt',
'87505504.txt',
'90950222.txt',
't1.txt',
't2.txt',
't3.txt']
list(Path('e:/PythonProjects/stack_overflow/t-files').glob('*'))
Out[3]:
[WindowsPath('e:/PythonProjects/stack_overflow/t-files/.ipynb_checkpoints'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/03900169.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/142233.0.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/153431.2.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/17371271.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/274301.5.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/42010316.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/429237.7.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/570651.4.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/65500027.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/688599.3.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/740103.5.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/742537.6.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/87505504.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/90950222.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/t1.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/t2.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/t3.txt')]

Related

Convert files from different paths using Python

I´m trying to convert Excel files from different paths but it only converts the file in the last path in path list.
What is the proper way to loop trough the paths in the list to to get the files to be converted?
import pandas as pd
import glob, os
import csv, json
import openpyxl
from pathlib import Path
list_path = Path("excel_files/PLM", "excel_files/PTR", "excel_files/TMR")
for xlsx_file in glob.glob(os.path.join(list_path,"*.xlsx*")):
data_xls = pd.read_excel(xlsx_file, 'Relatório - DADOS', index_col=None, engine = 'openpyxl')
csv_file = os.path.splitext(xlsx_file)[0]+".csv"
data_xls.to_csv(csv_file, encoding='utf-8', index=False)
Path("excel_files/PLM", "excel_files/PTR", "excel_files/TMR") returns a single path, not a list of paths:
>>> Path("excel_files/PLM", "excel_files/PTR", "excel_files/TMR")
PosixPath('excel_files/PLM/excel_files/PTR/excel_files/TMR')
I'm not sure why it finds any files at all, to be honest.
Instead, you will probably have to do another loop - something like:
for path in ["excel_files/PLM", "excel_files/PTR", "excel_files/TMR"]:
for xlsx_file in glob.glob(os.path.join(path, "*.xlsx*")):
...

Merge Excel files with pandas in python

I'm almost done with merging excel files with pandas in python but when I give the path it wont work. I get the error ''No such file or directory: 'file1.xlsx'''. When I leave the path empty it work but I want to decide from what folder it should take files from. AND I saved the file the folder 'excel'
cwd = os.path.abspath('/Users/Viktor/downloads/excel') #If i leave it empty and have files in /Viktor it works but I have the desired excel files in /excel
print(cwd)
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(file), ignore_index=True)
df.head()
df.to_excel(r'/Users/Viktor/Downloads/excel/resultat/merged.xlsx')
pd.read_excel(file) looks for the file relative to the path where the script is executed. If you execute in '/Users/Viktor/' try with:
import os
import pandas as pd
cwd = os.path.abspath('/Users/Viktor/downloads/excel') #If i leave it empty and have files in /Viktor it works but I have the desired excel files in /excel
#print(cwd)
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel('downloads/excel/' + file), ignore_index=True)
df.head()
df.to_excel(r'/Users/Viktor/downloads/excel/resultat/merged.xlsx')
How about actually changing the current working directory with
os.chdir(cwd)
Just printing the path doesn't help.
Use pathlib
Path.glob() to find all the files
Use Path.rglob() if you want to include subdirectories
Use pandas.concat to combine the dataframes created with the pd.read_excel in the list comprehension
from pathlib import Path
import pandas as pd
# path to files
p = Path('/Users/Viktor/downloads/excel')
# find the xlsx files
files = p.glob('*.xlsx')
# create the dataframe
df = pd.concat([pd.read_excel(file, ignore_index=True) for file in files])
# save the file
df.to_excel(r'/Users/Viktor/Downloads/excel/resultat/merged.xlsx')

Reading all files in a folder with relative urls in both windows and linux

I can read a csv with relative path using below.
import pandas as pd
file_path = './Data Set/part-0000.csv'
df = pd.read_csv(file_path )
but when there are multiple files, I am using glob, File paths are mixed with forward and backward slash. thus unable to read file due to wrong path.
allPaths = glob.glob(path)
file path looks like below for path = "./Data Set/UserIdToUrl/*"
"./Data Set/UserIdToUrl\\part-0000.csv"
file path looks like below for path = ".\\Data Set\\UserIdToUrl\\*"
".\\Data Set\\UserIdToUrl\\part-0000.csv"
If i am using
normalPath = os.path.normpath(path)
normalPath is missing the relative ./ or .\\ like below.
'Data Set\UserIdToUrl\part-00000.csv'
Below could work, what is the best way to do it so that it work in both windows and linux?
".\\Data Set\\UserIdToUrl\\part-0000.csv"
or
"./Data Set/UserIdToUrl/part-0000.csv"
Please ask clarification question, if any. Thanks in advance for comments and answers.
More Info:
I guess the problem is only in windows but not in linux.
Below is shortest program to show issue. consider there are files in path './Data Set/UserIdToUrl/*' and it is correct as i can read file when providing path to file directly to pd.read_csv('./Data Set/UserIdToUrl/filename.csv').
import os
import glob
import pandas as pd
path = "./Data Set/UserIdToUrl/*"
allFiles = glob.glob(path)
np_array_list = []
for file_ in allFiles:
normalPath = os.path.normpath(file_)
print(file_)
print(normalPath)
df = pd.read_csv(file_,index_col=None, header=0)
np_array_list.append(df.as_matrix())
Update2
I just googled glob library. Its definition says 'glob — Unix style pathname pattern expansion'. I guess, I need some utility function that could work in both unix and windows.
you can use abspath
for file in os.listdir(os.path.abspath('./Data Set/')):
...: if file.endswith('.csv'):
...: df = pandas.read_csv(os.path.abspath(file))
Try this:
import pandas as pd
from pathlib import Path
dir_path = 'Data Set'
datas = []
for p in Path(dir_path).rglob('*.csv'):
df = pd.read_csv(p)
datas.append(df)

FileNotFoundError occured when tried to read multiple text files into single pandas dataframe

I tried to read multiple text files from a local directory into one single pandas dataframe. Since original text files come with extra file extension I renamed it, after all, then I tried to read all text files into single dataframe by read_csv and concat from pandas. Problem is, I am able to read single text files with pandas but when I tried to read a list of text files from a local directory into single dataframe, I got following error:
folder = 'fakeNewsDatasets[Rada]/fakeNewsDataset/fake'
allfiles=os.listdir(folder)
print(allfiles)
['biz01.txt',
'biz02.txt',
'biz03.txt',
'biz04.txt',
'biz05.txt',
'biz06.txt']
then I tried to read those text files into single dataframe as follows:
dfs=pd.concat([pd.read_csv(file, header = None, sep = '\n', skip_blank_lines = True) for file in allfiles], axis=1)
*
FileNotFoundError: [Errno 2] File b'biz02.txt' does not exist:
b'biz02.txt'
*
I don't understand why this problem occurred because reading a single text file to pandas dataframe works well for me.
df = pd.read_csv('biz01.txt', header = None, sep = '\n', skip_blank_lines = True)
df=df.T
df.columns = ['headline', 'text']
can anyone help me to resolve this issue? how can I fix this error? any better idea?
use glob() it would be easier:
import glob
allfiles=glob.glob('C:\\folder1\\*.csv')
Else you may have to join the path with file while doing for file in allfiles when reading the file in pd.read_csv()
Another option:
import os
import pandas as pd
data_set = pd.DataFrame()
for root, dirs, files in os.walk(""):
for file in files:
if file.endswith('.txt'):
df = pd.read_csv(root + "/" + file, header=None)
data_set = pd.concat([data_set, df])
data_set.to_csv("/tx.txt", index=False, header=False)

How to read the file which is in other directory?

My project
et->datacollector
->eventprocessor->multilang->resources->python->tenderevent->rules->Table.py
->target->inpout->Read.csv
Table.py
import pandas as pd
df_LFB1 = pd.read_csv('Read.csv', sep = ',', usecols = [1,2,7,59])
Now above I want to use Read.csv file how should I give the directory of Read.csv file in pd.read_csv
import os
os.getcwd()
Out[42]: '/Users/Documents'
## os.path.abspath(__file__) ## inside script
If I have the 'Read.csv' file in my current working directory '/Users/Documents', I can read the file like below.
df_LFB1 = pd.read_csv('Read.csv', sep = ',', usecols = [1,2,7,59])
and if my file is not in current working dierctory but in some other directory lets say et directory is in /home/project,
df_LFB1 = pd.read_csv(r'/home/project/et/eventprocessor/target/inpout/ Read.csv',
sep = ',', usecols = [1,2,7,59])
Above statement will read the file.
Note: when you provide absolute path to file. It doesnt not matter where your script resides.

Categories

Resources