Sort files inside folder by element - python

I'm working with a Python script that takes some CSV files inside a folder and merges the data inside these files, but the problem is when sorting the files.
I found a similar useful question and I try to use the answers for it, but they didn't work.
The reality is I can obtain the final file, but the sort method doesn't work as I expect. I'm using the numeric element in the name of each file that I want to sort, I also include an image from my console:
How can I resolve this issue?
my code is the following:
import pandas as pd
import os
import glob
import numpy as np
import re
from os import listdir
#files = glob.glob1('./separa_0-60/', '*' + '.csv')
# if you want sort files according to the digits included in the filename, you can do as following:
#data_files = sorted(files, key=lambda x:float(re.findall("(\d+)",x)[0]))
#data_files = sorted(glob.glob('./separa_0-60/resultados_nodos_*.csv'))
data_files = sorted(glob.glob('./separa_0-60/resultados_nodos_*.csv'), key=lambda x: float(re.findall("(\d+)",x)[0]))
#print(files)
print(data_files)
mergeddata = pd.concat(pd.read_csv(datafile, sep=';')
for datafile in data_files)
keep_col = [
"node_code",
"throughput[Mbps]",
"node_code.1",
"throughput[Mbps].1"
]
mergeddata2 = mergeddata[keep_col]
print(mergeddata2)
mergeddata2.to_csv('resul_nodos_final_separa0-60.csv', index=False)
I very much appreciate all the help, regards!

The problem is that the directory name "separa_0-60" has digits in it. The first result from your findall is that "0". Better to do a more specific search on the file name.
data_files = sorted(glob.glob('./separa_0-60/resultados_nodos_*.csv'),
key=lambda x: float(re.search(r"resultados_nodos_(\d+).csv$", x).group(1)))

Related

How import multiple text files to one dataframe in python?

I found here how to import multiple text files to one data frame. However, it gives an error. Files are with the names as footballseason1,footballseason2,footballseason3 ... (until footballseason5000)
import pandas as pd
import datetime as dt
import os, glob
os.chdir("~/Downloads/data")
filenames = [i for i in glob.glob("*.txt")]
FileNotFoundError: [Errno 2] No such file or directory: '~/Downloads/data'
However, if I try to import one file, everything is working and the directory is found
df = pd.read_csv("~/Downloads/data/footballseason1.txt", sep=",")
Could you help to fix the problem? and are there any ways to do it without changing directory and simply do all the steps using the path where all files are located?
Python's os does not understand ~ by default, so it needs to be expanded manually:
filenames = [i for i in glob.glob(os.path.expanduser("~/Downloads/data/*.txt"))]
You can use python's list comprehension and pd.concat like below
df = pd.concat([pd.read_csv(i, sep=',') for i in glob.glob("~/Downloads/data/*.txt", recursive=True)])
Via pathlib ->
import pandas as pd
from pathlib import Path
inp_path = Path("~/Downloads/data")
df = pd.concat([
pd.read_csv(txt_file, sep=',') for txt_file in inp_path.glob('*.txt')
])
With added check - >
import pandas as pd
from pathlib import Path
inp_path = Path("~/Downloads/data")
if inp_path.exists():
df = pd.concat([
pd.read_csv(txt_file, sep=',') for txt_file in inp_path.glob('*.txt')
])
else:
print('input dir doesn\'t exist please check path')
Importing Data from Multiple files
Now let’s see how can we import data from multiple files from a specific directory. There are many ways to do so, but I personally believe this is an easier and simpler way to use and also to understand especially for beginners.
1)First, we are going to import the OS and glob libraries. We need them to navigate through different working directories and getting their paths.
import os
import glob
2) We also need to import the pandas library as we need to work with data frames.
import pandas as pd
3) Let’s change our working directory to the directory where we have all the data files.
os.chdir(r"C:\Users\HARISH\Path_for_our_files")
4) Now we need to create a for loop which iterates through all the .csv file in the current working directory
filenames = [i for i in glob.glob("*.csv")]

How to import all csv files in one folder and make the filename the variable name in pandas?

I would like to automatically import all csv files that are in one folder as dataframes and set the dataframe's variable name to the respective filename.
For example, in the folder are the following three files: data1.csv, data2.csv and data3.csv
How can I automatically import all three files having three dataframes (data1, data2 and data3) as the result?
If you want to save dataframe as variable with own file name. But it is not secure. This could cause code injection.
import pandas
import os
path = "path_of_directory"
files = os.listdir(path) # Returns list of files in the folder which is specifed path
for file in files:
if file.endswith(".csv"):# Checking wheter file endswith .csv
# os.sep returns the separtor of operator system
exec(f"{file[:-4]} = pandas.read_csv({path}+{os.sep}+{file})")
You can loop over the directory using pathlib and build a dictionary of name->DataFrame, eg:
import pathlib
import pandas as pd
dfs = {path.stem: pd.read_csv(path) for path in pathlib.Path('thepath/').glob(*.csv')}
Then access as dfs['test1'] etc...
Since the answer that was given includes an exec command, and munir.aygun already warned you what could go wrong with that approach. Now I want to show you the way to do it as Justin Ezequiel or munir.aygun already suggested:
import os
import glob
import pandas as pd
# Path to your data
path = r'D:\This\is\your\path'
# Get all .csv files at your path
allFiles = glob.glob(path + "/*.csv")
# Read in the data from files and safe to dictionary
dataStorage = {}
for filename in allFiles:
name = os.path.basename(filename).split(".")[0]
dataStorage[name] = pd.read_csv(filename)
# Can be used then like this (for printing here)
if "data1" in dataStorage:
print(dataStorage["data1"])
Hope this can still be helpful.

Python iterate over multiple files

I have a series of files that are in the following format:
file_1991.xlsx
file_1992.xlsx
# there are some gaps in the file numbering sequence
file_1995.xlsx
file_1996.xlsx
file_1997.xlsx
For each file I want to do something like:
import pandas as pd
data_1995 = pd.read_excel(open(directory + 'file_1995', 'rb'), sheetname = 'Sheet1')
do some work on the data, and save it as another file:
output_1995 = pd.ExcelWriter('output_1995.xlsx')
data_1995.to_excel(output_1995,'Sheet1')
Instead of doing all these for every single file, how can I iterate through multiple files and repeat the same operation across multiple files? In other words, I would like to iterate over all the files (they mostly following a numerical sequence in their names, but there are some gaps in the sequence).
Thanks for the help in advance.
You can use os.listdir or glob module to list all files in a directory.
With os.listdir, you can use fnmatch to filter files like this (can use a regex too);
import fnmatch
import os
for file in os.listdir('my_directory'):
if fnmatch.fnmatch(file, '*.xlsx'):
pd.read_excel(open(file, 'rb'), sheetname = 'Sheet1')
""" Do your thing to file """
Or with glob module (which is a shortcut for the fnmatch + listdir) you can do the same like this (or with a regex):
import glob
for file in glob.glob("/my_directory/*.xlsx"):
pd.read_excel(open(file, 'rb'), sheetname = 'Sheet1')
""" Do your thing to file """
You should use Python's glob module: https://docs.python.org/3/library/glob.html
For example:
import glob
for path in glob.iglob(directory + "file_*.xlsx"):
pd.read_excel(path)
# ...
I would recommend glob.
Doing glob.glob('file_*') returns a list which you can iterate on and do work.
Doing glob.iglob('file_*') returns a generator object which is an iterator.
The first one will give you something like:
['file_1991.xlsx','file_1992.xlsx','file_1995.xlsx','file_1996.xlsx']
If you know how your file names can be constructed, you might try to open a file with the 'r' attribute, so that open(..., 'r') fails if the file is non existent.
yearly_data = {}
for year in range(1990,2018):
try:
f = open('file_%4.4d.xlsx'%year, 'r')
except FileNotFoundError:
continue # to the next year
yearly_data[year] = ...
f.close()

Read in all csv files from a directory using Python

I hope this is not trivial but I am wondering the following:
If I have a specific folder with n csv files, how could I iteratively read all of them, one at a time, and perform some calculations on their values?
For a single file, for example, I do something like this and perform some calculations on the x array:
import csv
import os
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2] #Creates the array that will undergo a set of calculations
I know that I can check how many csv files there are in a given folder (check here):
import glob
for files in glob.glob("*.csv"):
print files
But I failed to figure out how to possibly nest the numpy.genfromtxt() function in a for loop, so that I read in all the csv files of a directory that it is up to me to specify.
EDIT
The folder I have only has jpg and csv files. The latter are named eventX.csv, where X ranges from 1 to 50. The for loop I am referring to should therefore consider the file names the way they are.
That's how I'd do it:
import os
directory = os.path.join("c:\\","path")
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
f=open(file, 'r')
# perform calculation
f.close()
Using pandas and glob as the base packages
import glob
import pandas as pd
glued_data = pd.DataFrame()
for file_name in glob.glob(directoryPath+'*.csv'):
x = pd.read_csv(file_name, low_memory=False)
glued_data = pd.concat([glued_data,x],axis=0)
I think you look for something like this
import glob
for file_name in glob.glob(directoryPath+'*.csv'):
x = np.genfromtxt(file_name,delimiter=',')[:,2]
# do your calculations
Edit
If you want to get all csv files from a folder (including subfolder) you could use subprocess instead of glob (note that this code only works on linux systems)
import subprocess
file_list = subprocess.check_output(['find',directoryPath,'-name','*.csv']).split('\n')[:-1]
for i,file_name in enumerate(file_list):
x = np.genfromtxt(file_name,delimiter=',')[:,2]
# do your calculations
# now you can use i as an index
It first searches the folder and sub-folders for all file_names using the find command from the shell and applies your calculations afterwards.
According to the documentation of numpy.genfromtxt(), the first argument can be a
File, filename, or generator to read.
That would mean that you could write a generator that yields the lines of all the files like this:
def csv_merge_generator(pattern):
for file in glob.glob(pattern):
for line in file:
yield line
# then using it like this
numpy.genfromtxt(csv_merge_generator('*.csv'))
should work. (I do not have numpy installed, so cannot test easily)
Here's a more succinct way to do this, given some path = "/path/to/dir/".
import glob
import pandas as pd
pd.concat([pd.read_csv(f) for f in glob.glob(path+'*.csv')])
Then you can apply your calculation to the whole dataset, or, if you want to apply it one by one:
pd.concat([process(pd.read_csv(f)) for f in glob.glob(path+'*.csv')])
The function below will return a dictionary containing a dataframe for each .csv file in the folder within your defined path.
import pandas as pd
import glob
import os
import ntpath
def panda_read_csv(path):
pd_csv_dict = {}
csv_files = glob.glob(os.path.join(path, "*.csv"))
for csv_file in csv_files:
file_name = ntpath.basename(csv_file)
pd_csv_dict['pd_' + file_name] = pd.read_csv(csv_file, sep=";", encoding='mac_roman')
locals().update(pd_csv_dict)
return pd_csv_dict
You can use pathlib glob functionality to list all .csv in a path, and pandas to read them.
Then it's only a matter of applying whatever function you want (which, if systematic, can also be done within the list comprehension)
import pands as pd
from pathlib import Path
path2csv = Path("/your/path/")
csvlist = path2csv.glob("*.csv")
csvs = [pd.read_csv(g) for g in csvlist ]
Another answer using list comprehension:
from os import listdir
files= [f for f in listdir("./") if f.endswith(".csv")]
You need to import the glob library and then use it like following:
import glob
path='C:\\Users\\Admin\\PycharmProjects\\db_conection_screenshot\\seclectors_absent_images'
filenames = glob.glob(path + "\*.png")
print(len(filenames))

How do you get a directory listing sorted by creation date in python?

What is the best way to get a list of all files in a directory, sorted by date [created | modified], using python, on a windows machine?
I've done this in the past for a Python script to determine the last updated files in a directory:
import glob
import os
search_dir = "/mydir/"
# remove anything from the list that is not a file (directories, symlinks)
# thanks to J.F. Sebastion for pointing out that the requirement was a list
# of files (presumably not including directories)
files = list(filter(os.path.isfile, glob.glob(search_dir + "*")))
files.sort(key=lambda x: os.path.getmtime(x))
That should do what you're looking for based on file mtime.
EDIT: Note that you can also use os.listdir() in place of glob.glob() if desired - the reason I used glob in my original code was that I was wanting to use glob to only search for files with a particular set of file extensions, which glob() was better suited to. To use listdir here's what it would look like:
import os
search_dir = "/mydir/"
os.chdir(search_dir)
files = filter(os.path.isfile, os.listdir(search_dir))
files = [os.path.join(search_dir, f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))
Update: to sort dirpath's entries by modification date in Python 3:
import os
from pathlib import Path
paths = sorted(Path(dirpath).iterdir(), key=os.path.getmtime)
(put #Pygirl's answer here for greater visibility)
If you already have a list of filenames files, then to sort it inplace by creation time on Windows (make sure that list contains absolute path):
files.sort(key=os.path.getctime)
The list of files you could get, for example, using glob as shown in #Jay's answer.
old answer
Here's a more verbose version of #Greg Hewgill's answer. It is the most conforming to the question requirements. It makes a distinction between creation and modification dates (at least on Windows).
#!/usr/bin/env python
from stat import S_ISREG, ST_CTIME, ST_MODE
import os, sys, time
# path to the directory (relative or absolute)
dirpath = sys.argv[1] if len(sys.argv) == 2 else r'.'
# get all entries in the directory w/ stats
entries = (os.path.join(dirpath, fn) for fn in os.listdir(dirpath))
entries = ((os.stat(path), path) for path in entries)
# leave only regular files, insert creation date
entries = ((stat[ST_CTIME], path)
for stat, path in entries if S_ISREG(stat[ST_MODE]))
#NOTE: on Windows `ST_CTIME` is a creation date
# but on Unix it could be something else
#NOTE: use `ST_MTIME` to sort by a modification date
for cdate, path in sorted(entries):
print time.ctime(cdate), os.path.basename(path)
Example:
$ python stat_creation_date.py
Thu Feb 11 13:31:07 2009 stat_creation_date.py
There is an os.path.getmtime function that gives the number of seconds since the epoch
and should be faster than os.stat.
import os
os.chdir(directory)
sorted(filter(os.path.isfile, os.listdir('.')), key=os.path.getmtime)
Here's my version:
def getfiles(dirpath):
a = [s for s in os.listdir(dirpath)
if os.path.isfile(os.path.join(dirpath, s))]
a.sort(key=lambda s: os.path.getmtime(os.path.join(dirpath, s)))
return a
First, we build a list of the file names. isfile() is used to skip directories; it can be omitted if directories should be included. Then, we sort the list in-place, using the modify date as the key.
Here's a one-liner:
import os
import time
from pprint import pprint
pprint([(x[0], time.ctime(x[1].st_ctime)) for x in sorted([(fn, os.stat(fn)) for fn in os.listdir(".")], key = lambda x: x[1].st_ctime)])
This calls os.listdir() to get a list of the filenames, then calls os.stat() for each one to get the creation time, then sorts against the creation time.
Note that this method only calls os.stat() once for each file, which will be more efficient than calling it for each comparison in a sort.
In python 3.5+
from pathlib import Path
sorted(Path('.').iterdir(), key=lambda f: f.stat().st_mtime)
Without changing directory:
import os
path = '/path/to/files/'
name_list = os.listdir(path)
full_list = [os.path.join(path,i) for i in name_list]
time_sorted_list = sorted(full_list, key=os.path.getmtime)
print time_sorted_list
# if you want just the filenames sorted, simply remove the dir from each
sorted_filename_list = [ os.path.basename(i) for i in time_sorted_list]
print sorted_filename_list
from pathlib import Path
import os
sorted(Path('./').iterdir(), key=lambda t: t.stat().st_mtime)
or
sorted(Path('./').iterdir(), key=os.path.getmtime)
or
sorted(os.scandir('./'), key=lambda t: t.stat().st_mtime)
where m time is modified time.
Here's my answer using glob without filter if you want to read files with a certain extension in date order (Python 3).
dataset_path='/mydir/'
files = glob.glob(dataset_path+"/morepath/*.extension")
files.sort(key=os.path.getmtime)
# *** the shortest and best way ***
# getmtime --> sort by modified time
# getctime --> sort by created time
import glob,os
lst_files = glob.glob("*.txt")
lst_files.sort(key=os.path.getmtime)
print("\n".join(lst_files))
sorted(filter(os.path.isfile, os.listdir('.')),
key=lambda p: os.stat(p).st_mtime)
You could use os.walk('.').next()[-1] instead of filtering with os.path.isfile, but that leaves dead symlinks in the list, and os.stat will fail on them.
For completeness with os.scandir (2x faster over pathlib):
import os
sorted(os.scandir('/tmp/test'), key=lambda d: d.stat().st_mtime)
this is a basic step for learn:
import os, stat, sys
import time
dirpath = sys.argv[1] if len(sys.argv) == 2 else r'.'
listdir = os.listdir(dirpath)
for i in listdir:
os.chdir(dirpath)
data_001 = os.path.realpath(i)
listdir_stat1 = os.stat(data_001)
listdir_stat2 = ((os.stat(data_001), data_001))
print time.ctime(listdir_stat1.st_ctime), data_001
Alex Coventry's answer will produce an exception if the file is a symlink to an unexistent file, the following code corrects that answer:
import time
import datetime
sorted(filter(os.path.isfile, os.listdir('.')),
key=lambda p: os.path.exists(p) and os.stat(p).st_mtime or time.mktime(datetime.now().timetuple())
When the file doesn't exist, now() is used, and the symlink will go at the very end of the list.
This was my version:
import os
folder_path = r'D:\Movies\extra\new\dramas' # your path
os.chdir(folder_path) # make the path active
x = sorted(os.listdir(), key=os.path.getctime) # sorted using creation time
folder = 0
for folder in range(len(x)):
print(x[folder]) # print all the foldername inside the folder_path
folder = +1
Here is a simple couple lines that looks for extention as well as provides a sort option
def get_sorted_files(src_dir, regex_ext='*', sort_reverse=False):
files_to_evaluate = [os.path.join(src_dir, f) for f in os.listdir(src_dir) if re.search(r'.*\.({})$'.format(regex_ext), f)]
files_to_evaluate.sort(key=os.path.getmtime, reverse=sort_reverse)
return files_to_evaluate
Add the file directory/folder in path, if you want to have specific file type add the file extension, and then get file name in chronological order.
This works for me.
import glob, os
from pathlib import Path
path = os.path.expanduser(file_location+"/"+date_file)
os.chdir(path)
saved_file=glob.glob('*.xlsx')
saved_file.sort(key=os.path.getmtime)
print(saved_file)
Turns out os.listdir sorts by last modified but in reverse so you can do:
import os
last_modified=os.listdir()[::-1]
Maybe you should use shell commands. In Unix/Linux, find piped with sort will probably be able to do what you want.

Categories

Resources