Assume, I have a csv file data.csv located in the following directory 'C:\\Users\\rp603\\OneDrive\\Documents\\Python Scripts\\Basics\\tutorials\\Revision\\datasets'. Using this code, I can access my csv file:
## read the csv file from a particular folder
import pandas as pd
import glob
files = glob.glob(r"C:\\Users\\rp603\\OneDrive\\Documents\\Python Scripts\\Basics\\tutorials\\Revision\\datasets*.csv")
df = pd.DataFrame()
for f in files:
csv = pd.read_csv(f)
df = df.append(csv)
But as you can see the csv file path is long. So, is there is any way to do the same operation where I can reduce the path location of my data as well as codes line.
use the "dot" notation for a relative path (it does not depend on the programming language)
# example for a "shorter" version of the path
import os
my_current_position = '.' # where you launch the program
files = '' # from above
print(os.path.relpath(files, my_current_position)
Remark relpath is order sensitive
You can use a context manager to open the file, not shorter but more elegant
with open(file, 'r') as fd:
data_table = pd.read_csv(fd)
If you put your script in the same directory as the datasets, you can simply do:
import glob
files = glob.glob("datasets*.csv")
Related
I've been searching for a way to merge all csv files in a folder. They all have the same headers, but different names. I've found some videos on youtube on merge and some questions here on stackoverflow that touches the matter. The problem is that this tutorials are focused on files with the same name as: sales1, sales2, etc.
In my case, all files in the directory are CSVs and are located in 'D:\XXXX\XXXX\output'
The code I have used is:
import pandas as pd
# set files path
amazon = r'D:\XXXX\XXXX\output\amazonbooks.csv'
bookcrossing = r'D:\XXXX\XXXX\output\bookcrossing.csv'
# merge files
dataFrame = pd.concat(
map(pd.read_csv, [amazon, bookcrossing]), ignore_index=True)
print(dataFrame)
If the code could merge all the files that stand in the folder output (since all of them are .csv), instead of naming each one of them, it would be better.
I'd be glad if anyone can help me with this problem, or can guide me on how to solve this.
If the goal is to append the files into a single result, you don't really need any CSV processing at all. Just write the file contents minus the header line (except the first one). glob will return file names with path that match the pattern, "*.csv".
from glob import glob
import os
import shutil
csv_dir = r'D:\XXXX\XXXX\output'
result_csv = r'd:\XXXX\XXXX\combined.csv'
first_hdr = True
# all .csv files in the directory have the same header
with open(result_csv, "w", newline="") as result_file:
for filename in glob(os.path.join(csv_dir, "*.csv")):
with open(filename) as in_file:
header = in_file.readline()
if first_hdr:
result_file.write(header)
first_hdr = False
shutil.copyfileobj(in_file, result_file)
(assuming all csvs have equal number of columns)
Try something like this:
import os
import pandas as pd
csvs = [file for file in os.listdir('D:\XXXX\XXXX\output\') if file.endswith('.csv')]
result_df = pd.concat([pd.read_csv(f'D:\XXXX\XXXX\output\{file}') for file in csvs])
I'm writing a script to import a csv using pandas, then upload it to a SQL server. It all works when I have a test file with one name, but that won't be the case in production. I need to figure out how to import with only a semi-known filename (the filename will always follow the same convention, but there will be differences in the filenames such as dates/times). I should, however, note that there will only ever be one csv file in the folder it's checking. I have tried the wildcard in the import path, but that didn't work.
After it's imported, I also then need to return the filename.
Thanks!
Look into the OS module:
import os
files = os.listdir("./")
csv_files = [filename for filename in files if filename.endswith(".csv")]
csv_files is a list with all the files that ends with .csv
I would like to automatically import all csv files that are in one folder as dataframes and set the dataframe's variable name to the respective filename.
For example, in the folder are the following three files: data1.csv, data2.csv and data3.csv
How can I automatically import all three files having three dataframes (data1, data2 and data3) as the result?
If you want to save dataframe as variable with own file name. But it is not secure. This could cause code injection.
import pandas
import os
path = "path_of_directory"
files = os.listdir(path) # Returns list of files in the folder which is specifed path
for file in files:
if file.endswith(".csv"):# Checking wheter file endswith .csv
# os.sep returns the separtor of operator system
exec(f"{file[:-4]} = pandas.read_csv({path}+{os.sep}+{file})")
You can loop over the directory using pathlib and build a dictionary of name->DataFrame, eg:
import pathlib
import pandas as pd
dfs = {path.stem: pd.read_csv(path) for path in pathlib.Path('thepath/').glob(*.csv')}
Then access as dfs['test1'] etc...
Since the answer that was given includes an exec command, and munir.aygun already warned you what could go wrong with that approach. Now I want to show you the way to do it as Justin Ezequiel or munir.aygun already suggested:
import os
import glob
import pandas as pd
# Path to your data
path = r'D:\This\is\your\path'
# Get all .csv files at your path
allFiles = glob.glob(path + "/*.csv")
# Read in the data from files and safe to dictionary
dataStorage = {}
for filename in allFiles:
name = os.path.basename(filename).split(".")[0]
dataStorage[name] = pd.read_csv(filename)
# Can be used then like this (for printing here)
if "data1" in dataStorage:
print(dataStorage["data1"])
Hope this can still be helpful.
I have a series of files that are in the following format:
file_1991.xlsx
file_1992.xlsx
# there are some gaps in the file numbering sequence
file_1995.xlsx
file_1996.xlsx
file_1997.xlsx
For each file I want to do something like:
import pandas as pd
data_1995 = pd.read_excel(open(directory + 'file_1995', 'rb'), sheetname = 'Sheet1')
do some work on the data, and save it as another file:
output_1995 = pd.ExcelWriter('output_1995.xlsx')
data_1995.to_excel(output_1995,'Sheet1')
Instead of doing all these for every single file, how can I iterate through multiple files and repeat the same operation across multiple files? In other words, I would like to iterate over all the files (they mostly following a numerical sequence in their names, but there are some gaps in the sequence).
Thanks for the help in advance.
You can use os.listdir or glob module to list all files in a directory.
With os.listdir, you can use fnmatch to filter files like this (can use a regex too);
import fnmatch
import os
for file in os.listdir('my_directory'):
if fnmatch.fnmatch(file, '*.xlsx'):
pd.read_excel(open(file, 'rb'), sheetname = 'Sheet1')
""" Do your thing to file """
Or with glob module (which is a shortcut for the fnmatch + listdir) you can do the same like this (or with a regex):
import glob
for file in glob.glob("/my_directory/*.xlsx"):
pd.read_excel(open(file, 'rb'), sheetname = 'Sheet1')
""" Do your thing to file """
You should use Python's glob module: https://docs.python.org/3/library/glob.html
For example:
import glob
for path in glob.iglob(directory + "file_*.xlsx"):
pd.read_excel(path)
# ...
I would recommend glob.
Doing glob.glob('file_*') returns a list which you can iterate on and do work.
Doing glob.iglob('file_*') returns a generator object which is an iterator.
The first one will give you something like:
['file_1991.xlsx','file_1992.xlsx','file_1995.xlsx','file_1996.xlsx']
If you know how your file names can be constructed, you might try to open a file with the 'r' attribute, so that open(..., 'r') fails if the file is non existent.
yearly_data = {}
for year in range(1990,2018):
try:
f = open('file_%4.4d.xlsx'%year, 'r')
except FileNotFoundError:
continue # to the next year
yearly_data[year] = ...
f.close()
I hope this is not trivial but I am wondering the following:
If I have a specific folder with n csv files, how could I iteratively read all of them, one at a time, and perform some calculations on their values?
For a single file, for example, I do something like this and perform some calculations on the x array:
import csv
import os
directoryPath=raw_input('Directory path for native csv file: ')
csvfile = numpy.genfromtxt(directoryPath, delimiter=",")
x=csvfile[:,2] #Creates the array that will undergo a set of calculations
I know that I can check how many csv files there are in a given folder (check here):
import glob
for files in glob.glob("*.csv"):
print files
But I failed to figure out how to possibly nest the numpy.genfromtxt() function in a for loop, so that I read in all the csv files of a directory that it is up to me to specify.
EDIT
The folder I have only has jpg and csv files. The latter are named eventX.csv, where X ranges from 1 to 50. The for loop I am referring to should therefore consider the file names the way they are.
That's how I'd do it:
import os
directory = os.path.join("c:\\","path")
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".csv"):
f=open(file, 'r')
# perform calculation
f.close()
Using pandas and glob as the base packages
import glob
import pandas as pd
glued_data = pd.DataFrame()
for file_name in glob.glob(directoryPath+'*.csv'):
x = pd.read_csv(file_name, low_memory=False)
glued_data = pd.concat([glued_data,x],axis=0)
I think you look for something like this
import glob
for file_name in glob.glob(directoryPath+'*.csv'):
x = np.genfromtxt(file_name,delimiter=',')[:,2]
# do your calculations
Edit
If you want to get all csv files from a folder (including subfolder) you could use subprocess instead of glob (note that this code only works on linux systems)
import subprocess
file_list = subprocess.check_output(['find',directoryPath,'-name','*.csv']).split('\n')[:-1]
for i,file_name in enumerate(file_list):
x = np.genfromtxt(file_name,delimiter=',')[:,2]
# do your calculations
# now you can use i as an index
It first searches the folder and sub-folders for all file_names using the find command from the shell and applies your calculations afterwards.
According to the documentation of numpy.genfromtxt(), the first argument can be a
File, filename, or generator to read.
That would mean that you could write a generator that yields the lines of all the files like this:
def csv_merge_generator(pattern):
for file in glob.glob(pattern):
for line in file:
yield line
# then using it like this
numpy.genfromtxt(csv_merge_generator('*.csv'))
should work. (I do not have numpy installed, so cannot test easily)
Here's a more succinct way to do this, given some path = "/path/to/dir/".
import glob
import pandas as pd
pd.concat([pd.read_csv(f) for f in glob.glob(path+'*.csv')])
Then you can apply your calculation to the whole dataset, or, if you want to apply it one by one:
pd.concat([process(pd.read_csv(f)) for f in glob.glob(path+'*.csv')])
The function below will return a dictionary containing a dataframe for each .csv file in the folder within your defined path.
import pandas as pd
import glob
import os
import ntpath
def panda_read_csv(path):
pd_csv_dict = {}
csv_files = glob.glob(os.path.join(path, "*.csv"))
for csv_file in csv_files:
file_name = ntpath.basename(csv_file)
pd_csv_dict['pd_' + file_name] = pd.read_csv(csv_file, sep=";", encoding='mac_roman')
locals().update(pd_csv_dict)
return pd_csv_dict
You can use pathlib glob functionality to list all .csv in a path, and pandas to read them.
Then it's only a matter of applying whatever function you want (which, if systematic, can also be done within the list comprehension)
import pands as pd
from pathlib import Path
path2csv = Path("/your/path/")
csvlist = path2csv.glob("*.csv")
csvs = [pd.read_csv(g) for g in csvlist ]
Another answer using list comprehension:
from os import listdir
files= [f for f in listdir("./") if f.endswith(".csv")]
You need to import the glob library and then use it like following:
import glob
path='C:\\Users\\Admin\\PycharmProjects\\db_conection_screenshot\\seclectors_absent_images'
filenames = glob.glob(path + "\*.png")
print(len(filenames))