I have a directory with csvs whose filenames represent the id of a row of a database.
I would like to read in this directory into a Pandas dataframe and join it to an existing dataframe.
Is there any way in Python to read the results of an 'ls' command into a pandas Dataframe?
I've tried getting a string of the filenames with the code below but I'm having trouble figuring out how to get it into a dataframe after.
import os
files = ''
for root, dirs, files in os.walk("."):
for filename in files:
files += filename
You are able to walk the files, now you just need to read the csv and concat them onto a dataframe.
import os
import pandas as pd
df = None
for root, dirs, files in os.walk('.'):
for filename in files:
if not df:
df = pd.read_csv(filename)
df['filename'] = filename
continue
tmp = pd.read_csv(filename)
tmp['filename'] = filename
df = pd.concat(df, tmp)
Related
I have a folder. In this folder are 48 xlsx files, but the count of the relevant files are 22. Them name of these 22 files have no structure, the only thing in common is that the filenames start with data. I would love to access this files and read them all into a dataframe. Doing this manually with the code line
df = pd.read_excel(filename, engine='openpyxl')
takes too long
The table structure is similar but not always exactly the same. How can I manage to solve this problem
import os
import pandas as pd
dfs = {}
def get_files(extension, location):
xlsx_list = []
for root, dirs, files in os.walk(location):
for t in files:
if t.endswith(extension):
xlsx_list.append(t)
return xlsx_list
file_list = get_files('.xlsx', '.')
index = 0
for filename in file_list:
index += 1
df = pd.read_excel(filename, engine='openpyxl')
dfs[filename] = df
print(dfs)
each element in dfs like dfs['file_name_here.xlsx'] accesses the data frame output from the read_excel.
EDIT: that you can add additional criteria to filter through the XLSX files at the line if t.endswith(extension): you can check out the beginning of the file like if t.startswith('data'): too. Maybe combine them if t.startswith('data') and t.endswith(extension):
I have problem No objects to concatenate. I can not import .csv files from main and its subdirectories to concatenate them into one DataFrame. I am using pandas. Old answers did not help me so please do not mark as duplicated.
Folder structure is like that
main/*.csv
main/name1/name1/*.csv
main/name1/name2/*.csv
main/name2/name1/*.csv
main/name3/*.csv
import pandas as pd
import os
import glob
folder_selected = 'C:/Users/jacob/Documents/csv_files'
not works
frame = pd.concat(map(pd.read_csv, glob.iglob(os.path.join(folder_selected, "/*.csv"))))
not works
csv_paths = glob.glob('*.csv')
dfs = [pd.read_csv(folder_selected) for folder_selected in csv_paths]
df = pd.concat(dfs)
not works
all_files = []
all_files = glob.glob (folder_selected + "/*.csv")
file_path = []
for file in all_files:
df = pd.read_csv(file, index_col=None, header=0)
file_path.append(df)
frame = pd.concat(file_path, axis=0, ignore_index=False)
You need to search the subdirectories recursively.
folder = 'C:/Users/jacob/Documents/csv_files'
path = folder+"/**/*.csv"
Using glob.iglob
df = pd.concat(map(pd.read_csv, glob.iglob(path, recursive=True)))
Using glob.glob
csv_paths = glob.glob(path, recursive=True)
dfs = [pd.read_csv(csv_path) for csv_path in csv_paths]
df = pd.concat(dfs)
Using os.walk
file_paths = []
for base, dirs, files in os.walk(folder):
for file in fnmatch.filter(files, '*.csv'):
file_paths.append(os.path.join(base, file))
df = pd.concat([pd.read_csv(file) for file in file_paths])
Using pathlib
from pathlib import Path
files = Path(folder).rglob('*.csv')
df = pd.concat(map(pd.read_csv, files))
Check Dask Library as following, which reads many files to one df
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
Read their docs
https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files
Python’s pathlib is a tool for such tasks
from pathlib import Path
FOLDER_SELECTED = 'C:/Users/jacob/Documents/csv_files'
path = Path(FOLDER_SELECTED) / Path("main")
# grab all csvs in main and subfolders
df = pd.concat(pd.read_csv(f.name) for f in path.rglob("*.csv"))
Note:
If the CSV need preprocing, you can create a read_csv function to deal with issues and place it in place of pd.read_csv
I am trying read in a folder of CSV files, process them one by one to remove duplicates, and then add them to a master dataframe which will then finally be output to a CSV. I have this...
import pandas as pd
import os
import sys
output = pd.DataFrame(columns=['col1', 'col2'])
for root, dirs, files in os.walk("sourcefolder", topdown=False):
for name in files:
data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
output.append(data)
output.to_csv("output.csv", index=False, encoding='utf8')
But my output CSV is empty apart fom the column names. Anyone any idea where I am going wrong?
Pandas dataframes don't act like a list so you can't use append like that try:
import pandas as pd
import os
import sys
output = pd.DataFrame(columns=['col1', 'col2'])
for root, dirs, files in os.walk("sourcefolder", topdown=False):
for name in files:
data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
output = output.append(data)
output_df.to_csv("output.csv", index=False, encoding='utf8')
Alternatively you can make output a list of dataframes and then use pd.concat to create a consolidated dataframe at the end, depending on the volume of data this could be more efficient
The built in pandas method concat is also pretty good. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat
import pandas as pd
import os
import sys
output = pd.DataFrame(columns=['col1', 'col2'])
for root, dirs, files in os.walk("sourcefolder", topdown=False):
for name in files:
data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
output = pd.concat([output, data], columns=output.columns)
output_df.to_csv("output.csv", index=False, encoding='utf8')
I'm trying to merge a bunch of xlsx files into a single pandas dataframe in python. Furthermore, I want to include a column that lists the source file for each row. My code is as follows:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import glob
import os
# get the path for where the xlsx files are
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
# create new dataframe
df = pd.DataFrame()
# read data from files and add into dataframe
for f in files_xlsx:
data = pd.read_excel(f, 'Sheet 1')
df['Source_file'] = f
df = df.append(data)
however, when I look at the 'Source_file' column it lists the final file it reads as the name for every row. I've spent way more time than I should trying to fix this. What am I doing wrong?
within your for loop you are writing over each iteration of df so you'll only get back the final file,
what you need to do is delcare a list before hand and append do that,
since you called glob lets use that as well.
files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
df = pd.concat(dfs)
if you want to add the filename into the df too then,
files = glob.glob(os.path.join(os.getcwd()) + '\*.xlsx')
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [os.path.basename(f) for f in files]
df = pd.concat(dfs,keys=file_names)
Using Pathlib module (recommended Python 3.4+)
from pathlib import Path
files = [f for f in Path.cwd().glob('*.xlsx')]
dfs = [pd.read_excel(f,sheet_name='Sheet1') for f in files]
file_names = [f.stem for f in files]
df = pd.concat(dfs,keys=file_names)
or as a one liner :
df = pd.concat([pd.read_excel(f) for f in Path.cwd().glob('*.xlsx')],keys=[f.stem for f in Path.cwd().glob('*.xlsx')],sort=False)
I am trying to open all files from a folder, store them in a dataframe and append each csv file with another csv file called Append.csv and am trying to write all the files with their names to a different folder.
For example I have 5 csv files that are saved in a folder called CSV FILES FOLDER. These files are F1.csv, F2.csv, F3.csv, F4.csvand F5.csv. What I am trying to do is open each file using pandas and I do this in a for loop, Append.csv and now store it in a different folder called NEW CSV FILES FOLDER as :
F1_APPENDED.csv
F2_APPENDED.csv
F3_APPENDED.csv
F4_APPENDED.csv
In other words, the _APPENDED is added with each file and then the file with the new name having _APPENDED is saved.
I have already defined the path for this folder but cant save it. The code is as below :
import pandas as pd
import glob
import os.path
import pathlib
path =r'C:\Users\Ahmed Ismail Khalid\Desktop\CSV FILES FOLDER'
allFiles = glob.glob(path + "/*.csv")
path1 = r'C:\Users\Ahmed Ismail Khalid\Desktop\Different Folder\Bitcoin Prices Hourly Based.csv'
outpath = r'C:\Users\Ahmed Ismail Khalid\Desktop\NEW CSV FILES FOLDER'
for f in allFiles:
file = open(f, 'r')
df1 = pd.read_csv(path1)
df2 = pd.read_csv(f)
output = pd.merge(df1, df2, how="inner", on="created_at")
df3 = output.created_at.value_counts().rename_axis('created_at').reset_index(name='count')
df3 = df3.sort_values(by=['created_at'])
#print(df3,'\n\n')
df3.to_csv(outpath+f, encoding='utf-8',index=False)
#print(f,'\n\n')
How can I do this? I tried to look up the official documentation but couldn't understand anything
Any and all help would be appreciated
Thanks
Here, I added a line in the for loop where you can get just the file name. You can use that instead of the full path to the file when you write the file and indicate the output .csv filename.
import pandas as pd
import glob
import os.path
import pathlib
path =r'C:\Users\Ahmed Ismail Khalid\Desktop\CSV FILES FOLDER'
allFiles = glob.glob(path + "/*.csv")
path1 = r'C:/Users/Ahmed Ismail Khalid/Desktop/Different Folder/Bitcoin Prices Hourly Based.csv'
# You need to have a slash at the end so it knows it's a folder
outpath = r'C:/Users/Ahmed Ismail Khalid/Desktop/NEW CSV FILES FOLDER/'
for f in allFiles:
file = open(f, 'r')
_, fname = os.path.split(f)
fname, ext = os.path.splittext(fname)
df1 = pd.read_csv(path1)
df2 = pd.read_csv(f)
output = pd.merge(df1, df2, how="inner", on="created_at")
df3 = output.created_at.value_counts().rename_axis('created_at').reset_index(name='count')
df3 = df3.sort_values(by=['created_at'])
#print(df3,'\n\n')
df3.to_csv(outpath+fname+'_appended.csv', encoding='utf-8',index=False)
#print(f,'\n\n')