Create and assign recursively to dataframes in Pandas - python

I want to read csv files from a directory and assign each to a different dataframe. I have tried to do so like this:
path = r'C:\Users\A\Documents\Dash'
files = glob.glob(path + "/*.csv")
for file in files:
f'df{file}' = pd.read_csv(file, sep=',')
But of course I couldn't assign to a literal, but I don't realise a way to do this. I don't really care if each dataframe is numbered differently or with the name of the csv.

You could do in this way:
for index, file in enumerate(files):
vars()['df'+str(index)] = pd.read_csv(file, sep=',')
print (df0)
print (df1)

files = glob.glob(path + "/*.csv")
## using map & zip:
df_list = list(map(lambda x: pd.read_csv(x, sep=","), files)) # result in list
df_dict = dict(zip(files, df_list)) # result in dict
## using for loop:
# result in list
df_list = list()
for file in files:
df_list.append(pd.read_csv(file, sep=','))
# result in dict
df_dict = dict()
for file in files:
df_dict[file] = pd.read_csv(file, sep=',')

Related

Adding file names from list to dataframe - Python

I previously used the script below to find all csv files in a folder and append them to a dataframe. Now I want to append specified files to a new dataframe.
#define path for all CSV files
path = r'C:filepath'
csv_files = glob.glob(os.path.join(path, "*.csv"))
li = []
#removes rows with missing data and appends file to data frame
for csv in csv_files:
df = pd.read_csv(csv, index_col=None, header=0)
df = df.loc[(df['A'].notna()) & (df['B'].notna()) & (df['C'].notna())]
li.append(df)
what I would like to do is add something like:
file_list = ['name1', 'name2', 'name3']
To add only the files in the file list to the df.
I think I got it, thanks largely in part to gtomer.
for file in file_list:
try:
df = pd.read_csv(path + file + '.csv', index_col=None, header=0)
df = df.loc[(df['A'].notna()) & (df['B'].notna()) & (df['C'].notna())]
li.append(df)
except:
print(file)
Once you have a list, you can loop through the items in the list and perform your desired action:
file_list = ['name1', 'name2', 'name3']
for csv in file_list:
df = pd.read_csv(csv, index_col=None, header=0)
df = df.loc[(df['A'].notna()) & (df['B'].notna()) & (df['C'].notna())]
li.append(df)

How to create variables and read several excel files in a loop with pandas?

L=[('X1',"A"),('X2',"B"),('X3',"C")]
for i in range (len(L)):
path=os.path.join(L[i][1] + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
File "<ipython-input-1-6220ffd8958b>", line 6
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
^
SyntaxError: can't assign to function call
I have a problem with pandas, I can not create several dataframes for several excel files but i don't know how to create variables
I'll need a result that looks like this :
X1 will have dataframe of A.xlsx
X2 will have dataframe of B.xlsx
.
.
.
Solved :
d = {}
for i,value in L:
path=os.path.join(value + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
df = pd.read_excel(xls,'Sheet1')
key = 'df-'+str(i)
d[key] = df
Main pull:
I would approach this by reading everything into 1 dataframe (loop over files, and concat):
import os
import pandas as pd
files = [] #generate list for files to go into
path_of_directory = "path/to/folder/"
for dirname, dirnames, filenames in os.walk(path_of_directory):
for filename in filenames:
files.append(os.path.join(dirname, filename))
output_data = [] #blank list for building up dfs
for name in files:
df = pd.read_excel(name)
df['name'] = os.path.basename(name)
output_data.append(df)
total = pd.concat(output_data, ignore_index=True, sort=True)
Then:
From then you can interrogate the df by using df.loc[df['name'] == 'choice']
Or (in keeping with your question):
You could then split into a dictionary of dataframes, based on this column. This is the best approach...
dictionary = {}
df[column] = df[column].astype(str)
col_values = df[column].unique()
for value in col_values:
key_name = 'df'+str(value)
dictionary[key_name] = copy.deepcopy(df)
dictionary[key_name] = dictionary[key_name][df[column] == value]
dictionary[key_name].reset_index(inplace=True, drop=True)
return dictionary
The reason for this approach is discussed here:
Create new dataframe in pandas with dynamic names also add new column which basically says that dynamic naming of dataframes is bad, and this dict approach is best
This might help.
files_xls = ['all your excel filename goes here']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
print(df)

Name a dataframe based on csv file name?

Trying to batch analyze a folder full of .csv files, then save them out again based on the .csv name. However, I'm having trouble extracting just the file name and assigning it to the dataframe (df).
import glob
import pandas as pd
path = r'csv_in'
allFiles = glob.glob(path + '/*.csv')
for file_ in allFiles:
df = pd.read_csv(file_, header=0)
df.name = file_
print(df.name)
The print result I get is "csv_in/*.csv".
The result I'm looking for is just the csv name, "*.csv"
Create new column with [] and os.path.basename with os.path.normpath:
import os
for file_ in allFiles:
df = pd.read_csv(file_, header=0)
df['name'] = os.path.basename(os.path.normpath(file_))
#if need remove extension (csv)
#df['name'] = os.path.splitext(os.path.basename("hemanth.txt"))[0]
print(df.name)

Reading text files from subfolders and folders and creating a dataframe in pandas for each file text as one observation

I have the following architecture of the text files in the folders and subfolders.
I want to read them all and create a df. I am using this code, but it dont work well for me as the text is not what I checked and the files are not equivalent to my counting.
l = [pd.read_csv(filename,header=None, encoding='iso-8859-1') for filename in glob.glob("2018_01_01/*.txt")]
main_df = pd.concat(l, axis=1)
main_df = main_df.T
for i in range(2):
l = [pd.read_csv(filename, header=None, encoding='iso-8859-1',quoting=csv.QUOTE_NONE) for filename in glob.glob(str(foldernames[i+1])+ '/' + '*.txt')]
df = pd.concat(l, axis=1)
df = df.T
main_df = pd.merge(main_df, df)
file
Assuming those directories contain txt files in which information have the same structure on all of them:
import os
import pandas as pd
df = pd.DataFrame(columns=['observation'])
path = '/path/to/directory/of/directories/'
for directory in os.listdir(path):
if os.path.isdir(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
observation = f.read()
current_df = pd.DataFrame({'observation': [observation]})
df = df.append(current_df, ignore_index=True)
Once all your files have been iterated, df should be the DataFrame containing all the information in your different txt files.
You can do that using a for loop. But before that, you need to give a sequenced name to all the files like 'fil_0' within 'fol_0', 'fil_1' within 'fol_1', 'fil_2' within 'fol_2' and so on. That would facilitate the use of a for loop:
dataframes = []
import pandas as pd
for var in range(1000):
name = "fol_" + str(var) + "/fil_" + str(var) + ".txt"
dataframes.append(pd.read_csv(name)) # if you need to use all the files at once
#otherwise
df = pd.read_csv(name) # you can use file one by one
It will automatically create dataframes for each file.

Using date as index in output file

I have several excel files with their filename differentiated by different dates. I have to concatenate all these files with their filename dates being as the index columns. I have written the following code below:
path = r"C:\\Users\\atcs\\Desktop\\data science\\files\\1-Danny Jones KPI's\\Source\\"
fileName = glob.glob(os.path.join(path, "*.xlsx"))
df = (pd.read_excel(f, header=None, sheetname = "YTD Summary_4") for f in fileName)
k = (re.search("([0-9]{1,2}\-[0-9]{1,2}\-[0-9]{4})", fileName))
concatenated_df = pd.concat(df, index=k)
concatenated_df.to_csv('tableau7.csv')
What i have done here is first defined a directory then assigned all files containing xlsx files to filename. I defined filename in a datadrame, used regular expression to get date from filename and assign it to variable k. now i concatenate the file to get the output csv file. But the code somehow gives an error: TypeError: expected string or bytes-like object. Can somebody help me what i am doing wrong.
You can use:
#simplify for add *.xlsx to path
path = r"C:\\Users\\atcs\\Desktop\\data science\\files\\1-Danny Jones KPI's\\Source\\*.xlsx"
fileName = glob.glob(path)
#create list of DataFrames dfs
dfs = [pd.read_excel(f, header=None, sheetname = "YTD Summary_4") for f in fileName]
#add parameter keys for filenames, remove second level of multiindex
concatenated_df = pd.concat(dfs, keys=fileName).reset_index(level=1, drop=True)
#extract dates and convert to DatetimeIndex
pat = '([0-9]{1,2}\-[0-9]{1,2}\-[0-9]{4})'
concatenated_df.index = pd.to_datetime(concatenated_df.index.str.extract(pat, expand=False))
print (concatenated_df)
A little mod,
path = r"C:\\Users\\atcs\\Desktop\\data science\\files\\1-Danny Jones KPI's\\Source\\*.xlsx"
fileName = glob.glob(path)
l = []
for f in fileName:
df = pd.read_excel(f, header=None, sheetname = "YTD Summary_4")
df['date'] = f
l.append(df)
concatenated_df = pd.concat(l).set_index('date')
concatenated_df.to_csv('tableau7.csv')

Categories

Resources