Using date as index in output file - python

I have several excel files with their filename differentiated by different dates. I have to concatenate all these files with their filename dates being as the index columns. I have written the following code below:
path = r"C:\\Users\\atcs\\Desktop\\data science\\files\\1-Danny Jones KPI's\\Source\\"
fileName = glob.glob(os.path.join(path, "*.xlsx"))
df = (pd.read_excel(f, header=None, sheetname = "YTD Summary_4") for f in fileName)
k = (re.search("([0-9]{1,2}\-[0-9]{1,2}\-[0-9]{4})", fileName))
concatenated_df = pd.concat(df, index=k)
concatenated_df.to_csv('tableau7.csv')
What i have done here is first defined a directory then assigned all files containing xlsx files to filename. I defined filename in a datadrame, used regular expression to get date from filename and assign it to variable k. now i concatenate the file to get the output csv file. But the code somehow gives an error: TypeError: expected string or bytes-like object. Can somebody help me what i am doing wrong.

You can use:
#simplify for add *.xlsx to path
path = r"C:\\Users\\atcs\\Desktop\\data science\\files\\1-Danny Jones KPI's\\Source\\*.xlsx"
fileName = glob.glob(path)
#create list of DataFrames dfs
dfs = [pd.read_excel(f, header=None, sheetname = "YTD Summary_4") for f in fileName]
#add parameter keys for filenames, remove second level of multiindex
concatenated_df = pd.concat(dfs, keys=fileName).reset_index(level=1, drop=True)
#extract dates and convert to DatetimeIndex
pat = '([0-9]{1,2}\-[0-9]{1,2}\-[0-9]{4})'
concatenated_df.index = pd.to_datetime(concatenated_df.index.str.extract(pat, expand=False))
print (concatenated_df)

A little mod,
path = r"C:\\Users\\atcs\\Desktop\\data science\\files\\1-Danny Jones KPI's\\Source\\*.xlsx"
fileName = glob.glob(path)
l = []
for f in fileName:
df = pd.read_excel(f, header=None, sheetname = "YTD Summary_4")
df['date'] = f
l.append(df)
concatenated_df = pd.concat(l).set_index('date')
concatenated_df.to_csv('tableau7.csv')

Related

how to loop through a folder of csv files and read header of each? then output in a folder

I'm a newbie in python and need help with this piece of code. I did a lot of search to get to this stage but couldn't fix it on my own. Thanks in advance for your help.
What I'm trying to do is that I have to compare 100+ csv files in a folder, and not all have the same number of columns or columns name. So I'm trying to use python to read the headers of each file and put in a csv file to output in a folder.
I got to this point but not sure if I'm on the right path even:
import pandas as pd
import glob
path = r'C:\Users\user1\Downloads\2016GAdata' # use your path
all_files = glob.glob(path + "/*.csv")
list1 = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
list1.append(df)
frame = pd.concat(list1, axis=0, ignore_index=True)
print(frame)
thanks for your help!
You can create a dictionary whose key is filename and value is dataframe columns. Using this dictionary to create dataframe results in filename as index and column names as column value.
d = {}
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
d[filename] = df.columns
frame = pd.DataFrame.from_dict(d, orient='index')
0 1 2 3
file1 Fruit Date Name Number
file2 Fruit Date Name None

Create and assign recursively to dataframes in Pandas

I want to read csv files from a directory and assign each to a different dataframe. I have tried to do so like this:
path = r'C:\Users\A\Documents\Dash'
files = glob.glob(path + "/*.csv")
for file in files:
f'df{file}' = pd.read_csv(file, sep=',')
But of course I couldn't assign to a literal, but I don't realise a way to do this. I don't really care if each dataframe is numbered differently or with the name of the csv.
You could do in this way:
for index, file in enumerate(files):
vars()['df'+str(index)] = pd.read_csv(file, sep=',')
print (df0)
print (df1)
files = glob.glob(path + "/*.csv")
## using map & zip:
df_list = list(map(lambda x: pd.read_csv(x, sep=","), files)) # result in list
df_dict = dict(zip(files, df_list)) # result in dict
## using for loop:
# result in list
df_list = list()
for file in files:
df_list.append(pd.read_csv(file, sep=','))
# result in dict
df_dict = dict()
for file in files:
df_dict[file] = pd.read_csv(file, sep=',')

Importing multiple excel files into Python, merge and apply filename to a new column

I have a for loop that imports all of the Excel files in the directory and merge them together in a single dataframe. However, I want to create a new column where each row takes the string of the filename of the Excel-file.
Here is my import and merge code:
path = os.getcwd()
files = os.listdir(path)
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
df = df.append(data)
For example if first Excel file is named "file1.xlsx", I want all rows from that file to have value file1.xlsx in col3 (a new column). If the second Excel file is named "file2.xlsx" I want all rows from that file to have value file2.xlsx. Notice that there is no real pattern of the Excel files, and I just use those names as an example.
Many thanks
Create new column in loop:
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
data['col3'] = f
df = df.append(data)
Another possible solution with list comprehension:
dfs = [pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2']).assign(col3 = f)
for f in files]
df = pd.concat(dfs)

Merge files with similar name in to one dataframe

I have a list of files stored in directory such as
filenames=[
abc_1.txt
abc_2.txt
abc_3.txt
bcd_1.txt
bcd_2.txt
bcd_3.txt
]
pattern=[abc]
I want to read multiple txt files into one dataframe such that all files starting with abc will be in one dataframe then all all filename starting with bcd etc.
My code:
filenames = os.listdir(file_path)
expnames=[]
for files in filenames:
expnames.append(files.rsplit('_',1)[0])
## expnames=[abc, bcd]
dfs = []
for exp in expnames:
for files in filenames:
if files.startswith(exp):
dfs.append(pd.read_csv(file_path+files,sep=',',header=None))
big_frame = pd.concat(dfs, ignore_index=True)
My output contain duplicate rows due to multiple for loops
Output:
Can someone help wih this?
This will store your desired outputs in a list of dataframes called list_of_dfs and then create a MultiIndex dataframe final from them with the file prefixes (e.g. ['abc','bcd']) as the keys for the outermost index level:
import pandas as pd
import os
filenames = os.listdir(file_path)
prefixes = list(set(i.split('_')[0] for i in filenames))
list_of_dfs = [pd.concat([pd.read_csv(os.path.join(file_path, file), header=None) for file in filenames if file.startswith(prefix)], ignore_index=True) for prefix in prefixes]
final = pd.concat(list_of_dfs, keys=prefixes)
file_path = '/home/iolie/Downloads/test/'
filenames = os.listdir(file_path)
prefixes = list(set(i.split('_')[0] for i in filenames))
for prefix in prefixes:
for file in filenames:
if file.startswith(prefix):
list_of_dfs= [ pd.concat( [pd.read_csv (os.path.join(file_path, file)], header=None ),ignore_index=True)]
final=pd.concat(list_of_dfs)

Reading text files from subfolders and folders and creating a dataframe in pandas for each file text as one observation

I have the following architecture of the text files in the folders and subfolders.
I want to read them all and create a df. I am using this code, but it dont work well for me as the text is not what I checked and the files are not equivalent to my counting.
l = [pd.read_csv(filename,header=None, encoding='iso-8859-1') for filename in glob.glob("2018_01_01/*.txt")]
main_df = pd.concat(l, axis=1)
main_df = main_df.T
for i in range(2):
l = [pd.read_csv(filename, header=None, encoding='iso-8859-1',quoting=csv.QUOTE_NONE) for filename in glob.glob(str(foldernames[i+1])+ '/' + '*.txt')]
df = pd.concat(l, axis=1)
df = df.T
main_df = pd.merge(main_df, df)
file
Assuming those directories contain txt files in which information have the same structure on all of them:
import os
import pandas as pd
df = pd.DataFrame(columns=['observation'])
path = '/path/to/directory/of/directories/'
for directory in os.listdir(path):
if os.path.isdir(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
observation = f.read()
current_df = pd.DataFrame({'observation': [observation]})
df = df.append(current_df, ignore_index=True)
Once all your files have been iterated, df should be the DataFrame containing all the information in your different txt files.
You can do that using a for loop. But before that, you need to give a sequenced name to all the files like 'fil_0' within 'fol_0', 'fil_1' within 'fol_1', 'fil_2' within 'fol_2' and so on. That would facilitate the use of a for loop:
dataframes = []
import pandas as pd
for var in range(1000):
name = "fol_" + str(var) + "/fil_" + str(var) + ".txt"
dataframes.append(pd.read_csv(name)) # if you need to use all the files at once
#otherwise
df = pd.read_csv(name) # you can use file one by one
It will automatically create dataframes for each file.

Categories

Resources