Merge files with similar name convention to a dataframe - python

I have a list of files stored in directory such as
filenames=[
abc_1.txt
abc_2.txt
abc_3.txt
bcd_1.txt
bcd_2.txt
bcd_3.txt
]
pattern=[abc]
I want to read multiple txt files into one dataframe such that all files starting with abc will be in one dataframe then all all filename starting with bcd etc.
My code:
file_path = '/home/iolie/Downloads/test/'
filenames = os.listdir(file_path)
prefixes = list(set(i.split('_')[0] for i in filenames))
for prefix in prefixes:
print('Reading files with prefix:',prefix)
for file in filenames:
if file.startswith(prefix):
print('Reading files:',file)
list_of_dfs = [pd.concat([pd.read_csv(os.path.join(file_path, file), header=None) ],ignore_index=True)]
final = pd.concat(list_of_dfs)
This code doesnt't append but overwrites the dataframe. Can someone help wih this?

A better idea than creating an arbitrary number of unlinked dataframes is to output a dictionary of dataframes, where the key is the prefix:
from collections import defaultdict
filenames = ['abc_1.txt', 'abc_2.txt', 'abc_3.txt',
'bcd_1.txt', 'bcd_2.txt', 'bcd_3.txt']
dd = defaultdict(list)
for fn in filenames:
dd[fn.split('_')[0]].append(fn)
dict_of_dfs = {}
for k, v in dd.items():
dict_of_dfs[k] = pd.concat([pd.read_csv(fn) for fn in v], ignore_index=True)

Related

How to import multiple excel files and manipulate them individually

I have to analyze 13 different Excel files and I want to read them al in Jupyter at once, instead of reading them al individually. Also I want to be able to acces the contents individually. So far I have this:
path = r"C:\Users\giova\PycharmProjects\DAEB_prijzen\data"
filenames = glob.glob(path + "\*.xlsx")
df_list = []
for file in filenames:
df = pd.read_excel(file, usecols=['Onderhoudsordernr.', 'Oorspronkelijk aantal', 'Bedrag (LV)'])
print(file)
print(df)
df_list.append(df)
When I'm running the code it seems to be like 1 big list, with some data missing, which I dont want. Can anyone help? :(
It seems a problem that can be solved with a for loop and a dictionary.
Read the path location of your files:
path = 'C:/your path'
paths = os.listdir(path)
Initialize an empty dictionary:
my_files = {}
for i, p in enumerate(paths):
my_files[i] = pd.read_excel(p)
Then you can acces to your files individually simply calling the key in the dictionary:
my_files[i]
Where i = 1, 2 ..., 13
Alternatively, if you want to assign a name to each file, you can either create a list of name or derive it from the filepath through some slice/regex function on the strings.
Assuming the first case:
names = ['excel1', ...]
for name, p in zip(names, paths):
my_files[name] = pd.read_excel(p)

Concat list of dataframes in pandas not working because the list is too long

I am importing a large list of JSON files. They come from one folder for each year.
The files are properly imported, and stored as dataframes in a list. My plan was to concatenate the list and export as one CSV for each year. The problem is that concatenate is not working because the list of df is too long (it works when I try with few files). I think I should find a way to make a list for each folder in order to concatenate each list and then exporting, or find a way to concatenate only the df in the list that have the same year (every df has a column with the year value). I can't do neither, so I need help.
My code looks like this:
os.chdir('C:\\Users\\User\\Documents\\Local\\hs')
rootdir = 'C:\\Users\\User\\Documents\\Local\\hs'
data_df = []
files_notloading=[]
for subdir, dirs, files in os.walk(rootdir):
for file in files:
print(os.path.join(subdir, file))
if 'json' in os.path.join(subdir, file):
try:
with open(os.path.join(subdir, file),'r') as f:
data = json.loads(f.read())
if not data['search']:
data['search'] = [{'R': 0}]
# Normalizing data
df = pd.json_normalize(data, record_path =['search'],
meta =['month', 'type',
'day','year'],errors='ignore')
data_df.append(df)
except: files_notloading.append(file)
data_df = pd.concat(data_df)
files_notloading = pd.DataFrame(files_notloading)
for year in data_df['year'].unique():
file_name = '/Users/User/Documents/data/hs_{0}.csv'.format(year)
data_df[data_df['year'] == year].to_csv(file_name, index= False)
files_notloading.to_csv(path_or_buf='/Users/User/Documents/data/filesnotloading_hs.csv',index= False)
I was able to find a way to make a list for each folder in order to concatenate each list and then exporting.
code:
import os
import pandas as pd
import json
import os.path
import os
os.chdir('C:\\Users\\User\\Documents\\Local\\hs')
working_dir = "C:\\Users\\User\\Documents\\Local\\hs"
output_dir = "C:\\Users\\User\\Documents\\Local\\hs"
files_notloading=[]
for root, dirs, files in os.walk(working_dir):
file_list = []
df_list=[]
for filename in files:
print(os.path.join(root,filename))
if filename.endswith('.json'):
file_list.append(os.path.join(root, filename))
for file in file_list:
try:
with open(file,'r') as f:
data = json.loads(f.read())
if not data['search']:
data['search'] = [{'R': 0}]
df = pd.json_normalize(data, record_path =['search'],
meta =['month','type', 'day','year'],errors='ignore')
df_list.append(df)
except: files_notloading.append(file)
if df_list:
final_df = pd.concat(df_list)
final_file = 'hs_{0}.csv'.format(final_df['year'].iloc[0])
final_df.to_csv(os.path.join(output_dir, final_file),index=False)
files_notloading = pd.DataFrame(files_notloading,columns =['file'])
files_notloading.to_csv(os.path.join(output_dir, 'hs_files_notloading.csv'),index= False)

Merge files with similar name in to one dataframe

I have a list of files stored in directory such as
filenames=[
abc_1.txt
abc_2.txt
abc_3.txt
bcd_1.txt
bcd_2.txt
bcd_3.txt
]
pattern=[abc]
I want to read multiple txt files into one dataframe such that all files starting with abc will be in one dataframe then all all filename starting with bcd etc.
My code:
filenames = os.listdir(file_path)
expnames=[]
for files in filenames:
expnames.append(files.rsplit('_',1)[0])
## expnames=[abc, bcd]
dfs = []
for exp in expnames:
for files in filenames:
if files.startswith(exp):
dfs.append(pd.read_csv(file_path+files,sep=',',header=None))
big_frame = pd.concat(dfs, ignore_index=True)
My output contain duplicate rows due to multiple for loops
Output:
Can someone help wih this?
This will store your desired outputs in a list of dataframes called list_of_dfs and then create a MultiIndex dataframe final from them with the file prefixes (e.g. ['abc','bcd']) as the keys for the outermost index level:
import pandas as pd
import os
filenames = os.listdir(file_path)
prefixes = list(set(i.split('_')[0] for i in filenames))
list_of_dfs = [pd.concat([pd.read_csv(os.path.join(file_path, file), header=None) for file in filenames if file.startswith(prefix)], ignore_index=True) for prefix in prefixes]
final = pd.concat(list_of_dfs, keys=prefixes)
file_path = '/home/iolie/Downloads/test/'
filenames = os.listdir(file_path)
prefixes = list(set(i.split('_')[0] for i in filenames))
for prefix in prefixes:
for file in filenames:
if file.startswith(prefix):
list_of_dfs= [ pd.concat( [pd.read_csv (os.path.join(file_path, file)], header=None ),ignore_index=True)]
final=pd.concat(list_of_dfs)

Reading text files from subfolders and folders and creating a dataframe in pandas for each file text as one observation

I have the following architecture of the text files in the folders and subfolders.
I want to read them all and create a df. I am using this code, but it dont work well for me as the text is not what I checked and the files are not equivalent to my counting.
l = [pd.read_csv(filename,header=None, encoding='iso-8859-1') for filename in glob.glob("2018_01_01/*.txt")]
main_df = pd.concat(l, axis=1)
main_df = main_df.T
for i in range(2):
l = [pd.read_csv(filename, header=None, encoding='iso-8859-1',quoting=csv.QUOTE_NONE) for filename in glob.glob(str(foldernames[i+1])+ '/' + '*.txt')]
df = pd.concat(l, axis=1)
df = df.T
main_df = pd.merge(main_df, df)
file
Assuming those directories contain txt files in which information have the same structure on all of them:
import os
import pandas as pd
df = pd.DataFrame(columns=['observation'])
path = '/path/to/directory/of/directories/'
for directory in os.listdir(path):
if os.path.isdir(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
observation = f.read()
current_df = pd.DataFrame({'observation': [observation]})
df = df.append(current_df, ignore_index=True)
Once all your files have been iterated, df should be the DataFrame containing all the information in your different txt files.
You can do that using a for loop. But before that, you need to give a sequenced name to all the files like 'fil_0' within 'fol_0', 'fil_1' within 'fol_1', 'fil_2' within 'fol_2' and so on. That would facilitate the use of a for loop:
dataframes = []
import pandas as pd
for var in range(1000):
name = "fol_" + str(var) + "/fil_" + str(var) + ".txt"
dataframes.append(pd.read_csv(name)) # if you need to use all the files at once
#otherwise
df = pd.read_csv(name) # you can use file one by one
It will automatically create dataframes for each file.

Reading multiple files into separate data frames in PYTHON

I want to know if there's a way in python for reading multiple CSV file form a folder and assigning to separate data frame by the name of the file. The below code will throw an error but to show the point I pasted it
import glob
for filename in glob.glob('*.csv'):
index = filename.find(".csv")
if "test" in filename:
filename[:index]) = pd.read_csv(filename)
I believe you need create dictionary of DataFrame with keys by filenames:
d = {}
for filename in glob.glob('*.csv'):
if "test" in filename:
d[filename[:-4]] = pd.read_csv(filename)
What is same as:
d = {f[:-4]: pd.read_csv(f) for f in glob.glob('*.csv') if "test" in f}
If want only name of file is possible use:
d = {os.path.basename(f).split('.')[0]:pd.read_csv(f) for f in glob.glob('*.csv') if "test" in f}

Categories

Resources