Merge files with similar name in to one dataframe - python

I have a list of files stored in directory such as
filenames=[
abc_1.txt
abc_2.txt
abc_3.txt
bcd_1.txt
bcd_2.txt
bcd_3.txt
]
pattern=[abc]
I want to read multiple txt files into one dataframe such that all files starting with abc will be in one dataframe then all all filename starting with bcd etc.
My code:
filenames = os.listdir(file_path)
expnames=[]
for files in filenames:
expnames.append(files.rsplit('_',1)[0])
## expnames=[abc, bcd]
dfs = []
for exp in expnames:
for files in filenames:
if files.startswith(exp):
dfs.append(pd.read_csv(file_path+files,sep=',',header=None))
big_frame = pd.concat(dfs, ignore_index=True)
My output contain duplicate rows due to multiple for loops
Output:
Can someone help wih this?

This will store your desired outputs in a list of dataframes called list_of_dfs and then create a MultiIndex dataframe final from them with the file prefixes (e.g. ['abc','bcd']) as the keys for the outermost index level:
import pandas as pd
import os
filenames = os.listdir(file_path)
prefixes = list(set(i.split('_')[0] for i in filenames))
list_of_dfs = [pd.concat([pd.read_csv(os.path.join(file_path, file), header=None) for file in filenames if file.startswith(prefix)], ignore_index=True) for prefix in prefixes]
final = pd.concat(list_of_dfs, keys=prefixes)

file_path = '/home/iolie/Downloads/test/'
filenames = os.listdir(file_path)
prefixes = list(set(i.split('_')[0] for i in filenames))
for prefix in prefixes:
for file in filenames:
if file.startswith(prefix):
list_of_dfs= [ pd.concat( [pd.read_csv (os.path.join(file_path, file)], header=None ),ignore_index=True)]
final=pd.concat(list_of_dfs)

Related

Reading several datasets

I need to read 8 datasets:
df01, df02, df03, df04, df05, df06, df07, df08
This is my current approach:
#Set up file paths
filepath_2_df01= "folderX/df01.csv"
filepath_2_df02= "folderX//df02.csv"
filepath_2_df03= "folderX//df03.csv"
filepath_2_df04= "folderX//df04.csv"
filepath_2_df05= "folderY/df05.csv"
filepath_2_df06= "folderY/df06.csv"
filepath_2_df07= "folderY/df07.csv"
filepath_2_df08= "folderY/df08.csv"
#Read files
df01= pd.read_csv(filepath_2_df01)
df02= pd.read_csv(filepath_2_df02)
df03= pd.read_csv(filepath_2_df03)
df04= pd.read_csv(filepath_2_df04)
df05= pd.read_csv(filepath_2_df05)
df06= pd.read_csv(filepath_2_df06)
df07= pd.read_csv(filepath_2_df07)
df08= pd.read_csv(filepath_2_df08)
Is there a more concise way of doing that?
You can use glob for this:
import glob
dfs = []
for file in glob.glob('folderX/*.csv'):
dfs.append(pd.read_csv(file))
for df in dfs:
print(df)
use glob
from glob import glob
filenames = glob('/**/df*.csv') #get list of filenames, starting with df, in a specific folder. ** will search for subfolders as well(folderX and folderY, in your case)
dataframes = [pd.read_csv(f) for f in filenames] # using list comprehension, get a list of dataframes.
master_df = pd.concat(dataframes) #Concatenate list of dataframes to get a master dataframe

Efficient ways to combine multiple .csv files using python

I currently have about 700 '.csv' files and want to combine them into one. Each file has three columns: 'date', 'time' and 'var'. I have to combine them based on two columns: date and name. I currently read them as dataframe. After combining them, the final file should have columns of date, name, var1, var2,var3...var700. I currently use pandas merge function, but it is supper slow, as the data is large. Is there any efficient way to combine the files? My current code is as follows:
for filename in os.listdir(signal_path):
filepath=os.path.join(signal_path,filename)
_temp = pd.read_pickle(filepath)
data.merge(_temp, how = 'left', on=['date','name'])
I have attached a
sample data , each file has different length.
this code will combine all the csv files if the files in the same path.
import pandas as pd
from pathlib import Path
dir = Path("../relevant_directory")
df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)
use this:
import pandas as pd
import os
dir = '../relevant_directory/'
first = True
for folder, subfolders, files in os.walk(dir):
for f in files:
file = str(folder)+str(f)
if file.split('.')[-1] == 'csv':
if first:
data = pd.read_csv(file)
first = False
else:
df = pd.read_csv(file)
data = pd.merge(data, df, on=['date', 'name'])

Iterating over files in a directory and remove rows from them based on other file

I am looking for a way to iterate over 30 files in directory and remove rows from them based on IDs in other file. The files contain two columns - ID and a value, without column names. The other file contains just a column with the IDs ("id") that should be removed("ids_toberemoved"). After the 30 files are cleaned I want to export them to other folder.
This is what I have so far:
import pandas as pd
import os
ids_toberemoved = pd.read_csv('F:\\ids.csv')
myPath = "F:\\Other"
filesList= []
for path, subdirs, files in os.walk(myPath):
for name in files:
filesList.append(os.path.join(name))
dataframes = []
for filename in filesList:
dataframes.append(pd.read_csv(filename))
for df in dataframes:
df_cleaned = df.merge(ids_toberemoved, left_index=True, right_on=['id'],
how='left', indicator=True)
df_cleaned[df_cleaned._merge != 'both']
I am missing something in the step where I iterate over the data frames and join them with 'ids_toberemoved', in order to delete the rows with the matching IDs. Also, I can't figure out how to store every single file, after the cleaning, to other folder.
Any help appreciated!
Try the following approach:
from pathlib import Path
myPath = Path("F:\\Other")
ids_toberemoved = pd.read_csv('F:\\ids.csv', squeeze=True)
res = pd.concatenate([pd.read_csv(f, header=None, names=["ID","val"])
.query("ID not in #ids_toberemoved")
for f in myPath.glob("*.csv")],
ignore_index=True)
UPDATE: in order to clean the files and to export them separately as "filename_clean.csv":
_ = [pd.read_csv(f, header=None, names=["ID","val"])
.query("ID not in #ids_toberemoved")
.to_csv(f.with_name(f"{f.stem}_clean{f.suffix}"), index=False)
for f in myPath.glob("*.csv")]

Merge files with similar name convention to a dataframe

I have a list of files stored in directory such as
filenames=[
abc_1.txt
abc_2.txt
abc_3.txt
bcd_1.txt
bcd_2.txt
bcd_3.txt
]
pattern=[abc]
I want to read multiple txt files into one dataframe such that all files starting with abc will be in one dataframe then all all filename starting with bcd etc.
My code:
file_path = '/home/iolie/Downloads/test/'
filenames = os.listdir(file_path)
prefixes = list(set(i.split('_')[0] for i in filenames))
for prefix in prefixes:
print('Reading files with prefix:',prefix)
for file in filenames:
if file.startswith(prefix):
print('Reading files:',file)
list_of_dfs = [pd.concat([pd.read_csv(os.path.join(file_path, file), header=None) ],ignore_index=True)]
final = pd.concat(list_of_dfs)
This code doesnt't append but overwrites the dataframe. Can someone help wih this?
A better idea than creating an arbitrary number of unlinked dataframes is to output a dictionary of dataframes, where the key is the prefix:
from collections import defaultdict
filenames = ['abc_1.txt', 'abc_2.txt', 'abc_3.txt',
'bcd_1.txt', 'bcd_2.txt', 'bcd_3.txt']
dd = defaultdict(list)
for fn in filenames:
dd[fn.split('_')[0]].append(fn)
dict_of_dfs = {}
for k, v in dd.items():
dict_of_dfs[k] = pd.concat([pd.read_csv(fn) for fn in v], ignore_index=True)

Reading text files from subfolders and folders and creating a dataframe in pandas for each file text as one observation

I have the following architecture of the text files in the folders and subfolders.
I want to read them all and create a df. I am using this code, but it dont work well for me as the text is not what I checked and the files are not equivalent to my counting.
l = [pd.read_csv(filename,header=None, encoding='iso-8859-1') for filename in glob.glob("2018_01_01/*.txt")]
main_df = pd.concat(l, axis=1)
main_df = main_df.T
for i in range(2):
l = [pd.read_csv(filename, header=None, encoding='iso-8859-1',quoting=csv.QUOTE_NONE) for filename in glob.glob(str(foldernames[i+1])+ '/' + '*.txt')]
df = pd.concat(l, axis=1)
df = df.T
main_df = pd.merge(main_df, df)
file
Assuming those directories contain txt files in which information have the same structure on all of them:
import os
import pandas as pd
df = pd.DataFrame(columns=['observation'])
path = '/path/to/directory/of/directories/'
for directory in os.listdir(path):
if os.path.isdir(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
observation = f.read()
current_df = pd.DataFrame({'observation': [observation]})
df = df.append(current_df, ignore_index=True)
Once all your files have been iterated, df should be the DataFrame containing all the information in your different txt files.
You can do that using a for loop. But before that, you need to give a sequenced name to all the files like 'fil_0' within 'fol_0', 'fil_1' within 'fol_1', 'fil_2' within 'fol_2' and so on. That would facilitate the use of a for loop:
dataframes = []
import pandas as pd
for var in range(1000):
name = "fol_" + str(var) + "/fil_" + str(var) + ".txt"
dataframes.append(pd.read_csv(name)) # if you need to use all the files at once
#otherwise
df = pd.read_csv(name) # you can use file one by one
It will automatically create dataframes for each file.

Categories

Resources