Assume using glob, we read a folder which contains several csv files such as:
/C/share\AA_12345_1.csv
/C/share\AA_12345_2.csv
/C/share\AA_12345_3.csv
/C/share\BB_13_1.csv
/C/share\BB_13_2.csv
The goal is to append the csv files based on the similar filename group, example append
/C/share\AA_12345_1.csv
...
/C/share\AA_12345_3.csv
as one dataframe of /C/share\AA_12345
/C/share\BB_13_1.csv
/C/share\BB_13_2.csv
also as one dataframe of /C/share\BB_13
My current approach is using
res = [list(i) for j, i in groupby(lof,
lambda a:a.partition('\.*_\d*_?[0-9]_*(?=_)')[0])]
to get group of [[/C/share\AA_12345_1.csv,/C/share\AA_12345_2.csv , /C/share\AA_12345_3.csv ],[/C/share\BB_13_1.csv,/C/share\BB_13_2.csv]]
and then for each group read the csv and append.
However the result is still one biglist [/C/share\AA_12345_1.csv,...,/C/share\BB_13_2.csv]
Any idea/pointer on how to move forward?
Many thanks in advance!
Here's my different approach to get group of similar file:
key_func = lambda x : re.search(r"\w*\D_\d*",x).group()
res = [list(j) for i, j in itertools.groupby(d, key_func)] #d contains list of csv files
for i in res:
print(i)
Related
If I have for example 3 txt files that looks as follows:
file1.txt:
a 10
b 20
c 30
file2.txt:
d 40
e 50
f 60
file3.txt:
g 70
h 80
i 90
I would like to read this data from the files and create a single excel file that will look like this:
Specifically in my case I have 100+ txt files that I read using glob and loop.
Thank you
There's a bit of logic involved into getting the output you need.
First, to process the input files into separate lists. You might need to adjust this logic depending on the actual contents of the files. You need to be able to get the columns for the files. For the samples provided my logic works.
I added a safety check to see if the input files have the same number of rows. If they don't it will seriously mess up the resulting excel file. You'll need to add some logic if a length mismatch happens.
For the writing to the excel file, it's very easy using pandas in combination with openpyxl. There are likely more elegant solutions, but I'll leave it to you.
I'm referencing some SO answers in the code for further reading.
requirements.txt
pandas
openpyxl
main.py
# we use pandas for easy saving as XSLX
import pandas as pd
filelist = ["file01.txt", "file02.txt", "file03.txt"]
def load_file(filename: str) -> list:
result = []
with open(filename) as infile:
# the split below is OS agnostic and removes EOL characters
for line in infile.read().splitlines():
# the split below splits on space character by default
result.append(line.split())
return result
loaded_files = []
for filename in filelist:
loaded_files.append(load_file(filename))
# you will want to check if the files have the same number of rows
# it will break stuff if they don't, you could fix it by appending empty rows
# stolen from:
# https://stackoverflow.com/a/10825126/9267296
len_first = len(loaded_files[0]) if loaded_files else None
if not all(len(i) == len_first for i in loaded_files):
print("length mismatch")
exit(419)
# generate empty list of lists so we don't get index error below
# stolen from:
# https://stackoverflow.com/a/33990699/9267296
result = [ [] for _ in range(len(loaded_files[0])) ]
for f in loaded_files:
for index, row in enumerate(f):
result[index].extend(row)
result[index].append('')
# trim the last empty column
result = [line[:-1] for line in result]
# write as excel file
# stolen from:
# https://stackoverflow.com/a/55511313/9267296
# note that there are some other options on this SO question, but this one
# is easily readable
df = pd.DataFrame(result)
with pd.ExcelWriter("output.xlsx") as writer:
df.to_excel(writer, sheet_name="sheet_name_goes_here", index=False)
result:
I have for example 4 csv files. I have many other files with the following naming convention with some other files that don't have 'kd' in their name. I want to select the files with 'kd' and do the follows:
kd_2020_2.csv
kd_2020_2_modified.csv
kd_2021_2.csv
kd_2021_2_modified.csv
pp_2012_2.csv
I want to group the two files that have the same name except for the 'modified' portion and then read those files and do some comparison (therefore, kd_2020_2.csv and kd_2020_2_modified.csv will be grouped together and so on).
So far, I got
import pandas as pd
import os
import glob
import difflib
os.chdir('C:\\New_folder')
FileList = glob.glob('*.csv')
print(FileList)
files=[f for f in FileList if 'kd' in f]
file_name =[files[i].split('.')[0] for i in range(len(files))]
for i in range(len(file_name)):
if difflib.ndiff(file_name[i], file_name[i+1]) == 'modified':
df[i] = pd.read_csv(FileList[i])
df[i+1] = pd.read_csv(FileList[i+1])
It is going out of range since I am doing (i+1). Also, this is not what I intend to do. I want to compare each file name with all the other file names and read only the two files with matching name except the part 'modified'. Thank you for your help.
You can use itertools``groupby to create groups based on the first 9 characters of the filenames:
from itertools import groupby
file_groups = [list(i) for j, i in groupby(FileList, lambda a: a[:9])]
This will output a list of pairs:
[['kd_2020_2.csv', 'kd_2020_2_modified.csv'], ['kd_2021_2.csv, kd_2021_2_modified.csv'], ['pp_2012_2.csv']]
You can then iterate the list and load the pairs and process them:
for i in file_groups:
df1 = pd.read_csv(i[0])
df2 = pd.read_csv(i[1])
Excuse my question, I know this is trivial but for some reasons I am not getting it right. Reading dataframes one by one is highly inefficient especially if you have a lot of dataframes you would like to read from. Remember DRY - DO NOT REPEAT YOURSELF
So here is my approach:
files = ["company.csv", "house.csv", "taxfile.csv", "reliablity.csv", "creditloan.csv", "medicalfunds.csv"]
DataFrameName = ["company_df", "house_df", "taxfile_df", "reliablity_df", "creditloan_df", "medicalfunds_df"]
for file in files:
for df in DataFrameName:
df = pd.read_csv(file)
This only gives me df as one of the frames, I am not sure which of them but I guess the last one. How can I read through the csv files and store them with a dataframe names in the DataFrameName
My goal:
To have 6 dataframes loaded in the workspace spaced in the DataFrameName
For example company_df holds the data from "company.csv"
You could set up
DataFrameDic = {"company":[], "house":[], "taxfile":[], "reliablity":[], "creditloan":[], "medicalfunds":[]}
for key in DataFrameDic:
DataFrameDic[key] = pd.read_csv(key+'.csv')
This should return a dictionary containing of dataframes.
Something like this:
files = [
"company.csv",
"house.csv",
"taxfile.csv",
"reliablity.csv",
"creditloan.csv",
"medicalfunds.csv",
]
DataFrameName = [
"company_df",
"house_df",
"taxfile_df",
"reliablity_df",
"creditloan_df",
"medicalfunds_df",
]
dfs = {}
for name, file in zip(DataFrameName, files):
dfs[name] = pd.read_csv(file)
zip lets you iterate two lists at the same time, so you can get both the name and the filename.
You'll end up with a dict of DataFrames
using pathlib we can create a generator expression then create a dictionary with the file name as the name and the value as the dataframe.
with pathlib we can use the .glob module to grab all the csv's in a target path.
replace "\tmp\files" with the path to your files, if your using windows use a raw string or escape the slashes.
from pathlib import Path
trg_files = (f for f in Path("\tmp\files").glob("*.csv"))
dataframe_dict = {f"{file.stem}_df": pd.read_csv(file) for file in trg_files}
print(dataframe_dict.keys())
'company_df'
print(datarame_dict['company_df'])
Dictionary are the way, since you can name their content dynamically.
names = ["company", "house", "taxfile", "reliablity", "creditloan", "medicalfunds"]
dataframes = {}
for name in names:
dataframes[f"{name}_df"] = pd.read_csv(f"{name}.csv")
The fact that you have a nice naming convention allows us to append easily the _df or .csv part to the name when needed.
I have multiple CSV files, I want to compare them. The file contents are the same except for some additional changes, and I want to list those additional changes.
For eg:
files =[1.csv, 2.csv,3.csv]
I want to compare 1.csv and 2.csv, get the difference and store somewhere, then compare 2.csv and 3.csv, store the diff somewhere.
for dirs in glob.glob(INPUT_PATH+"*"):
if (os.path.isdir(dirs)):
for files in glob.glob(dirs+'*/'+'/*.csv'):
## list all the csv files but how to read them to get difference.
you can use pandas to read csv as dataframe in a list then compare them from that list :
import pandas as pd
dfList = []
dfList.append(pd.read_csv('FilePath'))
dfList[0] contains the content of first csv file and so on
So, for comparing between first and 2nd csv you have to compare between dfList[0] and dfList[1]
The first fonction compare 2 files and the second fonction create a additional file with the difference between the 2 files.
import os
def compare(file_compared,file_master):
"""
A = [100,200,300]
B = [400,500,100]
compare(A,B) = [200,300]
"""
file_compared_list = []
file_master_list = []
with open(file_compared,'r') as fc:
for line in fc:
file_compared_list.append(line.strip())
with open(file_master,'r') as fm:
for line in fm:
file_master_list.append(line.strip())
return list(set(file_compared_list) - set(file_master_list))
def create_file(filename):
diff = compare("file1.csv","file2.csv")
with open(filename,'w') as f:
for element in diff:
f.write(element)
create_file("test.csv")
I've searched thoroughly and can't quite find the guidance I am looking for on this issue so I hope this question is not redundant. I have several .csv files that represent raster images. I'd like to perform some statistical analysis on them so I am trying to create a Pandas dataframe for each file so I can slice 'em dice 'em and plot 'em...but I am having trouble looping through the list of files to create a DF with a meaningful name for each file.
Here is what I have so far:
import glob
import os
from pandas import *
#list of .csv files
#I'd like to turn each file into a dataframe
dataList = glob.glob(r'C:\Users\Charlie\Desktop\Qvik\textRasters\*.csv')
#name that I'd like to use for each data frame
nameList = []
for raster in dataList:
path_list = raster.split(os.sep)
name = path_list[6][:-4]
nameList.append(name)
#zip these lists into a dict
dataDct = {}
for k, v in zip(nameList,dataList):
dataDct[k] = dataDct.get(k,"") + v
dataDct
So now I have a dict where the key is the name I want for each dataframe and the value is the path for read_csv(path):
{'Aspect': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\Aspect.csv',
'Curvature': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\Curvature.csv',
'NormalZ': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\NormalZ.csv',
'Slope': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\Slope.csv',
'SnowDepth': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\SnowDepth.csv',
'Vegetation': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\Vegetation.csv',
'Z': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\Z.csv'}
My instinct was to try variations of this:
for k, v in dataDct.iteritems():
k = read_csv(v)
but that leaves me with a single dataframe, 'k' , that is filled with data from the last file read in by the loop.
I'm probably missing something fundamental here but I am starting to spin my wheels on this so I'd thought I'd ask y'all...any ideas are appreciated!
Cheers.
Are you trying to get all of the data frames separately in a dictionary, one data frame per key? If so, this will leave you with the dict you showed but instead will have the data from in each key.
dataDct = {}
for k, v in zip(nameList,dataList):
dataDct[k] = read_csv(v)
So now, you could do this for example:
dataDct['SnowDepth'][['cola','colb']].plot()
Unclear why you're overwriting your object here I think you want either a list or dict of the dfs:
df_list=[]
for k, v in dataDct.iteritems():
df_list.append(read_csv(v))
or
df_dict={}
for k, v in dataDct.iteritems():
df_dict[k] = read_csv(v)