How to "join"? multiple dataframes to one in Python

How to "join"? multiple dataframes to one in Python - python

I have 100 excel files formatted as dataframes like this with varying names:
Key
Name
1
X
2
Y
And:
Key
Name
1
Z
2
A
I have one main file formatted like this:
Index
Key
0
1
1
2
I'd like to merge the 100 files to the one dataframe in a 'messy' way.
Creating something that looks like this:
Key
Name
Name
1
X
Z
2
Y
A
How would I write a loop to accomplish this?

You can do it this way:
path = r"C:\Users\.......\testing_read"
files = glob.glob(path + "/*.csv")
data_frame = pd.DataFrame()
content = []
for filename in files:
df = pd.read_csv(filename, sep=";", index_col='Key')
content.append(df)
df = pd.concat(content, axis = 1)
which returns what you wanted.
Name Name
Key
1 X Z
2 Y A
However, this is not a very good idea if you end upp with 100 columns named Name.
If I were you I'd do this:
path = r"C:\Users\.....\testing_read"
files = glob.glob(path + "/*.csv")
data_frame = pd.DataFrame()
content = []
for filename in files:
df = pd.read_csv(filename, sep=";", index_col='Key')
content.append(df)
df = pd.concat(content)
which returns:
Name
Key
1 X
2 Y
1 Z
2 A

Related

Pandas Concat vs append and join columns --> ("state", "state:", "State")

I join 437 tables and I get 3 columns for state as my coworkers feel like giving it a different name each day, ("state", "state:" and "State"), is there a way that joins those 3 columns to just 1 column called "state"?.
*also my code uses append, I just saw its deprecated, will it work the same using concat? any way to make it give the same results as append?.
I tried:
excl_merged.rename(columns={"state:": "state", "State": "state"})
but it doesn't do anything.
The code I use:
# importing the required modules
import glob
import pandas as pd
# specifying the path to csv files
path = "X:/.../Admission_merge"
# csv files in the path
file_list = glob.glob(path + "/*.xlsx")
# list of excel files we want to merge.
# pd.read_excel(file_path) reads the excel
# data into pandas dataframe.
excl_list = []
for file in file_list:
excl_list.append(pd.read_excel(file)) #use .concat will it give the columns in the same order?
# create a new dataframe to store the
# merged excel file.
excl_merged = pd.DataFrame()
for excl_file in excl_list:
# appends the data into the excl_merged
# dataframe.
excl_merged = excl_merged.append(
excl_file, ignore_index=True)
# exports the dataframe into excel file with
# specified name.
excl_merged.to_excel('X:/.../Admission_MERGED/total_admission_2021-2023.xlsx', index=False)
print("Merge finished")
Any suggestions how I can improve it? also is there a way to remove unnamed empty columns?.
Thanks a lot.

You can use pd.concat:
excl_list = ['state1.xlsx', 'state2.xlsx', 'state3.xlsx']
state_map = {'state:': 'state', 'State': 'state'}
data = []
for excl_file in excl_list:
df = pd.read_excel(excl_file)
# Case where first row is empty
if df.columns[0].startswith('Unnamed'):
df.columns = df.iloc[0]
df = df.iloc[1:]
df = df.rename(columns=state_map)
data.append(df)
excl_merged = pd.concat(data, ignore_index=True)
# Output
ID state
0 A a
1 B b
2 C c
3 D d
4 E e
5 F f
6 G g
7 H h
8 I i
file1.xlsx:
ID State
0 A a
1 B b
2 C c
file2.xlsx:
ID state
0 D d
1 E e
2 F f
file3.xlsx:
ID state:
0 G g
1 H h
2 I i
If you have empty columns, you can use data.append(df.dropna(how='all', axis=1)) before appending to data list.

merge excel files into one based on specific columns

I need to merge multi-excel files based on a specific column as every file has two columns id and value and I need to merge all values from all files into one file next to each other. I tried this code but merged all the columns
cwd = os.path.abspath('/path/')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel('/path/' + file), ignore_index=True)
df.head()
df.to_excel('/path/merged.xlsx')
but got all values into a single column like
1 504.0303111
2 1587.678968
3 1437.759643
4 1588.387983
5 1059.194416
1 642.4925851
2 459.3774304
3 1184.210851
4 1660.24336
5 1321.414708
and I need values stored like
1 504.0303111 1 670.9609316
2 1587.678968 2 459.3774304
3 1437.759643 3 1184.210851
4 1588.387983 4 1660.24336
5 1059.194416 5 1321.414708

One way is to append the DataFrames to a list in loop and concatenate along the columns after the loop:
cwd = os.path.abspath('/path/')
files = os.listdir(cwd)
tmp = []
for i, file in enumerate(files[1:], 1):
if file.endswith('.xlsx'):
tmp.append(pd.read_excel('/path/' + file))
df = pd.concat(tmp, axis=1)
df.to_excel('/path/merged.xlsx')
But I feel like the following code would work better for you since it doesn't duplicate the id columns and only adds the value columns as new columns to a DataFrame df in loop:
cwd = os.path.abspath('/path/')
files = [file for file in os.listdir(cwd) if file.endswith('.xlsx')]
df = pd.read_excel('/path/' + files[0])
for i, file in enumerate(files[1:], 1):
df[f'value{i}'] = pd.read_excel('/path/' + file).iloc[:, 1]
df.to_excel('/path/merged.xlsx')

Add filename as header while merging csv files

Want to combine all csv in one folder. This works as intended.
import os
import glob
import pandas as pd
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames], axis = 1)
#export to csv
combined_csv.to_csv( "combined.matrix", index=False)
However I would to add the filename without extension as header.
File1.csv
A,B
1,2
3,4
File2.csv
A,B
5,6
combined.matrix
File1,File1,File2,File2
A,B,A,B
1,2,5,6
3,4,,

Try the below code:
import pandas as pd
all_filenames = ['File1.csv','File2.csv']
headers = []
for i in all_filenames:
headers.append(i.replace('.csv', ''))
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames], keys=headers, axis = 1)
Created a header list with file names excluding the extension. Pass the list to keys argument in pd.concat function.

The basic idea being that you can include the file names somewhere in the DataFrame itself (in this case I am including it in the column names, you could probably include them in a row as well) as you are anyway exporting it to csv for further processing
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
# This takes the value ["file1.csv", "file2.csv"]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames], axis = 1)
# This looks like
# A B A B
# 1 2 5 6
# 3 4 nan nan
As the column names are fixed (A and B) - and you are more interested in the file names, you can change the columns with
combined_csv.columns = sorted(all_filenames * len(combined_csv.columns) / len(all_filenames))
# This evaluates to sorted(["file1.csv", "file2.csv"] * 4 / 2) which is equal to ["file1.csv", "file1.csv", "file2.csv", "file2.csv"]
And now your dataframe would look like - which indicates which column is from which file
# file1.csv file1.csv file2.csv file2.csv
# 1 2 5 6
# 3 4 nan nan
Which you can export to the combined.matrix.csv

import os
import pandas as pd
parent_dir = 'YOUR_PARENT_DIRECTORY_PATH'
ext = 'csv'
combined_csv = pd.DataFrame()
for root, dir, files in os.walk(parent_dir):
for f in files:
path = os.path.join(root, f)
filename, extension = os.path.splitext(f)
if extension == f'.{ext}':
new_df = pd.read_csv(path)
cols = new_df.columns
new_cols = []
for c in cols:
new_cols.append(f'{filename}{c}')
new_df.columns = new_cols
combined_csv = pd.concat([combined_csv, new_df], axis=1)
combined_csv.to_csv( "combined.matrix", index=False)

I add file names into dataframe but it adds only the same name

I have a lot of csv files to open and I need to add an extra column with name of those files. For example I have x.csv, y.csv, z.csv and etc. Inside csv file it looks like below:
X Z
1 3
4 5
4 6
And it should look like this
X Z name
1 3 x
4 5 x
4 6 x
4 5 y
4 5 y
1 2 y
My code is below but it returns only 1 value...
import pandas as pd
import os
import rglob
file_list = rglob.rglob("path", "*")
li = []
for path in file_list:
df = pd.read_csv(path, index_col=None, header=0,)
file_name = os.listdir('path')[0]
df["file_name"] = file_name
li.append(df)
Any idea how could I fix it?
Best regards

Your os.listdir is wrong. os.listdir returns a list of files in the directory. You should be using os.basename or pathlib.Path.name
With pathlib:
import pandas as pd
from pathlib import Path
file_list = Path("path").rglob("*.csv")
li = []
for path in file_list:
df = pd.read_csv(path, index_col=None, header=0,)
df["file_name"] = path.name
li.append(df)

read_excel into data frame and keep file name as column (Pandas)

I am trying to read multiple excel files into a data frame and but I can't seem to find a way to keep the file name as a column to reference to where it came from. Also, I need to filter the name of the excel file and the date created before I do read_excel. (there are so many files that I do not want to read them if I don't need to) This is what I have:
res = []
for root, dirs, files in os.walk('.../Minutes/', topdown=True):
if len(files) > 0:
res.extend(zip([root]*len(files), files))
df = pd.DataFrame(res, columns=['Path', 'File_Name'])
df['FullDir'] = df.Path+'\\'+df.File_Name
list_ = []
for f in df["FullDir"]:
data = pd.read_excel(f, sheet_name = 1)
list_.append(data)
df2 = pd.concat(list_)
df2
What I would like as an output
A B filename File Date Created
0 a a File1 1-1-2018
1 b b File1 1-1-2018
2 c c FIle2 2-1-2018
3 a a File2 2-1-2018
Any help would be greatly appreciated!!

You can using concat with keys , then reset_index
res = []
for root, dirs, files in os.walk('.../Minutes/', topdown=True):
if len(files) > 0:
res.extend(zip([root]*len(files), files))
df = pd.DataFrame(res, columns=['Path', 'File_Name'])
df['FullDir'] = df.Path+'\\'+df.File_Name
Assuming above code is work as expected
list_ = []
for f in df["FullDir"]:
data = pd.read_excel(f, sheet_name = 1)
list_.append(data)
df2 = pd.concat(list_, keys=df.File_Name.values.tolist()).reset_index(level=0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to "join"? multiple dataframes to one in Python - python

Related

Pandas Concat vs append and join columns --> ("state", "state:", "State")

merge excel files into one based on specific columns

Add filename as header while merging csv files

I add file names into dataframe but it adds only the same name

read_excel into data frame and keep file name as column (Pandas)

Categories

Resources