merge excel files into one based on specific columns - python

I need to merge multi-excel files based on a specific column as every file has two columns id and value and I need to merge all values from all files into one file next to each other. I tried this code but merged all the columns
cwd = os.path.abspath('/path/')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel('/path/' + file), ignore_index=True)
df.head()
df.to_excel('/path/merged.xlsx')
but got all values into a single column like
1 504.0303111
2 1587.678968
3 1437.759643
4 1588.387983
5 1059.194416
1 642.4925851
2 459.3774304
3 1184.210851
4 1660.24336
5 1321.414708
and I need values stored like
1 504.0303111 1 670.9609316
2 1587.678968 2 459.3774304
3 1437.759643 3 1184.210851
4 1588.387983 4 1660.24336
5 1059.194416 5 1321.414708

One way is to append the DataFrames to a list in loop and concatenate along the columns after the loop:
cwd = os.path.abspath('/path/')
files = os.listdir(cwd)
tmp = []
for i, file in enumerate(files[1:], 1):
if file.endswith('.xlsx'):
tmp.append(pd.read_excel('/path/' + file))
df = pd.concat(tmp, axis=1)
df.to_excel('/path/merged.xlsx')
But I feel like the following code would work better for you since it doesn't duplicate the id columns and only adds the value columns as new columns to a DataFrame df in loop:
cwd = os.path.abspath('/path/')
files = [file for file in os.listdir(cwd) if file.endswith('.xlsx')]
df = pd.read_excel('/path/' + files[0])
for i, file in enumerate(files[1:], 1):
df[f'value{i}'] = pd.read_excel('/path/' + file).iloc[:, 1]
df.to_excel('/path/merged.xlsx')

Related

Pandas Concat vs append and join columns --> ("state", "state:", "State")

I join 437 tables and I get 3 columns for state as my coworkers feel like giving it a different name each day, ("state", "state:" and "State"), is there a way that joins those 3 columns to just 1 column called "state"?.
*also my code uses append, I just saw its deprecated, will it work the same using concat? any way to make it give the same results as append?.
I tried:
excl_merged.rename(columns={"state:": "state", "State": "state"})
but it doesn't do anything.
The code I use:
# importing the required modules
import glob
import pandas as pd
# specifying the path to csv files
path = "X:/.../Admission_merge"
# csv files in the path
file_list = glob.glob(path + "/*.xlsx")
# list of excel files we want to merge.
# pd.read_excel(file_path) reads the excel
# data into pandas dataframe.
excl_list = []
for file in file_list:
excl_list.append(pd.read_excel(file)) #use .concat will it give the columns in the same order?
# create a new dataframe to store the
# merged excel file.
excl_merged = pd.DataFrame()
for excl_file in excl_list:
# appends the data into the excl_merged
# dataframe.
excl_merged = excl_merged.append(
excl_file, ignore_index=True)
# exports the dataframe into excel file with
# specified name.
excl_merged.to_excel('X:/.../Admission_MERGED/total_admission_2021-2023.xlsx', index=False)
print("Merge finished")
Any suggestions how I can improve it? also is there a way to remove unnamed empty columns?.
Thanks a lot.
You can use pd.concat:
excl_list = ['state1.xlsx', 'state2.xlsx', 'state3.xlsx']
state_map = {'state:': 'state', 'State': 'state'}
data = []
for excl_file in excl_list:
df = pd.read_excel(excl_file)
# Case where first row is empty
if df.columns[0].startswith('Unnamed'):
df.columns = df.iloc[0]
df = df.iloc[1:]
df = df.rename(columns=state_map)
data.append(df)
excl_merged = pd.concat(data, ignore_index=True)
# Output
ID state
0 A a
1 B b
2 C c
3 D d
4 E e
5 F f
6 G g
7 H h
8 I i
file1.xlsx:
ID State
0 A a
1 B b
2 C c
file2.xlsx:
ID state
0 D d
1 E e
2 F f
file3.xlsx:
ID state:
0 G g
1 H h
2 I i
If you have empty columns, you can use data.append(df.dropna(how='all', axis=1)) before appending to data list.

How to "join"? multiple dataframes to one in Python

I have 100 excel files formatted as dataframes like this with varying names:
Key
Name
1
X
2
Y
And:
Key
Name
1
Z
2
A
I have one main file formatted like this:
Index
Key
0
1
1
2
I'd like to merge the 100 files to the one dataframe in a 'messy' way.
Creating something that looks like this:
Key
Name
Name
1
X
Z
2
Y
A
How would I write a loop to accomplish this?
You can do it this way:
path = r"C:\Users\.......\testing_read"
files = glob.glob(path + "/*.csv")
data_frame = pd.DataFrame()
content = []
for filename in files:
df = pd.read_csv(filename, sep=";", index_col='Key')
content.append(df)
df = pd.concat(content, axis = 1)
which returns what you wanted.
Name Name
Key
1 X Z
2 Y A
However, this is not a very good idea if you end upp with 100 columns named Name.
If I were you I'd do this:
path = r"C:\Users\.....\testing_read"
files = glob.glob(path + "/*.csv")
data_frame = pd.DataFrame()
content = []
for filename in files:
df = pd.read_csv(filename, sep=";", index_col='Key')
content.append(df)
df = pd.concat(content)
which returns:
Name
Key
1 X
2 Y
1 Z
2 A

How to read several files over a loop as a table (panda) and pick one column from each table and append it together

I have several data files each with two columns. Column 1 has the same data in each file while column two changes with each file. I want to create a matrix or a table such that this data is of the form and then carry on with other functions. Would np.loadtxt be easier/better than pandas?
column_1 col_2(file1) col3(file2)...col_n(file-n)
1. 1 3 ...
2. 3 32
3 4 2
4 5 9
5 2 5
For now I have this-
for i in range(0,3):
file = file_name + '%d' %i+'.dat'
print(file)
f=open(file, 'r')
tble = pd.read_table(f, sep='\s+',skiprows= 15, header=None)
time=tble[0]
inten=tble[1]
but merge, append don't seem to work
tble['inten'] = pd.Series(inten, index=tble.index)
I would extract all the data file each in its dataframe and then concat the second columns:
tbls = []
for i in range(0,3):
file = file_name + '%d' %i+'.dat'
print(file)
f=open(file, 'r')
tble = pd.read_table(f, sep='\s+',skiprows= 15, header=None)
tbls.append(tble)
df = pd.concat([tbls[0]] + [tble.iloc[:, 1] for tble in tbls[1:]], axis = 1)

Adding a column to dataframe while reading csv files [pandas]

I'm reading multiple csv files and combining them into a single dataframe like below:
pd.concat([pd.read_csv(f, encoding='latin-1') for f in glob.glob('*.csv')],
ignore_index=False, sort=False)
Problem:
I want to add a column that doesn't exist in any csv (to the dataframe) based on the csv file name for every csv file that is getting concatenated to the dataframe. Any help will be appreciated.
glob.glob returns normal string so you can just add a column to every individual dataframe in a loop.
Assuming you have files df1.csv and df2.csv in your directory:
import glob
import pandas as pd
files = glob.glob('df*csv')
dfs = []
for file in files:
df = pd.read_csv(file)
df['filename'] = file
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df
a b filename
0 1 2 df1.csv
1 3 4 df1.csv
2 5 6 df2.csv
3 7 8 df2.csv
I have multiple csv files in my local directory. Each filename contains some numbers. Some of those numbers identify years for which the file is. I need to add a column year to each file that I'm concatenating and while I do I want to get the year information from the filename and insert it into that column. I'm using regex to extract the year and concatenate it like 20 + 11 = 2011. Then, I'm setting the column's data type to int32.
pd.concat(
[
pd.read_csv(f)
.assign(year = '20' + re.search('[a-z]+(?P<year>[0-9]{2})', f).group('year'))
.astype({'year' : 'int32'})
for f in glob.glob('stateoutflow*[0-9].csv')
],
ignore_index = True
)

Create a new dataframe out of dozens of df.sum() series

I have several pandas DataFrames of the same format, with five columns.
I would like to sum the values of each one of these dataframes using df.sum(). This will create a Series for each Dataframe, still with 5 columns.
My problem is how to take these Series, and create another Dataframe, one column being the filename, the other columns being the five columns above from df.sum()
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = []
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf.append(df)
newdf = pd.concat(newdf, ignore_index=True)
This approach doesn't work unfortunately. 'df['filename'] = str(filename)' throws a TypeError, and the creating a new dataframe newdf doesn't parse correctly.
How would one do this correctly?
How do you take a number of pandas.Series objects and create a DataFrame?
Try in this order:
Create an empty list, say list_of_series.
For every file:
load into a data frame, then save the sum in a series s
add an element to s: s['filename'] = your_filename
append s to list_of_series
Finally, concatenate (and transpose if needed):
final_df = pd.concat(list_of_series, axis = 1).T
Code
Preparation:
l_df = [pd.DataFrame(np.random.rand(3,5), columns = list("ABCDE")) for _ in range(5)]
for i, df in enumerate(l_df):
df.to_csv(str(i)+'.txt', index = False)
Files *.txt are comma separated and contain headers.
! cat 1.txt
A,B,C,D,E
0.18021800981245173,0.29919271590063656,0.09527248614484807,0.9672038093199938,0.07655003742768962
0.35422759068109766,0.04184770882952815,0.682902924462214,0.9400817219440063,0.8825581077493059
0.3762875793116358,0.4745731412494566,0.6545473610147845,0.7479829630649761,0.15641907539706779
And, indeed, the rest is quite similar to what you did (I append file names to a series, not to data frames. Otherwise they got concatenated several times by sum()):
files = glob.glob('*.txt')
print(files)
['3.txt', '0.txt', '4.txt', '2.txt', '1.txt']
list_of_series = []
for f in files:
df = pd.read_csv(f)
s = df.sum()
s['filename'] = f
list_of_series.append(s)
final_df = pd.concat(list_of_series, axis = 1).T
print(final_df)
A B C D E filename
0 1.0675 2.20957 1.65058 1.80515 2.22058 3.txt
1 0.642805 1.36248 0.0237625 1.87767 1.63317 0.txt
2 1.68678 1.26363 0.835245 2.05305 1.01829 4.txt
3 1.22748 2.09256 0.785089 1.87852 2.05043 2.txt
4 0.910733 0.815614 1.43272 2.65527 1.11553 1.txt
To answer this specific question :
#ThomasTu How do I go from a list of Series with 'Filename' as a
column to a dataframe? I think that's the problem---I don't understand
this
It's essentially what you have now, but instead of appending to an empty list, you append to an empty dataframe. I think there's an inplace keyword if you don't want to reassign newdf on each iteration.
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = pd.DataFrame()
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf = newdf.append(df, ignore_index=True)

Categories

Resources