I join 437 tables and I get 3 columns for state as my coworkers feel like giving it a different name each day, ("state", "state:" and "State"), is there a way that joins those 3 columns to just 1 column called "state"?.
*also my code uses append, I just saw its deprecated, will it work the same using concat? any way to make it give the same results as append?.
I tried:
excl_merged.rename(columns={"state:": "state", "State": "state"})
but it doesn't do anything.
The code I use:
# importing the required modules
import glob
import pandas as pd
# specifying the path to csv files
path = "X:/.../Admission_merge"
# csv files in the path
file_list = glob.glob(path + "/*.xlsx")
# list of excel files we want to merge.
# pd.read_excel(file_path) reads the excel
# data into pandas dataframe.
excl_list = []
for file in file_list:
excl_list.append(pd.read_excel(file)) #use .concat will it give the columns in the same order?
# create a new dataframe to store the
# merged excel file.
excl_merged = pd.DataFrame()
for excl_file in excl_list:
# appends the data into the excl_merged
# dataframe.
excl_merged = excl_merged.append(
excl_file, ignore_index=True)
# exports the dataframe into excel file with
# specified name.
excl_merged.to_excel('X:/.../Admission_MERGED/total_admission_2021-2023.xlsx', index=False)
print("Merge finished")
Any suggestions how I can improve it? also is there a way to remove unnamed empty columns?.
Thanks a lot.
You can use pd.concat:
excl_list = ['state1.xlsx', 'state2.xlsx', 'state3.xlsx']
state_map = {'state:': 'state', 'State': 'state'}
data = []
for excl_file in excl_list:
df = pd.read_excel(excl_file)
# Case where first row is empty
if df.columns[0].startswith('Unnamed'):
df.columns = df.iloc[0]
df = df.iloc[1:]
df = df.rename(columns=state_map)
data.append(df)
excl_merged = pd.concat(data, ignore_index=True)
# Output
ID state
0 A a
1 B b
2 C c
3 D d
4 E e
5 F f
6 G g
7 H h
8 I i
file1.xlsx:
ID State
0 A a
1 B b
2 C c
file2.xlsx:
ID state
0 D d
1 E e
2 F f
file3.xlsx:
ID state:
0 G g
1 H h
2 I i
If you have empty columns, you can use data.append(df.dropna(how='all', axis=1)) before appending to data list.
Related
I need to merge multi-excel files based on a specific column as every file has two columns id and value and I need to merge all values from all files into one file next to each other. I tried this code but merged all the columns
cwd = os.path.abspath('/path/')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel('/path/' + file), ignore_index=True)
df.head()
df.to_excel('/path/merged.xlsx')
but got all values into a single column like
1 504.0303111
2 1587.678968
3 1437.759643
4 1588.387983
5 1059.194416
1 642.4925851
2 459.3774304
3 1184.210851
4 1660.24336
5 1321.414708
and I need values stored like
1 504.0303111 1 670.9609316
2 1587.678968 2 459.3774304
3 1437.759643 3 1184.210851
4 1588.387983 4 1660.24336
5 1059.194416 5 1321.414708
One way is to append the DataFrames to a list in loop and concatenate along the columns after the loop:
cwd = os.path.abspath('/path/')
files = os.listdir(cwd)
tmp = []
for i, file in enumerate(files[1:], 1):
if file.endswith('.xlsx'):
tmp.append(pd.read_excel('/path/' + file))
df = pd.concat(tmp, axis=1)
df.to_excel('/path/merged.xlsx')
But I feel like the following code would work better for you since it doesn't duplicate the id columns and only adds the value columns as new columns to a DataFrame df in loop:
cwd = os.path.abspath('/path/')
files = [file for file in os.listdir(cwd) if file.endswith('.xlsx')]
df = pd.read_excel('/path/' + files[0])
for i, file in enumerate(files[1:], 1):
df[f'value{i}'] = pd.read_excel('/path/' + file).iloc[:, 1]
df.to_excel('/path/merged.xlsx')
I have an excel file containing a column of 'Usernames' and I want to copy-paste that data into the adjacent column in the same sheet and call it 'Passwords'. All this must be done in a Python program.
You can try pandas.ExcelWriter:
import pandas as pd
writer = pd.ExcelWriter('testsheet.xlsx', engine='openpyxl')
wb = writer.book
df = pd.read_excel("testsheet.xlsx")
df['Passwords'] = df['Usernames']
df.to_excel(writer, index=False)
wb.save('testsheet.xlsx')
Alternatively, you can try a more simpler solution:
df = pd.read_excel('testsheet.xlsx')
df['password'] = df['username']
df.to_excel("testsheet.xlsx", index=False)
You can use pandas.DataFrame.copy. Suppose this is your dataframe:
import pandas as pd
df = pd.read_excel('username.xlsx')
df
This gives you:
username
0 a
1 b
2 c
3 d
Then create another dataframe:
df1 = df.copy()
Then after copying the content , create a column with the name of 'password' in df and equate it to df1
df['password'] = df1
df
This gives you:
username password
0 a a
1 b b
2 c c
3 d d
Then save it to excel:
df.to_excel('username.xlsx' , index = False)
I'm reading multiple csv files and combining them into a single dataframe like below:
pd.concat([pd.read_csv(f, encoding='latin-1') for f in glob.glob('*.csv')],
ignore_index=False, sort=False)
Problem:
I want to add a column that doesn't exist in any csv (to the dataframe) based on the csv file name for every csv file that is getting concatenated to the dataframe. Any help will be appreciated.
glob.glob returns normal string so you can just add a column to every individual dataframe in a loop.
Assuming you have files df1.csv and df2.csv in your directory:
import glob
import pandas as pd
files = glob.glob('df*csv')
dfs = []
for file in files:
df = pd.read_csv(file)
df['filename'] = file
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df
a b filename
0 1 2 df1.csv
1 3 4 df1.csv
2 5 6 df2.csv
3 7 8 df2.csv
I have multiple csv files in my local directory. Each filename contains some numbers. Some of those numbers identify years for which the file is. I need to add a column year to each file that I'm concatenating and while I do I want to get the year information from the filename and insert it into that column. I'm using regex to extract the year and concatenate it like 20 + 11 = 2011. Then, I'm setting the column's data type to int32.
pd.concat(
[
pd.read_csv(f)
.assign(year = '20' + re.search('[a-z]+(?P<year>[0-9]{2})', f).group('year'))
.astype({'year' : 'int32'})
for f in glob.glob('stateoutflow*[0-9].csv')
],
ignore_index = True
)
I have several pandas DataFrames of the same format, with five columns.
I would like to sum the values of each one of these dataframes using df.sum(). This will create a Series for each Dataframe, still with 5 columns.
My problem is how to take these Series, and create another Dataframe, one column being the filename, the other columns being the five columns above from df.sum()
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = []
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf.append(df)
newdf = pd.concat(newdf, ignore_index=True)
This approach doesn't work unfortunately. 'df['filename'] = str(filename)' throws a TypeError, and the creating a new dataframe newdf doesn't parse correctly.
How would one do this correctly?
How do you take a number of pandas.Series objects and create a DataFrame?
Try in this order:
Create an empty list, say list_of_series.
For every file:
load into a data frame, then save the sum in a series s
add an element to s: s['filename'] = your_filename
append s to list_of_series
Finally, concatenate (and transpose if needed):
final_df = pd.concat(list_of_series, axis = 1).T
Code
Preparation:
l_df = [pd.DataFrame(np.random.rand(3,5), columns = list("ABCDE")) for _ in range(5)]
for i, df in enumerate(l_df):
df.to_csv(str(i)+'.txt', index = False)
Files *.txt are comma separated and contain headers.
! cat 1.txt
A,B,C,D,E
0.18021800981245173,0.29919271590063656,0.09527248614484807,0.9672038093199938,0.07655003742768962
0.35422759068109766,0.04184770882952815,0.682902924462214,0.9400817219440063,0.8825581077493059
0.3762875793116358,0.4745731412494566,0.6545473610147845,0.7479829630649761,0.15641907539706779
And, indeed, the rest is quite similar to what you did (I append file names to a series, not to data frames. Otherwise they got concatenated several times by sum()):
files = glob.glob('*.txt')
print(files)
['3.txt', '0.txt', '4.txt', '2.txt', '1.txt']
list_of_series = []
for f in files:
df = pd.read_csv(f)
s = df.sum()
s['filename'] = f
list_of_series.append(s)
final_df = pd.concat(list_of_series, axis = 1).T
print(final_df)
A B C D E filename
0 1.0675 2.20957 1.65058 1.80515 2.22058 3.txt
1 0.642805 1.36248 0.0237625 1.87767 1.63317 0.txt
2 1.68678 1.26363 0.835245 2.05305 1.01829 4.txt
3 1.22748 2.09256 0.785089 1.87852 2.05043 2.txt
4 0.910733 0.815614 1.43272 2.65527 1.11553 1.txt
To answer this specific question :
#ThomasTu How do I go from a list of Series with 'Filename' as a
column to a dataframe? I think that's the problem---I don't understand
this
It's essentially what you have now, but instead of appending to an empty list, you append to an empty dataframe. I think there's an inplace keyword if you don't want to reassign newdf on each iteration.
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = pd.DataFrame()
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf = newdf.append(df, ignore_index=True)
Hello all…a question in using Panda to combine Excel spreadsheets.
The problem is that, sequence of columns are lost when they are combined. If there are more files to combine, the format will be even worse.
If gives an error message, if the number of files are big.
ValueError: column index (256) not an int in range(256)
What I am using is below:
import pandas as pd
df = pd.DataFrame()
for f in ['c:\\1635.xls', 'c:\\1644.xls']:
data = pd.read_excel(f, 'Sheet1')
data.index = [os.path.basename(f)] * len(data)
df = df.append(data)
df.to_excel('c:\\CB.xls')
The original files and combined look like:
what's the best way to combine great amount of such similar Excel files?
thanks.
I usually use xlrd and xlwt:
#!/usr/bin/env python
# encoding: utf-8
import xlwt
import xlrd
import os
current_file = xlwt.Workbook()
write_table = current_file.add_sheet('sheet1', cell_overwrite_ok=True)
key_list = [u'City', u'Country', u'Received Date', u'Shipping Date', u'Weight', u'1635']
for title_index, text in enumerate(key_list):
write_table.write(0, title_index, text)
file_list = ['1635.xlsx', '1644.xlsx']
i = 1
for name in file_list:
data = xlrd.open_workbook(name)
table = data.sheets()[0]
nrows = table.nrows
for row in range(nrows):
if row == 0:
continue
for index, context in enumerate(table.row_values(row)):
write_table.write(i, index, context)
i += 1
current_file.save(os.getcwd() + '/result.xls')
Instead of data.index = [os.path.basename(f)] * len(data) you should use df.reset_index().
For example:
1.xlsx:
a b
1 1
2 2
3 3
2.xlsx:
a b
4 4
5 5
6 6
code:
df = pd.DataFrame()
for f in [r"C:\Users\Adi\Desktop\1.xlsx", r"C:\Users\Adi\Desktop\2.xlsx"]:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
df.reset_index(inplace=True, drop=True)
df.to_excel('c:\\CB.xls')
cb.xls:
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
If you don't want the dataframe's index to be in the output file, you can use df.to_excel('c:\\CB.xls', index=False).