merge multiple excel files in one dataframe - python

I have a lots of excel files +200, all of these have the same format.
the directorys are saved in this list
dir_list = ['all','files]
I want to convert all of them into one single df
below is what I want to select from each excel file into the new df
used_col = ['Dimension', 'Length','Customer']
df_x = pd.read_excel(file,sheet_name='Tabelle1',skiprows=3,skipinitialspace=True, usecols=used_col)
how can I do that ?

You are close, you need to use concat to create a single df from all files.
tmp = []
used_col = ['Dimension', 'Length','Customer']
for file in dir_list:
df_x = pd.read_excel(file,sheet_name='Tabelle1',skiprows=3,skipinitialspace=True, usecols=used_col)
tmp.append(df_x)
final_df = pd.concat(tmp)

Related

How to combine multiple csv as columns in python?

I have 10 .txt (csv) files that I want to merge together in a single csv file to use later in analysis. when I use pd.append, it always merges the files below each other.
I use the following code:
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.txt'):
files = pd.read_csv(file, sep='\t', skiprows=[1])
master_df = master_df.append(files)
the output is:
output
what I need is to insert the columns of each file side-by-side, as follows:
The required output
could you please help with this?
To merge DataFrames side by side, you should use pd.concat.
frames = []
for file in os.listdir(os.getcwd()):
if file.endswith('.txt'):
files = pd.read_csv(file, sep='\t', skiprows=[1])
frames.append(files)
# axis = 0 has the same behavior of your original approach
master_df = pd.concat(frames, axis = 1)

Merging csv files into one (columnwise) in Python

I have many .csv files like this (with one column):
picture
Id like to merge them into one .csv file, so that each of the column will contain one of the csv files data. The headings should be like this (when converted to spreadsheet):
picture (the first number is the number of minutes extracted from the file name, the second is the first word in the file name behind "export_" in the name, and third is the whole name of the file).
Id like to work in Python.
Can you please someone help me with this? I am new in Python.
Thank you very much.
I tried to join only 2 files, but I have no idea how to do it with more files without writing all down manually. Also, i dont know, how to extract headings from the file names:
import pandas as pd
file_list = ['export_Control 37C 4h_Single Cells_Single Cells_Single Cells.csv', 'export_Control 37C 0 min_Single Cells_Single Cells_Single Cells.csv']
df = pd.DataFrame()
for file in file_list:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df], axis=1)
print(df)
df.to_csv('output2.csv', index=False)
Assuming that your .csv files they all have a header and the same number of rows, you can use the code below to put all the .csv (single-columned) one besides the other in a single Excel worksheet.
import os
import pandas as pd
csv_path = r'path_to_the_folder_containing_the_csvs'
csv_files = [file for file in os.listdir(csv_path)]
list_of_dfs=[]
for file in csv_files :
temp=pd.read_csv(csv_path + '\\' + file, header=0, names=['Header'])
time_number = pd.DataFrame([[file.split('_')[1].split()[2]]], columns=['Header'])
file_title = pd.DataFrame([[file.split('_')[1].split()[0]]], columns=['Header'])
file_name = pd.DataFrame([[file]], columns=['Header'])
out = pd.concat([time_number, file_title, file_name, temp]).reset_index(drop=True)
list_of_dfs.append(out)
final= pd.concat(list_of_dfs, axis=1, ignore_index=True)
final.columns = ['Column' + str(col+1) for col in final.columns]
final.to_csv(csv_path + '\output.csv', index=False)
final
For example, considering three .csv files, running the code above yields to :
>>> Output (in Jupyter)
>>> Output (in Excel)

Merge files with a for loop

I have over two thousands csv files in a folder as follows:
University_2010_USA.csv, University_2011_USA.csv, Education_2012_USA.csv, Education_2012_Mexico.csv, Education_2012_Argentina.csv,
and
Results_2010_USA.csv, Results_2011_USA.csv, Results_2012_USA.csv, Results_2012_Mexico.csv, Results_2012_Argentina.csv,
I would like to match the first csv files in the list with the second ones based on "year" (2012, etc.) and "country" (Mexico, etc.) in the file name. Is there a way to do so quickly? Both the csv files have the same column names and I'm looking at the following code:
df0 = pd.read_csv('University_2010_USA.csv')
df1 = pd.read_csv('Results_2010_USA.csv')
new_df = pd.merge(df0, df1, on=['year','country','region','sociodemographics'])
So basically, I'd need help to write a for-loop that iterates over the datasets... Thanks!
Try this:
from pathlib import Path
university = []
results = []
for file in Path('/path/to/data/folder').glob('*.csv'):
# Determine the properties from the file's name
file_type, year, country = file.stem.split('_')
if file_type not in ['University', 'Result']:
continue
# Make the data frame, with 2 extra columns using properties
# we extracted from the file's name
tmp = pd.read_csv(file).assign(
year=int(year),
country=country
)
if file_type == 'University':
university.append(tmp)
else:
results.append(tmp)
df = pd.merge(
pd.concat(university),
pd.concat(results),
on=['year','country','region','sociodemographics']
)

dropping columns in multiple excel spreedsheets

Is there a way in python i can drop columns in multiple excel files? i.e. i have a folder with several xlsx files. each file has about 5 columns (date, value, latitude, longitude, region). I want to drop all columns except date and value in each excel file.
Let's say you have a folder with multiple excel files:
from pathlib import Path
folder = Path('excel_files')
xlsx_only_files = list(folder.rglob('*.xlsx'))
def process_files(xls_file):
#stem is a method in pathlib
#that gets just the filename without the parent or the suffix
filename = xls_file.stem
#sheet = None ensure the data is read in as a dictionary
#this sets the sheetname as the key
#usecols allows you to read in only the relevant columns
df = pd.read_excel(xls_file, usecols = ['date','value'] ,sheet_name = None)
df_cleaned = [data.assign(sheetname=sheetname,
filename = filename)
for sheetname, data in df.items()
]
return df_cleaned
combo = [process_files(xlsx) for xlsx in xlsx_only_files]
final = pd.concat(combo, ignore_index = True)
Let me know how it goes
stem
I suggest you should define columns you want to keep as a list and then select as a new dataframe.
# after open excel file as
df = pd.read_excel(...)
keep_cols = ['date', 'value']
df = df[keep_cols] # keep only selected columns it will return df as dataframe
df.to_excel(...)

How to read all excel files under a directory as a pandas dataframe

I have a couple of excel sheets (using pd.read_excel ) under a directory and would like to read them as a pandas and add them to a list. so my list should end up having multiple dataframe in it. How can I do that?
My method for this:
data = os.listdir('data')
df = pd.DataFrame()
for file in data:
path = 'data' + '/' + file
temp = pd.read_excel(path)
df = df.append(temp, ignore_index = True)

Categories

Resources