Good Morning.
I'm starting with Python and I have a problem.
I need to find all .xls files (all have the same header) and merge all into a single DataFrame, so I need to say that the first line of the file should be ignored.
The current code I'm using is this:
os.chdir("file folder path")
fileLista = glob.glob('*.xls')
df = list()
for arquivo in fileLista:
df = df.append(pd.read_excel(arquivo))
Company= pd.concat(df)
Company.columns = Company.columns.str.strip()
I am using Glob to return all the .xls extension files,
df.append is to merge all the files that have been returned and put inside a DataFrame,
Company concat is to form a single file,
Company strip is to remove the spaces that it has in the column header.
When I run the code it returns me this error:
"erro NoneType' object is not iterable"
Can anyone help me with this mistake?
What about this instead?
fileLista = glob.glob('*.xls')
Company = pd.DataFrame()
for arquivo in fileLista:
df = pd.read_excel(arquivo)
Company= pd.concat([Company,df])
Company.columns = Company.columns.str.strip()
This should do what you want.
import pandas as pd
import numpy as np
import glob
glob.glob("C:/your_path_here/*.xlsx")
all_data = pd.DataFrame()
for f in glob.glob("C:/your_path_here/*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
print(all_data)
Here is another option to consider.
import pandas as pd
# filenames
excel_names = ["C:/your_path_here/Book1.xlsx", "C:/your_path_here/Book2.xlsx", "C:/your_path_here/Book3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
# Results go to the default directory if not assigned somewhere else.
# C:\Users\Excel\.spyder-py3
Related
I have a list of .csv files stored in a local folder and I'm trying to concatenate them into one single dataframe.
Here is the code I'm using :
import pandas as pd
import os
folder = r'C:\Users\_M92\Desktop\myFolder'
df = pd.concat([pd.read_csv(os.path.join(folder, f), delimiter=';') for f in os.listdir(folder)])
display(df)
Only one problem, it happens that one of the files is sometimes empty (0 cols, 0 rows) and in this case, pandas is throwing an EmptyDataError: No columns to parse from file in line 6.
Do you have any suggestions how to bypass the empty csv file ?
And why not how to concatenate csv files in a more efficient/simplest way.
Ideally, I would also like to add a column (to the dataframe df) to carry the name of each .csv.
You can check if a file is empty with:
import os
os.stat(FILE_PATH).st_size == 0
In your use case:
import os
df = pd.concat([
pd.read_csv(os.path.join(folder, f), delimiter=';') \
for f in os.listdir(folder) \
if os.stat(os.path.join(folder, f)).st_size != 0
])
Personally I would filter the files for content first, then merge them using the basic try-except.
import pandas as pd
import os
folder = r'C:\Users\_M92\Desktop\myFolder'
data = []
for f in os.listdir(folder):
try:
temp = pd.read_csv(os.path.join(folder, f), delimiter=';')
# adding original filename column as per request
temp['origin'] = f
data.append(temp)
except pd.errors.EmptyDataError:
continue
df = pd.concat(data)
display(df)
Here is where in need your help.
I have multiple xlsx files and I am looking for the same columns information inside each one. Until now all was working fine but some *.xlsx files are not containing the data and my python script just stop but looking over the others.
import glob
import pandas as pd
# Setup variables
xlsx_input = 'D:\\script\\bdd\\xlsx\\*.xlsx'
csv_output = 'D:\\script\\bdd\\csv\\'
# Save all file matches: xlsx_files
xlsx_files = glob.glob(xlsx_input, recursive=True)
# Create an empty list: frames
frames = []
# Iterate over xlsx_files
for file in xlsx_files:
# Read xlsx into a DataFrame
df = pd.read_excel(file , usecols=['ref_01','ref_02','ref_03'])
# Append df to frames
frames.append(df)
# Concatenate frames into dataframe
excel_output = pd.concat(frames)
# Write CSV file
excel_output.to_csv ((csv_output +"bdd_export.csv"), encoding='utf-8-sig', index=None)
Any help would be greatly appreciated.
Cheers !
Ok, I have found how to do it.
Just by adding this: df = pd.read_excel(file , usecols=lambda c: c in ['ref_01','ref_02','ref_03'])
I'm having a hard time loading multiple line delimited JSON files into a single pandas dataframe. This is the code I'm using:
import os, json
import pandas as pd
import numpy as np
import glob
pd.set_option('display.max_columns', None)
temp = pd.DataFrame()
path_to_json = '/Users/XXX/Desktop/Facebook Data/*'
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)
for file in file_list:
data = pd.read_json(file, lines=True)
temp.append(data, ignore_index = True)
It looks like all the files are loading when I look through file_list, but cannot figure out how to get each file into a dataframe. There are about 50 files with a couple lines in each file.
Change the last line to:
temp = temp.append(data, ignore_index = True)
The reason we have to do this is because the append doesn't happen in place. The append method does not modify the data frame. It just returns a new data frame with the result of the append operation.
Edit:
Since writing this answer I have learned that you should never use DataFrame.append inside a loop because it leads to quadratic copying (see this answer).
What you should do instead is first create a list of data frames and then use pd.concat to concatenate them all in a single operation. Like this:
dfs = [] # an empty list to store the data frames
for file in file_list:
data = pd.read_json(file, lines=True) # read data frame from json file
dfs.append(data) # append the data frame to the list
temp = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
This alternative should be considerably faster.
If you need to flatten the JSON, Juan Estevez’s approach won’t work as is. Here is an alternative :
import pandas as pd
dfs = []
for file in file_list:
with open(file) as f:
json_data = pd.json_normalize(json.loads(f.read()))
dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs
Or if your JSON are line-delimited (not tested) :
import pandas as pd
dfs = []
for file in file_list:
with open(file) as f:
for line in f.readlines():
json_data = pd.json_normalize(json.loads(line))
dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs
from pathlib import Path
import pandas as pd
paths = Path("/home/data").glob("*.json")
df = pd.DataFrame([pd.read_json(p, typ="series") for p in paths])
I combined Juan Estevez's answer with glob. Thanks a lot.
import pandas as pd
import glob
def readFiles(path):
files = glob.glob(path)
dfs = [] # an empty list to store the data frames
for file in files:
data = pd.read_json(file, lines=True) # read data frame from json file
dfs.append(data) # append the data frame to the list
df = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
return df
Maybe you should state, if the json files are created themselves with pandas pd.to_json() or in another way.
I used data which was not created with pd.to_json() and I think it is not possible to use pd.read_json() in my case. Instead, I programmed a customized for-each loop approach to write everything to the DataFrames
I have 300 raw datas (.xlsm) and wanne to extract useful datas and turn them to csv files as input for subsequent neural network, now i try to implement them with 10 datas as example, i have sucessfully extracted the informations what i need, but i dont know how to convert them to csv files with the same name, for single data we can use df.to_csv, but how about for all the datas? with for function?
import glob
import pandas as pd
import numpy as np
import csv
import os
excel_files = glob.glob('../../Versuch/Versuche/RohBeispiel/*.xlsm')
directory = '/Beispiel'
for files in excel_files:
data = pd.read_excel(files)
# getting the list of rows and columns you need
list_of_dfs = pd.DataFrame(data.values[0:600:,12:26],
columns=data.columns[12:26]).drop(['Sauberkeit', 'Temparatur'], axis=1)
# converting pandas dataframe columns to numeric: string into float
cols = ['KonzA', 'KonzB', 'KonzC', 'TempA',
'TempB', 'TempC', 'Modul1', 'Modul2',
'Modul3', 'Modul4', 'Modul5', 'Modul6']
list_of_dfs[cols] = list_of_dfs[cols].apply(pd.to_numeric, errors='coerce', axis=1)
# Filling down from a column through missing data
for fec in list_of_dfs[cols]:
list_of_dfs[fec].fillna(method='ffill', inplace=True)
csvfilename = files.split('/')[-1].split('.')[0] + '.csv'
newtempfile = os.path.join(directory,csvfilename)
print(newtempfile)
print(list_of_dfs.head(2))
problem is solved.
folder_name = 'Beispiel'
csvfilename = files.split('/')[-1].split('.')[0] + '.csv' # change into csv files
newtempfile = os.path.join(folder_name, csvfilename)
# Verify if directory exists
if not os.path.exists(folder_name):
os.makedirs(folder_name) # If not, create it
print(newtempfile)
list_of_dfs.to_csv(newtempfile, index=False)
The easiest way of doing this is to get the filename from the excel and then use the os.path.join() method to save it to the directory you want.
directory = "C:/Test"
for files in excel_files:
csvfilename = (os.path.basename(file)[-1]).replace('.xlsm','.csv')
newtempfile=os.path.join(directory,csvfilename)
Since you already have the excel df you want to push into the csv file, just add the above code to the loop and change the output csv file to 'newtempfile' and that should do it.
df.to_csv(newtempfile, 'Beispel/data{0}.csv'.format(idx))
Hope this helps. :)
Updated Code:
cols = ['KonzA', 'KonzB', 'KonzC', 'TempA',
'TempB', 'TempC', 'Modul1', 'Modul2',
'Modul3', 'Modul4', 'Modul5', 'Modul6']
excel_files = glob.glob('../../Versuch/Versuche/RohBeispiel/*.xlsm')
for file in excel_files:
data = pd.read_excel(file, columns = cols) # import only the columns you need to the dataframe
csvfilename = (os.path.basename(files)[-1]).replace('.xlsm','.csv')
newtempfile=os.path.join(directory,csvfilename)
# converting pandas dataframe columns to numeric: string into float
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)
data[cols].fillna(method='ffill', inplace=True)
data.to_csv(newtempfile).format(idx)
My python code works correctly in the below example. My code combines a directory of CSV files and matches the headers. However, I want to take it a step further - how do I add a column that appends the filename of the CSV that was used?
import pandas as pd
import glob
globbed_files = glob.glob("*.csv") #creates a list of all csv files
data = [] # pd.concat takes a list of dataframes as an agrument
for csv in globbed_files:
frame = pd.read_csv(csv)
data.append(frame)
bigframe = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
bigframe.to_csv("Pandas_output2.csv")
This should work:
import os
for csv in globbed_files:
frame = pd.read_csv(csv)
frame['filename'] = os.path.basename(csv)
data.append(frame)
frame['filename'] creates a new column named filename and os.path.basename() turns a path like /a/d/c.txt into the filename c.txt.
Mike's answer above works perfectly. In case any googlers run into the following error:
>>> TypeError: cannot concatenate object of type "<type 'str'>";
only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
It's possibly because the separator is not correct. I was using a custom csv file so the separator was ^. Becuase of that I needed to include the separator in the pd.read_csv call.
import os
for csv in globbed_files:
frame = pd.read_csv(csv, sep='^')
frame['filename'] = os.path.basename(csv)
data.append(frame)
files variable contains all list of csv files in your present directory. Such as
['FileName1.csv',FileName2.csv']. You also need to remove ".csv". You can use .split() function. Below is simple logic. This will work for you.
files = glob.glob("*.csv")
for i in files:
df=pd.read_csv(i)
df['New Column name'] = i.split(".")[0]
df.to_csv(i.split(".")[0]+".csv")