My ultimate goal is to merge the contents of a folder full of .xlsx files into one big file.
I thought the below code would suffice, but it only does the first file, and I can't figure out why it stops there. The files are small (~6 KB), so it shouldn't be a matter of waiting. If I print f_list, it shows the complete list of files. So, where am I going wrong? To be clear, there is no error returned, it just does not do the entire for loop. I feel like there should be a simple fix, but being new to Python and coding, I'm having trouble seeing it.
I'm doing this with Anaconda on Windows 8.
import pandas as pd
import glob
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx") # creates my file list
all_data = pd.DataFrame() # creates my DataFrame
for f in f_list: # basic for loop to go through file list but doesn't
df = pd.read_excel(f) # reads .xlsx file
all_data = all_data.append(df) # appends file contents to DataFrame
all_data.to_excel("output.xlsx") # creates new .xlsx
Edit with new information:
After trying some of the suggested changes, I noticed the output claiming the files are empty, except for 1 of them which is slightly larger than the others. If I put them into the DataFrame, it claims the DataFrame is empty. If I put it into the dict, it claims there are no values associated. Could this have something to do with the file size? Many, if not most, of these files have 3-5 rows with 5 columns. The one it does see has 12 rows.
I strongly recommend reading the DataFrames into a dict:
sheets = {f: pd.read_excel(f) for f in f_list}
For one thing this is very easy to debug: just inspect the dict in the REPL.
Another is that you can then concat these into one DataFrame efficiently in one pass:
pd.concat(sheets.values())
Note: This is significantly faster than append, which has to allocate a temporary DataFrame at each append-call.
An alternative issue is that your glob may not be picking up all the files, you should check that it is by printing f_list.
Related
#Purpose: to read CSV files from every every csv files in the directory. Filter the rows with the column that say 'fail" from the csv file. Then copy and paste those rows onto a new CSV file.
# import necessary libraries
from sqlite3 import Row
import pandas as pd
import os
import glob
import csv
# Writing to a CSV
# using Python
import csv
# the path to your csv file directory.
mycsvdir = 'C:\Users\'' #this is where all the csv files will be housed.
# use glob to get all the csv files
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csv_files:
# read the csv file
df = pd.read_csv(csvfile)
dataframes.append(df)
#print(row['roi_id'], row['result']) #roi_id is the column label for the first cell on the csv, result is the Jth column label
dataframes = dataframes[dataframes['result'].str.contains('fail')]
# print out to a new csv file
dataframes.to_csv('ROI_Fail.csv') #rewrite this to mirror the variable you want to save the failed rows in.
I tried running this script but im getting a couple of errors. First off, i know my indentation is off(newbie over here), and im getting a big error under my for loop saying that "csv_files" is not defined. Any help would be greatly appreciated.
There are two issues here:
The first one is kind of easy - The variable in the for loop should be csvfiles, not csv_files.
The second one (Which will show up when you fix the one above) is that you are treating a list of dataframes as a dataframe.
The object "dataframes" in your script is a list to which you are appending the dataframes created from the CSV files. As such, you cannot index them by the column name as you are trying to do.
If your dataframes have the same layout I'd recommend using df.concat to join all dataframes into a single one, and then filtering the rows as you did here.
full_dataframe = pd.concat(dataframes, axis=0)
full_dataframe = full_dataframe[full_dataframe['result'].str.contains('fail')]
As a tip for further posts I'd recommend you also post the full traceback from your program. It helps us understand exactly what error you had when executing your code.
I have 9 excel files named "1_mock" to "10_mock" (It needs to skip "5_mock").
These files just have a single sheet.
Then, I want to combine those 9 files into 9 single sheets, but not to concat them into one sheet.
I've found this online merger https://products.aspose.app/cells/merger. Though it's convenient to use, I need to stop processing the code and turn to this link every time. Not very efficient at all! Thus, I hope there are ways to write this process into a code.
Thanks!
You just have to read all the files first and then write them to the same output file using the to_excel function but have a different sheet_name parameter every time. You can try something like this:
import pandas as pd
files_list = ["1_mock.xlsx", "2_mock.xlsx", "3_mock.xlsx", .....]
df_list = [pd.read_excel(file_name) for file_name in files_list]
with pd.ExcelWriter('output.xlsx') as writer:
for i in range(len(files_list)):
df_list[i].to_excel(writer, sheet_name=files_list[i].replace(".xlsx",""))
how to there is around 10k .csv files named as data0,data1 like that in sequence, want to combine them and want to have a master sheet in one file or at least couple of sheets using python because i think there is limitation of around 1070000 rows in one excel file i think?
import pandas as pd
import os
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.csv'):
master_df = master_df.append(pd.read_csv(file))
master_df.to_csv('master file.CSV', index=False)
A few things to note:
Please check your csv file content first. It would easily mismatch columns when reading csv with text(maybe ; in the content). Or you can try changing the csv engine
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8',engine='python')
If you want to combing into one sheet, your can concat into one dataframe first, then to_excel
df = pd.concat([df,sh_tmp],axis=0,sort=False)
note: concat or append would be a straightforward way to combine data. However, 10k would lead to a perfomance topic. Try list instead of pd.concat if you facing perfomance issue.
Excel has maximum row limitation. 10k files would easily exceed the limit (1048576). You might change the output to csv file or split into multiple .xlsx
----update the 3rd----
You can try grouping the data first (1000k each), then write to sheet one by one.
row_limit = 1000000
master_df['group']=master_df.index//row_limit
writer = pd.ExcelWriter(path_out)
for gr in range(0,master_df['group'].max()+1):
master_df.loc[master_df['group']==gr].to_excel(writer,sheet_name='Sheet'+str(gr),index=False)
writer.save()
I am trying to merge a number of large data sets using Dask in Python to avoid loading issues. I want to save as .csv the merged file. The task proves harder than imagined:
I put together a toy example with just two data sets
The code I then use is the following:
import dask.dataframe as dd
import glob
import os
os.chdir('C:/Users/Me/Working directory')
file_list = glob.glob("*.txt")
dfs = []
for file in file_list:
ddf = dd.read_table(file, sep=';')
dfs.append(ddf)
dd_all = dd.concat(dfs)
If I use dd_all.to_csv('*.csv') I simply print out the two original data sets.
If I use dd_all.to_csv('name.csv') I get an error saying the file does not exist.
(FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Me\\Working directory\\name.csv\\1.part')
I can check that using dd_all.compute() the merged data set had successfully been created.
You are misunderstanding how Dask works - the behaviour you see is as expected. In order to be able to write from multiple workers in parallel, it is necessary for each worker to be able to write to a separate file; there is no way to know the length of the first chunk before writing it has finished, for example. To write to a single file is therefore necessarily a sequential operation.
The default operation, therefore, is to write one output file for each input partition, and this is what you see. Since Dask can read from these in parallel, it does raise the question of why you would want to creation one output file at all.
For the second method without the "*" character, Dask is assuming that you are supplying a directory, not a file, and is trying to write two files within this directory, which doesn't exist.
If you really wanted to write a single file, you could do one of the following:
use the repartition method to make a single output piece and then to_csv
write the separate file and concatenate them after the fact (taking care of the header line)
iterate over the partitions of your dataframe in sequence to write to the same file.
I'm trying to merge a single data column from 40 almost similar csv files with Pandas. The files contains info from windows processes in csv form generated by Windows 'Tasklist' command.
What I want to do is, to merge the memory information from these files into a single file by using the PID as the key. However there are some random insignificant processes that appear every now and then, but cause inconsistency among the csv files. Meaning that in some file there might be 65 rows and in some files 75 rows. However those random processes are not significant and their changing PID should not matter and they should also be dropped off when merging the files.
This is how I first tried to do it:
# CSV files have following columns
# Image Name, PID, Session Name, Session #, Mem Usage
file1 = pd.read_csv("tasklist1.txt")
file1 = file1.drop(file1.columns[[2,3]], axis=1)
for i in range(2,41):
filename = "tasklist" + str(i) + ".txt"
filei = pd.read_csv(filename)
filei = filei.drop(filei.columns[[0,2,3]], axis=1)
file1 = file1.merge(filei, on='PID')
file1.to_csv("Final.txt", index=False)
From the first csv file I just drop the Session Name and Session # columns, but keep the Image Names just as the titles for each row. Then from the following csv files I just keep the PID and Mem Usage columns and try to merge the previous all the time growing csv file with the data from upcoming file.
The problem here is that when the loop comes to 5th iteration, it cannot merge the files anymore as I get the "Reindexing only valid with uniquely valued Index objects" error.
So I can merge 1st file with 2nd to 4th inside the first loop. If I then create second loop where I merge the 5th file to 6th to 8th file and then merge these two merged files together, all the data from files 1 to 8 will be merged just perfectly fine.
Any suggestion how to perform this kind of chained merge without creating x amount of additional loops? At this point I'm experimenting with 40 files and could actually go through the whole process by brute force this with nested loops, but that isn't effective way of merging in the first place and unacceptable, if I need to scale this to merge even more files.
Duplicate column names will cause this error.
So you can add parameter suffixes in function merge:
suffixes : 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right side, respectively
Overlapping value columns.