Stackig vertically .csv files with Pandas in Python

Stackig vertically .csv files with Pandas in Python - python

So I have been trying to merge .csv files with Pandas, and trying to create a couple of functions to automate it but I keep having an issue.
My problem is that I want to stack one .csv after the other(same number of columns and different number of rows) but instead of getting a bigger csv with the same numer of columns , I get a bigger csv with more columns and rows(correct number of rows, incorrect number of columns(more columns than the ones that are supposed to be)).
The code Im using is this one:
import os
import pandas as pd
def stackcsv(content_folder):
global combined_csv
combined_csv= []
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
solo_csv = pd.read_csv(csv_path,index_col=None)
combined_csv.append(solo_csv)
csv_final = pd.concat(combined_csv,axis = 0,ignore_index=True)
return csv_final.to_csv("final_data.csv",index = None, header = None)
I have 3.csv files that have a size of 20000x17, and I want to merge it into one of 60000x17. I suppose my error must be in the arguments of index, header, index_col, etc....
Thanks in advance.

So after modifying the code, it worked. First of all, as Serge Ballesta said, it is necesary to say to the read_csv that there is no header. Finally, using the sort = False, the function works perfectly. This is the final code that I have used, and the final .csv is 719229 rows × 17 columns long. Thanks to everbody!
import os
import pandas as pd
def stackcsv(content_folder):
global combined_csv
combined_csv= []
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
solo_csv = pd.read_csv(csv_path,index_col=None,header = None)
combined_csv.append(solo_csv)
csv_final = pd.concat(combined_csv,axis = 0,sort = False)
return csv_final.to_csv("final_data.csv", header = None)

If the files have no header you must say it to read_csv. If you don't, the first line of each file is read as a header line. As a result the DataFrames have different column names and concat will add new columns. So you should read with:
solo_csv = pd.read_csv(csv_path,index_col=None, header=None)
Alternatively, there is no reason to decode them, and you could just concatenate the sequential files:
def stackcsv(content_folder):
with open("final_data.csv", "w") as fdout
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
with open(csv_path) as fdin:
while True:
chunk = fdin.read()
if len(chunk) == 0: break
fdout.write(chunk)

Add parameter sort to False in the pandas concat function:
csv_final = pd.concat(combined_csv, axis = 0, ignore_index=True, sort=False)

Related

How to concatenate a list of csv files (including empty ones) using Pandas

I have a list of .csv files stored in a local folder and I'm trying to concatenate them into one single dataframe.
Here is the code I'm using :
import pandas as pd
import os
folder = r'C:\Users\_M92\Desktop\myFolder'
df = pd.concat([pd.read_csv(os.path.join(folder, f), delimiter=';') for f in os.listdir(folder)])
display(df)
Only one problem, it happens that one of the files is sometimes empty (0 cols, 0 rows) and in this case, pandas is throwing an EmptyDataError: No columns to parse from file in line 6.
Do you have any suggestions how to bypass the empty csv file ?
And why not how to concatenate csv files in a more efficient/simplest way.
Ideally, I would also like to add a column (to the dataframe df) to carry the name of each .csv.

You can check if a file is empty with:
import os
os.stat(FILE_PATH).st_size == 0
In your use case:
import os
df = pd.concat([
pd.read_csv(os.path.join(folder, f), delimiter=';') \
for f in os.listdir(folder) \
if os.stat(os.path.join(folder, f)).st_size != 0
])

Personally I would filter the files for content first, then merge them using the basic try-except.
import pandas as pd
import os
folder = r'C:\Users\_M92\Desktop\myFolder'
data = []
for f in os.listdir(folder):
try:
temp = pd.read_csv(os.path.join(folder, f), delimiter=';')
# adding original filename column as per request
temp['origin'] = f
data.append(temp)
except pd.errors.EmptyDataError:
continue
df = pd.concat(data)
display(df)

Is there a way to export a list of 100+ dataframes to excel?

So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:

So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.

How to concatenate multiple csv files into one based on column names without having to type every column header in code

I am relatively new to python (about a weeks experience) and I can't seem to find the answer to my problem.
I am trying to merge hundreds of csv files based in my folder Data into a single csv file based on column name.
The solutions I have found require me to type out either each file name or column headers which would take days.
I used this code to create one csv file but the column names move around and therefore the data is not in the same columns over the whole DataFrame:
import pandas as pd
import glob
import os
def concatenate(indir=r"C:\\Users\ge\Documents\d\de",
outfile = r"C:\Users\ge\Documents\d"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList = []
for filename in fileList:
print(filename)
df = pd.read_csv(filename, header = None)
dfList.append(df)
concatDf = pd.concat(dfList, axis = 0)
concatDf.to_csv(outfile, index= None)
Is there quick fire method to do this as I have less than a week to run statistics on the dataset.
Any help would be appreciated.

Here is one, memory efficient, way to do that.
from pathlib import Path
import csv
indir = Path(r'C:\\Users\gerardchurch\Documents\Data\dev_en')
outfile = Path(r"C:\\Users\gerardchurch\Documents\Data\output.csv")
def find_header_from_all_files(indir):
columns = set()
print("Looking for column names in", indir)
for f in indir.glob('*.csv'):
with f.open() as sample_csv:
sample_reader = csv.DictReader(sample_csv)
try:
first_row = next(sample_reader)
except StopIteration:
print("File {} doesn't contain any data. Double check this".format(f))
continue
else:
columns.update(first_row.keys())
return columns
columns = find_header_from_all_files(indir)
print("The columns are:", sorted(columns))
with outfile.open('w') as outf:
wr = csv.DictWriter(outf, fieldnames=list(columns))
wr.writeheader()
for inpath in indir.glob('*.csv'):
print("Parsing", inpath)
with inpath.open() as infile:
reader = csv.DictReader(infile)
wr.writerows(reader)
print("Done, find the output at", outfile)
This should handle case, when one of the input csvs doesn't contain all columns

I am not sure if I understand your problem correctly, but this is one of the ways that you can merge your files without giving any column names:
import pandas as pd
import glob
import os
def concatenate(indir):
os.chdir(indir)
fileList=glob.glob("*.csv")
output_file = pd.concat([pd.read_csv(filename) for filename in fileList])
output_file.to_csv("_output.csv", index=False)
concatenate(indir= r"C:\\Users\gerardchurch\Documents\Data\dev_en")

How to handle another Excel file in Python

Good Morning.
I'm starting with Python and I have a problem.
I need to find all .xls files (all have the same header) and merge all into a single DataFrame, so I need to say that the first line of the file should be ignored.
The current code I'm using is this:
os.chdir("file folder path")
fileLista = glob.glob('*.xls')
df = list()
for arquivo in fileLista:
df = df.append(pd.read_excel(arquivo))
Company= pd.concat(df)
Company.columns = Company.columns.str.strip()
I am using Glob to return all the .xls extension files,
df.append is to merge all the files that have been returned and put inside a DataFrame,
Company concat is to form a single file,
Company strip is to remove the spaces that it has in the column header.
When I run the code it returns me this error:
"erro NoneType' object is not iterable"
Can anyone help me with this mistake?

What about this instead?
fileLista = glob.glob('*.xls')
Company = pd.DataFrame()
for arquivo in fileLista:
df = pd.read_excel(arquivo)
Company= pd.concat([Company,df])
Company.columns = Company.columns.str.strip()

This should do what you want.
import pandas as pd
import numpy as np
import glob
glob.glob("C:/your_path_here/*.xlsx")
all_data = pd.DataFrame()
for f in glob.glob("C:/your_path_here/*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
print(all_data)
Here is another option to consider.
import pandas as pd
# filenames
excel_names = ["C:/your_path_here/Book1.xlsx", "C:/your_path_here/Book2.xlsx", "C:/your_path_here/Book3.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]
# concatenate them..
combined = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
# Results go to the default directory if not assigned somewhere else.
# C:\Users\Excel\.spyder-py3

python pandas automatic excel lookup system

I know this is alot of code and there is alot to do, but i am really stuck and don't know how to continue after i got the function that the program can match identical files. I am pretty sure you know how the lookup from excel works. This Program does basicly the same. I tried to comment out the important parts and hope you can give me some help how i can continue this project. Thank you very much!
import pandas as pd
import xlrd
File1 = pd.read_excel("Excel_test.xlsx", usecols=[0], header=None, index=False) #the two excel files with the columns that should be compared
File2 = pd.read_excel("Excel_test02.xlsx", usecols=[0], header=None, index=False)
fullFile1 = pd.read_excel("Excel_test.xlsx", header=None, index=False)#the full excel files
fullFile2 = pd.read_excel("Excel_test02.xlsx", header=None, index=False)
i = 0
writer = pd.ExcelWriter("output.xlsx")
def loadingTime(): #just a loader that shows the percentage of the matching process
global i
loading = (i / len(File1)) * 100
loading = round(loading, 2)
print(str(loading) + "%/100%")
def matcher():
global i
while(i < len(File1)):#goes in column that should be compared and goes on higher if there is a match found in second file
for o in range(len(File2)):#runs through the column in second file
matching = File1.iloc[i].str.lower() == File2.iloc[o].str.lower() #matches the column contents of the two files
if matching.bool() == True:
print("Match")
"""
df.append(File1.iloc[i])#the whole row of the matched column should be appended in Dataframe with the arrangement of excel file
df.append(File2.iloc[o])#the whole row of the matched column should be appended in Dataframe with the arrangement of excel file
"""
i += 1
matcher()
df.to_excel(writer, "Sheet")
writer.save() #After the two files have been compared to each other, now a file containing both excel contents and is also arranged correctly

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stackig vertically .csv files with Pandas in Python - python

Add parameter sort to False in the pandas concat function: csv_final = pd.concat(combined_csv, axis = 0, ignore_index=True, sort=False)

Related

How to concatenate a list of csv files (including empty ones) using Pandas

Is there a way to export a list of 100+ dataframes to excel?

How to concatenate multiple csv files into one based on column names without having to type every column header in code

How to handle another Excel file in Python

python pandas automatic excel lookup system

Categories

Resources