I need to concatenate csv files with same column headers in python. The csv files with the following filenames should concatenate in order as shown below(ascending order of filename):
AB201602.csv
AB201603.csv
AB201604.csv
AB201605.csv
AB201606.csv
AB201607.csv
AB201608.csv
AB201610.csv
AB201612.csv
I would like to keep the column headers only from first file. Any idea?
I tried to use the below code and it combined the csv file by random filenames and truncated half of the column header name. thanks
csvfiles = glob.glob('/home/c/*.csv')
wf = csv.writer(open('/home/c/output.csv','wb'),delimiter = ',')
for files in csvfiles:
rd = csv.reader(open(files,'r'),delimiter = ',')
rd.next()
for row in rd:
print(row)
wf.writerow(row)
Using #Gokul comment and pandas.
import pandas as pd
import glob
csvfiles = sorted(glob.glob('/home/c/*.csv'))
df = pd.DataFrame()
for files in csvfiles:
df = df.append(pd.read_csv(files))
df.to_csv('newfile.csv')
Related
I have a list of .csv files stored in a local folder and I'm trying to concatenate them into one single dataframe.
Here is the code I'm using :
import pandas as pd
import os
folder = r'C:\Users\_M92\Desktop\myFolder'
df = pd.concat([pd.read_csv(os.path.join(folder, f), delimiter=';') for f in os.listdir(folder)])
display(df)
Only one problem, it happens that one of the files is sometimes empty (0 cols, 0 rows) and in this case, pandas is throwing an EmptyDataError: No columns to parse from file in line 6.
Do you have any suggestions how to bypass the empty csv file ?
And why not how to concatenate csv files in a more efficient/simplest way.
Ideally, I would also like to add a column (to the dataframe df) to carry the name of each .csv.
You can check if a file is empty with:
import os
os.stat(FILE_PATH).st_size == 0
In your use case:
import os
df = pd.concat([
pd.read_csv(os.path.join(folder, f), delimiter=';') \
for f in os.listdir(folder) \
if os.stat(os.path.join(folder, f)).st_size != 0
])
Personally I would filter the files for content first, then merge them using the basic try-except.
import pandas as pd
import os
folder = r'C:\Users\_M92\Desktop\myFolder'
data = []
for f in os.listdir(folder):
try:
temp = pd.read_csv(os.path.join(folder, f), delimiter=';')
# adding original filename column as per request
temp['origin'] = f
data.append(temp)
except pd.errors.EmptyDataError:
continue
df = pd.concat(data)
display(df)
I want to read multiple .csv files and append the datetime part of their filename to the column header. Each csv file contains data acquired at a specific datetime. Each filename has the following format:
yyyy-mm-dd_hh-mm-ss_someothertext
Each file contains only one column of data.
I successfully import multiple files as a list of dataframes as follows:
import pandas as pd
import glob
path = r'C:\Users\...' #path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
I then concatenate the files into one dataframe such that each column contains the data from one of the files:
frame = pd.concat(li, axis=1, ignore_index=True)
However, this is where I lose the filename information. The column headers are now just a series of numbers. My question is: how can I append the datetime portion of each filename to its respective column header in frame?
The closest I have got is being able to append the whole filename, not just the datetime part, in a roundabout way by transposing frame, adding the whole filename as a new column, transposing back, then setting the filename row as the header row...
import os
frame=pd.DataFrame.transpose(frame)
frame['filename'] = os.path.basename(filename)
frame=pd.DataFrame.transpose(frame)
frame.reset_index(drop=True)
frame.columns = frame.iloc[6628] #row 6628 is where the row with the filenames ends up after transposing
This seems terribly inefficient though and ends up with the whole filename as the header rather than just the datetime part.
This would be my suggested approach, squeezing the DataFrame and using Regex:
import re
import os
import glob
import pandas as pd
path = 'C:\Users\....'
files = glob.glob(f'{path}\*.csv')
li = []
for file in files:
name = os.path.basename(file) # get filename
date = re.search(r'\d{4}-\d{2}-\d{2}', name).group(0) # extract yyyy-mm-dd from filename
# read file, squeeze to Series, rename to date
li.append(pd.read_csv(file, index_col=None, header=0, squeeze=True).rename(date))
frame = pd.concat(li, axis=1, ignore_index=False)
I'm trying to merge multiple csv to one bigfile.
The script is working but I would like to have only the first header, and not one for each csv within bigfile.
How could I do it, shouldn't work with header=None?
import os
import glob
import pandas
def concatenate(inDir = r'myPath', outFile = r"outPath"):
os.chdir(inDir)
fileList = glob.glob("*.csv") #generate a list of csv files using the glob method
dfList = []
for filename in fileList:
print (filename)
df = pandas.read_csv(filename, header=None)
dfList.append(df)
concatDf = pandas.concat(dfList, axis=0)
concatDf.to_csv(outfile, index=None) # export the dataframe to a csv file
Hello!
I would like to combine horizontally many CSV files (the total number will oscillate around 120-150) into one CSV file by adding one column from each file (in this case column called “grid”). All those files have the same columns and number of the rows (they are constructed the same) and are stored in the same catalogue. I’ve tried with CSV module and pandas. I don't want to define all 120 files. I need a script to do it automatically. I’m stuck and I have no ideas...
Some input CSV files (data) and CSV file (merged) which I would like to get:
https://www.dropbox.com/transfer/AAAAAHClI5b6TPzcmW2dmuUBaX9zoSKYD1ZrFV87cFQIn3PARD9oiXQ
That's how my code looks like when I use the CSV module:
import os
import glob
import csv
os.chdir('\csv_files_direction')
extension = 'csv'
files = [i for i in glob.glob('*.{}'.format(extension))]
out_merg = ('\merged_csv_file_direction')
with open(out_merg,'wt') as out:
writer = csv.writer(out)
for file in files:
with open(file) as csvfile:
data = csv.reader(csvfile, delimiter=';')
result = []
for row in data:
a = row[3] #column which I need
result.append(a)
Using this code I receive values only from the last CSV. The rest is missing. As a result I would like to have one precise column from each CSV file from the catalogue.
And Pandas:
import os
import glob
import pandas as pd
import csv
os.chdir('\csv_files_direction')
extension = 'csv'
files = [i for i in glob.glob('*.{}'.format(extension))]
out_merg = ('\merged_csv_file_direction')
in_names = [pd.read_csv(f, delimiter=';', usecols = ['grid']) for f in files]
Using pandas I receive data from all CSV's as the list which can be navigated using e.g in_names[1].
I confess that this is my first try with pandas and I don't have ideas what should be my next step.
I will really appreciate any help!
Thanks in advance,
Mateusz
For the part of CSV i think you need another list define OUTSIDE the loop.
Something like
import os
import sys
dirname = os.path.dirname(os.path.realpath('__file__'))
import glob
import csv
extension = 'csv'
files = [i for i in glob.glob('*.{}'.format(extension))]
out_merg = ('merged_csv_file_direction')
result= []
with open(out_merg,'wt') as out:
writer = csv.writer(out)
for file in files:
with open(file) as csvfile:
data = csv.reader(csvfile, delimiter=';')
col = []
for row in data:
a = row[3] #column which I need
col.append(a)
result.append((col))
NOTE: I have also changed the way to go into the folder. Now you can run the file direcly in the folder that contains the 2 folders (one for take the data and the other to save the data)
Regarding the part of pandas
you can create a loop again. This time you need to CONCAT the dataframes that you have created using in_names = [pd.read_csv(f, delimiter=';', usecols = ['grid']) for f in files]
I think you can use
import os
import glob
import pandas as pd
import csv
os.chdir('\csv_files_direction')
extension = 'csv'
files = [i for i in glob.glob('*.{}'.format(extension))]
out_merg = ('\merged_csv_file_direction')
in_names = [pd.read_csv(f, delimiter=';', usecols = ['grid']) for f in files]
result = pd.concat(in_names)
Tell me if it works
I am trying to come up with a script that will allow me to read all csv files with greater than 62 bits and print two columns into a separate excel file and create a list.
The following is one of the csv files:
FileUUID Table RowInJSON JSONVariable Error Notes SQLExecuted
ff3ca629-2e9c-45f7-85f1-a3dfc637dd81 lng02_rpt_b_calvedets 1 Duplicate entry 'ETH0007805440544' for key 'nosameanimalid' INSERT INTO lng02_rpt_b_calvedets(farmermobile,hh_id,rpt_b_calvedets_rowid,damidyesno,damid,calfdam_id,damtagid,calvdatealv,calvtype,calvtypeoth,easecalv,easecalvoth,birthtyp,sex,siretype,aiprov,othaiprov,strawidyesno,strawid) VALUES ('0974502779','1','1','0','ETH0007805440544','ETH0007805470547',NULL,'2017-09-16','1',NULL,'1',NULL,'1','2','1',NULL,NULL,NULL,NULL,NULL,'0',NULL,NULL,NULL,NULL,NULL,NULL,'0',NULL,'Tv',NULL,NULL,'Et','23',NULL,'5',NULL,NULL,NULL,'0','0')
This is my attempt to solving this problem:
path = 'csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
output = infile + '.out'
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))
Please help point me to the right direction with any alternative solution
pandas does a lot of what you are trying to achieve:
import pandas as pd
# Read a csv file to a dataframe
df = pd.read_csv("<path-to-csv>")
# Filter two columns
columns = ["FileUUID", "Table"]
df = df[columns]
# Combine multiple dataframes
df_combined = pd.concat([df1, df2, df3, ...])
# Output dataframe to excel file
df_combined.to_excel("<output-path>", index=False)
To loop through all csv files > 62bits, you can use glob.glob() and os.stat()
import os
import glob
dataframes = []
for csvfile in glob.glob("<csv-folder-path>/*.csv"):
if os.stat(csvfile).st_size > 62:
dataframes.append(pd.read_csv(csvfile))
Use the standard csv module. Don't re-invent the wheel.
https://docs.python.org/3/library/csv.html