I am trying read in a folder of CSV files, process them one by one to remove duplicates, and then add them to a master dataframe which will then finally be output to a CSV. I have this...
import pandas as pd
import os
import sys
output = pd.DataFrame(columns=['col1', 'col2'])
for root, dirs, files in os.walk("sourcefolder", topdown=False):
for name in files:
data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
output.append(data)
output.to_csv("output.csv", index=False, encoding='utf8')
But my output CSV is empty apart fom the column names. Anyone any idea where I am going wrong?
Pandas dataframes don't act like a list so you can't use append like that try:
import pandas as pd
import os
import sys
output = pd.DataFrame(columns=['col1', 'col2'])
for root, dirs, files in os.walk("sourcefolder", topdown=False):
for name in files:
data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
output = output.append(data)
output_df.to_csv("output.csv", index=False, encoding='utf8')
Alternatively you can make output a list of dataframes and then use pd.concat to create a consolidated dataframe at the end, depending on the volume of data this could be more efficient
The built in pandas method concat is also pretty good. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat
import pandas as pd
import os
import sys
output = pd.DataFrame(columns=['col1', 'col2'])
for root, dirs, files in os.walk("sourcefolder", topdown=False):
for name in files:
data = pd.read_csv(os.path.join(root, name), usecols=[1], skiprows=1)
output = pd.concat([output, data], columns=output.columns)
output_df.to_csv("output.csv", index=False, encoding='utf8')
Related
I have an Excel file that needs to be refreshed automatically every week. It must be extended by other Excel files. The problem is that these files have different names each time.
So in my opinion i can not use code like:
import pandas as pd
NG = 'NG.xlsx'
df = pd.read_excel(NG)
because the filename is not always "NG" like in this case.
Do you have any ideas?
Best Greetz
You could read all the files in your folder by doing this, because it allows you to ignore name changes:
import sys
import csv
import glob
import pandas as pd
# get data file names
path = r"C:\.......\folder_with_excel"
filenames = glob.glob(path + "/*.xlsx")
DF = []
for df in dfs:
xl_file = pd.ExcelFile(filenames)
df=xl_file.parse('Sheet1')
DF.concat(df, ignore_index=True)
Alternatively:
import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path) # list all the files in you directory
files_xls = [f for f in files if f[-3:] == 'xlsx'] # make a list of the xlsx
df = pd.DataFrame()
for f in files_xls:
info = pd.read_excel(f, '<sheet name>') # remove <sheet name if you don't need it
df = df.append(info)
I'm trying to merge multiple csv to one bigfile.
The script is working but I would like to have only the first header, and not one for each csv within bigfile.
How could I do it, shouldn't work with header=None?
import os
import glob
import pandas
def concatenate(inDir = r'myPath', outFile = r"outPath"):
os.chdir(inDir)
fileList = glob.glob("*.csv") #generate a list of csv files using the glob method
dfList = []
for filename in fileList:
print (filename)
df = pandas.read_csv(filename, header=None)
dfList.append(df)
concatDf = pandas.concat(dfList, axis=0)
concatDf.to_csv(outfile, index=None) # export the dataframe to a csv file
I have a directory with csvs whose filenames represent the id of a row of a database.
I would like to read in this directory into a Pandas dataframe and join it to an existing dataframe.
Is there any way in Python to read the results of an 'ls' command into a pandas Dataframe?
I've tried getting a string of the filenames with the code below but I'm having trouble figuring out how to get it into a dataframe after.
import os
files = ''
for root, dirs, files in os.walk("."):
for filename in files:
files += filename
You are able to walk the files, now you just need to read the csv and concat them onto a dataframe.
import os
import pandas as pd
df = None
for root, dirs, files in os.walk('.'):
for filename in files:
if not df:
df = pd.read_csv(filename)
df['filename'] = filename
continue
tmp = pd.read_csv(filename)
tmp['filename'] = filename
df = pd.concat(df, tmp)
I am working with lots of csv files and need to add column. I tried glob, for example:
import glob
filenames = sorted(glob.glob('./DATA1/*2018*.csv'))
filenames = filenames[0:10]
import numpy as np
import pandas as pd
for f in filenames:
df = pd.read_csv(f, header=None, index_col=None)
df.columns = ['Date','Signal','Data','Code']
#this is what I should add to all csv files
df["ID"] = df["Data"].str.slice(0,2)
and I need a way to save the file back to csv (not concatenated) with different name such as "file01edited.csv" after I add the column to each csv file.
Use to_csv with f-strings for change file names:
for f in filenames:
df = pd.read_csv(f, names=['Date','Signal','Data','Code'], index_col=None)
#this is what I should add to all csv files
df["ID"] = df["Data"].str.slice(0,2)
#python 3.6+
df.to_csv(f'{f[:-4]}edited.csv', index=False)
#python bellow 3.6
#df.to_csv('{}edited.csv'.format(f[:-4]), index=False)
I have this code
import pandas as p
import csv
df = p.read_csv('interview1.csv')
df2 = df[['Participant', 'Translation']] # selects two of the columns in your file
df2.to_csv('out.csv')
How do i read multiple files and then write to 'out.csv'. So basically, instead of reading only interview1, i read interview2, interview3 to interview7 as well into the out.csv
Simply open the output file in append mode:
import pandas as p
import csv
csv_list=['interview1.csv', 'interview2.csv', ...]
for itw in csv_list:
df = p.read_csv(itw)
df.to_csv('out.csv', mode='a')
Use this to read all .CSV data from a folder and combined it together
import pandas as pd
import glob
import os
path = r'file path'
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
concatenated_df.to_csv("combined-data_new.csv")