I have a script that parses Excel files all together from one directory. It joins all of the files together and concatenates them into one.
Right now the way I write CSV files from a dataframe by starting an empty list then appending the scraped data from the function cutpaste which parses the data I want from each file and into a new dataframe which then writes a final concatenated CSV file.
files is the variable that calls all the Excel files from a given directory.
# Create new CSV file
df_list = []
for file in files:
df = pd.read_excel(io=file, sheet_name=sheet)
new_file = cutpaste(df)
df_list.append(new_file)
df_final = pd.concat(df_list)
df_final.to_csv('Energy.csv', header=True, index=False)
What I need now is a way of changing my code so that I can write any new Excel files that don't already exist in Energy.csv to Energy.csv.
Related
I have a folder of Excel files, many of which have 3-4 tabs worth of data that I just want as individual Excel files. For example, let's say I have an Excel file with three tabs: "employees", "summary", and "data". I would want this to create 3 new Excel files out of this: employees.xlsx, summary.xlsx, and data.xlsx.
I have code that will loop through a folder and identify all of the tabs, but I have struggling to figure out how to export data individually from each sheet into its own Excel file. I have gotten to the point where I can loop through the folder, open each Excel file, and find the name of each sheet. Here's what I have so far.
import pandas as pd
import os
# filenames
files = os.listdir()
excel_names = list(filter(lambda f: f.endswith('.xlsx'), files))
excels = [pd.ExcelFile(name, engine='openpyxl') for name in excel_names]
sh = [x.sheet_names for x in excels] # I am getting all of the sheet names here
for s in sh:
for x in s:
#there is where I want to start exporting each sheet as its own spreadsheet
#df.to_excel("output.xlsx", header=False, index=False) #I want to eventually export it obviously, this is a placeholder
import pandas as pd
import glob
# get the file names using glob
# (this assumes that the files are in the current working directory)
excel_names = glob.glob('*.xlsx')
# iterate through the excel file names
for excel in excel_names:
# read the excel file with sheet_name as none
# this will create a dict
dfs = pd.read_excel(excel, sheet_name=None)
# iterate over the dict keys (which is the sheet name)
for key in dfs.keys():
# use f-strings (only available in python 3) to assign
# the new file name as the sheet_name
dfs[key].to_excel(f'{key}.xlsx', index=False)
I have different excel files in the same folder, in each of them there are the same sheets. I need to select the last sheet of each file and join them all by the columns (that is, form a single table). The columns of all files are named the same. I think it is to identify the dataframe of each file and then paste them. But I do not know how
Just do what Recessive said and use a for loop to read the excel file one by one and do the following:
excel_files = os.listdir(filepath)
for file in excel_files:
read excel file sheet
save specific column to variable
end of loop
concatenate each column from different variables to one dataframe
So I'm working on a project that analyzes Covid-19 data from this entire year. I have multiple csv files in a given directory. I am trying to merge all the files' contents from each month into a single, comprehensive csv file. Here's what I got so far as shown below...Specifically, the error message that appears is 'EmptyDataError: No columns to parse from file.' If I were to delete df = pd.read_csv('./csse_covid_19_daily_reports_us/' + file) and simply run print(file) It lists all the correct files that I am trying to merge. However, when trying to merge all data into one I get that error message. What gives?
import pandas as pd
import os
df = pd.read_csv('./csse_covid_19_daily_reports_us/09-04-2020.csv')
files = [file for file in os.listdir('./csse_covid_19_daily_reports_us')]
all_data = pd.DataFrame()
for file in files:
df = pd.read_csv('./csse_covid_19_daily_reports_us/' + file)
all_data = pd.concat([all_data, df])
all_data.head()
Folks, I have resolved this issue. Instead of sifting through files with files = [file for file in os.listdir('./csse_covid_19_daily_reports_us')], I have instead used files=[f for f in os.listdir("./") if f.endswith('.csv')]. This filtered out some garbage files that were not .csv, thus allowing me to compile all data into a single csv.
I have written a script which extracts text from multiple csv's. Can someone help me embed the script in this which can read csv data from different zipped files and create multiple csv's(one for each ziped file) at a location.
For example-- If i have 10 csv's in zipped folder z1 and 5 in zipped folder z2. I want to extract files from each zipped file and get the extracted files at one location. In this case it would be z1.csv(with concatenated data from 10 csv's) and z2.csv(with concatenated data from 5 csv's).
I am using the following script,
allfiles=glob.glob(os.path.join(input_fldr,"*.csv"))
a=[]
b=[]
for file_ in allfiles:
dirname, filename=os.path.split(file_)
f=open(file_,'r',encoding='UTF-8')
lines=f.readlines()
f.close()
for line in lines:
if line.startswith('Hello'):
a.append(filename)
b.append(line)
df_a=pd.DataFrame(a,columns=list("A")
df_b=pd.DataFrame(b,columns=list("B")
df=pd.concat([df_a,df_b],axis=1)
The Code
The code I came to, that does roughly what I believe you are wanting to happen is this (all the files you need for this example are available here):
import zipfile
import pandas as pd
virtual_csvs = []
with zipfile.ZipFile("test3.zip", "r") as f:
for name in f.namelist():
if name.endswith(".csv"):
data = f.open(name)
virtual_csvs.append(pd.read_csv(data, header=None))
pd.concat(virtual_csvs, axis=1).to_csv('test4.csv', header=False, index=False)
Code Breakdown
virtual_csvs = []
We start by creating an array that will store all of the panda DataFrames, much like your array [df_a, df_b]
with zipfile.ZipFile("test3.zip", "r") as f:
This will load the zipfile, "test3.zip" - replace with your zipfile name, in read mode into the variable f
for name in f.namelist():
This iterates over every file name in the zipfile, and loads that to the variable: name
if name.endswith(".csv"):
This line is rather self-explanatory - if the file has an extension of .csv, run the following code.
data = f.open(name)
The f.open(name) command opens the file (name) - the equivalent would be open(name, 'r') as data
virtual_csvs.append(pd.read_csv(data, header=None))
pd.read_csv(data, header=None) loads that file into a panda dataframe (header=None means no column headers so the data is loaded into a dataframe)
virtual_csvs.append loads the dataframe into the virtual_csvs list
The final line of this code:
pd.concat(virtual_csvs, axis=1).to_csv('output.csv', header=False, index=False)
concatenates all of the csv files into one larger file ('output.csv').
pd.concat(virtual_csvs, axis=1) means to join all the csv files (DataFrame) in virtual_csvs by column (this returns a pd.DataFrame)
to_csv('output.csv', header=False, index=False) means to convert the given DataFrame to a csv file, named 'output.csv'.
header=False means to remove header names for each column
index=False disables row numbers from the DataFrames
I'm new in python ...I have tried to apply this code to merge multiple csv files but it doesn't work..basically, I have a files which contains stock prices with header: date,open,High,low,Close,Adj Close Volume... . but each csv file has a different name: Apl.csv,VIX.csv,FCHI.csv etc..
I would like to merge all these csv files in One.. but I would like to add a new columns which will disclose the name of the csv files example:
stock_id,date,open,High,low,Close,Adj Close Volume with stock_id = apl,Vix etc..
I used this code but I got stuck in line 4
here is the code:
files = os.listdir()
file_list = list()
for file in os.listdir():
if file.endswith(".csv")
df=pd.read_csv(file,sep=";")
df['filename'] = file
file_list.append(df)
all_days = pd.concat(file_list, axis=0, ignore_index=True)
all_days.to_csv("all.csv")
Someone could help me to sort out this ?
In Python, the indentation level matters, and you need a colon at the end of an if statement. I can't speak to the method you're trying, but you can clean up the synax with this:
files = os.listdir()
file_list = list()
for file in os.listdir():
if file.endswith(".csv"):
df=pd.read_csv(file,sep=";")
df['filename'] = file
file_list.append(df)
all_days = pd.concat(file_list, axis=0, ignore_index=True)
all_days.to_csv("all.csv")
I'm relatively new in python ..here is what I'd like to do..I got a folder with multiples csv files ( 2018.csv,2017.csv,2016.csv etc..)500 csv files to be precise.. each csv contains header "date","Code","Cur",Price etc..I'd like to concatenate all 500 csv files in one datafame...here is my code for one csv files but it's very slow , I want to do it for all 500 files and concantanate in one dataframe :
DB_2017 = pd.read_csv("C:/folder/2018.dat",sep=",", header =None).iloc[: 0,4,5,6]
DB_2017.columns =["date","Code","Cur",Price]
DB_2017['Code'] =DB_2017['Code'].map(lambdax:x.lstrip('#').rstrip('#'))
DB_2017['Cur'] =DB_2017['Cur'].map(lambdax:x.lstrip('#').rstrip('#'))
DB_2017['date'] =DB_2017['date'].apply(lambdax:pd.timestamp(str(x)[:10)
DB_2017['Price'] =pd.to_numeric(DB_2017.Price.replace(',',';')