combining multiple files into a single file with DataFrame - python

I have been able to generate several CSV files through an API. Now I am trying to combine all CSV's into a unique Master file so that I can then work on it. But it does not work. Below code is what I have attempted What am I doing wrong?
import glob
import pandas as pd
from pandas import read_csv
master_df = pd.DataFrame()
for file in files:
df = read_csv(file)
master_df = pd.concat([master_df, df])
del df
master_df.to_csv("./master_df.csv", index=False)

Although it is hard to tell what the precise problem is without more information (i.e., error message, pandas version), I believe it is that in the first iteration, master_df and df do not have the same columns. master_df is an empty DataFrame, whereas df has whatever columns are in your CSV. If this is indeed the problem, then I'd suggest storing all your data-frames (each of which represents one CSV file) in a single list, and then concatenating all of them. Like so:
import pandas as pd
df_list = [pd.read_csv(file) for file in files]
pd.concat(df_list, sort=False).to_csv("./master_df.csv", index=False)
Don't have time to find/generate a set of CSV files and test this right now, but am fairly sure this should do the job (assuming pandas version 0.23 or compatible).

Related

how to stack specific sheets from multiple excel workbooks using python

I am trying to stack multiple workbooks by sheet in python. Currently my folder contains over 50 individual workbooks which are separated by date and usually contain up to 3 sheets, although not always.
The sheets include "Pay", "Tablet", "SIMO".
I want to try to stack all data from the "Tablet" sheet into a new workbook and have been using the following code.
**import os
import pandas as pd
path = r"path_location"
files = os.listdir(path)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(file, sheet_name='Tablet'), ignore_index=True)
df.head()
df.to_csv('Tablet_total.csv')**
however after checking the files I realised this is not pulling from all workbooks that have the sheet "Tablet". I suspect this may be due to the fact that not all workbooks have this sheet but in case I missed anything I'd greatly appreciate some ideas to what I may be doing wrong.
Also, as a final request, the sheet "Tablet" across all workbooks has unnecessary columns in the beginning.
I have tried incorporating df.drop(index=df.index[:7], axis=0, inplace=True) into my loop yet this only removes the first 7 rows from the first iteration. Again any support with this would be greatly appreciated.
Many thanks
First, I would check that you do not have any .xls files or other excel file suffixes with:
import os
path = r"path_location"
files = os.listdir(path)
print({
file.split('.')[1]
for file in files
})
Then, I would check that you don't have any sheets with trailing white spaces or capitalization issues with:
import os
import pandas
path = r"path_location"
files = os.listdir(path)
print({
sheet_name
for file in files
if file.endswith('.xlsx')
for sheet_name in pandas.ExcelFile(file).sheet_names
})
I would use pandas.concat() with a list comprehension to concatenate the sheets. I would also add a check to ensure that the worksheet as a sheet named 'Tablet'. Finally, if want to not use the first seven columns you should a) do it on each dataframe as it's read in and before it is concatenated together with the other dataframes, and b) first include all the rows and then specify the columns with .iloc[:, 7:]
import os
import pandas
path = r"path_location"
files = os.listdir(path)
df = pandas.concat([
pandas.read_excel(file, sheet_name='Tablet').iloc[:, 7:]
for file in files
if file.endswith('.xlsx') and 'Tablet' in pandas.ExcelFile(file).sheet_names
])
df.head()
Check if you have excel files with other extensions - xlsm, xlsb and others?
In order to remove seven rows at each iteration, you need to select a temporary dataframe when reading from excel and delete them from it.
df_tmp = pd.read_excel(file, sheet_name='Tablet')
df_tmp.drop(index=df_tmp.index[:7], axis=0, inplace=True)
Since append is deprecated, use concat() instead.
pandas.DataFrame.append
pd.concat([df, df_tmp], ignore_index=True)

Pandas - import CSV files in folder, change column name if it contains a string, concat into one dataframe

I have a folder with about 40 CSV files containing data by month. I want to combine this all together, however I have one column in these CSV files that are either denoted as 'implementationstatus' or 'implementation'. When I try to concat using Pandas, obviously this is a problem. I want to basically change 'implementationstatus' to 'implementation' for each CSV file as it is imported. I could run a loop for each CSV file, change the column name, export it, and then run my code again with everything changed, but that just seems prone to error or unexpected things happening.
Instead, I just want to import all the CSVs, change the column name 'implementationstatus' to 'implementation' IF APPLICABLE, and then concatenate into one data frame. My code is below.
import pandas as pd
import os
import glob
path = 'c:/mydata'
filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths),join='inner', ignore_index=True)
df.columns = df.columns.str.replace('implementationstatus', 'implementation') # I know this doesn't work, but I am trying to demonstrate what I want to do
If you want to change the column name, please try this:
import glob
import pandas as pd
filenames = glob.glob('c:/mydata/*.csv')
all_data = []
for file in filenames:
df = pd.read_csv(file)
if 'implementationstatus' in df.columns:
df = df.rename(columns={'implementationstatus':'implementation'})
all_data.append(df)
df_all = pd.concat(all_data, axis=0)
You can use a combination of header and names parameters from the pd.read_csv function to solve it.
You must pass to names a list containing the names for all columns on the csv files. This will allow you to standardize all names.
From pandas docs:
names: array-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

How to write pandas dataframe into xlsb file in python pandas

I need to write pandas dataframe (df_new, in my case) into an xlsb file which has some formulas. I am stuck on the code below and do not know what to do next:
with open_workbook('NiSource SLA_xlsb.xlsb') as wb:
with wb.get_sheet("SL Dump") as sheet:
can anyone suggest me how to write dataframe into xlsb file
You could try reading the xlsb file as a dataframe and then concating the two.
import pandas as pd
existingdf = pd.DataFrame()
originaldf = pd.read_excel('./NiSource SLA_xlsb.xlsb'
twodflist = [originaldf, df_new]
existingdf = pd.concat(twodflist)
existingdf.reset_index(drop = True)
existingdf.to_excel(r'PATH\filename.xlsb')
Change the path to wherever you want the output to go to and change filename to what you want the output to be named. Let me know if this works.

Python Script to copy CSV files and save as XLSX files

I have a series of csv files in a specific folder on my computer. Need to write a python code to pick those CSV files and extract them into another designated folder on my drive as XLSX. On each file, Column L,M,N is formatted as Date. Column AA & AF is formatted as Number. Other columns can be stored as text or General.
Here is some code i got stuck at
from openpyxl import Workbook
import csv
wb = Workbook()
ws = wb.active
with open('test.csv', 'r') as f:
for row in csv.reader(f):
ws.append(row)
wb.save('name.xlsx')
Using pandas this task should be quite simple.
import pandas as pd
df = pd.read_csv('test.csv')
df.to_excel('test.xlsx')
You can do that for any amount of files by changing the strings to the appropriate filenames.
Edit
I am not sure if you can save by the desired type. You may be able to change that using another package or even pandas. In pandas you can perform pd.to_dateime or pd.to_numeric on a Series to change its type. You can also specify dtype when importing. Hope that helps!
the solution should be something like this
import pandas as pd
import os
dpath = 'path//to//folder'
for filename in os.listdir('dpath'):
df = pd.read_csv(path + '/' + filename)
df = df['a':'b'] #select required columns based on your requirement.
df["a"] = pd.to_numeric(df["a"]) # convert datatype of the column based on your need
df1.append(df)
del df
df1.to_excel('test.xlsx')

way to generate a specified number dataframe of new csv file from existing csv file in python using pandas

I have large data-frame in a Csv file sample1 from that i have to generate a new Csv file contain only 100 data-frame.i have generate code for it.but i am getting key Error the label[100] is not in the index?
I have just tried as below,Any help would be appreciated
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv")
data_frame1 = data_frame[:100]
data_frame.to_csv("C:/users/raju/sample.csv")`
`
The correct syntax is with iloc:
data_frame.iloc[:100]
A more efficient way to do it is to use nrows argument who purpose is exactly to extract portions of files. This way you avoid wasting resources and time parsing useless rows:
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv", nrows=101) # 100+1 for header
data_frame.to_csv("C:/users/raju/sample.csv")

Categories

Resources