I have following N number of invoice data in Excel and I want to create CSV of that file so that it can be imported whenever needed...so how can I archive this?
Here is a screenshot:
Assuming you have a Folder "excel" full of Excel Files within your Project-Directory and you also have another folder "csv" where you intend to put your generated CSV Files, you could pretty much easily batch-convert all the Excel Files in the "excel" Directory into "csv" using Pandas.
It will be assumed that you already have Pandas installed on your System. Otherwise, you could do that via: pip install pandas. The fairly commented Snippet below illustrates the Process:
# IMPORT DATAFRAME FROM PANDAS AS WELL AS PANDAS ITSELF
from pandas import DataFrame
import pandas as pd
import os
# OUR GOAL IS:::
# LOOP THROUGH THE FOLDER: excelDir.....
# AT EACH ITERATION IN THE LOOP, CHECK IF THE CURRENT FILE IS AN EXCEL FILE,
# IF IT IS, SIMPLY CONVERT IT TO CSV AND SAVE IT:
for fileName in os.listdir(excelDir):
#DO WE HAVE AN EXCEL FILE?
if fileName.endswith(".xls") or fileName.endswith(".xlsx"):
#IF WE DO; THEN WE DO THE CONVERSION USING PANDAS...
targetXLFile = os.path.join(excelDir, fileName)
targetCSVFile = os.path.join(csvDir, fileName) + ".csv"
# NOW, WE READ "IN" THE EXCEL FILE
dFrame = pd.read_excel(targetXLFile)
# ONCE WE DONE READING, WE CAN SIMPLY SAVE THE DATA TO CSV
pd.DataFrame.to_csv(dFrame, path_or_buf=targetCSVFile)
Hope this does the Trick for you.....
Cheers and Good-Luck.
Instead of putting total output into one csv, you could go with following steps.
Convert your excel content to csv files or csv-objects.
Each object will be tagged with invoice id and save into dictionary.
your dictionary data structure could be like {'invoice-id':
csv-object, 'invoice-id2': csv-object2, ...}
write custom function which can reads your csv-object, and gives you
name,product-id, qty, etc...
Hope this helps.
Related
#Purpose: to read CSV files from every every csv files in the directory. Filter the rows with the column that say 'fail" from the csv file. Then copy and paste those rows onto a new CSV file.
# import necessary libraries
from sqlite3 import Row
import pandas as pd
import os
import glob
import csv
# Writing to a CSV
# using Python
import csv
# the path to your csv file directory.
mycsvdir = 'C:\Users\'' #this is where all the csv files will be housed.
# use glob to get all the csv files
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csv_files:
# read the csv file
df = pd.read_csv(csvfile)
dataframes.append(df)
#print(row['roi_id'], row['result']) #roi_id is the column label for the first cell on the csv, result is the Jth column label
dataframes = dataframes[dataframes['result'].str.contains('fail')]
# print out to a new csv file
dataframes.to_csv('ROI_Fail.csv') #rewrite this to mirror the variable you want to save the failed rows in.
I tried running this script but im getting a couple of errors. First off, i know my indentation is off(newbie over here), and im getting a big error under my for loop saying that "csv_files" is not defined. Any help would be greatly appreciated.
There are two issues here:
The first one is kind of easy - The variable in the for loop should be csvfiles, not csv_files.
The second one (Which will show up when you fix the one above) is that you are treating a list of dataframes as a dataframe.
The object "dataframes" in your script is a list to which you are appending the dataframes created from the CSV files. As such, you cannot index them by the column name as you are trying to do.
If your dataframes have the same layout I'd recommend using df.concat to join all dataframes into a single one, and then filtering the rows as you did here.
full_dataframe = pd.concat(dataframes, axis=0)
full_dataframe = full_dataframe[full_dataframe['result'].str.contains('fail')]
As a tip for further posts I'd recommend you also post the full traceback from your program. It helps us understand exactly what error you had when executing your code.
The following code coverts .dat files into data frames with the use of its dictionary file in .dct format. It works well. But my problem is that I was unable to automate this process, creating a loop that takes the pairs of these files from lists is a little bit tricky, atleast for me. I could really use some help with that.
try:
from statadict import parse_stata_dict
except ImportError:
!pip install statadict
import pandas as pd
from statadict import parse_stata_dict
dict_file = '2015_2017_FemPregSetup.dct'
data_file = '2015_2017_FemPregData.dat'
stata_dict = parse_stata_dict(dict_file)
stata_dict
nsfg = pd.read_fwf(data_file,
names=stata_dict.names,
colspecs=stata_dict.colspecs)
# nsfg is now a pandas DataFrame
These are the lists of files that I would like to convert into data frames. Every .dat file has its own dictionary file:
dat_name = ['2002FemResp.dat',
'2002Male.dat'...
dct_name = ['2002FemResp.dct',
'2002Male.dct'...
Assuming both lists have the same length and you will want to save the csv dataframe you could try:
c=0
for dat,dct in zip(dat_name, dct_name):
c+=1
stata_dict = parse_stata_dict(dct)
pd.read_fwf(dat, names=stata_dict.names, colspecs=stata_dict.colspecs).to_csv(r'path_name\file_name_{}.csv'.format(c))
# don't forget the '.csv'!
Also consider that if you are not using windows you need to use '/' rather than '\' in your path (or you can use os.path.join() to avoid this issue.
I have a large set of data that I am trying to extract from multiple excel files that have multiple sheets using python and then write that data into a new excel file. I am new with python and have tried to use various tutorials to come up with code that can help me automate the process. However, I have reached a point where I am stuck and need some guidance on how to write the data that I extract to a new excel file. If someone could point me in the write direction, it would be greatly appreciated. See code below:
import os
from pandas.core.frame import DataFrame
path = r"Path where all excel files are located"
os.chdir(path)
for WorkingFile in os.listdir(path):
if os.path.isfile(WorkingFile):
DataFrame = pd.read_excel(WorkingFile, sheet_name = None, header = 12, skipfooter = 54)
DataFrame.to_excel(r'Empty excel file where to write all the extracted data')
When I execute the code I get an error "AttributeError: 'dict' object has no attribute 'to_excel'. So I am not sure how to rectify this error, any help would be appreciated.
Little bit more background on what I am trying to do. I have a folder with about 50 excel files, each file might have multiple sheets. The data I need is located on a table that consists of one row and 14 columns and is in the same location on each file and each sheet. I need to pull that data and compile it into a single excel file. When I run the code above and and a print statement, it is showing me the exact data I want but when I try to write it to excel it doesn't work.
Thanks for help in advance!
Not sure why you're importing DataFrame instead of pandas. Looks like your code is incomplete. Below code will clear the doubts you have. (Not include any conditions for excluding non excel files/dir etc )
import pandas as pd
import os
path = "Dir path to excel files" #Path
df = pd.DataFrame() # Initialize empty df
for file in os.listdir(path):
data = pd.read_excel(path + file) # Read each file from dir
df = df.append(data, ignore_index=True) # and append to empty df
# process df
df.to_excel("path/file.xlsx")
I'm generating an csv file that is opened by excel and converted to xlsx manually.
The csv contains some path to .txt files.
Is it possible to build the file path in such way that when the csv is converted to xlsx , they became clickable hyperlinks ?
Thanks.
I would be interested to understand your workflow a bit better, but to try and help with your specific request:
The HYPERLINK solution proposed in the comments looks like a good one
If you are able to implement that upstream in the csv generation step then great
If not and/or you are interested in automating the conversion process, consider using the pandas library:
Create a DataFrame object from a csv using the pandas.read_csv method
Convert your paths to HYPERLINKs
Write back to xlsx using the pandas.DataFrame.to_excel method
E.g. if you have a file original.csv and the relevant column header is file_paths:
import pandas as pd
df = pd.read_csv('original.csv')
df['file_paths'] = '=HYPERLINK("' + df['file_paths'] + '")'
df.to_excel('new.xlsx', index=False)
Hope that helps!
Jon
I have a folder with a large number of Excel workbooks. Is there a way to convert every file in this folder into a CSV file using Python's xlrd, xlutiles, and xlsxWriter?
I would like the newly converted CSV files to have the extension '_convert.csv'.
OTHERWISE...
Is there a way to merge all the Excel workbooks in the folder to create one large file?
I've been searching for ways to do both, but nothing has worked...
Using pywin32, this will find all the .xlsx files in the indicated directory and open and resave them as .csv. It is relatively easy to figure out the right commands with pywin32...just record an Excel macro and perform the open/save manually, then look at the resulting macro.
import os
import glob
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
for f in glob.glob('tmp/*.xlsx'):
fullname = os.path.abspath(f)
xl.Workbooks.Open(fullname)
xl.ActiveWorkbook.SaveAs(Filename=fullname.replace('.xlsx','.csv'),
FileFormat=win32com.client.constants.xlCSVMSDOS,
CreateBackup=False)
xl.ActiveWorkbook.Close(SaveChanges=False)
I will give a try with my library pyexcel:
from pyexcel import Book, BookWriter
import glob
import os
for f in glob.glob("your_directory/*.xlsx"):
fullname = os.path.abspath(f)
converted_filename = fullname.replace(".xlsx", "_converted.csv")
book = Book(f)
converted_csvs = BookWriter(converted_filename)
converted_csvs.write_book_reader(book)
converted_csvs.close()
If you have a xlsx that has more than 2 sheets, I imagine you will have more than 2 csv files generated. The naming convention is: "file_converted_%s.csv" % your_sheet_name. The script will save all converted csv files in the same directory where you had xlsx files.
In addition, if you want to merge all in one, it is super easy as well.
from pyexcel.cookbook import merge_all_to_a_book
import glob
merge_all_to_a_book(glob.glob("your_directory/*.xlsx"), "output.xlsx")
If you want to do more, please read the tutorial
Look at openoffice's python library. Although, I suspect openoffice would support MS document files.
Python has no native support for Excel file.
Sure. Iterate over your files using something like glob and feed them into one of the modules you mention. With xlrd, you'd use open_workbook to open each file by name. That will give you back a Book object. You'll then want to have nested loops that iterate over each Sheet object in the Book, each row in the Sheet, and each Cell in the Row. If your rows aren't too wide, you can append each Cell in a Row into a Python list and then feed that list to the writerow method of a csv.writer object.
Since it's a high-level question, this answer glosses over some specifics like how to call xlrd.open_workbook and how to create a csv.writer. Hopefully googling for examples on those specific points will get you where you need to go.
You can use this function to read the data from each file
import xlrd
def getXLData(Filename, min_row_len=1, get_datemode=False, sheetnum=0):
Data = []
book = xlrd.open_workbook(Filename)
sheet = book.sheets()[sheetnum]
rowcount = 0
while rowcount < sheet.nrows:
row = sheet.row_values(rowcount)
if len(row)>=min_row_len: Data.append(row)
rowcount+=1
if get_datemode: return Data, book.datemode
else: return Data
and this function to write the data after you combine the lists together
import csv
def writeCSVFile(filename, data, headers = []):
import csv
if headers:
temp = [headers]
temp.extend(data)
data = temp
f = open(filename,"wb")
writer = csv.writer(f)
writer.writerows(data)
f.close()
Keep in mind you may have to re-format the data, especially if there are dates or integers in the Excel files since they're stored as floating point numbers.
Edited to add code calling the above functions:
import glob
filelist = glob.glob("*.xls*")
alldata = []
headers = []
for filename in filelist:
data = getXLData(filename)
headers = data.pop(0) # omit this line if files do not have a header row
alldata.extend(data)
writeCSVFile("Output.csv", alldata, headers)