Expected a list of dataframe got just one dataframe - python

Am trying to convert list of sheets from an excel file into a csv, so beginning with the following codes, i want to read the files first, but i only get the first sheet, and the rest are lost
import pandas as pd
def accept_xcl_file(file):
xcl_file = pd.ExcelFile(file)
sheets= xcl_file.sheet_names
file = xcl_file.parse(sheet_names = sheets)
return file,sheets
file, sheet = accept_xcl_file('Companies.xlsx')
sheet >>
this is the output from sheet
['companies',
'fruits',
'vehicles',
'sales',
'P&L',
'price',
'clubs',
'countries',
'housing',
'life-expectancy']
file['fruits'] >>
i get a keyerror when i try to index the file, but when i use 'companies' key i get the correct data. going by the documentation i should expect a DataFrame or dict of DataFrames
anyhelp..

The read_excel method is already available in pandas to import Excel data.
Try this instead of your code:
import pandas as pd
file = pd.read_excel('Companies.xlsx')
# file is a dict object
# keys are the sheet names as strings
# items are the pd.DataFrame objects containing sheet data

Related

Extract list to a csv file

I need to extract to a .csv file a Dataframe that I extract from a website. I can generate the values ​​but I can't extract to .csv because of the following error:
AttributeError: object 'list' has no attribute 'to_csv'
code:
import pandas as pd
url = "https://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-ptBR.asp?Data=23/01/2023&Data1=20230123&slcTaxa=PRE"
df = pd.read_html(io=url, flavor='html5lib', encoding='latin1')
print(df)
df.to_csv(r'C:/Users/xport_dataframe.csv', index=False, header=True)
You have not made a data-frame, you just used pandas to create a list from html. Use df = pd.DataFrame(#list goes here) to create a dataframe, and then you can use df.to_csv(...
read_html returns a list of dataframes (as explained here).
You need to concatenate this list into a Pandas dataframe prior to exporting it to csv:
import pandas as pd
url = "https://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-ptBR.asp?Data=23/01/2023&Data1=20230123&slcTaxa=PRE"
list_of_dfs = pd.read_html(io=url, flavor='html5lib', encoding='latin1')
print(df)
pd.concat(list_of_dfs).to_csv(r'C:/Users/xport_dataframe.csv', index=False, header=True)

Import several sheets from the same excel into one dataframe in pandas

I have one excel file with several identical structured sheets on it (same headers and number of columns) (sheetsname: 01,02,...,12).
How can I get this into one dataframe?
Right now I would load it all seperate with:
df1 = pd.read_excel('path.xls', sheet_name='01')
df2 = pd.read_excel('path.xls', sheet_name='02')
...
and would then concentate it.
What is the most pythonic way to do it and get directly one dataframe with all the sheets? Also assumping I do not know every sheetname in advance.
read the file as:
collection = pd.read_excel('path.xls', sheet_name=None)
combined = pd.concat([value.assign(sheet_source=key)
for key,value in collection.items()],
ignore_index=True)
sheet_name = None ensures all the sheets are read in.
collection is a dictionary, with the sheet_name as key, and the actual data as the values. combined uses the pandas concat method to get you one dataframe. I added the extra column sheet_source, in case you need to track where the data for each row comes from.
You can read more about it on the pandas doco
you can use:
df_final = pd.concat([pd.read_excel('path.xls', sheet_name="{:02d}".format(sheet)) for sheet in range(12)], axis=0)

Write multiple dataframes into a single flat text file to input into Pandas

I have an excel file with multiple sheets that I convert into a dictionary of dataframes where the key represents the sheet's name:
xl = pd.ExcelFile(r"D:\Python Code\PerformanceTable.xlsx")
pacdict = { i : pd.read_excel(xl, i) for c, i in enumerate(xl.sheet_names, 1)}
I would like to replace this input Excel file with a flat text file -- but would still like to end up with the same outcome of a dictionary of dataframes.
Any suggestions on how I might be able to format the text file so it still contains data for multiple, named tables/sheets and can be read into the above format? Preferrably still making Pandas' built-in functionality do the heavy lifting.
Loop through each sheet. Create a new column called "sheet_source". Concatenate the sheet dataframes to a master dataframe. Lastly export to CSV file.
# create a master dataframe to store the sheets
df_master = pd.DataFrame()
# loop through each dict key
for each_df_key in pacdict.keys():
# dataframe for each sheet
sheet_df = pacdict[each_df_key]
# add column for sheet name
sheet_df['sheet_source'] = each_df_key
# concatenate each sheet to the master
df_master = pd.concat([df_master, sheet_df])
# after the for-loop, export the master dataframe to CSV
df_master.to_csv('new_dataframe.csv', index=False)

python efficient way to append all worksheets in multiple excel into pandas dataframe

I have around 20++ xlsx files, inside each xlsx files might contain different numbers of worksheets. But thank god, all the columns are the some in all worksheets and all xlsx files. By referring to here", i got some idea. I have been trying a few ways to import and append all excel files (all worksheet) into a single dataframe (around 4 million rows of records).
Note: i did check here" as well, but it only include file level, mine consits file and down to worksheet level.
I have tried below code
# import all necessary package
import pandas as pd
from pathlib import Path
import glob
import sys
# set source path
source_dataset_path = "C:/Users/aaa/Desktop/Sample_dataset/"
source_dataset_list = glob.iglob(source_dataset_path + "Sales transaction *")
for file in source_dataset_list:
#xls = pd.ExcelFile(source_dataset_list[i])
sys.stdout.write(str(file))
sys.stdout.flush()
xls = pd.ExcelFile(file)
out_df = pd.DataFrame() ## create empty output dataframe
for sheet in xls.sheet_names:
sys.stdout.write(str(sheet))
sys.stdout.flush() ## # View the excel files sheet names
#df = pd.read_excel(source_dataset_list[i], sheet_name=sheet)
df = pd.read_excel(file, sheetname=sheet)
out_df = out_df.append(df) ## This will append rows of one dataframe to another(just like your expected output)
Question:
My approach is like first read the every single excel file and get a list of sheets inside it, then load the sheets and append all sheets. The looping seems not very efficient expecially when datasize increase for every append.
Is there any other efficient way to import and append all sheets from multiple excel files?
Use sheet_name=None in read_excel for return orderdict of DataFrames created from all sheetnames, then join together by concat and last DataFrame.append to final DataFrame:
out_df = pd.DataFrame()
for f in source_dataset_list:
df = pd.read_excel(f, sheet_name=None)
cdf = pd.concat(df.values())
out_df = out_df.append(cdf,ignore_index=True)
Another solution:
cdf = [pd.read_excel(excel_names, sheet_name=None).values()
for excel_names in source_dataset_list]
out_df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)
If i understand your problem correctly, set sheet_name=None in pd.read_excel does the trick.
import os
import pandas as pd
path = "C:/Users/aaa/Desktop/Sample_dataset/"
dfs = [
pd.concat(pd.read_excel(path + x, sheet_name=None))
for x in os.listdir(path)
if x.endswith(".xlsx") or x.endswith(".xls")
]
df = pd.concat(dfs)
I have a pretty straight forward solution if you want to read all the sheets.
import pandas as pd
df = pd.concat(pd.read_excel(path+file_name, sheet_name=None),
ignore_index=True)

Not getting back the column names after reading into an xlsx file

Hello I have xlsx files and merged them into one dataframe by using pandas. It worked but instead of getting back the column names that I had in the xlsx file I got numbers as columns instead and the column titles became a row: Like this:
Output: 1 2 3
COLTITLE1 COLTITLE2 COLTITLE3
When they should be like this:
Output: COLTITLE1 COLTITLE2 COLTITLE3
The column titles are not column titles but rather they have become a row. How can I get back the rightful column names that I had within the xlsx file. Just for clarity all the column names are the same within both the xlsx files. Help would be appreciated heres my code below:
# import modules
from IPython.display import display
import pandas as pd
import numpy as np
pd.set_option("display.max_rows", 999)
pd.set_option('max_colwidth',100)
%matplotlib inline
# filenames
file_names = ["data/OrderReport.xlsx", "data/OrderReport2.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in file_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# concatenate them
atlantic_data = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I hope I understood your question correctly. You just need to get rid of the index_col=None and it will return the column name as usual:
frames = [x.parse(x.sheet_names[0], header=None) for x in excels]
If you add index_col=None pandas will treat your column name as 1 row of data rather than a column for the dataframe.

Categories

Resources