I have loaded a excel into python (Google Colab), but I was wondering if there was a way of extracting the names of the excel (.xlsm) file. Please check attached image.
import pandas as pd
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_excel(io.BytesIO(uploaded['202009 Testing - September - Diamond Plod Day & Night MKY021.xlsm']),sheet_name='1 D',header=8,usecols='BE,BH',nrows=4)
df1 = pd.read_excel(io.BytesIO(uploaded['202009 Testing - September - Diamond Plod Day & Night MKY021.xlsm']),sheet_name='1 D',header=3)
df=df.assign(PlodDate='D5')
df['PlodDate']=df1.iloc[0,3]
df=df.assign(PlodShift='D6')
df['PlodShift']=df1.iloc[1,3]
df =df.rename({'Qty.2':'Loads','Total (L)':'Litres'},axis=1)
df = df.reindex(columns=['PlodDate','PlodShift','Loads','Litres','DataSource'])
df=df.assign(DataSource='Name of the Source File')
df
Instead of the datasource='name of the source file', I want active excel sheet name.
Output should be:
Datasource='202009 Testing - September - Diamond Plod Day & Night MKY021'
As I have a file for every month, I just want a code that take the name of active excel sheet when I run the code.
I tried this code but it was not working in google colab.
import os
os.listdir('.')
Excel File Name:
Code Image:
Code in Google Colab
Excel File Attached
I have not used google colab, but I used to have a similar problem on how to extract sheet names of some Excel file. The solution turned out to be very simple:
using pandas as pd
excel_file = pd.ExcelFile("excel_file_name.xlsx")
sheet_names = excel_file.sheet_names
So, basically the idea is that you want to open the whole Excel file instead of specific sheet of it. This can be done by pd.ExcelFile( ... ). Once you have your excel file "open", you can get the names by some_excel_file.sheet_names. This is especially useful when you want to loop over all sheets in some excel file. For example, the code can be something like this:
excel_file = pd.ExcelFile("excel_file_name.xlsx")
sheet_names = excel_file.sheet_names
for sheet_name in sheet_names:
# do some operations here for this sheet
This is not a complete answer as I am not sure about Google Colab, but I hope this would give you an idea on what you can do to the sheet names.
Related
I have a large set of data that I am trying to extract from multiple excel files that have multiple sheets using python and then write that data into a new excel file. I am new with python and have tried to use various tutorials to come up with code that can help me automate the process. However, I have reached a point where I am stuck and need some guidance on how to write the data that I extract to a new excel file. If someone could point me in the write direction, it would be greatly appreciated. See code below:
import os
from pandas.core.frame import DataFrame
path = r"Path where all excel files are located"
os.chdir(path)
for WorkingFile in os.listdir(path):
if os.path.isfile(WorkingFile):
DataFrame = pd.read_excel(WorkingFile, sheet_name = None, header = 12, skipfooter = 54)
DataFrame.to_excel(r'Empty excel file where to write all the extracted data')
When I execute the code I get an error "AttributeError: 'dict' object has no attribute 'to_excel'. So I am not sure how to rectify this error, any help would be appreciated.
Little bit more background on what I am trying to do. I have a folder with about 50 excel files, each file might have multiple sheets. The data I need is located on a table that consists of one row and 14 columns and is in the same location on each file and each sheet. I need to pull that data and compile it into a single excel file. When I run the code above and and a print statement, it is showing me the exact data I want but when I try to write it to excel it doesn't work.
Thanks for help in advance!
Not sure why you're importing DataFrame instead of pandas. Looks like your code is incomplete. Below code will clear the doubts you have. (Not include any conditions for excluding non excel files/dir etc )
import pandas as pd
import os
path = "Dir path to excel files" #Path
df = pd.DataFrame() # Initialize empty df
for file in os.listdir(path):
data = pd.read_excel(path + file) # Read each file from dir
df = df.append(data, ignore_index=True) # and append to empty df
# process df
df.to_excel("path/file.xlsx")
I have a folder on my pc which I clean every monday and download new attachments (3 file in total for previous month current month and next month) from outlook (I have code for that).I have one master excel file with 12 sheet: January, February, March ... December. The files I download from outlook has same names as sheet_name in master file. What I would like to do is this: I want to take a data from each outlook file and paste it on corresponding sheet. So if i have a file called December.xlsb I want to take all the data from sheet1 and paste it in master file on sheet_name December.
Master file and outlook attachments are in different directory. Preferably I would like to do it with pandas but I welcome to other solutions too.
Im not really sure how should I do or from where should I start. For sure I will need for loop and os.listdir I guess. Any help is appreciated
As you say, pandas is a good idea for this as it is good at manipulating excel files.
Start by reading the new excel files wherever they are saved.
Then simply add the dataframe as a new sheet in the master excel workbook
import pandas as pd
month = input("Enter month: ")
new_file = f"C:\New Files\{month}.xlsb"
master_file = "C:\Master Files\master.xlsb"
df = pd.read_excel(new_file)
with pd.ExcelWriter(master_file, engine='openpyxl', mode='a') as writer:
df.to_excel(writer,sheet_name=month)
firstly I ask admin not to close the topic. Because last time I opened and it was closed that there are similar topics. But those are not same. Thanks in advance.
Every day I receive 15-20 excel files with huge number of worksheets (more than 200). Fortunately worksheet names and counts are same for all excel files. I want to merge all excel files into one but with multiple worksheets. I am new in Python, actually I watched and read a lot about the options how to do but could not find a way. Thanks for your support.
example I tried: I have two files with two sheets (actual sheet count is huge as mentioned above), I want to merge both files into one with two sheets as sum.xlsx.
Data1.xlsx
Data1.xlsx
sum.xlsx
import os
import openpyxl
import pandas as pd
files = os.listdir(r'C:\Python\db') # we open the folder where files are located
os.chdir(r'C:\Python\db') # we change working direction
df = pd.DataFrame() # we create an empty data frame
wb = openpyxl.load_workbook(r'C:\Python\db\Data1.xlsx') # we load one of file to extract list of sheet names
sh_name = wb.sheetnames # we extract list of names into a list
for i in sh_name:
for f in files:
data = pd.read_excel(f, sheet_name=i)
df = df.append(data)
df.to_excel('sum.xlsx', index=False, sheet_name=i)
So, I used Jupyter Notebook and there using the 'sep' command was pretty simple. But now I'm slowly migrating to Google Colab, and while I can find the file and build the DataFrame with 'pd.read_csv()', I can't seem to separate the columns with the 'sep = ' command!
I mounted the Drive and located the file:
import pandas as pd
from google.colab import drive
drive.mount('/content/gdrive')
with open('/content/gdrive/My Drive/wordpress/cousins.csv','r') as f:
f.read()
Then I built the Dataframe:
df = pd.read_csv('/content/gdrive/My Drive/wordpress/cousins.csv',sep=";")
The dataframe is built, but it is not separated by columns! Below is a screenshot:
Built DataFrame
Last edit: Turns out the problem was with the data I was trying to use, because it also didn't work on Jupyter. There is no problem with the 'sep' command the way it was being used!
PS: I also tried 'sep='.'' and 'sep = ','' to see if it works, and nothing.
I downloaded the data as a 'csv' table from Football-Reference, paste it on excel, saved as a csv (UTF-8), an example of the file can be found here:
Pastebin Example File
This works for me:
My data:
a,b,c
5,6,7
8,9,10
You don't need sep for comma separated file.
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
# suppose I have data in my Google Drive in the file path
# GoogleColaboratory/data/so/a.csv
# The folder GoogleColaboratory is in my Google Drive.
df = pd.read_csv('drive/My Drive/GoogleColaboratory/data/so/a.csv')
df.head()
Instead of
df = pd.read_csv('/content/gdrive/My Drive/wordpress/cousins.csv', sep=";")
Use
df = pd.read_csv('/content/gdrive/My Drive/wordpress/cousins.csv', delimiter=";")
I am trying to read an excel sheet into df using pandas read_excel method. The excel file contains 6-7 different sheet. Out of it, 2-3 sheets are very huge. I only want to read one excel sheet out of the file.
If I copy the sheet out and read the time reduces by 90%.
I have read that xlrd that is used by pandas always loads the whole sheet to memory. I cannot change the format of the input.
Can you please suggest a way to improve the performance?
It's quite simple. Just do this.
import pandas as pd
xls = pd.ExcelFile('C:/users/path_to_your_excel_file/Analysis.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
# etc.
df2 = pd.read_excel(xls, 'Sheet2')
print(df2)
import pandas as pd
df = pd.read_excel('YourFile.xlsx', sheet_name = 'YourSheet_Name')
Whatever sheet you want to read just put the sheet name and your path to excel file.
Use openpyxl in read-only mode. See http://openpyxl.readthedocs.io/en/default/pandas.html