import pandas as pd
import re
file_name = "example.xlsx" #name of the excel file
sheet = "sheet" #name of the sheet
df = pd.read_excel(file_name, sheet_name = sheet, usecols = "A:F")
select_rows = df.iloc[516-2:] #specify rows
My question is why if I want to refer to row 516 onwards (from excel's index), I should subtract the number by 2 as stated on the code? I know the index on Pandas starting from zero, which means subtracted by 1 and not 2.
#Samuel You are already 'minus one' because of zero-based index in Pandas. However, what isn't clear until reading the Pandas documentation for pd.read_excel is that there is a parameter called 'header' that is set to 0 by default (i.e. the first row (row 1 in Excel) is used as your header for column names). To demonstrate, try modifying the line where you create 'df' by adding an additional argument of header=None (code snippet below) and then run your code and inspect the results.
df = pd.read_excel(file_name, sheet_name = sheet, usecols = "A:F", header=None)
Related
Very new to Python and Pandas...but the issue is that my final output file isn't excluding any duplicates on the 'Customer Number'. Any suggestions on why this would be happening would be appreciated!
import pandas as pd
import numpy as np #numpy is the module which can replace errors from huge datasets
from openpyxl import load_workbook
from openpyxl.styles import Font
df_1 = pd.read_excel('PRT Tracings 2020.xlsx', sheet_name='Export') #this is reading the Excel document shifts and looks at sheet
df_2 = pd.read_excel('PRT Tracings 2021.xlsx', sheet_name='Export') #this reads the same Excel document but looks at a different sheet
df_3 = pd.read_excel('PRT Tracings YTD 2022.xlsx', sheet_name='Export') #this reads a different Excel file, and only has one sheet so no need to have it read a sheet
df_all = pd.concat([df_1, df_2, df_3], sort=False) #this combines the sheets from 1,2,3 and the sort function as false so our columns stay in the same order
to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet
df_all = df_all.replace(np.nan, 'N/A', regex=True) #replaces errors with N/A
remove = ['ORDERNUMBER', 'ORDER_TYPE', 'ORDERDATE', 'Major Code Description', 'Product_Number_And_Desc', 'Qty', 'Order_$', 'Order_List_$'] #this will remove all unwanted columns
df_all.drop(columns=remove, inplace=True)
df_all.drop_duplicates(subset=['Customer Number'], keep=False) #this will remove all duplicates from the tracing number syntax with pandas module
to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet
wb = load_workbook('Combined_PRT_Tracings.xlsx') #we are using this to have openpyxl read the data, from the spreadsheet already created
ws = wb.active #this workbook is active
wb.save('Combined_PRT_Tracings.xlsx')
You should assign the return value of df_all.drop_duplicates to a variable or set inplace=True to have the DataFrame contents overwritten. This is to prevent undesired changes to the original data.
Try:
df_all = df_all.drop_duplicates(subset='Customer Number', keep=False)
Or the equivalent:
df_all.drop_duplicates(subset='Customer Number', keep=False, inplace=True)
That will remove all duplicate rows from the DataFrame. If you want to keep the first or last row which contains a duplicate, change keep to first or last.
I'm using Pandas to edit an Excel file which other people are using. But when I save it using df.to_excel, Pandas adds an ugly looking black border to cells in the header and in the index. I want it to be written in a plain format, how a CSV file would look if I opened it up in Excel. It would be even better if it was written back using the same styles it was read in
Is there anyway to make df.to_excel write without styling or with the original styles?
Thanks.
Try this trick:
import io
pd.read_csv(io.StringIO(df.to_csv()), header=None)
.to_excel("output.xlsx", header=None, index=None)
If you still want index and header values - but without styling, you can use this (requires openpyxl):
def insert_dataframe(df,sheet,start_row=1,start_col=1):
"""inserts a dataframe into an openpyxl sheet at the given (row,col) position.
Parameters
----------
df : pandas.Dataframe
Any dataframe
sheet : openpyxl.worksheet.worksheet
The sheet where the dataframe should be inserted
start_row : int
The row where the dataframe should be insterted (default is 1)
start_col : int
The column where the dataframe should be insterted (default is 1)
"""
#iterate dataframe index names and insert
for name_idx, name in enumerate(df.index.names):
label_col=start_col+name_idx
sheet.cell(row=start_row, column=label_col, value=name)
#for each name iterate values as rows in the current index name column
value_row=start_row+1
for i_value in list(df.index.values):
if isinstance(df.index, pd.MultiIndex):
val=i_value[name_idx]
else:
val=i_value
sheet.cell(row=value_row, column=label_col, value=val)
#goto next row
value_row+=1
row_idx=0
col_idx=label_col+1
#insert values
for label,content in df.items():
sheet.cell(row=start_row, column=col_idx, value=label)
for row_idx,value_ in enumerate(content):
sheet.cell(row=start_row+row_idx+1, column=col_idx, value=value_)
col_idx+=1
Gist: https://gist.github.com/Aer0naut/094ff1b6838b2177a4222591ace8f6bf
We have a data system that creates tables of data as Excel files. I'm trying to import this Excel file into a pandas dataframe.
In the Excel, Row 1 is some metadata I don't want, while row 2 is the column header. By default, Pandas correctly uses column 1 as the index (a lot number), but the second column is a production date, but it for what ever reason does not have a header in row 2.
So pandas seems to be creating a multi-index by default, is there a way to suppress this function? It seems to be doing this because there is no column header in row 2 column 2 (cell B2). If I manually edit the Excel to add a label, it imports as I want.
import pandas as pd
xlsx01 = pd.ExcelFile("C:/Users/maherp/Desktop/JunkFiles/Book1.xlsx")
df_01 = pd.read_excel(xlsx01, header=1)
I get an error that I cannot decipher when I try:
df_01 = pd.read_excel(xlsx01, header=1, index_col=0)
As suggested by #Peej1226, here is final solution which worked.
df_01 = pd.read_excel(xlsx01, sheet_name='Discrete', skiprows=1, header=0,index_col=0)
I have been searching over on how to append/insert/concat a row from one excel to another but with merged cells. I was not able to find what I am looking for.
What I need to get is this:
and append to the very first row of this:
I tried using pandas append() but it destroyed the arrangement of columns.
df = pd.DataFrame()
for f in ['merge1.xlsx', 'test1.xlsx']:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
df.to_excel('test3.xlsx')
Is there way pandas could do it? I just need to literally insert the header to the top row.
Although I am still trying to find a way, it would actually be fine to me if this question had a duplicate as long as I can find answers or advice.
You can use pd.read_excel to read in the workbook with the data you want, in your case that is 'test1.xlsx'. You could then utilize openpyxl.load_workbook() to open an existing workbook with the header, in your case that is 'merge1.xlsx'. Finally you could save the new workbbok by a new name ('test3.xlsx') without changing the two existing workbooks.
Below I've provided a fully reproducible example of how you can do this. To make this example fully reproducible, I create 'merge1.xlsx' and 'test1.xlsx'.
Please note that if in your 'merge1.xlsx', if you only have the header that you want and nothing else in the file, you can make use of the two lines I've left commented out below. This would just append your data from 'test1.xlsx' to the header in 'merge1.xlsx'. If this is the case then you can get rid of the two for llops at the end. Otherwise as in my example it's a bit more complicated.
In creating 'test3.xlsx', we loop through each row and we determine how many columns there are using len(df3.columns). In my example this is equal to two but this code would also work for a greater number of columns.
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows
df1 = pd.DataFrame()
writer = pd.ExcelWriter('merge1.xlsx') #xlsxwriter engine
df1.to_excel(writer, sheet_name='Sheet1')
ws = writer.sheets['Sheet1']
ws.merge_range('A1:C1', 'This is a merged cell')
ws.write('A3', 'some string I might not want in other workbooks')
writer.save()
df2 = pd.DataFrame({'col_1': [1,2,3,4,5,6], 'col_2': ['A','B','C','D','E','F']})
writer = pd.ExcelWriter('test1.xlsx')
df2.to_excel(writer, sheet_name='Sheet1')
writer.save()
df3 = pd.read_excel('test1.xlsx')
wb = load_workbook('merge1.xlsx')
ws = wb['Sheet1']
#for row in dataframe_to_rows(df3):
# ws.append(row)
column = 2
for item in list(df3.columns.values):
ws.cell(2, column=column).value = str(item)
column = column + 1
for row_index, row in df3.iterrows():
ws.cell(row=row_index+3, column=1).value = row_index #comment out to remove index
for i in range(0, len(df3.columns)):
ws.cell(row=row_index+3, column=i+2).value = row[i]
wb.save("test3.xlsx")
Expected Output of the 3 Workbooks:
I read an Excel Sheet into a pandas DataFrame this way:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet1")
the first cell's value of each column is selected as the column name for the dataFrame, I want to specify my own column names, How do I do this?
This thread is 5 years old and outdated now, but still shows up on the top of the list from a generic search. So I am adding this note. Pandas now (v0.22) has a keyword to specify column names at parsing Excel files. Use:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet 1", header=None, names=['A', 'B', 'C'])
If header=None is not set, pd seems to consider the first row as header and delete it during parsing. If there is indeed a header, but you dont want to use it, you have two choices, either (1) use "names" kwarg only; or (2) use "names" with header=None and skiprows=1. I personally prefer the second option, since it clearly makes note that the input file is not in the format I want, and that I am doing something to go around it.
I think setting them afterwards is the only way in this case, so if you have for example four columns in your DataFrame:
df.columns = ['W','X','Y','Z']
If you know in advance what the headers in the Excelfile are its probably better to rename them, this would rename W into A, etc:
df.rename(columns={'W':'A', 'X':'B', etc})
As Ram said, this post comes on the top and may be useful to some....
In pandas 0.24.2 (may be earlier as well), read_excel itself has the capability of ignoring the source headers and giving your own col names and few other good controls:
DID = pd.read_excel(file1, sheet_name=0, header=None, usecols=[0, 1, 6], names=['A', 'ID', 'B'], dtype={2:str}, skiprows=10)
# for example....
# usecols => read only specific col indexes
# dtype => specifying the data types
# skiprows => skip number of rows from the top.
call .parse with header=None keyword argument.
df = xl.parse("Sheet1", header=None)
In case the excel sheet only contains the data without headers:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"])
In case the excel sheet already contains header names, then use skiprows to skip the line:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"],skiprows=1)