i have a an excel table that contains :
ID product
03/1/2021
16/1/2022
12/2/2022
14/3/2023
A
4
1
2
5
B
6
1
3
C
7
6
and in the same sheet I have a drop down list that contains(the year , and the month)
if i select in the drop down list for example year = 2020 and month= 1,
it will be return something like this:
ID product
03/1/2021
A
4
B
6
C
and then it will calculate the som of the cells : som = 10 in this case
here is my code :
# import load_workbook
import pandas as pd
import numpy as np
from openpyxl import load_workbook
from openpyxl.worksheet.datavalidation import DataValidation
from openpyxl import Workbook
from openpyxl.styles import PatternFill
# set file path
filepath= r'test.xlsx'
wb=load_workbook(filepath)
ws=wb["sheet1"]
#Generates 10 year in the Column MK;
for number in range(1,10):
ws['MK{}'.format(number)].value= "202{}".format(number)
data_val = DataValidation(type="list",formula1='=MK1:MK10')
ws.add_data_validation(data_val)
# drop down list with all the values from the column MK
data_val.add(ws["E2"])
#Generates the numbers of month in the Column MN;
for numbers in range(1,12):
ws['MN{}'.format(numbers)].value= "{}".format(numbers)
data_vals = DataValidation(type="list",formula1='=MN1:MN14')
ws.add_data_validation(data_vals)
# drop down list with all the values from the sheet list column MK
data_vals.add(ws["E3"])
# add a color to the cell 'year' and 'month'
ws['E2'].fill = PatternFill(start_color='FFFFFF00', end_color='FFFFFF00', fill_type = 'solid')
ws['E3'].fill = PatternFill(start_color='FFFFFF00', end_color='FFFFFF00', fill_type = 'solid')
# save workbook
wb.save(filepath)
Any suggestions?
thank you for your help.
Assuming your excel file looks like below:
Final Code looks like below:
import xlrd
file = r'C:\path\test_exl.xlsx'
sheetname='Sheet1'
n=2
df = pd.read_excel(file,skiprows=[*range(2)],index_col=[0])
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_name(sheetname)
year = worksheet.cell(0,0).value
month = worksheet.cell(1,0).value
datetime_cols= pd.to_datetime(df.columns,dayfirst=True,errors='coerce')
out = (df.loc[:,(datetime_cols.year == year) & (datetime_cols.month == month)]
.reset_index())
print(out)
ID Product 03-01-2021
0 A 4.0
1 B 6.0
2 C NaN
Breakdown:
you can first read the table in pandas using pd.read_excel:
file = r'C:\path\test_exl.xlsx'
sheetname='Sheet1'
n=2 #change n to how many lines to skip to read the table.
#In the above image my dataframe starts at line 3 onwards so I put n=2
df = pd.read_excel(file,skiprows=[*range(n)],index_col=[0])
Then access the cell values using xlrd:
import xlrd
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_name(sheetname)
year = worksheet.cell(0,0).value #A1 is 0,0
month = worksheet.cell(1,0).value #A2 is 1,0 and so on..
#print(year,month) gives 2021 and 1
then convert columns to datetime and filter:
datetime_cols= pd.to_datetime(df.columns,dayfirst=True,errors='coerce')
out = (df.loc[:,(datetime_cols.year == year) & (datetime_cols.month == month)]
.reset_index())
Related
I' m beginner in Python.
I have a excel file.
| Name | Reg Date |
|Annie | 2021-07-01 |
|Billy | 2021-07-02 |
|Cat | 2021-07-03 |
|David | 2021-07-04 |
|Eric | 2021-07-04 |
|Annie | 2021-07-01 |
|Bob | 2021-07-05 |
|David | 2021-07-04 |
I found duplicate rows in excel.
Code:
import openpyxl as xl
import pandas as pd
import numpy as np
dt = pd.read_excel('C:/Users/Desktop/Student.xlsx')
dt['Duplicate'] = dt.duplicated()
DuplicateRows=[dt.duplicated(['Name', 'Reg Date'], keep=False)]
print(DuplicateRows)
Output:
Name Reg Date Duplicate
1 Annie 2021-07-01 False
6 Annie 2021-07-01 True
4 David 2021-07-04 False
8 David 2021-07-04 True
Above I have two questions... Please teach me.
Q1: How to update Duplicate value from False to True?
Q2: When Duplicate is True, how to fill background color of rows save in Student.xlsx?
the duplicated() function itself doesn't have a way to interchange True and False. You can use a NOT operator (~) to so this. Also, used PatternFill to color the relevant rows and write to student.xlsx. Please refer to below code.
Code
import pandas as pd
import numpy as np
from openpyxl.styles import PatternFill
from openpyxl import Workbook
dt = pd.read_excel("myinput.xlsx", sheet_name="Sheet1")
dt['Duplicate'] = dt.duplicated()
dt['Duplicate'] = ~dt[dt.duplicated(['Name', 'Reg Date'], keep=False)].Duplicate
dt['Duplicate'] = dt['Duplicate'].replace(np.nan, False)
print(dt)
wb = Workbook()
ws = wb.active
# Write heading
for i in range(0, dt.shape[1]):
cell_ref = ws.cell(row=1, column=i+1)
cell_ref.value = dt.columns[i-1]
#Write and colot data rows
for row in range(dt.shape[0]):
for col in range(dt.shape[1]):
print(row, col, dt.iat[row, col])
ws.cell(row+2,col+1).value = str(dt.iat[row, col])
if dt.iat[row, 2] == True:
ws.cell(row+2,col+1).fill = PatternFill(start_color='FFD970', end_color='FFD970', fill_type="solid") # change hex code to change color
wb.save('student.xlsx')
Output sheet
Updated Req
Hi - Please find below the updated code. This will read the data from Student.xlsx Sheet1 and update Sheet2 with the updated colored sheet. If there is NO sheet by that name, it will error out. If there is data in those specific cells in Sheet2, that data will be overwritten, not any other data. Only Name and Reg Date will be written. Post writing to the excel sheet, the dataframe dt's Duplicate column will also be deleted and will only contain the other two columns. Hope this is clear and is what you are looking for...
import pandas as pd
import numpy as np
from openpyxl.styles import PatternFill
from openpyxl import load_workbook
dt = pd.read_excel("Student.xlsx", sheet_name="Sheet1")
dt['Duplicate'] = dt.duplicated()
dt['Duplicate'] = ~dt[dt.duplicated(['Name', 'Reg Date'], keep=False)].Duplicate
dt['Duplicate'] = dt['Duplicate'].replace(np.nan, False)
wb = load_workbook("Student.xlsx")
ws = wb['Sheet2'] # Make sure this sheet exists, else, it will give an error
#Write Heading
for i in range(0, dt.shape[1]-1):
ws.cell(row = 1, column = i+1).value = dt.columns[i]
#Write Data
for row in range(dt.shape[0]):
for col in range(dt.shape[1] - 1):
print(row, col, dt.iat[row, col])
ws.cell(row+2,col+1).value = str(dt.iat[row, col])
if dt.iat[row, 2] == True:
ws.cell(row+2,col+1).fill = PatternFill(start_color='FFD970', end_color='FFD970', fill_type="solid") # used hex code for brown color
wb.save("Student.xlsx")
dt = dt.drop(['Duplicate'], axis=1, inplace=True) # Finally remove the column Duplicated, in case you want to use it
print(dt)
Output Excel
Let's say I have a .txt file like that:
#D=H|ID|STRINGIDENTIFIER
#D=T|SEQ|DATETIME|VALUE
H|879|IDENTIFIER1
T|1|1569972384|7
T|2|1569901951|9
T|3|1569801600|8
H|892|IDENTIFIER2
T|1|1569972300|109
T|2|1569907921|101
T|3|1569803600|151
And I need to create a dataframe like this:
IDENTIFIER SEQ DATETIME VALUE
879_IDENTIFIER1 1 1569972384 7
879_IDENTIFIER1 2 1569901951 9
879_IDENTIFIER1 3 1569801600 8
892_IDENTIFIER2 1 1569972300 109
892_IDENTIFIER2 2 1569907921 101
892_IDENTIFIER2 3 1569803600 151
What would be the possible code?
A basic way to do it might just to be to process the text file and convert it into a csv before using the read_csv function in pandas. Assuming the file you want to process is as consistent as the example:
import pandas as pd
with open('text.txt', 'r') as file:
fileAsRows = file.read().split('\n')
pdInput = 'IDENTIFIER,SEQ,DATETIME,VALUE\n' #addHeader
for row in fileAsRows:
cols = row.split('|') #breakup row
if row.startswith('H'): #get identifier info from H row
Identifier = cols[1]+'_'+cols[2]
if row.startswith('T'): #get other info from T row
Seq = cols[1]
DateTime = cols[2]
Value = cols[3]
tempList = [Identifier,Seq, DateTime, Value]
pdInput += (','.join(tempList)+'\n')
with open("pdInput.csv", "a") as file:
file.write(pdInput)
## import into pandas
df = pd.read_csv("pdInput.csv")
I am trying to take a dictionary in python and place it into an excel worksheet where the keys are displayed in the header section of the sheet and the values are in to columns. I am close I am just missing something small and cannot figure it out here is my code. Caution I use way to many imports
import os
import re
import openpyxl
from openpyxl.utils import get_column_letter, column_index_from_string
import xlsxwriter
import pprint
from openpyxl.workbook import Workbook
from openpyxl.worksheet.copier import WorksheetCopy
workbook = xlsxwriter.Workbook('dicExcel.xlsx')
worksheet = workbook.add_worksheet()
d = {'a':['Alpha','Bata','Gamma'], 'b':['1','2','3'], 'c':['1.0','2.0','3.0']}
row = 0
col = 1
for key in d.keys():
row += 1
worksheet.write(row, col, key)
for item in d[key]:
worksheet.write(row, col + 1, item)
row += 1
workbook.close()
I think this is what you are trying to do:
import xlsxwriter
workbook = xlsxwriter.Workbook('dicExcel.xlsx')
worksheet = workbook.add_worksheet()
d = {'a':['Alpha','Bata','Gamma'], 'b':['1','2','3'], 'c':['1.0','2.0','3.0']}
row = 0
col = 0
for key in d.keys():
row = 0
worksheet.write(row, col, key)
row += 1
for item in d[key]:
worksheet.write(row, col, item)
row += 1
col += 1
workbook.close()
This puts the data in this format:
a c b
Alpha 1.0 1
Bata 2.0 2
Gamma 3.0 3
Is this what you wanted?
One option is to use the pandas package, you won't need too many imports.
import pandas as pd
d = {'a':['Alpha','Bata','Gamma'], 'b':['1','2','3'], 'c':['1.0','2.0','3.0']}
df = pd.DataFrame(d)
df will look like this:
a b c
0 Alpha 1 1.0
1 Bata 2 2.0
2 Gamma 3 3.0
To write the DataFrame back in to an Excel file:
df.to_excel("Path to write Excel File + File Name")
Hello all…a question in using Panda to combine Excel spreadsheets.
The problem is that, sequence of columns are lost when they are combined. If there are more files to combine, the format will be even worse.
If gives an error message, if the number of files are big.
ValueError: column index (256) not an int in range(256)
What I am using is below:
import pandas as pd
df = pd.DataFrame()
for f in ['c:\\1635.xls', 'c:\\1644.xls']:
data = pd.read_excel(f, 'Sheet1')
data.index = [os.path.basename(f)] * len(data)
df = df.append(data)
df.to_excel('c:\\CB.xls')
The original files and combined look like:
what's the best way to combine great amount of such similar Excel files?
thanks.
I usually use xlrd and xlwt:
#!/usr/bin/env python
# encoding: utf-8
import xlwt
import xlrd
import os
current_file = xlwt.Workbook()
write_table = current_file.add_sheet('sheet1', cell_overwrite_ok=True)
key_list = [u'City', u'Country', u'Received Date', u'Shipping Date', u'Weight', u'1635']
for title_index, text in enumerate(key_list):
write_table.write(0, title_index, text)
file_list = ['1635.xlsx', '1644.xlsx']
i = 1
for name in file_list:
data = xlrd.open_workbook(name)
table = data.sheets()[0]
nrows = table.nrows
for row in range(nrows):
if row == 0:
continue
for index, context in enumerate(table.row_values(row)):
write_table.write(i, index, context)
i += 1
current_file.save(os.getcwd() + '/result.xls')
Instead of data.index = [os.path.basename(f)] * len(data) you should use df.reset_index().
For example:
1.xlsx:
a b
1 1
2 2
3 3
2.xlsx:
a b
4 4
5 5
6 6
code:
df = pd.DataFrame()
for f in [r"C:\Users\Adi\Desktop\1.xlsx", r"C:\Users\Adi\Desktop\2.xlsx"]:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
df.reset_index(inplace=True, drop=True)
df.to_excel('c:\\CB.xls')
cb.xls:
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
If you don't want the dataframe's index to be in the output file, you can use df.to_excel('c:\\CB.xls', index=False).
I have many excel sheets in which data is arranged in table as shown below
I want to convert each of this table to another list-like format as shown below.
In this table:
Dir -> Name of the tab
Year -> Same for entire table
DOM -> Day of the month
DOW -> Day of the week.
Hour -> Column label in original table
Traffic Count -> Value in the original table
There are close to 1000 such sheets. The data in each sheet is at the same location. What is the best way to do this? Should I write a VBA script or is there any thing in Excel I can use to make my life easier?
I solved the problem using python and the xlrd module
import xlrd
import numpy as np
from os import listdir, chdir
import calendar
import re
direction = {'N':0,'S':1,'E':2,'W':3}
rend = [0, 40, 37, 40, 39, 40, 39, 40, 40, 39, 40, 39, 40]
# Initialize the matrix to store final result
aData = np.matrix(np.zeros((0,7)))
# get all the xlsx files from the directory.
filenames = [ f for f in listdir('.') if f.endswith('.xlsx') ]
# for each .xlsx in the current directory
for file in filenames:
# The file names are in the format gdot_39446_yyyy_mm.xlsx
# yyyy is the year and mm is the month number with 0 -Jan and 11 - Dec
# I extracted the month and year info from the file name
ms = re.search('.+_.+_.+_([0-9]+)\.',file,re.M).group(1)
month = int(ms) + 1
year = int(file[11:15])
# open the workbook
workbook = xlrd.open_workbook(file)
# the workbook has three sheets. I want information from
# sheet2 and sheet3 (indexed by 1 adn 2 resp.)
for i in range(1,3):
sheet = workbook.sheet_by_index(i)
di = sheet.name[-1]
data = [[sheet.cell_value(r,c) for c in range(2,26)] for r in range(9,rend[month])]
mData = np.matrix(data)
mData[np.where(mData=='')] = 0 # some cells are blank. Insert 0 in those cells
n,d = mData.shape
rows = n * d
rData = np.matrix(np.zeros((rows,7)))
rData[:,0].fill(direction[di])
rData[:,1].fill(year)
rData[:,2].fill(month)
for i in range(rows):
rData[i,3] = (i/24) + 1
rData[i,4] = calendar.weekday(year,month,(i/24) + 1)
rData[i,5] = i%24
for i in range(n):
rData[i*24:((i+1)*24),6] = mData[i,:].T
aData = np.vstack((aData,rData))
np.savetxt("alldata.csv",aData, delimiter=',', fmt='%s')