I am trying to check a column of an Excel file for values in a given format and, if there is a match, append it to a list. Here is my code:
from openpyxl import load_workbook
import re
#Open file and read column with PBSID.
PBSID = []
wb = load_workbook(filename="FILE_PATH", data_only=True)
sheet = wb.active
for col in sheet["E"]:
if re.search("\d{3}[-\.\s]??\d{5}", str(col)):
PBSID.append(col.value)
print(PBSID)
Column E of the Excel file contains IDs like 431-00456 that I would like to append to the list named PBSID.
Expected result: PBSID list to be populated with ID in regex mask XXX-XXXXX.
Actual result: Output is an empty list ("[]").
Am I missing something? (I know there are more elegant ways of doing this but I am relatively new to Python and very open to critism).
Thanks!
Semantically, I think the for loop should be written as:
for row in sheet["E"]:
As I'm guessing that sheet["E"] is simply referring to the column 'E' already.
Without seeing exact data that's in a cell, I think what's happening here is that python is interpreting your call to str() as follows:
It's performing a maths operation (in my example) '256 - 23690' before giving you the string of the answer, which is '-23434', and then looking for your regular expression in '-23434' for which it won't find any match (hence no results). Make sure the string is interpreted as a raw string.
You also appear to be referring to the whole row object in 'str(col)', and then referring separately to the row value in 'PBSID.append(col.value)'. It's best to refer to the same object, whichever is more suitable in your case.
Related
I want to use openpyxl to work with an excel file.
Why doesnt the dot change into a comma?
I added this minimal reproducible Example:
my ExcelFile:
111.11
111.12
My Code:
import openpyxl
def someFunction():
wb = openpyxl.load_workbook('test.xlsx')
ws = wb.active
for cell in ws['A']:
cell.number_format = 'Comma'
print(cell.number_format)
print(cell.value)
wb.save('test.xlsx')
someFunction()
Note: I tried different number_format values, like #,##0 and it didnt work either
Expected Output:
Comma
111,11
Comma
111,12
Actual Output:
Comma
111.11
Comma
111.12
I don't have a direct answer for you but I have been troubleshooting this for some time and have some takeaways that may help.
You are definitely targeting and affecting the excel sheet.
If you input the desired output 111,111 and then ask what format that cell is you'll find it doesn't say Comma but instead #,##0 - Using this format instead of 'Comma' you can take a value of 111111 and convert it to 111,111... However, you cannot take the value of 111.111 and convert to 111,111 with this format. It will just change to 111 and hide the remaining 3 digits. No idea why it behaves this way.
When using your method and checking the resulting excel file it comes out as a text error and the actual cell contains a date. Something strange is happening here.
I tried a lot of steps including running outside a function, first formatting the cells to number before changing to comma, targetting individual cells and so on to no avail. Also it seems the comma format always spews an error, so perhaps it's just not supported?
def check_duplication(excelfile, col_Date, col_Name):
list_rows[]
Uphere is the a bit of the code.
How do I make lists in Python from the excel file? I want to compile every rows that contains the value of Date and Name in the sheet from excel and make it as a list. The reason I want to make a list because later I want to compare between the rows within the list to check if there is a duplicate within the list of rows.
Dataframe Method
To compare excel content, you do not need to make a list. But if you want to make a list, one starting point may be making a dataframe, which you can inspect in python. To make a dataframe, use:
import pandas as pd
doc_path = r"the_path_of_excel_file"
sheets= pd.read_excel(doc_path, sheet_name= None, engine= "openpyxl", header= None)
This code lines read the excel document's all sheets without headers. You may change the parameters.
(For more information: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html)
Assume Sheet1 is the sheet we have our data in:
d_frame = sheets[0]
list_rows = [df.iloc[i,:] for i in range(len(d_frame.shape[0]))]
I assume you want to use all columns. You may find the list with the code.
I thought this would have been a pretty simple task, but it is turning out to be much more complicated than I thought it would be. Trying to read a simple excel spreadsheet with a table of values, then will perform calculations on the values and output a new sheet.
First question is, what library do people recommend to use? Pandas? Openpyxl? Currently using openpyxl and struggling to get the value of an individual cell. Here's some code:
collectionOrder = np.empty( [numRows,2], dtype='object')
numCountries = 0
for i in burndownData.iter_rows():
elemnt = burndownData.cell(row=i,column=1)
print("elemnt=",elemnt.value )
if not( np.isnan(burndownData.cell(row=i,column=1).value)):
collectionOrder[ int(burndownData.cell(row=i,column=1).value) ][0] = burndownData.cell(row=i,column=1).value
collectionOrder[ int(burndownData.cell(row=i,column=1).value) ][1] = i
numCountries = numCountries + 1
But when I first try and use the cell reference, (burndownData.cell(row=i,column=1)), I get the following error:
Exception has occurred: TypeError '<' not supported between instances of 'tuple' and 'int'
File "C:\Users\cpeddie\Documents\projects\generateBurndownReport.py", line 59, in run elemnt =
burndownData.cell(row=i,column=1) File
"C:\Users\cpeddie\Documents\projects\generateBurndownReport.py", line 96, in <module>
run()
Everything I have seen on the web says this is the way to get the value of an individual cell. What am I doing wrong? Is there an easier way?
Thanks....
Unless you're doing something more complicated than gathering some cells, numpy or pandas is usually unnecessary overhead. openpyxl works well enough on its own.
You have two options for iterating through a worksheet but you're trying to mix them, hence the error.
One option is simply query every cell's value using the cell method of the worksheet object with the row and column keyword arguments. The row and column indexes are 1-based integers, like this:
burndownData.cell(row=1, column=1).value
The other option is iterating the sheet and indexing the row as a list:
for row in burndownData.iter_rows():
elemnt = row[0].value
This will get you column A of row 1, column A of row 2, and so on. (because it's an index of a Python list it's zero-based)
What you were doing above:
for i in burndownData.iter_rows():
elemnt = burndownData.cell(row=i,column=1)
generates an error because i is a tuple of openpyxl Cell objects, and the row argument to cell expects an integer.
Update: I should add there's a third way to reference cells using the spreadsheet column:row syntax, like this:
burndownData['B9'].value
but unless you're only going to select a few specific cells, translating back and forth between letters and numbers seems clumsy.
I am trying to read an excel file, convert it into csv and load its head:
df = pd.read_excel("final.xlsx", sheet_name="NewCustomerList")
# df = df.to_csv()
print(df.head(3))
Without converting to csv, the results look like this:
Note: The data and information in this document is reflective of a hypothetical situation and client. \
0 first_name
1 Chickie
2 Morly
However, if I uncomment the conversion, I get an error that:
'str' object has no attribute 'head'
I am guessing its because of the first line of the data. How else can I convert this properly and read it?
to_csv() is used to save a table to disk, it has no effect on memory-stored tables and returns None. So, you are changing your df variable to None with your commented line.
If you just want to display the table on screen in a specific format, perhaps take a look at to_string()
If you absolutely MUST have each row of your df as a comma-separated string then try a list comprehension:
my_csv_list = [','.join(map(str, row)) for row in df.itertuples()]
Beware of the csv format, if any datapoint contains a comma then you are in for a nightmare when decoding back to a table.
According to the documentation, Pandas' to_csv() method returns None (nothing) or a string.
You could further need to use something like in this answer to turn the string into a dataframe again and use its head.
I am trying to write a new dataframe into the CSV file. But it is not being written into the file(empty file) neither giving any error.
I feel the issue is with this line. Since I am trying to write a single value to the column.
order['trade_type'] = trade_type
Any idea what's wrong here.
def write_transaction_to_file(trade_type):
order = pd.DataFrame()
order['trade_type'] = trade_type
order.to_csv("open_pos.csv", sep='\t', index=False)
write_transaction_to_file('SELL')
Your code creates an empty DataFrame, without even column names.
And now look at order['trade_type'] = trade_type.
If order contained some rows, among columns there were one named just
'trade_type' (string) and trade_type (variable) was a scalar,
then in all rows in order would receive this value (in this column).
But since order contains no rows, there is no place to write to.
Instead you can append a row, e.g.:
order = order.append({'trade_type': trade_type}, ignore_index=True)
The rest of code is OK, the output file name as ordinary string
is also OK.
Other solution: Just create a DataFrame with a single row and single
named column, filled with your variable:
order = pd.DataFrame([[trade_type]], columns=['trade_type'])
Then write it to CSV file as before.