Search entire excel sheet with Pandas for word(s) - python

I am trying to essentially replicate the Find function (control-f) in Python with Pandas. I want to search and entire sheet (all rows and columns) to see if any of the cells on the sheet contain a word and then print out the row in which the word was found. I'd like to do this across multiple sheets as well.
I've imported the sheet:
pdTestDataframe = pd.read_excel(TestFile, sheet_name="Sheet Name",
keep_default_na= False, na_values=[""])
And tried to create a list of columns that I could index into the values of all of the cells but it's still excluding many of the cells in the sheet. The attempted code is below.
columnsList = []
for i, data in enumerate(pdTestDataframe.columns):
columnList.append(pdTestDataframe.columns[i])
for j, data1 in enumerate(pdTestDataframe.index):
print(pdTestDataframe[columnList[i]][j])
I want to make sure that no matter the formatting of the excel sheet, all cells with data inside can be searched for the word(s). Would love any help I can get!

Pandas has a different way of thinking about this. Just calling df[df.text_column.str.contains('whatever')] will show you all the rows in which the text is contained in one specific column. To search the entire dataframe, you can use:
mask = np.column_stack([df[col].str.contains(r"\^", na=False) for col in df])
df.loc[mask.any(axis=1)]
(Source is here)

Related

openpyxl - How to retrieve multiple columns of one row from an Excel file?

I'm looking for the correct way to print cell values from one desired row, but I only want the cell values that are in specific columns. e.g. columns 'A','C','F' etc.
Below is the code I currently have, I get no errors with this but am at a loss with how to progress with my desired outcome.
path_file = 'Readfrom.xlsx'
spreadsheet = load_workbook(path_file)
spreadsheet = spreadsheet.active
desire_columns = spreadsheet['A']
disire_row = spreadsheet['5']
for cell in disire_row:
print(cell.value)
path_file = 'Readfrom.xlsx'
wb = load_workbook(path_file) # Changed variable name
ws = wb.active # Changed variable name
desired_columns = ws['A'] # Minor change in variable name
desired_row = ws['5'] # Minor change in variable name
for column in desired_columns: # Iterate over columns
cell = ws[f"{column}{desired_row}"] # Create the cell reference
print(cell.value)
You must read your pseudocode over and over again and you will find the answer.
I have also corrected your variables, hope you dont mind. Use the common variables as much as possible so that other programmers can easily understand your code. Also, the spreadsheet variable is specially confusing as it initialized as the workbook and later as a worksheet. Think about if you decide later that you want to save the workbook (remember to back up the excel file), or you have multiple sheets in the workbook.
Also read "for loop" to further your study.
https://www.w3schools.com/python/python_for_loops.asp
You can read data from excel using pandas
import pandas as pd
df = pd.read_excel('tmp.xlsx', index_col=0)
#Cell value can be accessed in this way
print (df['ColumnName'].iloc[rowIndex])
#This also can be done in a loop.
for index, row in df.iterrows():
print(row['ColumnName'])
Refer this link for further details
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

How to read excel data starting from specific col

I have an excel workbook with multiple sheets and I am trying to import/read the data starting from an empty col.
the row data look like this
A
C
One
Two
Three
and I am trying to get the data
C
Two
Three
I can't use usecols as the position of this empty col changes in each sheet I have in the workbook.
I have tried this but didn't work out for me.
df = df[~df.header.shift().eq('').cummax()]
I would appreciate any suggestions or hints. Mant thanks in advance!
Assuming that you want to start from the first empty header, then:
df = df[df.columns[list(df.columns).index('Unnamed: 1'):]]

Pandas Find Index Value

I'm trying to find the index value for "Rental Income" in the dataframe below. The dataframe is being uploaded from an Excel document and needs to be cleaned.
Is there a way to find the index value for "Rental Income" without given the column name or row name? The formatting is different in ever Excel file so the column and row names change with each file. But if I can search the full dataframe at once, I can you the reference as the anchor point.
Try with where then stack
out = df.where(df.eq('A1')).stack()
idx = out.index.get_level_values(0)
col = out.index.get_level_values(1)

Copy data from one row in Dataframe A to a specific row in Dataframe B in python/pandas

I've been struggling with this assignment quite long now and I cant seem to get the correct solution. My problem is as follows: I have imported two worksheets (A & B) from an Excelfile and assigned these to two Dataframe variables. Worksheets A & B have rows with similar names however the cells in Worksheet B are empty whereas the cells in Worksheet A contain numbers. My assignment is to copy the rows from worksheet A to worksheet B. This needs to be done in a specific order. For example: in worksheet A the row sales revenue is at index 1, but in worksheet B the row sales revenue is at index 5. Does anybody know how to do this? Im an absolute beginner with python/pandas. I have included a printscreen of the situation
you need to merge the two tables using left join
#remove all Nan columns
df_data3 = df_data3[['Unnamed:0']]
df_data3 = pd.merge(df_data3,df_data2,on='Unnamed:0',how='left')
see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more option play with it a bit to better understand how this work
it is similar to SQL join

Read Excel sectioned data, transform, then output to raw format for database

I don't know if this is possible.. haven't come across this on the webs. In excel I have formatted crosstab data sectioned by location/city all in the same spread sheet for thousands of rows. Simple example below.
Example
I want to run a python excel parser that takes this formatted data and un-formats it in a raw data format so that I can load it in a database table. Is this possible? Desired result would look something like this.
Target output Example
Pandas has a method to read Excel files, which is rather neat, as you get a dataframe out of it and that probably makes it easier for scanning and customized parsing.
import pandas as pd
# Reads the excel file
xl = pd.ExcelFile(file_path)
# Parses the desired sheet
df = xl.parse(sheet_name)
# To host all your table title indices
tbl_title = []
# To locate the title of your tables, I think you can do a sampling of that column to ascertain all the row numbers that contain the table titles
for i, n in enumerate(df.loc[:, column_name]):
if n == 'P': # The first column in your table header as the cue
tbl_title.append(i - 1) # This would be the row index for Frisco, Dallas etc.
Once you have the indices of all your table titles, you can just create another table reader function to iterate over the dataframe at the specific rows.

Categories

Resources