Openpyxl, Pandas or both - python

I'm trying to process an excel file so that i can use each row and column for specific operations later on. 
My problem is as follows:
Using Openpyxl made it easier for me to load the file and be able to iterate over the rows
#reading the excel file
path = r'Datasets/Chapter 1/Table B1.1.xlsx'
wb = load_workbook(path) #loading the excel table
ws = wb.active #grab the active worksheet
#Setting the doc Header
for h in ws.iter_rows(max_row = 1, values_only = True): #getting the first row (Headers) in the table
header = list(h)
for sh in ws.iter_rows(min_row = 1 ,max_row = 2, values_only = True):
sub_header = list(sh)
#removing all of the none Values
header = list(filter(None, header))
sub_header = list(filter(None, sub_header))
#creating a list of all the rows in the excel file
row_list = []
for row in ws.iter_rows(min_row=3): #Iteration over every single row starting from the third row since first two are the headers
row = [cell.value for cell in row] #Creating a list from each row
row = list(filter(None, row)) #removing the none values from each row
row_list.append(row) #creating a list of all rows (starting from the 3d one)
colm = []
for col in ws.iter_cols(min_row=3,min_col = 1): #Iteration over every single row starting from the third row since first two are the headers
col = [cell.value for cell in col] #Creating a list from each row
col = list(filter(None, col)) #removing the none values from each row
colm.append(col) #creating a list of all rows (starting from the 3d one)
but at the same time (as far as I've read in the docs), I can't visualize it or do direct operations on the rows or columns.
While using pandas is more efficient to do direct operations on the rows and columns, I've read that iterating over a dataframe to get the rows in a list is not recommended even if it were to be done using df.iloc[2:] it would not give me the same result (saving each row in a specific list since the headers would always be there). However, unlike Openpyxl, doing direct operations on columns is much easier using something like df[col1]-df[col2] using the column name which is something I need to do. (Since just putting all columns values in a list wont do it for me)
So my question is whether or not there is a solution to be able to do what I want using only one of them, or if using both of them isn't that bad, keeping in mind I'd have to load the excel file twice.
"Thanks in Advance!"

There is no problem to read an excel file once using openpyxl and then load rows to pandas:
pandas.DataFrame(row_list, columns=header)
You are right, iterating over a DataFrame using indexes is quite slow, but you have other options: apply(), iterrows(), itertuples()
Link: Different ways to iterate over rows in pandas DataFrame
I would also like to point out that your code probably does not do what you would like.
list(filter(None, header)) filters not only None, but all falsy-values such as 0 or "".
such filtering shifts the columns. for example, you have a row [1, None, 3] and columns ['a', 'b', 'c']. by filtering None, you will get [1, 3] which will relate to columns 'a' and 'b'.

Related

How to build a dataframe row by row, where each row comes from a different csv?

I have searched through perhaps a dozen variations of the question "How to build a dataframe row by row", but none of the solutions have worked for me. Thus, though this is a frequently asked question, my case is unique enough to be a valid question. I think the problem might be that I am grabbing each row from a different csv. This code demonstrates that I am successfully making dataframes in the loop:
onlyfiles = list_of_csvs
for idx, f in enumerate(onlyfiles):
row = pd.read_csv(mypath + f,sep="|").iloc[0:1]
But the rows are individual dataframes and cannot be combined (so far). I have attempted the following:
df = pd.DataFrame()
for idx, f in enumerate(onlyfiles):
row = pd.read_csv(path + f,sep="|").iloc[0:1]
df.iloc(idx) = row
Which returns
df.loc(idx) = row
^
SyntaxError: can't assign to function call
I think the problem is that each row, or dataframe, has its own headers. I've also tried df.loc(idx) = row[1] but that doesn't work either (where we grab row[:] when idx = 0). Neither iloc(idx) or loc(idx) works.
In the end, I want one dataframe that has the header (column names) from the first data frame, and then n rows where n is the number of files.
Try pd.concat().
Note, you can read just the first line from the file directly, instead of reading in the file and then limiting to first row. pass parameter nrows=1 in pd.read_csv.
onlyfiles = list_of_csvs
df_joint = pd.DataFrame()
for f in enumerate(onlyfiles):
df_ = pd.read_csv(mypath + f,sep="|", nrows=1)
df_joint = pd.concat([df_joint, df_])

Trying to split a tsv file into two by looking for an empty line

I'm trying to use pandas to split a tsv file that looks something like this:
x y
x y
[empty row]
x y z a b c
x y z a b c
into 2 separate dataframes with one containing the half before the empty line, and one containing the rest of the file - this is because I can't read the whole file into one dataframe as the two portions have a different amount of columns.
Is there a way I can establish the empty row as a "stopping point" for the first dataframe, and read the rest of the tsv file into another dataframe?
Currently, I'm solving this by just skipping lines using pd.read_csv(file_name, skiprows = 3, delimiter = '\t'), but using this method is not a very good approach.
Thanks!
Try this:
First, you need to read your file as a whole to a df. Do not skip blank lines, this will read the blank lines as NaN.
df = pd.read_csv(filename, delimiter='\t', skip_blank_lines=False)
Now, identify the empty row and create separate groups in the df.
df['emptyrow'] = df.isnull()
df['group'] = (df['emptyrow'] != df['emptyrow'].shift()).cumsum()
groups = df.groupby(by='group')
With this, we have groups within df which can be accessed using groups.get_group(key).
Also, we can have a dict of data frames for each group key.
split_dfs = {}
for grp in groups.groups.keys():
split_dfs[grp] = groups.get_group(grp).drop(['emptyrow','group'], axis=1)
Now, split_dfs is a dict of dfs each with a subset of the original df based on the group we created.

Openpyxl : need the max number of rows in a column that has data in Excel

I need the last row in a particular column that contains data in Excel. In openpyxl sheet.max_row or max_column gets us the maximum row or column in the whole sheet. But what I want is for a particular column.
My scenario is where I have to get some values from database and append it to the end of a particular column in Excel sheet.
In this screenshot, if I want max_column containing data in column 'C', it should return 10:
In the above image if I want last cell containing data of column 'C', it should return 10
------------- Solution 1 --------------------
import pandas as pd
# lt is the dataframe containing the data to be loaded to excel file
for index,i in enumerate(lt):
panda_xl_rd = pd.read_excel('file.xlsx',"sheet_Name") # Panda Dataframe
max = len(panda_xl_rd.iloc[:,(col-1)].dropna())+2 ''' getting the
row_num of
last record in
column
dropna removes
the Nan
values else we
will get
the entire
sheets max
column length .
+2 gets
the next column
right after the
last column to
enter data '''
cellref = sheet.cell(row = max+index, column=col)
cellref.value = i
del panda_xl_rd
------------------------Solution 2 ----------------------
https://stackoverflow.com/a/52816289/10003981
------------------------Solution 3 ----------------------
https://stackoverflow.com/a/52817637/10003981
Maybe solution 3 is a more concise one !!
"Empty" is a relative concept so your code should be clear about this. The methods in openpyxl are guaranteed to return orthogonal result sets: the length of rows and columns will always be the same.
Using this we can work deduce the row highest row in column of a cell where the value is not None.
max_row_for_c = max((c.row for c in ws['C'] if c.value is not None))
Question: i want max_column containing data in Column 'C' it should return 10:
Simple count cell.value not Empty
Documentation Accessing many cells
PSEUDOCODE
for cell in Column('C'):
if not cell.value is empty:
count += 1
Comment: What if we have an empty cell in between?
Count the Rows in sync with the Column Range, and use a maxRowWithData variable. This will also work with no empty cell between.
PSEUDOCODE
for row index, cell in enumerate Column('C'):
if not cell.value is empty:
maxRowWithData = row index
Note: The cell index of openpyxl is 1-based!
Documentation: enumerate(iterable, start=0)
why not just find the length of column 'C'
result would be same output-->10
because when u will get the column 'C' values it will present u as tuple elements
so just take length of tuple which would come =10
import Openpyxl
file=openpyxl.load_workbook('example.xlsx')
current_sheet=file.get_sheet_by_name('sheet1')
Column_C=current_sheet['C']
print ( len(column_C))
data.close()
data.closed()
The accepted answer is not correct if an empty cell in between two cells with values then it will fail the following is the correct way.
import openpyxl as xl
import os
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
Dir_Name = os.path.join(BASE_DIR, 'Your_Project_Folder_Name_Here')
xl_file_path = os.path.join(Dir_Name, 'Your_Excel_File_Name_Here.xlsx')
wb_obj = xl.load_workbook(xl_file_path)
sheet_obj = wb_obj.active
number_of_rows = sheet_obj.max_row
last_row_index_with_data = 0
while True:
if sheet_obj.cell(number_of_rows, 1).value != None:
last_row_index_with_data = number_of_rows
break
else:
number_of_rows -= 1
print( "last row index having values " , last_row_index_with_data)
In this way we check from bottom to top of the page, when we find a cell that has a value other than a None, that index of row is the one we are requiring.
I think I just found a way using pandas:
import pandas as pd
# lt is the dataframe containing the data to be loaded to excel file
for index,i in enumerate(lt):
panda_xl_rd = pd.read_excel('file.xlsx',"sheet_Name") # Panda Dataframe
max = len(panda_xl_rd.iloc[:,(col-1)].dropna())+2 ''' getting the row_num of
last record in column
dropna removes the Nan
values else we will get
the entire sheets max
column length . +2 gets
the next column right
after the last column to
enter data '''
cellref = sheet.cell(row = max+index, column=col)
cellref.value = i
del panda_xl_rd

Removing duplicates for a column including rows adjacent and append duplicates to the above

I am wanting to delete duplicates for Column D and delete rows adjacent to it where the duplicate existed. I am wanting to remove gaps and so to append to the above. I have represented this below in a Table. The data is constantly changing in row size. We have used VBA traditionally but we are now using Python and have to change this part of the job.
What data does: https://ibb.co/gwh0Hb
Expectant/What I am trying to achieve: https://ibb.co/f08Dnb
The following tends to remove duplicates and place it in one column, however the rows beside duplicates beside it are not deleted and the columns are not appended.
Below code -
import openpyxl
wb1 = openpyxl.load_workbook('C:/Users/Documents/dwa.xlsx')
ws1 = wb1.active # keep naming convention consistent
wb2 = openpyxl.load_workbook('C:/Users/Documents/123.xlsx')
ws2 = wb2.active # keep naming convention consistent
values = []
col_e = 6 # easier to remember
values = set() # no duplicates by default; faster 'in' searching
for row in ws1.iter_rows(row_offset=1): # if you have a header
if row[col_e].value not in values:
values.add(row[col_e].value)
else:
row[col_e].value = '',
wb2.save('C:/Users/Documents/123.xlsx')
I have attempted to add -
values.add(row[col_c].value) a well as other column values however I am yet to have any success with this.
IIUC, here is a solution using pandas:
import pandas as pd
df = pd.read_excel('remove_duplicates.xlsx')
# Identifying duplicates only by column 'C4'
# Further details https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
df.drop_duplicates(['C4'],keep='first', inplace=True)
Input excel is like this:
And the output will be like this:

How to write into csv using the results generated from a DataFrame in python?

I am reading data from a tsv file using DataFrame from Pandas module in Python.
df = pandas.DataFrame.from_csv(filename, sep='\t')
The file has around 5000 columns (4999 test parameters and 1 result / output value).
I iterate through the entire tsv file and check if the result value matches the value that is expected. I then write this row inside another csv file.
expected_value = 'some_value'
with open(file_to_write, 'w') as csvfile:
csvfwriter = csv.writer(csvfile, delimiter='\t')
for row in df.iterrows():
result = row['RESULT']
if expected_value.lower() in str(result).lower():
csvwriter.writerow(row)
But in the output csv file, the result is not proper, i.e. the individual column values are not going into their respective columns / cells. It is getting appended as rows. How do I write this data correctly in the csv file?
The answers suggested works well however, I need to check for multiple conditions. I have a list which has some values:
vals = ['hello', 'foo', 'bar']
One of the column for all the rows has values that looks like this 'hello,foo,bar'. I need to do two checks, one if any value in the vals list is present in the column with the values 'hello, foo, bar' or if the result value matches the expected value. I have written the following code
df = pd.DataFrame.from_csv(filename, sep='\t')
for index, row in df.iterrows():
csv_vals = row['COL']
values = str(csv_vals).split(",")
if(len(set(vals).intersection(set(values))) > 0 or expected_value.lower() in str(row['RESULT_COL'].lower()):
print row['RESULT_COL']
You should create a dataframe where you have a column 'RESULT' and one 'EXPECTED'.
Then you can filter the rows where both match and output only these to csv using:
df.ix[df['EXPECTED']==df['RESULT']].to_csv(filename)
You can filter the values like this:
df[df['RESULT'].str.lower().str.contains(expected_value.lower())].to_csv(filename)
This will work for filtering values that contain your expected_value as you did in your code.
If you want to get exact match you can use:
df.loc[df['Result'].str.lower() == expected_value.lower()].to_csv(filename)
As you suggested in comment, for multiple criteria you will need something like this:
expected_values = [expected_value1, expected_value2, expected_value3]
df[df['Result'].isin(expected_values)]
UPDATE:
And to filter on multiple criteria and to filter desired column:
df.ix[df.isin(vals).any(axis=1)].loc[df['Result'].str.lower() == expected_value.lower()].to_csv(filename)

Categories

Resources