python counting rows with missing values doesn't work - python

i do'nt know why but the code to calculate rows with missing values doesn't work.
Can somebody please hlep?
excel file showing data
code in IDE
in excel, the rows that have missing values were 156 in total but i can't get this in python
using the code below
(kidney_df.isna().sum(axis=1) > 0).sum()
count=0
for i in kidney_df.isnull().sum(axis=1):
if i>0:
count=count+1
kidney_df.isna().sum().sum()

kidney_df is a whole dataframe, do you want to count each empty cell or just the empty cells in one column? Based on the formula in your image, it seems your are interested only in column 'Z'. You can specify that by using .iloc[] (index location) or by specifying the column name (not visible in your imgage) like so:
kidney_df.iloc[:, 26].isnull().sum()
Explaination:
.iloc[] # index location
: # meaning -> from row 0 to last row or '0:-1' which can be shortened to ':'
26 # which is the column index of column 'Z' in excel

Related

How do I search a pandas dataframe to get the row with a cell matching a specified value?

I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]

Data cleanup in python using and excel file

I'm trying to do some cleanup on python using an excel file.
I have a bunch of columns, but those of interest are: dat.columns=['H', 'Test', 'ProductNumber','PM', 'THNewHRequired','THRealignment',
'THRecommendedRealignement']
Basically, what I'm trying to do is for each row that the product number contains TH, I'll look at the columns 'ProductNumber' and 'Test' to perform a comparison:
if the value of row (row 'x' for example) in 'Test' is equal to the value of 'ProductNumber' column (that left part should have the same length as the row in 'Test'), then I'll assign to column :
. row 'x' of 'PM' column the 'PM' value of the corresponding row of ProductNumber column,
. row 'x' of 'THRealignment' column the value 'Yes',
. row 'x' of 'THRecommendedRealignement' column, the value of column H of the corresponding row of ProductNumber column.
else, I'll assign to row 'x' of 'THNewHRequired' column the value 'Yes'.
In others words, for those rows that contains the 'TH' character, the columns THRecommendedRealignement, THRealignment, 'THNewHSKURequired', which are initially empty, have to be filled according the 1. statement or the 2. statement.
What I want to do is: if I have row1 in column Test, I want to compare that row1 with all the rows of ProductNumber so that if there is a match, I apply statement 1.; if not, I will apply statement 2. Then I will follow by row 2 of column Test till row n of column Test.
I wrote the code below in jupyter notebook but it doesn't seems to be working. There is no error message, it's just that I don't get the desired resulted. Normally the columns (THRecommendedRealignement and THRealignment) or ('THNewHSKURequired') should be filled wherever the product number contains 'TH'. But after running the code, the dat frame remains the same, with no changes at all
for product in dat['ProductNumber']:
if "|TH" in dat['ProductNumber']:
j=0
for test in dat["Test"]:
k=0
for product in dat['ProductNumber']:
if test[j]==product[k]:
dat['THRecommendedRealignement'][j]=dat['H'][k]
dat['THRealignment'][j]='Yes'
dat['PM'][k]=dat['PM'][j]
else:
dat['THNewHSKURequired']='Yes'
k+=1
j+=1
input
So if we have this input, the output should be the image below. Note that the highlighted rows are the ones that should have been modified by the code.
Output

iterrows() loop is only reading last value and only modifying first row

I have a dataframe test. My goal is to search in the column t1 for specific strings, and if it matches exactly a specific string, put that string in the next column over called t1_selected. Only thing is, I can't get iterrows() to go over the entire dataframe, and to report results in respective rows.
for index, row in test.iterrows():
if any(['ABCD_T1w_MPR_vNav_passive' in row['t1']]):
#x = ast.literal_eval(row['t1'])
test.loc[i, 't1_selected'] = str(['ABCD_T1w_MPR_vNav_passive'])
I am only trying to get ABCD_T1w_MPR_vNav_passive to be in the 4th row under the t1_selected, while all the other rows will have not found. The first entry in t1_selected is from the last row under t1 which I didn't include in the screenshot because the dataframe has over 200 rows.
I tried to initialize an empty list to append output of
import ast
x = ast.literal_eval(row['t1'])
to see if I can put x in there, but the same issue occurred.
Is there anything I am missing?
for index, row in test.iterrows():
if any(['ABCD_T1w_MPR_vNav_passive' in row['t1']]):
#x = ast.literal_eval(row['t1'])
test.loc[index, 't1_selected'] = str(['ABCD_T1w_MPR_vNav_passive'])
Where index is the row its written to. With i it was not changing

How to return the string of a header based on the max value of a cell in Openpyxl

Good morning guys! quick question for Openpyxl:
I am working with Python editing a xlsx document and generating various stats. Part of my script is to generate max values of a cell range :
temp_list=[]
temp_max=[]
for row in sheet.iter_rows(min_row=3, min_col=10, max_row=508, max_col=13):
print(row)
for cell in row:
temp_list.append(cell.value)
print(temp_list)
temp_max.append(max(temp_list))
temp_list=[]
I would also like to be able to print the string of the header of the column that contains the max value for the cell range desired. My data structure looks like this :
Any idea on how to do so?
Thanks!
This seems like a typical INDEX/MATCH Excel problem.
Have you tried retrieving the index for the max value in each temp_list?
You can use a function like numpy.argmax() to get the index of your max value within your "temp_list" array, then use this index to locate the header and append the string to a new list called, say, "max_headers" which contains all the header strings in order of appearance.
It would look something like this
for cell in row:
temp_list.append(cell.value)
i_max = np.argmax(temp_list)
max_headers.append(cell(row = 1, column = i_max).value)
And so on and so forth. Of course, for that to work, your temp_list should be a numpy array instead of a simple python list, and the max_headers list would have to be defined.
First, Thanks Bernardo for the hint. I found a decently working solution but still have a little issue. Perhaps someone can be of assistance.
Let me amend my initial statement : here is the code I am working with now :
temp_list=[]
headers_list=[]
for row in sheet.iter_rows(min_row=3, min_col=27, max_row=508, max_col=32): #Index starts at 1 // Here we set the rows/columns containing the data to be analyzed
for cell in row:
temp_list.append(cell.value)
for cell in row:
if cell.value == max(temp_list):
print(str(cell.column))
print(cell.value)
print(sheet.cell(row=1, column=cell.column).value)
headers_list.append(sheet.cell(row=1,column=cell.column).value)
else:
print('keep going.')
temp_list = []
This formula works, but has a little issue : If, for instance, a row has the same value twice (ie : 25,9,25,8,9), this loop will print 2 headers instead of one. My question is :
how can I get this loop to take in account only the first match of a max value in a row?
You probably want something like this:
headers = [c for c in next(ws.iter_rows(min_col=27, max_col=32, min_row=1, max_row=1, values_only=True))]
for row in ws.iter_rows(min_row=3, min_col=27, max_row=508, max_col=32, values_only=True):
mx = max(row)
idx = row.index(mx)
col = headers[idx]

how to locate row in dataframe without headers

I noticed that when using .loc in pandas dataframe, it not only finds the row of data I am looking for but also includes the header column names of the dataframe I am searching within.
So when I try to append the .loc row of data, it includes the data + column headers - I don't want any column headers!
##1st dataframe
df_futures.head(1)
date max min
19990101 2000 1900
##2nd dataframe
df_cash.head(1)
date$ max$ min$
1999101 50 40
##if date is found in dataframe 2, I will collect the row of data
data_to_track = []
for ii in range(len(df_futures['date'])):
##date I will try to find in df2
date_to_find = df_futures['date'][ii]
##append the row of data to my list
data_to_track.append(df_cash.loc[df_cash['Date$'] == date_to_find])
I want the for loop to return just 19990101 50 40
It currently returns 0 19990101 50 40, date$, max$, min$
I agree with other comments regarding the clarity of the question. However, if what you want to get is just a string that contains a particular row's data, then you could use to_string() method of Pandas.
In your case,
Instead of this:
df_cash.loc[df_cash['Date$'] == date_to_find]
You could get a string that includes only the row data:
df_cash[df_cash['Date$'] == date_to_find].to_string(header=None)
Also notice that I dropped the .loc part, which outputs the same result.
If your dataframe has multiple columns and you dont want them to be joined in a string (may bring data type issues and is potentially problematic if you want to separate them later on), you could use list() method such as:
list(df_cash[df_cash['Date$'] == date_to_find].iloc[0])

Categories

Resources