Data cleanup in python using and excel file - python

I'm trying to do some cleanup on python using an excel file.
I have a bunch of columns, but those of interest are: dat.columns=['H', 'Test', 'ProductNumber','PM', 'THNewHRequired','THRealignment',
'THRecommendedRealignement']
Basically, what I'm trying to do is for each row that the product number contains TH, I'll look at the columns 'ProductNumber' and 'Test' to perform a comparison:
if the value of row (row 'x' for example) in 'Test' is equal to the value of 'ProductNumber' column (that left part should have the same length as the row in 'Test'), then I'll assign to column :
. row 'x' of 'PM' column the 'PM' value of the corresponding row of ProductNumber column,
. row 'x' of 'THRealignment' column the value 'Yes',
. row 'x' of 'THRecommendedRealignement' column, the value of column H of the corresponding row of ProductNumber column.
else, I'll assign to row 'x' of 'THNewHRequired' column the value 'Yes'.
In others words, for those rows that contains the 'TH' character, the columns THRecommendedRealignement, THRealignment, 'THNewHSKURequired', which are initially empty, have to be filled according the 1. statement or the 2. statement.
What I want to do is: if I have row1 in column Test, I want to compare that row1 with all the rows of ProductNumber so that if there is a match, I apply statement 1.; if not, I will apply statement 2. Then I will follow by row 2 of column Test till row n of column Test.
I wrote the code below in jupyter notebook but it doesn't seems to be working. There is no error message, it's just that I don't get the desired resulted. Normally the columns (THRecommendedRealignement and THRealignment) or ('THNewHSKURequired') should be filled wherever the product number contains 'TH'. But after running the code, the dat frame remains the same, with no changes at all
for product in dat['ProductNumber']:
if "|TH" in dat['ProductNumber']:
j=0
for test in dat["Test"]:
k=0
for product in dat['ProductNumber']:
if test[j]==product[k]:
dat['THRecommendedRealignement'][j]=dat['H'][k]
dat['THRealignment'][j]='Yes'
dat['PM'][k]=dat['PM'][j]
else:
dat['THNewHSKURequired']='Yes'
k+=1
j+=1
input
So if we have this input, the output should be the image below. Note that the highlighted rows are the ones that should have been modified by the code.
Output

Related

python counting rows with missing values doesn't work

i do'nt know why but the code to calculate rows with missing values doesn't work.
Can somebody please hlep?
excel file showing data
code in IDE
in excel, the rows that have missing values were 156 in total but i can't get this in python
using the code below
(kidney_df.isna().sum(axis=1) > 0).sum()
count=0
for i in kidney_df.isnull().sum(axis=1):
if i>0:
count=count+1
kidney_df.isna().sum().sum()
kidney_df is a whole dataframe, do you want to count each empty cell or just the empty cells in one column? Based on the formula in your image, it seems your are interested only in column 'Z'. You can specify that by using .iloc[] (index location) or by specifying the column name (not visible in your imgage) like so:
kidney_df.iloc[:, 26].isnull().sum()
Explaination:
.iloc[] # index location
: # meaning -> from row 0 to last row or '0:-1' which can be shortened to ':'
26 # which is the column index of column 'Z' in excel

How do I search a pandas dataframe to get the row with a cell matching a specified value?

I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]

iterrows() loop is only reading last value and only modifying first row

I have a dataframe test. My goal is to search in the column t1 for specific strings, and if it matches exactly a specific string, put that string in the next column over called t1_selected. Only thing is, I can't get iterrows() to go over the entire dataframe, and to report results in respective rows.
for index, row in test.iterrows():
if any(['ABCD_T1w_MPR_vNav_passive' in row['t1']]):
#x = ast.literal_eval(row['t1'])
test.loc[i, 't1_selected'] = str(['ABCD_T1w_MPR_vNav_passive'])
I am only trying to get ABCD_T1w_MPR_vNav_passive to be in the 4th row under the t1_selected, while all the other rows will have not found. The first entry in t1_selected is from the last row under t1 which I didn't include in the screenshot because the dataframe has over 200 rows.
I tried to initialize an empty list to append output of
import ast
x = ast.literal_eval(row['t1'])
to see if I can put x in there, but the same issue occurred.
Is there anything I am missing?
for index, row in test.iterrows():
if any(['ABCD_T1w_MPR_vNav_passive' in row['t1']]):
#x = ast.literal_eval(row['t1'])
test.loc[index, 't1_selected'] = str(['ABCD_T1w_MPR_vNav_passive'])
Where index is the row its written to. With i it was not changing

Finding the column name of row which has highest value while comprehending the row based on max column value

I'm pretty new to Python. Im trying to define a function for the below set of data.
Sample data
I'm first looking for the max value in cell 3. Based on this max value I'm checking which column given in the last 8 columns of the data has the highest value against it.
For example, As per the given data, Max value in cell 3 is 1470758. Now I'm checking which columns from column cell 9 to cell 16 has the highest value against this max value. In the case of this sample data the answer should be cell 10 with a value of 7201. So the output should be cell 10.
Here's my code:
def winner(filename):
data=pd.read_csv(filename, sep=',')
maxC=data.npop.max()
while data.loc[data['npop']]==maxC:
data3=data.iloc[:,-8:].max()
#missing code
winner("demo.csv")
Please help. I didn't understand what I should be writing in the missing code section.
Line by line explanation of code is given with comments.
Try this :
def winner(filename):
df=pd.read_csv(filename, sep=',') # Read the csv into dataframe.
column_names = list(df.columns.values) # Get list of column names
max_col3_index = df['col3'].idmax() # this will return the index of max value in `col3` column.
row_data = df.loc[max_col3_index, column_names[-8:]] # get series of data present in last 8 columns at above index.
final_column_name = row_data.idxmax() # Get the name of column having max value in above series.
print(final_column_name)

Compare values in a row and write result in new column

My dataset looks like this:
Paste_Values AB_IDs AC_IDs AD_IDs
AE-1001-4 AB-1001-0 AC-1001-3 AD-1001-2
AE-1964-7 AB-1964-2 AC-1964-7 AD-1964-1
AE-2211-1 AB-2211-1 AC-2211-3 AD-2211-2
AE-2182-4 AB-2182-6 AC-2182-7 AD-2182-5
I need to compare all values in the Paste_values column with the all other three values in a row.
For Example:
AE-1001-4 is split into two part AE and 1001-4 we need check 1001-4 is present other columns or not
if its not present we need to create new columns put the same AE-1001-4
if 1001-4 match with other columns we need to change it inot 'AE-1001-5' put in the new column
After:
If there is no match I need to to write the value of Paste_values as is in the newly created column named new_paste_value.
If there is a match (same value) in other columns within the same row, then I need to change the last digit of the value from Paste_values column, so that the whole value should not be the same as in any other whole values in the row and that newly generated value should be written in new_paste_value column.
I need to do this with every row in the data frame.
So the result should look like:
Paste_Values AB_IDs AC_IDs AD_IDs new_paste_value
AE-1001-4 AB-1001-0 AC-1001-3 AD-1001-2 AE-1001-4
AE-1964-7 AB-1964-2 AC-1964-7 AD-1964-1 AE-1964-3
AE-2211-1 AB-2211-1 AC-2211-3 AD-2211-2 AE-2211-4
AE-2182-4 AB-2182-6 AC-2182-4 AD-2182-5 AE-2182-1
How can I do it?
Start from defining a function to be applied to each row of your DataFrame:
def fn(row):
rr = row.copy()
v1 = rr.pop('Paste_Values') # First value
if not rr.str.contains(f'{v1[3:]}$').any():
return v1 # No match
v1a = v1[3:-1] # Central part of v1
for ch in '1234567890':
if not rr.str.contains(v1a + ch + '$').any():
return v1[:-1] + ch
return '????' # No candidate found
A bit of explanation:
The row argument is actually a Series, with index values taken from
column names.
So rr.pop('Paste_Values') removes the first value, which is saved in v1
and the rest remains in rr.
Then v1[3:] extracts the "rest" of v1 (without "AE-")
and str.contains checks each element of rr whether it
contains this string at the end position.
With this explanation, the rest of this function should be quite
understandable. If not, execute each individual instruction and
print their results.
And the only thing to do is to apply this function to your DataFrame,
substituting the result to a new column:
df['new_paste_value'] = df.apply(fn, axis=1)
To run a test, I created the following DataFrame:
df = pd.DataFrame(data=[
['AE-1001-4', 'AB-1001-0', 'AC-1001-3', 'AD-1001-2'],
['AE-1964-7', 'AB-1964-2', 'AC-1964-7', 'AD-1964-1'],
['AE-2211-1', 'AB-2211-1', 'AC-2211-3', 'AD-2211-2'],
['AE-2182-4', 'AB-2182-6', 'AC-2182-4', 'AD-2182-5']],
columns=['Paste_Values', 'AB_IDs', 'AC_IDs', 'AD_IDs'])
I received no error on this data. Perform a test on the above data.
Maybe the source of your error is in some other place?
Maybe your DataFrame contains also other (float) columns,
which you didn't include in your question.
If this is the case, run my function on a copy of your DataFrame,
with this "other" columns removed.

Categories

Resources