How to delete specific values from a column in a dataset (Python)?

How to delete specific values from a column in a dataset (Python)? - python

I have a data set as below:
I want to remove 'Undecided' from my ['Grad Intention'] column. For this, I created a copy DataFrame and using the code as follows:
df_copy=df_copy.drop([df_copy['Grad Intention'] =='Undecided'], axis=1)
However, this is giving me an error.
How can I remove the row with 'Undecided'? Also, what's wrong with my code?

you could simply use:
df = df[df['Grad Intention'] != 'Undecided']
or
df.drop(df[df['Grad Intention'] == 'Undecided'].index, inplace = True)

Related

Changing column values for a value in an adjacent column in the same dataframe using Python

I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!

With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]

Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)

How to access the second to last row of a csv file using python Pandas?

Want to know if I can access the second to last row of this csv file?
Am able to access the very last using:
pd.DataFrame(file1.iloc[-1:,:].values)
But want to know how I can access the one right before the last?
Here is the code I have so far:
import pandas as pd
import csv
url1 = r"https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/country_data/Austria.csv"
file1 = pd.read_csv(url1)
df1 = pd.DataFrame(file1.iloc[:,:].values)
df1 = pd.DataFrame(file1.iloc[-1:,:].values)
Austria_date = df1.iloc[:,1]
Austria_cum = df1.iloc[:, 4].map('{:,}'.format)
if ( Austria_cum.iloc[0] == 'nan' ):
Essentially I am checking if the the row at that specific col is 'nan', which is True, and after which I want to get the data from the row right before the last. Please, how would this be done?
Thank you

As simple as that :
df1.iloc[-2,:]

To access the data frame at any index you can use the following
df.iloc[i]
For your ask you need to set the index as -2, which would look like this:
df.iloc[-2]

Just use negative indices like in your example, but pull out the second to last with -2 instead of the last:
df.iloc[-2,:]

Making a new column based on 2 other columns

I am trying to calculate a new column labeled in the code as "Sulphide-S(calc)-C_%S", this column can be calculated from one of two options (see below in the code). Both these columns wont be filled at the same time. So I want it to calculate from the column that has data present. Presently, I have this but the second equation overwrites the first.
df["Sulphide-S(calc)-C_%S"] = df["Total-S_%S"] - df["Sulphate-S(HCL Leachable)_%S"]
df.head()
df["Sulphide-S(calc)-C_%S"] = df["Total-S_%S"]- df["Sulphate-S_%S"]
df.head()

You can use the apply function in pandas to create a new column based on other columns, resulting in a Series that you can add to your original dataframe. Without knowing what your dataframe looks like, the following code might not work directly until you replace the if condition with a working condition to detect the empty dataframe spot.
def create_sulfide_col(row):
if row["Sulphate-S(HCL Leachable)_%S"] is None:
val = row["Total-S_%S"] - row["Sulphate-S(HCL Leachable)_%S"]
else:
val = ["Total-S_%S"]- df["Sulphate-S_%S"]
return val
df["Sulphide-S(calc)-C_%S"] = df.apply(lambda row: create_sulfide_col(row), axis='columns')

If I'm understanding what you're saying correctly, the second equation overwrites the first because they have the same column name. Try changing the column name in one or both of the "Sulphide-S(calc)-C_%S" to something else like "Sulphide-S(calc)-C_%S_A" and "Sulphide-S(calc)-C_%S_B":
df["Sulphide-S(calc)-C_%S_A"] = df["Total-S_%S"] - df["Sulphate-S(HCL Leachable)_%S"]
df.head()
df["Sulphide-S(calc)-C_%S_B"] = df["Total-S_%S"]- df["Sulphate-S_%S"]
df.head()

Pandas Dataframe subset not working as expected

This seemingly simple exercise is throwing me off my tracks, I'm sure it's something simple skipping my eye.
Let's say I have a dataframe
datas = pd.DataFrame({'age':[10,20,30],
'name':['John','Mark','Lisa']})
I now want to subset the dataframe by the name 'Mark' so I did:
if (datas['name']=='Mark').any():
datas.loc[datas['name'] == 'Mark']
else:
print('no')
Expected result is
age name
20 Mark
but I get the original dataframe back again, please assist.
I've looked at several posts but none seems to help.
Posts example I looked at: Check if string is in a pandas dataframe

I think you need assign back to original DataFrame if need overwrite original DataFrame by subset:
datas = datas.loc[datas['name'] == 'Mark']
Or assign to new variable, e.g. df1:
df1 = datas.loc[datas['name'] == 'Mark']
Next if data pare processing and assign putput to new variable like df1is necessary use DataFrame.copy for prevent SettingWithCopyWarning:
df1 = datas.loc[datas['name'] == 'Mark'].copy()
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.

Did you mean to print the subset? Right now your code doesn't change anything.
if (datas['name']=='Mark').any():
print( datas.loc[datas['name'] == 'Mark'] )
else:
print('no')

You can change your dataset even in one line:
datas = datas[datas['name']=='Mark']

Python Pandas dataframe modify column value based on function that cleans string value and assign to new column

I have a certain data to clean, it's some keys where the keys have six leading zeros that I want to get rid of, and if the keys are not ending with "ABC" or it's not ending with "DEFG", then I need to clean the currency code in the last 3 indexes. If the key doesn't start with leading zeros, then just return the key as it is.
To achieve this I wrote a function that deals with string as below:
def cleanAttainKey(dirtyAttainKey):
if dirtyAttainKey[0] != "0":
return dirtyAttainKey
else:
dirtyAttainKey = dirtyAttainKey.strip("0")
if dirtyAttainKey[-3:] != "ABC" and dirtyAttainKey[-3:] != "DEFG":
dirtyAttainKey = dirtyAttainKey[:-3]
cleanAttainKey = dirtyAttainKey
return cleanAttainKey
Now I build a dummy data frame to test it but it's reporting errors:
data frame
df = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102]},
columns=["dirtyKey","amount"])
I need to get a new column called "cleanAttainKey" in the df, then modify each value in the "dirtyKey" using the "cleanAttainKey" function, then assign the cleaned key to the new column "cleanAttainKey", however it seems pandas doesn't support this type of modification.
# add a new column in df called cleanAttainKey
df['cleanAttainKey'] = ""
# I want to clean the keys and get into the new column of cleanAttainKey
dirtyAttainKeyList = df['dirtyKey'].tolist()
for i in range(len(df['cleanAttainKey'])):
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
I am getting the below error message:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The result should be the same as the df2 below:
df2 = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102],
'cleanAttainKey':["12345ABC","12345DEFG","23456DEFG"]},
columns=["dirtyKey","cleanAttainKey","amount"])
df2
Is there any better way to modify the dirty keys and get a new column with the clean keys in Pandas?
Thanks

Here is the culprit:
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
When you use extract of the dataframe, Pandas reserves the ability to choose to make a copy or a view. It does not matter if you are just reading the data, but it means that you should never modify it.
The idiomatic way is to use loc (or iloc or [i]at):
df.loc[i, 'cleanAttainKey'] = cleanAttainKey(vpAttainKeyList[i])
(above assumes a natural range index...)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to delete specific values from a column in a dataset (Python)? - python

you could simply use: df = df[df['Grad Intention'] != 'Undecided'] or df.drop(df[df['Grad Intention'] == 'Undecided'].index, inplace = True)

Related

Changing column values for a value in an adjacent column in the same dataframe using Python

How to access the second to last row of a csv file using python Pandas?

Making a new column based on 2 other columns

Pandas Dataframe subset not working as expected

Python Pandas dataframe modify column value based on function that cleans string value and assign to new column

Categories

Resources