Pandas Dataframe subset not working as expected - python

This seemingly simple exercise is throwing me off my tracks, I'm sure it's something simple skipping my eye.
Let's say I have a dataframe
datas = pd.DataFrame({'age':[10,20,30],
'name':['John','Mark','Lisa']})
I now want to subset the dataframe by the name 'Mark' so I did:
if (datas['name']=='Mark').any():
datas.loc[datas['name'] == 'Mark']
else:
print('no')
Expected result is
age name
20 Mark
but I get the original dataframe back again, please assist.
I've looked at several posts but none seems to help.
Posts example I looked at: Check if string is in a pandas dataframe

I think you need assign back to original DataFrame if need overwrite original DataFrame by subset:
datas = datas.loc[datas['name'] == 'Mark']
Or assign to new variable, e.g. df1:
df1 = datas.loc[datas['name'] == 'Mark']
Next if data pare processing and assign putput to new variable like df1is necessary use DataFrame.copy for prevent SettingWithCopyWarning:
df1 = datas.loc[datas['name'] == 'Mark'].copy()
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.

Did you mean to print the subset? Right now your code doesn't change anything.
if (datas['name']=='Mark').any():
print( datas.loc[datas['name'] == 'Mark'] )
else:
print('no')

You can change your dataset even in one line:
datas = datas[datas['name']=='Mark']

Related

How not to iterate over pandas Dataframe

I have a pretty simple problem I could solve just by iterating over rows of a dataframe. But I read it's never a good practice, so I'm wondering how to avoid this step.
Dummy DataFrame
In this example I'd like to automatically give a new name to fruits that are special, according to a conventional rule (as shown in the code below).
This default name should only be applied if the fruit is special and 'Logic name' is still unknown.
In python I would write something like this:
for idx in range(len(a['Fruit'])):
if df.loc[idx]['Logic name'] == 'unknown' and df.loc[idx]['Special'] == 'yes':
df.loc[idx]['Logic name'] = df.loc[idx]['color'] + df.loc[idx]['Fruit'][2:]
The final result is this
Final Dataframe
How would you avoid iteration in this case?
Use numpy.where with a condition on "special"
import numpy as np
df['Logic name'] = np.where(df['Special'].eq('yes')&df['Logic name'].eq('unknown'),
df['color']+df['Fruit'].str[2;],
df['Logic name'])

How to delete specific values from a column in a dataset (Python)?

I have a data set as below:
I want to remove 'Undecided' from my ['Grad Intention'] column. For this, I created a copy DataFrame and using the code as follows:
df_copy=df_copy.drop([df_copy['Grad Intention'] =='Undecided'], axis=1)
However, this is giving me an error.
How can I remove the row with 'Undecided'? Also, what's wrong with my code?
you could simply use:
df = df[df['Grad Intention'] != 'Undecided']
or
df.drop(df[df['Grad Intention'] == 'Undecided'].index, inplace = True)

Changing column values for a value in an adjacent column in the same dataframe using Python

I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)

How do I check for a specific ticker string inside my dataframe with a column name of Name and return if it is found in the dataframe?

I need to check if the input of 'ticker' is inside of this dataframe with 'Name' as the column name and if I do stock_final.query("Name == 'AMZN'"), it works. I am unsure what the value of ticker is because it is input from a user. I need to correct this my_tick function to return the ticker if it is in the dataframe, otherwise have an error message.
You can do this:
stock_final.query("Name == '{}'".format(ticker))
But if I understand correctly what you're really asking is whether a given ticker is in the column called "Name". That can be done better like this:
(stock_final["Name"] == ticker).any()
if the df.query() function does not find something in the column that macth, it will return a empty dataframe, in the begining is just a dataframe, but you can check if is empty by calling the df.empty property, for example
#some dataframe df
df.empty
This will return True if the dataframe is empty or False if it does not,
In your case, i recommend this
def my_tick(ticker):
if not stock_final.query("Name == '{}'".format(ticker)).empty:
return str(ticker)

Python Pandas dataframe modify column value based on function that cleans string value and assign to new column

I have a certain data to clean, it's some keys where the keys have six leading zeros that I want to get rid of, and if the keys are not ending with "ABC" or it's not ending with "DEFG", then I need to clean the currency code in the last 3 indexes. If the key doesn't start with leading zeros, then just return the key as it is.
To achieve this I wrote a function that deals with string as below:
def cleanAttainKey(dirtyAttainKey):
if dirtyAttainKey[0] != "0":
return dirtyAttainKey
else:
dirtyAttainKey = dirtyAttainKey.strip("0")
if dirtyAttainKey[-3:] != "ABC" and dirtyAttainKey[-3:] != "DEFG":
dirtyAttainKey = dirtyAttainKey[:-3]
cleanAttainKey = dirtyAttainKey
return cleanAttainKey
Now I build a dummy data frame to test it but it's reporting errors:
data frame
df = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102]},
columns=["dirtyKey","amount"])
I need to get a new column called "cleanAttainKey" in the df, then modify each value in the "dirtyKey" using the "cleanAttainKey" function, then assign the cleaned key to the new column "cleanAttainKey", however it seems pandas doesn't support this type of modification.
# add a new column in df called cleanAttainKey
df['cleanAttainKey'] = ""
# I want to clean the keys and get into the new column of cleanAttainKey
dirtyAttainKeyList = df['dirtyKey'].tolist()
for i in range(len(df['cleanAttainKey'])):
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
I am getting the below error message:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The result should be the same as the df2 below:
df2 = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102],
'cleanAttainKey':["12345ABC","12345DEFG","23456DEFG"]},
columns=["dirtyKey","cleanAttainKey","amount"])
df2
Is there any better way to modify the dirty keys and get a new column with the clean keys in Pandas?
Thanks
Here is the culprit:
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
When you use extract of the dataframe, Pandas reserves the ability to choose to make a copy or a view. It does not matter if you are just reading the data, but it means that you should never modify it.
The idiomatic way is to use loc (or iloc or [i]at):
df.loc[i, 'cleanAttainKey'] = cleanAttainKey(vpAttainKeyList[i])
(above assumes a natural range index...)

Categories

Resources