Pandas - nullable integer with Python None

Pandas - nullable integer with Python None - python

I need to keep null values in integer columns as Python None type for further comparison,
is this possible?
Currently I use pd.Int32Dtype() but that uses pd.NA as it's null type which does not equal Python None
Later comparison is done like that:
Consider dataframe:
idx
count
1
NA
2
10
pd_data = list(df.itertuples(index=False, name=None))
database_data = list((1, None), (2, 10))
if pd_data == database_data:
...
And I want these two lists to be equal. Is that possible?
I'm doing that on 10K+ rows so I need it to be fast as well.

Related

Performing the equivalent of a vlookup within a merged df in Pandas

I had no pandas/python experience this time last week so I have had a steep learning curve in trying to transfer a complex, multi-step process that was being done in excel, into pandas. Sorry if the following is unclear.
I merged 2 dataframes. I have a column, let's call it 'new_ID', with new ID names from originaldf1, some of which say 'no match was found'. For the 'no match was found' entries I would like to get the old ID number from originaldf2, which is another column in currentdf, let's call this col 'old_ID'. So, I would like to do something like an excel vlookup where I say: "if there is 'no match was found' in col 'new_ID', give me the ID that is in col 'old_ID', in that same row". The output I would like is just a list of all the old IDs where no match was found.
I've tried a few solutions that I found on here but all just give me blank outputs. I'm assuming this is because they aren't searching each individual instance of "no match found". For example I tried:
deletes = mydf.loc[mydf['new_ID'] == "no match was found", ['old_ID']
this turns out with just the col header, then all blank.
is what i'm trying to do possible in pandas? or maybe i'm stuck in excel ways of thinking, and there is a better/different way!?...
enter image description here

Welcome to Python. What you are trying to do is a straightforward task in pandas. Each column of a pandas Dataframe is a Series object; basically a list of values. You are trying to find which row numbers (aka indeces) satisfy this criteria: new_id == "no match was found". This can be done by pulling the column out of the dataframe and applying a lambda function. I would recommend pasting this code in a new file and playing around to see how it works.
import pandas as pd
# Create test data frame
df = pd.DataFrame(columns=('new_id','old_id'))
df.loc[0] = (1, None)
df.loc[1] = ("no match", 4)
df.loc[2] = (3, None)
df.loc[3] = ("no match", 4)
print("\nHere is our test dataframe:")
print(df)
print("\nShow the values of the 'new_id' that meet our criteria:")
print(df['new_id'][lambda x: x == "no match"])
# Pull the index from these rows
indeces = df['new_id'][lambda x: x == "no match"].index.tolist()
print("\nIndeces:\n", indeces)
print("\nShow only the rows of the data frame that match 'indeces':")
print(df.loc[indeces]['old_id'])
A couple of notes about this code:
df.loc[] refers to a specific row of a data frame. df.loc[2] refers to the 3rd row (since pandas data frames are generally zero-indexed)
A lambda function here takes each value of a list (or Series object) individually and plugs these values one-by-one into a function. In this case we are referring to each value of 'new_id' as 'x', and then checking if x == "no match". Placing brackets [] around it converts the output to a list. So in this case the output of [lambda x: x == "no_match"] will be a list of True or False values. The list is then applied to our Series object, so that only the rows with True are returned.
After the lambda function .index.tolist() is applied to convert the Series object to a list of its indeces.

Working off your example im going to assume all new_ID entries are numbers only unless there is no match.
so if your dataframe looks like this (assuming this 2nd column has any values, i didnt know so i put 0's)
new_ID
originaldf2
1
0
2
0
3
0
no match
4
Next we can check to see if your new_id column has an id or not by seeing if it contains a number using str.isnumeric()
has_id =df1.new_ID.str.isnumeric()
has_id>>>
0 True
1 True
2 True
3 False
4 True
Name: new_ID, dtype: bool
Then finally we'll use where()
what this does it takes the first argument cond that we've passed the has_id bollean filter through and checks whether its True or False. If true, it keeps original value, if false, goes to the argument found in other which in this case we assigned to the second column of our dataframe.
df1.where(has_id, df.iloc[:,1], axis=0)>>>
new_ID old_df_2
0 1 0
1 2 0
2 3 0
3 4 4

pandas: difference between selecting rows with d[condition] and d.loc[condition]? (And setting values to the rows returned)

For example, if I have a dataframe d with columns A, B, C and I run
d[(d.A == 2) & (d.B > 3)]
Is there a difference between that and this:
d.loc[(d.A == 2) & (d.B > 3)] ?
Wondering specifically because I will be setting values to what this conditional indexing returns, and the behavior for setting values seems to change depending on whether I have the .loc or not. But I could be wrong

Why does using "==" return a Series instead of bool in pandas?

I just can't figure out what "==" means at the second line:
- It is not a test, there is no if statement...
- It is not a variable declaration...
I've never seen this before, the thing is data.ctage==cat is a pandas Series and not a test...
for cat in data["categ"].unique():
subset = data[data.categ == cat] # Création du sous-échantillon
print("-"*20)
print('Catégorie : ' + cat)
print("moyenne:\n",subset['montant'].mean())
print("mediane:\n",subset['montant'].median())
print("mode:\n",subset['montant'].mode())
print("VAR:\n",subset['montant'].var())
print("EC:\n",subset['montant'].std())
plt.figure(figsize=(5,5))
subset["montant"].hist(bins=30) # Crée l'histogramme
plt.show() # Affiche l'histogramme

It is testing each element of data.categ for equality with cat. That produces a vector of True/False values. This is passed as in indexer to data[], which returns the rows from data that correspond to the True values in the vector.
To summarize, the whole expression returns the subset of rows from data where the value of data.categ equals cat.
(Seems possible the whole operation could be done more elegantly using data.groupBy('categ').apply(someFunc).)

It creates a boolean series with indexes where data.categ is equal to cat , with this boolean mask, you can filter your dataframe, in other words subset will have all records where the categ is the value stored in cat.
This is an example using numeric data
np.random.seed(0)
a = np.random.choice(np.arange(2), 5)
b = np.random.choice(np.arange(2), 5)
df = pd.DataFrame(dict(a = a, b = b))
df[df.a == 0].head()
# a b
# 0 0 0
# 2 0 0
# 4 0 1
df[df.a == df.b].head()
# a b
# 0 0 0
# 2 0 0
# 3 1 1

Yes, it is a test. Boolean expressions are not restricted to if statements.
It looks as if data is a data frame (PANDAS). The expression used as a data frame index is how PANDAS denotes a selector or filter. This says to select every row in which the fieled categ matches the variable cat (apparently a pre-defined variable). This collection of rows becomes a new data frame, subset.

data.categ == cat will return a boolean list that will be used to filter your dataframe by lefting only values where boolean is equal True.
Booleans are used in many situations, not only in if statements.

Here you are checking data.categ with the element iterating, cat, in the dictionary of data.
And if they are equal you are continuing the loop.

Run logical Expressions against pandas dataframe

I m trying to select rows from a pandas dataframe by applying condition to a column (in form of logical expression).
Sample data frame looks like:
id userid code
0 645382311 12324234234
1 645382311 -2434234242
2 645382312 32536365654
3 645382312 12324234234
...
For example, I expect next result by applying logical expressions for column 'code':
case 1: (12324234234 OR -2434234242) AND NOT 32536365654
case 2: (12324234234 AND -2434234242) OR NOT 32536365654
must give a result for both cases:
userid: 645382311
The logic above says:
For case 1 - give me only those userid who has at least one of the values (12324234234 OR -2434234242) and doesn't have 32536365654 in the whole data frame.
For case 2 - I need only those userid who has either both codes in data frame (12324234234 AND -2434234242) or any codes but not 32536365654.
The statement like below returns empty DataFrame:
flt = df[(df.code == 12324234234) & (df.code == -2434234242)]
print("flt: ", flt)
Result (and it make sens):
flt: Empty DataFrame
Would be appreciate for any hints to handle such cases.

As a simple approach, I would transform your sample table into a boolean presence matrix, which would then allow you to perform the logic you need:
import pandas
sample = pandas.DataFrame([[645382311, 12324234234], [645382311, -2434234242], [645382312, 32536365654], [645382312, 12324234234]], columns=['userid', 'code'])
# Add a column of True values
sample['value'] = True
# Pivot to boolean presence matrix and remove MultiIndex
presence = sample.pivot(index='userid', columns='code').fillna(False)['value']
# Perform desired boolean tests
case1 = (presence[12324234234] | presence[-2434234242]) & ~(presence[32536365654])
case2 = (presence[12324234234] & presence[-2434234242]) | ~(presence[32536365654])
The case variables will contain the boolean test result for each userid.

Replacing column values in a pandas DataFrame

I'm trying to replace the values in one column of a dataframe. The column ('female') only contains the values 'female' and 'male'.
I have tried the following:
w['female']['female']='1'
w['female']['male']='0'
But receive the exact same copy of the previous results.
I would ideally like to get some output which resembles the following loop element-wise.
if w['female'] =='female':
w['female'] = '1';
else:
w['female'] = '0';
I've looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.
Any help will be appreciated.

If I understand right, you want something like this:
w['female'] = w['female'].map({'female': 1, 'male': 0})
(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I'm not sure why you'd want that.)
The reason your code doesn't work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn't mean "select rows where the value is 'female'". It means to select rows where the index is 'female', of which there may not be any in your DataFrame.

You can edit a subset of a dataframe by using loc:
df.loc[<row selection>, <column selection>]
In this case:
w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1

w.female.replace(to_replace=dict(female=1, male=0), inplace=True)
See pandas.DataFrame.replace() docs.

Slight variation:
w.female.replace(['male', 'female'], [1, 0], inplace=True)

This should also work:
w.female[w.female == 'female'] = 1
w.female[w.female == 'male'] = 0

This is very compact:
w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0
Another good one:
w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)

You can also use apply with .get i.e.
w['female'] = w['female'].apply({'male':0, 'female':1}.get):
w = pd.DataFrame({'female':['female','male','female']})
print(w)
Dataframe w:
female
0 female
1 male
2 female
Using apply to replace values from the dictionary:
w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)
Result:
female
0 1
1 0
2 1
Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.

Using Series.map with Series.fillna
If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.
That's why we have to chain it with fillna:
Example why .map fails:
df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})
female
0 male
1 female
2 female
3 male
4 other
5 other
df['female'].map({'female': '1', 'male': '0'})
0 0
1 1
2 1
3 0
4 NaN
5 NaN
Name: female, dtype: object
For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:
df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])
0 0
1 1
2 1
3 0
4 other
5 other
Name: female, dtype: object

Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:
w['female'] = pd.get_dummies(w['female'],drop_first = True)
This gives you a data frame with two columns, one for each value that occurs in w['female'], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.
This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don't assign the entire data frame to a single column, but instead, if w['female'] could be 'male', 'female' or 'neutral', do something like this:
w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)
Then you are left with two new columns giving you the dummy coding of 'female' and you got rid of the column with the strings.

w.replace({'female':{'female':1, 'male':0}}, inplace = True)
The above code will replace 'female' with 1 and 'male' with 0, only in the column 'female'

There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.

w.female = np.where(w.female=='female', 1, 0)
if someone is looking for a numpy solution. This is useful to replace values based on a condition. Both if and else conditions are inherent in np.where(). The solutions that use df.replace() may not be feasible if the column included many unique values in addition to 'male', all of which should be replaced with 0.
Another solution is to use df.where() and df.mask() in succession. This is because neither of them implements an else condition.
w.female.where(w.female=='female', 0, inplace=True) # replace where condition is False
w.female.mask(w.female=='female', 1, inplace=True) # replace where condition is True

dic = {'female':1, 'male':0}
w['female'] = w['female'].replace(dic)
.replace has as argument a dictionary in which you may change and do whatever you want or need.

I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.
When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you'll get back DataFrame.
So in this case you can use DataFrame methods like .replace.
When you use .loc or iloc you get back Series, and Series don't have .replace method, so you should use methods like apply, map and so on.

To answer the question more generically so it applies to more use cases than just what the OP asked, consider this solution. I used jfs's solution solution to help me. Here, we create two functions that help feed each other and can be used whether you know the exact replacements or not.
import numpy as np
import pandas as pd
class Utility:
#staticmethod
def rename_values_in_column(column: pd.Series, name_changes: dict = None) -> pd.Series:
"""
Renames the distinct names in a column. If no dictionary is provided for the exact name changes, it will default
to <column_name>_count. Ex. female_1, female_2, etc.
:param column: The column in your dataframe you would like to alter.
:param name_changes: A dictionary of the old values to the new values you would like to change.
Ex. {1234: "User A"} This would change all occurrences of 1234 to the string "User A" and leave the other values as they were.
By default, this is an empty dictionary.
:return: The same column with the replaced values
"""
name_changes = name_changes if name_changes else {}
new_column = column.replace(to_replace=name_changes)
return new_column
#staticmethod
def create_unique_values_for_column(column: pd.Series, except_values: list = None) -> dict:
"""
Creates a dictionary where the key is the existing column item and the value is the new item to replace it.
The returned dictionary can then be passed the pandas rename function to rename all the distinct values in a
column.
Ex. column ["statement"]["I", "am", "old"] would return
{"I": "statement_1", "am": "statement_2", "old": "statement_3"}
If you would like a value to remain the same, enter the values you would like to stay in the except_values.
Ex. except_values = ["I", "am"]
column ["statement"]["I", "am", "old"] would return
{"old", "statement_3"}
:param column: A pandas Series for the column with the values to replace.
:param except_values: A list of values you do not want to have changed.
:return: A dictionary that maps the old values their respective new values.
"""
except_values = except_values if except_values else []
column_name = column.name
distinct_values = np.unique(column)
name_mappings = {}
count = 1
for value in distinct_values:
if value not in except_values:
name_mappings[value] = f"{column_name}_{count}"
count += 1
return name_mappings
For the OP's use case, it is simple enough to just use
w["female"] = Utility.rename_values_in_column(w["female"], name_changes = {"female": 0, "male":1}
However, it is not always so easy to know all of the different unique values within a data frame that you may want to rename. In my case, the string values for a column are hashed values so they hurt the readability. What I do instead is replace those hashed values with more readable strings thanks to the create_unique_values_for_column function.
df["user"] = Utility.rename_values_in_column(
df["user"],
Utility.create_unique_values_for_column(df["user"])
)
This will changed my user column values from ["1a2b3c", "a12b3c","1a2b3c"] to ["user_1", "user_2", "user_1]. Much easier to compare, right?

If you have only two classes you can use equality operator. For example:
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b']})
df['col1'].eq('a').astype(int)
# (df['col1'] == 'a').astype(int)
Output:
0 1
1 1
2 1
3 0
Name: col1, dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - nullable integer with Python None - python

Related

Performing the equivalent of a vlookup within a merged df in Pandas

pandas: difference between selecting rows with d[condition] and d.loc[condition]? (And setting values to the rows returned)

Why does using "==" return a Series instead of bool in pandas?

Run logical Expressions against pandas dataframe

Replacing column values in a pandas DataFrame

Categories

Resources