Compare pandas series of Dataframe as a Whole and not element wise - python

Problem:
Accessing the same column of a Dataframe I would like to compare if series is the same.
Data:
DATA link for copy and paste: API_link_to_data='https://raw.githubusercontent.com/jenfly/opsd/master/opsd_germany_daily.csv'
energyDF = pd.read_csv(API_link_to_data)
row3_LOC = energyDF.loc[[3],:]
row3_ILOC = energyDF.iloc[[3],:]
This code compares element wise
row3_LOC == row3_ILOC
getting a list with booleans
What I would like to get is TRUE, since row3_LOC and row3_ILOC are the same
Thanks

If you check,both row3_LOC and row3_ILOC are in turn dataframes.
print(type(row3_LOC))
print(type(row3_ILOC))
results in:
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
You can check if they are equal using row3_ILOC.equals(row3_LOC). Refer to the equals function.

You can compare two series using all():
(row3_loc == row3_ILOC).all()
As soon as one of the values don't match, you will get a false. You may also be interested in .any(), which checks whether at least one value is true.

Fill the Nans with NULL
energyDF.fillna('NULL')
energyDF = energyDF.fillna('NULL')
energyDF.loc[[3],:] == (energyDF.iloc[[3],:])
Date Consumption Wind Solar Wind+Solar
3 True True True True True

Related

Performing the equivalent of a vlookup within a merged df in Pandas

I had no pandas/python experience this time last week so I have had a steep learning curve in trying to transfer a complex, multi-step process that was being done in excel, into pandas. Sorry if the following is unclear.
I merged 2 dataframes. I have a column, let's call it 'new_ID', with new ID names from originaldf1, some of which say 'no match was found'. For the 'no match was found' entries I would like to get the old ID number from originaldf2, which is another column in currentdf, let's call this col 'old_ID'. So, I would like to do something like an excel vlookup where I say: "if there is 'no match was found' in col 'new_ID', give me the ID that is in col 'old_ID', in that same row". The output I would like is just a list of all the old IDs where no match was found.
I've tried a few solutions that I found on here but all just give me blank outputs. I'm assuming this is because they aren't searching each individual instance of "no match found". For example I tried:
deletes = mydf.loc[mydf['new_ID'] == "no match was found", ['old_ID']
this turns out with just the col header, then all blank.
is what i'm trying to do possible in pandas? or maybe i'm stuck in excel ways of thinking, and there is a better/different way!?...
enter image description here
Welcome to Python. What you are trying to do is a straightforward task in pandas. Each column of a pandas Dataframe is a Series object; basically a list of values. You are trying to find which row numbers (aka indeces) satisfy this criteria: new_id == "no match was found". This can be done by pulling the column out of the dataframe and applying a lambda function. I would recommend pasting this code in a new file and playing around to see how it works.
import pandas as pd
# Create test data frame
df = pd.DataFrame(columns=('new_id','old_id'))
df.loc[0] = (1, None)
df.loc[1] = ("no match", 4)
df.loc[2] = (3, None)
df.loc[3] = ("no match", 4)
print("\nHere is our test dataframe:")
print(df)
print("\nShow the values of the 'new_id' that meet our criteria:")
print(df['new_id'][lambda x: x == "no match"])
# Pull the index from these rows
indeces = df['new_id'][lambda x: x == "no match"].index.tolist()
print("\nIndeces:\n", indeces)
print("\nShow only the rows of the data frame that match 'indeces':")
print(df.loc[indeces]['old_id'])
A couple of notes about this code:
df.loc[] refers to a specific row of a data frame. df.loc[2] refers to the 3rd row (since pandas data frames are generally zero-indexed)
A lambda function here takes each value of a list (or Series object) individually and plugs these values one-by-one into a function. In this case we are referring to each value of 'new_id' as 'x', and then checking if x == "no match". Placing brackets [] around it converts the output to a list. So in this case the output of [lambda x: x == "no_match"] will be a list of True or False values. The list is then applied to our Series object, so that only the rows with True are returned.
After the lambda function .index.tolist() is applied to convert the Series object to a list of its indeces.
Working off your example im going to assume all new_ID entries are numbers only unless there is no match.
so if your dataframe looks like this (assuming this 2nd column has any values, i didnt know so i put 0's)
new_ID
originaldf2
1
0
2
0
3
0
no match
4
Next we can check to see if your new_id column has an id or not by seeing if it contains a number using str.isnumeric()
has_id =df1.new_ID.str.isnumeric()
has_id>>>
0 True
1 True
2 True
3 False
4 True
Name: new_ID, dtype: bool
Then finally we'll use where()
what this does it takes the first argument cond that we've passed the has_id bollean filter through and checks whether its True or False. If true, it keeps original value, if false, goes to the argument found in other which in this case we assigned to the second column of our dataframe.
df1.where(has_id, df.iloc[:,1], axis=0)>>>
new_ID old_df_2
0 1 0
1 2 0
2 3 0
3 4 4

Bool and missing values in pandas

I am trying to figure out whether or not a column in a pandas dataframe is boolean or not (and if so, if it has missing values and so on).
In order to test the function that I created I tried to create a dataframe with a boolean column with missing values. However, I would say that missing values are handled exclusively 'untyped' in python and there are some weird behaviours:
> boolean = pd.Series([True, False, None])
> print(boolean)
0 True
1 False
2 None
dtype: object
so the moment you put None into the list, it is being regarded as object because python is not able to mix the types bool and type(None)=NoneType back into bool. The same thing happens with math.nan and numpy.nan. The weirdest things happen when you try to force pandas into an area it does not want to go to :-)
> boolean = pd.Series([True, False, np.nan]).astype(bool)
> print(boolean)
0 True
1 False
2 True
dtype: bool
So 'np.nan' is being casted to 'True'?
Questions:
Given a data table where one column is of type 'object' but in fact it is a boolean column with missing values: how do I figure that out? After filtering for the non-missing values it is still of type 'object'... do I need to implement a try-catch-cast of every column into every imaginable data type in order to see the true nature of columns?
I guess that there is a logical explanation of why np.nan is being casted to True but this is an unwanted behaviour of the software pandas/python itself, right? So should I file a bug report?
Q1: I would start with combining
np.any(pd.isna(boolean))
to identify if there are any None Values in a column, and with
set(boolean)
You can identify, if there are only True, False and Nones inside. Combining with filtering (and if you prefer to also typcasting) you should be done.
Q2: see comment of #WeNYoBen
I've hit the same problem. I came up with the following solution:
from pandas import Series
def is_boolean_series(col: Series):
val = col[~col.isna()].iloc[0]
return type(val) == bool

Text data stored differently

My problem is I have 2 values which should be the same, however they have this strange difference I don't know where its coming from.
The context is I have imported 3 files using pd.read_csv. I grouped the values using groupby, using some date field, and aggregated the offending variable using nunique, just to keep record of the count.
Then, using Tableau it actually counted different number unique records. I found a pair of records pandas says are different, while Tableau sees as equals.
Take a look:
df
A
0 100000306
1 100000306
x1 = df.iloc[0,0]
str(x1.values)
"['100000306']"
x2 = df.iloc[1,0]
str(x2.values)
'[100000306]'
Why is this happening and what can I do so pandas knows they are the same value?
You have different type in one columns
df.applymap(type)
A
0 <class 'str'>
1 <class 'int'>
Notice when you print df.A it will show object
df.A
0 100000306
1 100000306
Name: A, dtype: object
welcome to Stackoverflow!
I'm not sure what other processing steps you have done with your data but it seems the value stored in [0,0] is a string '100000306' as opposed to an integer 100000306. What you could do is to use pandas.to_numeric() in order to convert values in your column to float values where possible
df['A'] = pd.to_numeric(df['A'])

How to get the index of a value in a pandas series

What's the code to get the index of a value in a pandas series data structure?.
animals=pd.Series(['bear','dog','mammoth','python'],
index=['canada','germany','iran','brazil'])
What's the code to extract the index of "mammoth"?
You can just use boolean indexing:
In [8]: animals == 'mammoth'
Out[8]:
canada False
germany False
iran True
brazil False
dtype: bool
In [9]: animals[animals == 'mammoth'].index
Out[9]: Index(['iran'], dtype='object')
Note, indexes aren't necessarily unique for pandas data structures.
You have two options:
1) If you make sure that value is unique, or just want to get the first one, use the find function.
find(animals, 'mammoth') # retrieves index of first occurrence of value
2) If you would like to get all indices matching that value, as per #juanpa.arrivillaga 's post.
animals[animals == 'mammoth'].index # retrieves indices of all matching values
You can also index find any number occurrence of the value by treating the the above statement as a list:
animals[animas == 'mammoth'].index[1] #retrieves index of second occurrence of value.

Searching a Pandas DataFrame column for empty values gives contradictory results

I'm trying to clean test data from Kaggle's Titanic dataset, specifically the columns - sex, fare, pclass, and age. In order to do this, I'd like to find out if any of these columns have empty values. I load the data:
import pandas as pd
test_address = 'path_to_data\test.csv'
test = pd.read_csv(test_address)
When I try to look for empty values in the columns,
True in test['Sex'].isna()
outputs True.
However,
test['Sex'].isna().value_counts()
outputs
False 418
Name: Sex, dtype: int64
which should mean there aren't any empty values (I could confirm this by visually scanning the csv file). These commands on test['Pclass'] have similar outputs.
Why is the 'True in' command giving me the wrong answer?
The operator in, when applied to a Series, checks if its left operand is in the index of the right operand. Since there is a row #1 (the numeric representation of True) in the series, the operator evaluates to True.
For the same reason False in df['Sex'].isna() is True, but False in df['Sex'][1:].isna() is False (there is no row #0 in the latter slice).
You should check if True in df['Sex'].isna().values.

Categories

Resources