Scenario: I have a dataframe with some nan scattered around. It has multiple columns, the ones of interest are "bid" and "ask"
What I want to do: I want to remove all rows where the bid column value is nan AND the ask column value is nan.
Question: What is the best way to do it?
What I already tried:
ab_df = ab_df[ab_df.bid != 'nan' and ab_df.ask != 'nan']
ab_df = ab_df[ab_df.bid.empty and ab_df.ask.empty]
ab_df = ab_df[ab_df.bid.notnull and ab_df.ask.notnull]
But none of them work.
You need vectorized logical operators & or | (and and or from python are to compare scalars not for pandas Series), to check nan values, you can use isnull and notnull:
To remove all rows where the bid column value is nan AND the ask column value is nan, keep the opposite:
ab_df[ab_df.bid.notnull() | ab_df.ask.notnull()]
Example:
df = pd.DataFrame({
"bid": [pd.np.nan, 1, 2, pd.np.nan],
"ask": [pd.np.nan, pd.np.nan, 2, 1]
})
df[df.bid.notnull() | df.ask.notnull()]
# ask bid
#1 NaN 1.0
#2 2.0 2.0
#3 1.0 NaN
If you need both columns to be non missing:
df[df.bid.notnull() & df.ask.notnull()]
# ask bid
#2 2.0 2.0
Another option using dropna by setting the thresh parameter:
df.dropna(subset=['ask', 'bid'], thresh=1)
# ask bid
#1 NaN 1.0
#2 2.0 2.0
#3 1.0 NaN
df.dropna(subset=['ask', 'bid'], thresh=2)
# ask bid
#2 2.0 2.0
ab_df = ab_df.loc[~ab_df.bid.isnull() | ~ab_df.ask.isnull()]
all this time I've been usign that because i convinced myself that .notnull() didn't exist. TIL.
ab_df = ab_df.loc[ab_df.bid.notnull() | ab_df.ask.notnull()]
The key is & rather than and and | rather than or
I made a mistake earlier using & - this is wrong because you want either bid isn't null OR ask isn't null, using and would give you only the rows where both are not null.
I think you can ab_df.dropna() as well, but i'll have to look it up
EDIT
oddly df.dropna() doesn't seem to support dropping based on NAs in a specific column. I would have thought it did.
based on the other answer I now see it does. It's friday afternoon, ok?
Related
I have Pandas Data Frame in Python like below:
NR
--------
910517196
921122192
NaN
And by using below code I try to calculate age based on column NR in above Data Frame (it does not matter how below code works, I know that it is correct - briefly I take 6 first values to calculate age, because for example 910517 is 1991-05-17 :)):
df["age"] = (ABT_DATE - pd.to_datetime(df.NR.str[:6], format = '%y%m%d')) / np.timedelta64(1, 'Y')
My problem is: I can modify above code to calculate age only using NOT NaN values in column "NR" in Data Frame, nevertheless some values are NaN.
My question is: How can I modify my code so as to take to calculations only these rows from column "NR" where is not NaN ??
As a result I need something like below, so simply I need to temporarily disregard NaN rows and, where there is a NaN in column NR, insert also a NaN in the calculated age column:
NR age
------------------
910517196 | 30
921122192 | 29
NaN | NaN
How can I do that in Python Pandas ?
df['age']=np.where(df['NR'].notnull(),'your_calculation',np.nan)
I have a dataframe which is sparsed and something like this,
Conti_mV_XSCI_140|Conti_mV_XSCI_12|Conti_mV_XSCI_76|Conti_mV_XSCO_11|Conti_mV_XSCO_203|Conti_mV_XSCO_75
1 | nan | nan | 12 | nan | nan
nan | 22 | nan | nan | 13 | nan
nan | nan | 9 | nan | nan | 31
As you can see, XSCI is present in 3 header names, only thing is a random number(_140, _12, _76) is added which makes them different.
This is not correct. The column names should be like this - Conti_mV_XSCI, Conti_mV_XSCO.
and the final column name(without any random number), should be having values from all the three columns it was spread to(for example - xsci was xsci_140, xsci_12,xsci_76) like that.
The final dataframe should look something like this -
Conti_mV_XSCI| Conti_mV_XSCO
1 | 12
22 | 13
99 | 31
If you notice, the first value of XSCI comes from the first XSCI_140, second value comes from the second column with XSCI and so on. This is same for XSCO as well.
The issue is, I have to do this for all the columns starting with certain value, like - "Conti_mV,"IDD_PowerUp_mA" etc
My issue:
I am having a hard time cleaning out the header names because as soon as I remove the random number from the last, it throws an error of columns being duplicate, also it is not elegant
It would be a great help if anyone can help me. Please comment if anything is not clear here.
I need a new dataframe with one column(where there were 3) and combine the data from them.
Thanks.
First if necessary convert all columns to numeric:
df = df.apply(pd.to_numeric, errors='coerce')
If need grouping by column names splited with right side and selected first values:
df = df.groupby(lambda x: x.rsplit('_', 1)[0], axis=1).sum()
print (df)
Conti_mV_XSCI Conti_mV_XSCO
0 1.0 12.0
1 22.0 13.0
2 9.0 31.0
If need filter columns manually:
df['Conti_mV_XSCI'] = df.filter(like='XSCI').sum(axis=1)
df['Conti_mV_XSCO'] = df.filter(like='XSCO').sum(axis=1)
EDIT: One idea for sum only columns specified in list of starts of columns names:
cols = ['IOZH_Pat_uA', 'IOZL_Pat_uA', 'Power_Short_uA', 'IDDQ_uA']
for c in cols:
# here ^ is for start of string
columns = df.filter(regex=f'^{c}')
df[c] = columns.sum(axis=1)
df = df.drop(columns, axis=1)
print (df)
try:
df['Conti_mV_XSCI']=df.filter(regex='XSCI').sum()
df['Conti_mV_XSCO']=df.filter(regex='XSCO').sum()
edit:
you can fillna with zeroes before the above operations.
df=df.fillna(0)
This will add a column Conti_mV_XSCI with the first non-nan entry for any column whose name begins with Conti_mV_XSCI
from math import isnan
df['Conti_mV_XSCI'] = df.filter(regex=("Conti_mV_XSCI.*")).apply(lambda row: [_ for _ in row if not isnan(_)][0], axis=1)
you can use the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
(df.pivot_longer(names_to=".value",
names_pattern=r"(.+)_\d+")
.dropna())
Conti_mV_XSCI Conti_mV_XSCO
0 1.0 12.0
4 22.0 13.0
8 9.0 31.0
The code looks for values that match a pattern in the group, and returns those values with the header.
I have the following data set:
Survived Not Survived
0 NaN 22.0
1 38.0 NaN
2 26.0 NaN
3 35.0 NaN
4 NaN 35.0
.. ... ...
886 NaN 27.0
887 19.0 NaN
888 NaN NaN
889 26.0 NaN
890 NaN 32.0
I want to remove all the rows which contains NaN so i wrote the following code(the dataset's name is titanic_feature_data):
titanic_feature_data = titanic_feature_data.dropna()
And when i try to display the new dataset i get the following result:
Empty DataFrame
Columns: [Survived, Not Survived]
Index: []
What's the problem ? and how can i fix it ?
By using titanic_feature_data.dropna(), you are removing all rows with at least one missing value. From the data you printed in your question, it looks like all rows contains at least one missing value. Is it possible that simply all your rows contains at least one missing value? If so, it makes total sense that your dataframe is empty after dropna(), right?
Having said that, perhaps you are looking to drop rows that have a missing value for one particular column, for example column Not Survived. Then you could use:
titanic_feature_data.dropna(subset='Not Survived')
Also, if you are confused about why certain rows are dropped, I recommend checking for missing values explicitly first, without dropping them. That way you can see which instances would have been dropped:
incomplete_rows = titanic_feature_data.isnull().any(axis=1)
incomplete_rows is a boolean series, which indicates whether a row contains any missing value or not. You can use this series to subset your dataframe and see which rows contain missing values (presumably all of them, given your example)
titanic_feature_data.loc[incomplete_rows, :]
I've been banging my head against a wall on this for a couple of hours, and would appreciate any help I could get.
I'm working with a large data set (over 270,000 rows), and am trying to find an anomaly within two columns that should have paired values.
From the snippet of output below - I'm looking at the Alcohol_Category_ID and Alcohol_Category_Name columns. The ID column has a numeric string value that should pair up 1:1 with a string descriptor in the Name column. (e.g., "1031100.0" == "100 PROOF VODKA".
As you can see, both columns have the same count of non-null values. However, there are 72 unique IDs and only 71 unique Names. I take this to mean that one Name is incorrectly associated with two different IDs.
County Alcohol_Category_ID Alcohol_Category_Name Vendor_Number \
count 269843 270288 270288 270920
unique 99 72 71 116
top Polk 1031080.0 VODKA 80 PROOF 260
freq 49092 35366 35366 46825
first NaN NaN NaN NaN
last NaN NaN NaN NaN
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
My trouble is in actually isolating out where this duplication is occurring so that I can hopefully replace the erroneous ID with its correct value. I am having a dog of a time with this.
My dataframe is named i_a.
I've been trying to examine the pairings of values between these two columns with groupby and count statements like this:
i_a.groupby(["Alcohol_Category_Name", "Alcohol_Category_ID"]).Alcohol_Category_ID.count()
However, I'm not sure how to whittle it down from there. And there are too many pairings to make this easy to do visually.
Can someone recommend a way to isolate out the Alcohol_Category_Name associated with more than one Alcohol_Category_ID?
Thank you so much for your consideration!
EDIT: After considering the advice of Dmitry, I found the solution by continually pairing down duplicates until I honed in on the value of interest, like so:
#Finding all unique pairings of Category IDs and Names
subset = i_a.drop_duplicates(["Alcohol_Category_Name", "Alcohol_Category_ID"])
#Now, determine which of the category names appears more than once (thus paired with more than one ID)
subset[subset["Alcohol_Category_Name"].duplicated()]
Thank you so much for your help. It seems really obvious in retrospect, but I could not figure it out for the life of me.
I think this snippet meets your needs:
> df = pd.DataFrame({'a':[1,2,3,1,2,3], 'b':[1,2,1,1,2,1]})
So df.a has 3 unique values mapping to 2 uniques in df.b.
> df.groupby('b')['a'].nunique()
b
1 2
2 1
That shows that df.b=1 maps to 2 uniques in a (and that df.b=2 maps to only 1).
Please be advised, I am a beginning programmer and a beginning python/pandas user. I'm a behavioral scientist and learning to use pandas to process and organize my data. As a result, some of this might seem completely obvious and it may seem like a question not worthy of the forum. Please have tolerance! To me, this is days of work, and I have indeed spent hours trying to figure out the answer to this question already. Thanks in advance for any help.
My data look like this. The "real" Actor and Recipient data are always 5-digit numbers, and the "Behavior" data are always letter codes. My problem is that I also use this format for special lines, denoted by markers like "date" or "s" in the Actor column. These markers indicate that the "Behavior" column holds this special type of data, and not actual Behavior data. So, I want to replace the markers in the Actor column with NaN values, and grab the special data from the behavior column to put in another column (in this example, the empty Activity column).
follow Activity Actor Behavior Recipient1
0 1 NaN date 2.1.3.2012 NaN
1 1 NaN s ss.hx NaN
2 1 NaN 50505 vo 51608
3 1 NaN 51608 vr 50505
4 1 NaN s ss.he NaN
So far, I have written some code in pandas to select out the "s" lines into a new dataframe:
def get_act_line(group):
return group.ix[(group.Actor == 's')]
result = trimdata.groupby('follow').apply(get_act_line)
I've copied over the Behavior column in this dataframe to the Activity column, and replaced the Actor and Behavior values with NaN:
result.Activity = result.Behavior
result.Behavior = np.nan
result.Actor = np.nan
result.head()
So my new dataframe looks like this:
follow follow Activity Actor Behavior Recipient1
1 2 1 ss.hx NaN NaN NaN
34 1 hf.xa NaN NaN f.53702
74 1 hf.fe NaN NaN NaN
10 1287 10 ss.hf NaN NaN db
1335 10 fe NaN NaN db
What I would like to do now is to combine this dataframe with the original, replacing all of the values in these selected rows, but maintaining values for the other rows in the original dataframe.
This may seem like a simple question with an obvious solution, or perhaps I have gone about it all wrong to begin with!
I've worked through Wes McKinney's book, I've read the documentation on different types of merges, mapping, joining, transformations, concatenations, etc. I have browsed the forums and have not found an answer that helps me to figure this out. Your help will be very much appreciated.
One way you can do this (though there may be more optimal or elegant ways) is:
mask = (df['Actor']=='s')
df['Activity'] = df[mask]['Behavior']
df.ix[mask, 'Behavior'] = np.nan
where df is equivalent to your results dataframe. This should return (my column orders are slightly different):
Activity Actor Behavior Recipient1 follow
0 NaN date 2013-04-01 00:00:00 NaN 1
1 ss.hx NaN ss.hx NaN 1
2 NaN 50505 vo 51608 1
3 NaN 51608 vr 50505 1
4 ss.he NaN ss.hx NaN 1
References:
Explanation of df.ix from other STO post.