So I am operating on a rather large set of data. I am usign Pandas DataFrame to handle this data and am stuck on an efficient way to parse the data into two formatted lists
HERE IS MY DATAFRAME OBJECT
fet1 fet2 fet3 fet4 fet5
stim1 True True False False False
stim2 True False False False True
stim3 ...................................
stim4 ...................................
stim5 ............................. so on
I am trying to parse each row and create two lists. List one should have the column name of all the true values. List two should have the column names of the false values.
example for stim 1:
list_1=[fet1,fet2]
list_2=[fet3,fet4,fet5]
I know I can brute force this approach and Iterate over the rows. Or I can transpose and convert to a dictionary and Parse that Way. I can also create Sparse Series objects and then create sets but then have to reference the column names separately.
The only problem I am running into is that I am always getting Quadratic O(n^2) run time.
Is there a more efficient way to do this as a built in functionality from Pandas?
Thanks for your help.
Is this what you want?
>>> df
fet1 fet2 fet3 fet4 fet5
stim1 True True False False False
stim2 True False False False True
>>> def func(row):
return [
row.index[row == True],
row.index[row == False]
]
>>> df.apply(func, axis=1)
stim1 [[fet1, fet2], [fet3, fet4, fet5]]
stim2 [[fet1, fet5], [fet2, fet3, fet4]]
dtype: object
This may or may not be faster. I do not think a more succinct solution is possible.
Fast (not row-by-row) operations can get this far.
In [126]: (np.array(df.columns)*~df)[~df]
Out[126]:
fet1 fet2 fet3 fet4 fet5
stim1 NaN NaN fet3 fet4 fet5
stim2 NaN fet2 fet3 fet4 NaN
But at this point, because the rows might have variable length, the array structure must be broken and each row must be considered individually.
In [122]: (np.array(df.columns)*df)[df].apply(lambda x: Series([x.dropna()]), 1)
Out[122]:
0
stim1 [fet1, fet2]
stim2 [fet1, fet5]
In [125]: (np.array(df.columns)*~df)[~df].apply(lambda x: Series([x.dropna()]), 1)
Out[125]:
0
stim1 [fet3, fet4, fet5]
stim2 [fet2, fet3, fet4]
The slowest step is probably the Series constructor. I'm pretty sure there's no way around it though.
Related
My code is like:
a = pd.DataFrame([np.nan, True])
b = pd.DataFrame([True, np.nan])
c = a|b
print(c)
I don't know the result of logical operation result when one element is np.nan, but I expect them to be the same whatever the oder. But I got the result like this:
0
0 False
1 True
Why? Is this about short circuiting in pandas? I searched the doc of pandas but did not find answer.
My pandas version is 1.1.3
This is behaviour that is tied to np.nan, not pandas. Take the following examples:
print(True or np.nan)
print(np.nan or True)
Output:
True
nan
When performing the operation, dtype ends up mattering and the way that np.nan functions within the numpy library is what leads to this strange behaviour.
To get around this quirk, you can fill NaN values with False for example or some other token value which evaluates to False (using pandas.DataFrame.fillna()).
While using csv files from excel and read it with pandas data frame, got 1 value that's has symbol such as 2$3.74836730957 while it has to be 243.74836730957 (it seems mistook 4 with $). is there anyways that I could find such as values that I mention before and change it into NaN values on Data Frame?
CSV file:
You can use pd.to_numeric in order to report boolean values that denote whether a particular column has only numerical values. In order to check all columns you can do
df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())
And the output would look like:
A True
B False
C True
D False
...
dtype: bool
Now if you want to know which specific row(s) are not numerical you can use np.isreal:
df.applymap(np.isreal)
A B C D
item
r1 True True True True
r2 True True True True
r3 True True True False
...
I have a dateframe named Mj_rank, with date as Datetime and index which looks like this:
A B C ...
date
2016-01-29 False False True
2016-01-30 False False True
2016-02-01 True True True
....
2017-12-29 False True True
Currently, the data is daily, but I would like to resample the data into a new df that contains every 6 months nth.
Therefore I did:
Mj_rank_s = Mj_rank.resample('6M').asfreq().tail()
which gives me this output:
ValueError: cannot reindex from a duplicate axis
strangely enough, if I use other methods like max() or min() it works fine, but not "asfreq()".
I tried different ways based on existing stackoverflow suggestions like adding in front, but didn't work :
Mj_rank = Mj_rank.reset_index()
Mj_rank['date'] = pd.to_datetime(Mj_rank['date'])
Mj_rank = Mj_rank.set_index('date')
Thanks a lot!
Edit:
Thanks to #jezrael he pointed out I had problems with duplicates using
Mj_rank[Mj_rank.index.duplicated(keep=False)]
This is my code where this_proteinDF.loc[:, 'p-value'] contains masked values:
this_proteinDF.loc[:, 'p-value'] = this_proteinDF.loc[:, 'p-value'].apply(lambda x: np.nan if x is np.ma.masked else x)
This should change every masked value into np.nan, however when I call the values after this operation it returns a float64 0.0 instead of nan.
What's going on here and how do i fix this?
It seems you have several points of mis-understanding (you and me).
np.ma.masked is a numpy constant that supposed to mean that something is masked. I do not think it means what you think it means.
I'm not sure when you actually want the 'p-value' series masked.
I'm thinking it's beneficial to you and maybe others to show the following.
mask method
use pandas to mask the things you'd like
df = pd.DataFrame({'p-value': [.001, .01, .1, .2, .05, .5]})
df
Check what's not significant
df['p-value'].gt(.05)
0 False
1 False
2 True
3 True
4 False
5 True
Name: p-value, dtype: bool
mask will make True values into np.nan values
df.loc[:, 'is_sig'] = df['p-value'].mask(df['p-value'].gt(.05))
df
I want to do an element-wise OR operation on two pandas Series of boolean values. np.nans are also included.
I have tried three approaches and realized that the expression "np.nan or False" can be evaluted to True, False, and np.nan depending on the approach.
These are my example Series:
series_1 = pd.Series([True, False, np.nan])
series_2 = pd.Series([False, False, False])
Approach #1
Using the | operator of pandas:
In [5]: series_1 | series_2
Out[5]:
0 True
1 False
2 False
dtype: bool
Approach #2
Using the logical_or function from numpy:
In [6]: np.logical_or(series_1, series_2)
Out[6]:
0 True
1 False
2 NaN
dtype: object
Approach #3
I define a vectorized version of logical_or which is supposed to be evaluated row-by-row over the arrays:
#np.vectorize
def vectorized_or(a, b):
return np.logical_or(a, b)
I use vectorized_or on the two series and convert its output (which is a numpy array) into a pandas Series:
In [8]: pd.Series(vectorized_or(series_1, series_2))
Out[8]:
0 True
1 False
2 True
dtype: bool
Question
I am wondering about the reasons for these results.
This answer explains np.logical_or and says np.logical_or(np.nan, False) is be True but why does this only works when vectorized and not in Approach #2? And how can the results of Approach #1 be explained?
first difference : | is np.bitwise_or. it explains the difference between #1 and #2.
Second difference : since serie_1.dtype if object (non homogeneous data), operations are done row by row in the two first cases.
When using vectorize ( #3):
The data type of the output of vectorized is determined by calling
the function with the first element of the input. This can be avoided
by specifying the otypes argument.
For vectorized operations, you quit the object mode. data are first converted according to first element (bool here, bool(nan) is True) and the operations are done after.