Finding order of conditions met in dataframe - python

Say I have a set of data like so in a pandas.DataFrame:
A B C
1 0.96 1.2 0.75
2 0.94 1.3 0.72
3 0.92 1.15 0.68
4 0.90 1.0 0.73
...
and I'd like to figure out the order in which the data meets conditions. If I were looking for A decreasing, B decreasing, and C increasing in the example above, I would get ABC, as A is first to meet its condition, B is second, and C is third.
Right now I'm running through a loop trying to figure this out, but is there a better way to do this leveraging the capabilities of Pandas?

Here is one way to do that. This makes the assumption, which matches the context of your question, that we can describe the possible conditions as the previous value was less than or greater than the current value.
Code:
def met_condition_at(test_df, tests):
# for each column apply the conditional test and then cumsum()
deltas = [getattr(test_df.diff()[col], test)(0).cumsum() for col, test
in zip(test_df.columns, tests)]
# the first time the condition is true, cumsum() == 1
return (pd.concat(deltas, axis=1) == 1).idxmax()
How?
We take the .diff() of each column
We then apply the test to see when the diff changes signs
We then .cumsum() on the Boolean result and find when it is == 1
The index when == 1 is the index when it first changed direction
Test Code:
import pandas as pd
df = pd.read_fwf(StringIO(u"""
A B C
0.96 1.2 0.75
0.94 1.3 0.72
0.92 1.15 0.68
0.90 1.0 0.73"""), header=1)
print(df)
tests = ('lt', 'lt', 'gt')
print(met_condition_at(df, tests))
print(''.join(met_condition_at(df, tests).sort_values().index.values))
Results:
A B C
0 0.96 1.20 0.75
1 0.94 1.30 0.72
2 0.92 1.15 0.68
3 0.90 1.00 0.73
A 1
B 2
C 3
dtype: int64
ABC

Related

Merging multiple dataframe into one with each dataframe as a header name containing many columns in it and creating a 3D dataframe

I have multiple dataframes df1, df2 ,df3 etc to df10. The dataframe has 135 columns. each look like this:
time
a
b
c
d
e
f
g
1
2
3
4
5
6
7
8
I wanted to arrange them in one data frame and stack them together side by side but having their df name as the header. Meaning one heading df1 having all those columns names( time,a,b...) and their value under it and so on.Seeing this example here Constructing 3D Pandas DataFrame
I tried following codes
list1=['df1', 'df2', 'df3', 'df4', 'df5','df6', 'df7', 'df8', 'df9',
'df10']
list2=[]
for df in list1:
for i in range(135):
list2.append(df)
A=np.array(list2)
B = np.array([df1.columns]*10)
C=pd.concat([df1,df2,df3,df4,df5,df6,df7,df8,df9,df10], axis=1)
C=C.values.tolist()
C=np.array(C)
df = pd.DataFrame(data=C.T, columns=pd.MultiIndex.from_tuples(zip(A,B)))
print(df)
But each time I am having an error
TypeError: unhashable type: 'numpy.ndarray'
I have a column time: where the time are in hhmm format. 01:00,01:01 so on. I tried dropping the column from the data frames but getting same error. How could I fix this? Can anyone help?
You could use the keys in Pandas concat command (using the correct range with f-string to create a relevant nomenclature or use your already defined list1):
keys sequence, default None
If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.
import pandas as pd
import numpy as np
# setup
np.random.seed(12345)
all_df_list = []
for i in range(3):
d = {
'time': (pd.timedelta_range(start='00:01:00', periods=5, freq='1s')
+ pd.Timestamp("00:00:00")).strftime("%M:%S"),
'a': np.random.rand(5),
'b': np.random.rand(5),
'c': np.random.rand(5),
}
all_df_list.append(pd.DataFrame(d).round(2))
# code
dfc = pd.concat(all_df_list, axis=1,
keys=[f'df{i}' for i in range(1,4)]) # use the correct 'range' or your already defined 'list1'
dfc = dfc.set_index(dfc.df1.time)
dfc = dfc.drop('time', axis=1, level=1)
print(dfc)
df1 df2 df3
a b c a b c a b c
time
01:00 0.93 0.60 0.75 0.66 0.64 0.73 0.03 0.53 0.82
01:01 0.32 0.96 0.96 0.81 0.72 0.99 0.80 0.60 0.50
01:02 0.18 0.65 0.01 0.87 0.47 0.68 0.90 0.05 0.81
01:03 0.20 0.75 0.11 0.96 0.33 0.79 0.02 0.90 0.10
01:04 0.57 0.65 0.30 0.72 0.44 0.17 0.49 0.73 0.22
Extracting columns a and b from df2
In [190]: dfc.df2[['a','b']]
Out[190]:
a b
time
01:00 0.66 0.64
01:01 0.81 0.72
01:02 0.87 0.47
01:03 0.96 0.33
01:04 0.72 0.44

How to normalize the columns of a DataFrame using sklearn.preprocessing.normalize?

is there a way to normalize the columns of a DataFrame using sklearn's normalize? I think that by default it normalizes rows
For example, if I had df:
A B
1000 10
234 3
500 1.5
I would want to get the following:
A B
1 1
0.234 0.3
0.5 0.15
Why do you need sklearn?
Just use pandas:
>>> df / df.max()
A B
0 1.000 1.00
1 0.234 0.30
2 0.500 0.15
>>>
You can using div after get the max
df.div(df.max(),1)
Out[456]:
A B
0 1.000 1.00
1 0.234 0.30
2 0.500 0.15
sklearn defaults to normalize rows with the L2 normalization. Both of these arguments need to be changed for your desired normalization by the maximum value along columns:
from sklearn import preprocessing
preprocessing.normalize(df, axis=0, norm='max')
#array([[1. , 1. ],
# [0.234, 0.3 ],
# [0.5 , 0.15 ]])
From the documentation
axis : 0 or 1, optional (1 by default) axis used to normalize the data
along. If 1, independently normalize each sample, otherwise (if 0)
normalize each feature.
So just change the axis. Having said that, sklearn is an overkill for this task. It can be achieved easily using pandas.

Selecting random values from dataframe without replacement

I am following the answer from the link:
If I have a dataframe df as:
Month Day mnthShape
1 1 1.01
1 1 1.09
1 1 0.96
1 2 1.01
1 1 1.09
1 2 0.96
1 3 1.01
1 3 1.09
1 3 1.78
I want to get the following from df:
Month Day mnthShape
1 1 1.01
1 2 1.01
1 1 0.96
where the mnthShape values are selected at random from the index without replacement. i.e. if the query is df.loc[(1, 1)] it should look for all values for (1, 1) and select randomly from it a value to be displayed above. If another df.loc[(1,1)] appears it should select randomly again but without replacement.
I know I need to modify the code to use the following:
apply(np.random.choice, replace=False)
But not sure how to do it.
Edit:
Everytime I do df.loc[(1, 1)], it should give new value without replacement. I intend to do df.loc[(1, 1)] multiple times. In the previous question, it was just one time.
If you're trying to sample from the dataset without replacement, it probably makes sense to do this all in one go, rather than iteratively pulling a sample from the dataset.
Pulling N samples from each month/day combo requires that there be sufficient combinations to pull N without replacement. But assuming this is true, you could write a function to sample N values from a subset of the data:
def select_n(subset, n=2):
choices = np.random.choice(len(x), size=n, replace=False)
return (
subset
.mnthShape
.iloc[choices]
.reset_index(drop=True)
.rename_axis('choice'))
to apply this across the whole dataset:
In [34]: df.groupby(['Month', 'Day']).apply(select_n)
Out[34]:
choice 0 1
Month Day
1 1 1.09 0.96
2 0.96 1.01
3 1.09 1.01
If you really need to pull these one at a time, you'll still need to generate the samples all at once to guarantee that they're drawn without replacement, but you could generate the sample indices separately from subsetting the data:
In [48]: indices = np.random.choice(3, size=2, replace=False)
In [49]: df[((df.Month == 1) & (df.Day == 2))].iloc[indices[0]]
Out[49]:
Month 1.00
Day 2.00
mnthShape 1.01
Name: 3, dtype: float64
In [50]: df[((df.Month == 1) & (df.Day == 2))].iloc[indices[1]]
Out[50]:
Month 1.00
Day 2.00
mnthShape 0.96
Name: 5, dtype: float64

Pandas - Using `.rolling()` on multiple columns

Consider a pandas DataFrame which looks like the one below
A B C
0 0.63 1.12 1.73
1 2.20 -2.16 -0.13
2 0.97 -0.68 1.09
3 -0.78 -1.22 0.96
4 -0.06 -0.02 2.18
I would like to use the function .rolling() to perform the following calculation for t = 0,1,2:
Select the rows from t to t+2
Take the 9 values contained in those 3 rows, from all the columns. Call this set S
Compute the 75th percentile of S (or other summary statistics about S)
For instance, for t = 1 we have
S = { 2.2 , -2.16, -0.13, 0.97, -0.68, 1.09, -0.78, -1.22, 0.96 } and the 75th percentile is 0.97.
I couldn't find a way to make it work with .rolling(), since it apparently takes each column separately. I'm now relying on a for loop, but it is really slow.
Do you have any suggestion for a more efficient approach?
One solution is to stack the data and then multiply your window size by the number of columns and slice the result by the number of columns. Also, since you want a forward looking window, reverse the order of the stacked DataFrame
wsize = 3
cols = len(df.columns)
df.stack(dropna=False)[::-1].rolling(window=wsize*cols).quantile(0.75)[cols-1::cols].reset_index(-1, drop=True).sort_index()
Output:
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
dtype: float64
In the case of many columns and a small window:
import pandas as pd
import numpy as np
wsize = 3
df2 = pd.concat([df.shift(-x) for x in range(wsize)], 1)
s_quant = df2.quantile(0.75, 1)
# Only necessary if you need to enforce sufficient data.
s_quant[df2.isnull().any(1)] = np.NaN
Output: s_quant
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
Name: 0.75, dtype: float64
You can use numpy ravel. Still you may have to use for loops.
for i in range(0,3):
print(df.iloc[i:i+3].values.ravel())
If your t steps in 3s, you can use numpy reshape function to create a n*9 dataframe.

Python boxplot out of columns of different lengths

I have the following dataframe in Python (the actual dataframe is much bigger, just presenting a small sample):
A B C D E F
0 0.43 0.52 0.96 1.17 1.17 2.85
1 0.43 0.52 1.17 2.72 2.75 2.94
2 0.43 0.53 1.48 2.85 2.83
3 0.47 0.59 1.58 3.14
4 0.49 0.80
I convert the dataframe to numpy using df.values and then pass that to boxplot.
When I try to make a boxplot out of this pandas dataframe, the number of values picked from each column is restricted to the least number of values in a column (in this case, column F). Is there any way I can boxplot all values from each column?
NOTE: I use df.dropna to drop the rows in each column with missing values. However, this is resizing the dataframe to the lowest common denominator of column length, and messing up the plotting.
import prettyplotlib as ppl
import numpy as np
import pandas
import matplotlib as mpl
from matplotlib import pyplot
df = pandas.DataFrame.from_csv(csv_data,index_col=False)
df = df.dropna()
labels = ['A', 'B', 'C', 'D', 'E', 'F']
fig, ax = pyplot.subplots()
ppl.boxplot(ax, df.values, xticklabels=labels)
pyplot.show()
The right way to do it, saving from reinventing the wheel, would be to use the .boxplot() in pandas, where the nan handled correctly:
In [31]:
print df
A B C D E F
0 0.43 0.52 0.96 1.17 1.17 2.85
1 0.43 0.52 1.17 2.72 2.75 2.94
2 0.43 0.53 1.48 2.85 2.83 NaN
3 0.47 0.59 1.58 NaN 3.14 NaN
4 0.49 0.80 NaN NaN NaN NaN
[5 rows x 6 columns]
In [32]:
_=plt.boxplot(df.values)
_=plt.xticks(range(1,7),labels)
plt.savefig('1.png') #keeping the nan's and plot by plt
In [33]:
_=df.boxplot()
plt.savefig('2.png') #keeping the nan's and plot by pandas
In [34]:
_=plt.boxplot(df.dropna().values)
_=plt.xticks(range(1,7),labels)
plt.savefig('3.png') #dropping the nan's and plot by plt

Categories

Resources