Python boxplot out of columns of different lengths - python

I have the following dataframe in Python (the actual dataframe is much bigger, just presenting a small sample):
A B C D E F
0 0.43 0.52 0.96 1.17 1.17 2.85
1 0.43 0.52 1.17 2.72 2.75 2.94
2 0.43 0.53 1.48 2.85 2.83
3 0.47 0.59 1.58 3.14
4 0.49 0.80
I convert the dataframe to numpy using df.values and then pass that to boxplot.
When I try to make a boxplot out of this pandas dataframe, the number of values picked from each column is restricted to the least number of values in a column (in this case, column F). Is there any way I can boxplot all values from each column?
NOTE: I use df.dropna to drop the rows in each column with missing values. However, this is resizing the dataframe to the lowest common denominator of column length, and messing up the plotting.
import prettyplotlib as ppl
import numpy as np
import pandas
import matplotlib as mpl
from matplotlib import pyplot
df = pandas.DataFrame.from_csv(csv_data,index_col=False)
df = df.dropna()
labels = ['A', 'B', 'C', 'D', 'E', 'F']
fig, ax = pyplot.subplots()
ppl.boxplot(ax, df.values, xticklabels=labels)
pyplot.show()

The right way to do it, saving from reinventing the wheel, would be to use the .boxplot() in pandas, where the nan handled correctly:
In [31]:
print df
A B C D E F
0 0.43 0.52 0.96 1.17 1.17 2.85
1 0.43 0.52 1.17 2.72 2.75 2.94
2 0.43 0.53 1.48 2.85 2.83 NaN
3 0.47 0.59 1.58 NaN 3.14 NaN
4 0.49 0.80 NaN NaN NaN NaN
[5 rows x 6 columns]
In [32]:
_=plt.boxplot(df.values)
_=plt.xticks(range(1,7),labels)
plt.savefig('1.png') #keeping the nan's and plot by plt
In [33]:
_=df.boxplot()
plt.savefig('2.png') #keeping the nan's and plot by pandas
In [34]:
_=plt.boxplot(df.dropna().values)
_=plt.xticks(range(1,7),labels)
plt.savefig('3.png') #dropping the nan's and plot by plt

Related

Merging multiple dataframe into one with each dataframe as a header name containing many columns in it and creating a 3D dataframe

I have multiple dataframes df1, df2 ,df3 etc to df10. The dataframe has 135 columns. each look like this:
time
a
b
c
d
e
f
g
1
2
3
4
5
6
7
8
I wanted to arrange them in one data frame and stack them together side by side but having their df name as the header. Meaning one heading df1 having all those columns names( time,a,b...) and their value under it and so on.Seeing this example here Constructing 3D Pandas DataFrame
I tried following codes
list1=['df1', 'df2', 'df3', 'df4', 'df5','df6', 'df7', 'df8', 'df9',
'df10']
list2=[]
for df in list1:
for i in range(135):
list2.append(df)
A=np.array(list2)
B = np.array([df1.columns]*10)
C=pd.concat([df1,df2,df3,df4,df5,df6,df7,df8,df9,df10], axis=1)
C=C.values.tolist()
C=np.array(C)
df = pd.DataFrame(data=C.T, columns=pd.MultiIndex.from_tuples(zip(A,B)))
print(df)
But each time I am having an error
TypeError: unhashable type: 'numpy.ndarray'
I have a column time: where the time are in hhmm format. 01:00,01:01 so on. I tried dropping the column from the data frames but getting same error. How could I fix this? Can anyone help?
You could use the keys in Pandas concat command (using the correct range with f-string to create a relevant nomenclature or use your already defined list1):
keys sequence, default None
If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.
import pandas as pd
import numpy as np
# setup
np.random.seed(12345)
all_df_list = []
for i in range(3):
d = {
'time': (pd.timedelta_range(start='00:01:00', periods=5, freq='1s')
+ pd.Timestamp("00:00:00")).strftime("%M:%S"),
'a': np.random.rand(5),
'b': np.random.rand(5),
'c': np.random.rand(5),
}
all_df_list.append(pd.DataFrame(d).round(2))
# code
dfc = pd.concat(all_df_list, axis=1,
keys=[f'df{i}' for i in range(1,4)]) # use the correct 'range' or your already defined 'list1'
dfc = dfc.set_index(dfc.df1.time)
dfc = dfc.drop('time', axis=1, level=1)
print(dfc)
df1 df2 df3
a b c a b c a b c
time
01:00 0.93 0.60 0.75 0.66 0.64 0.73 0.03 0.53 0.82
01:01 0.32 0.96 0.96 0.81 0.72 0.99 0.80 0.60 0.50
01:02 0.18 0.65 0.01 0.87 0.47 0.68 0.90 0.05 0.81
01:03 0.20 0.75 0.11 0.96 0.33 0.79 0.02 0.90 0.10
01:04 0.57 0.65 0.30 0.72 0.44 0.17 0.49 0.73 0.22
Extracting columns a and b from df2
In [190]: dfc.df2[['a','b']]
Out[190]:
a b
time
01:00 0.66 0.64
01:01 0.81 0.72
01:02 0.87 0.47
01:03 0.96 0.33
01:04 0.72 0.44

Finding intersection of Pandas dataframes within range

A project I'm working on requires merging two dataframes together along some line with a delta. Basically, I need to take a dataframe with a non-linear 2D line and find the data points within the other that fall along that line, plus or minus a delta.
Dataframe 1 (Line that we want to find points along)
import pandas as pd
df1 = pd.read_csv('path/to/df1/data.csv')
df1
x y
0 0.23 0.54
1 0.27 0.95
2 0.78 1.59
...
97 0.12 2.66
98 1.74 0.43
99 0.93 4.23
Dataframe 2 (Dataframe we want to filter, leaving points within some delta)
df2 = pd.read_csv('path/to/df2/data.csv')
df2
x y
0 0.21 0.51
1 0.27 0.35
2 3.45 1.19
...
971 0.94 2.60
982 1.01 1.33
993 0.43 2.43
Finding the coarse line
DELTA = 0.03
coarse_line = find_coarse_line(df1, df2, DELTA)
coarse_line
x y
0 0.21 0.51
1 0.09 2.68
2 0.23 0.49
...
345 1.71 0.45
346 0.96 0.40
347 0.81 1.62
I've tried using df.loc((df['x'] >= BOTLEFT_X) & (df['x'] >= BOTLEFT_Y) & (df['x'] <= TOPRIGHT_X) & (df['y'] <= TOPRIGHT_Y)) among many, many other Pandas functions and whatnot but have yet to find anything that works, much less anything efficient (with datasets >2 million points).
Have taken an approach of using merge() where x,y have been placed into bins from good curve df1
generated a uniform line, y=x^2
randomised it a small amount to generate df1
randomised it a large amount to generate df2 also generated three times as many co-ordinates
take df1 as reference for good ranges of x and y co-ordinates to split into bins using pd.cut(). bins being 1/3 of total number of co-ordinates is working well
standardised these back into arrays for use again in pd.cut() when merging
You can see from scatter plots, it's doing a pretty reasonable job of finding and keeping points close to curve in df2
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,3, sharey=True, sharex=False, figsize=[20,5])
linex = [i for i in range(100)]
liney = [i**2 for i in linex]
df1 = pd.DataFrame({"x":[l*random.uniform(0.95, 1.05) for l in linex],
"y":[l*random.uniform(0.95, 1.05) for l in liney]})
df1.plot("x","y", kind="scatter", ax=ax[0])
df2 = pd.DataFrame({"x":[l*random.uniform(0.5, 1.5) for l in linex*3],
"y":[l*random.uniform(0.5, 1.5) for l in liney*3]})
df2.plot("x","y", kind="scatter", ax=ax[1])
# use bins on x and y axis - both need to be within range to find
bincount = len(df1)//3
xc = pd.cut(df1["x"], bincount).unique()
yc = pd.cut(df1["y"], bincount).unique()
xc = np.sort([intv.left for intv in xc] + [xc[-1].right])
yc = np.sort([intv.left for intv in yc] + [yc[-1].right])
dfm = (df2.assign(
xb=pd.cut(df2["x"],xc, duplicates="drop"),
yb=pd.cut(df2["y"],yc, duplicates="drop"),
).query("~(xb.isna() | yb.isna())") # exclude rows where df2 falls outside of range of df1
.merge(df1.assign(
xb=pd.cut(df1["x"],xc, duplicates="drop"),
yb=pd.cut(df1["y"],yc, duplicates="drop"),
),
on=["xb","yb"],
how="inner",
suffixes=("_l","_r")
)
)
dfm.plot("x_l", "y_l", kind="scatter", ax=ax[2])
print(f"graph 2 pairs:{len(df2)} graph 3 pairs:{len(dfm)}")

Pandas - Using `.rolling()` on multiple columns

Consider a pandas DataFrame which looks like the one below
A B C
0 0.63 1.12 1.73
1 2.20 -2.16 -0.13
2 0.97 -0.68 1.09
3 -0.78 -1.22 0.96
4 -0.06 -0.02 2.18
I would like to use the function .rolling() to perform the following calculation for t = 0,1,2:
Select the rows from t to t+2
Take the 9 values contained in those 3 rows, from all the columns. Call this set S
Compute the 75th percentile of S (or other summary statistics about S)
For instance, for t = 1 we have
S = { 2.2 , -2.16, -0.13, 0.97, -0.68, 1.09, -0.78, -1.22, 0.96 } and the 75th percentile is 0.97.
I couldn't find a way to make it work with .rolling(), since it apparently takes each column separately. I'm now relying on a for loop, but it is really slow.
Do you have any suggestion for a more efficient approach?
One solution is to stack the data and then multiply your window size by the number of columns and slice the result by the number of columns. Also, since you want a forward looking window, reverse the order of the stacked DataFrame
wsize = 3
cols = len(df.columns)
df.stack(dropna=False)[::-1].rolling(window=wsize*cols).quantile(0.75)[cols-1::cols].reset_index(-1, drop=True).sort_index()
Output:
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
dtype: float64
In the case of many columns and a small window:
import pandas as pd
import numpy as np
wsize = 3
df2 = pd.concat([df.shift(-x) for x in range(wsize)], 1)
s_quant = df2.quantile(0.75, 1)
# Only necessary if you need to enforce sufficient data.
s_quant[df2.isnull().any(1)] = np.NaN
Output: s_quant
0 1.12
1 0.97
2 0.97
3 NaN
4 NaN
Name: 0.75, dtype: float64
You can use numpy ravel. Still you may have to use for loops.
for i in range(0,3):
print(df.iloc[i:i+3].values.ravel())
If your t steps in 3s, you can use numpy reshape function to create a n*9 dataframe.

Finding order of conditions met in dataframe

Say I have a set of data like so in a pandas.DataFrame:
A B C
1 0.96 1.2 0.75
2 0.94 1.3 0.72
3 0.92 1.15 0.68
4 0.90 1.0 0.73
...
and I'd like to figure out the order in which the data meets conditions. If I were looking for A decreasing, B decreasing, and C increasing in the example above, I would get ABC, as A is first to meet its condition, B is second, and C is third.
Right now I'm running through a loop trying to figure this out, but is there a better way to do this leveraging the capabilities of Pandas?
Here is one way to do that. This makes the assumption, which matches the context of your question, that we can describe the possible conditions as the previous value was less than or greater than the current value.
Code:
def met_condition_at(test_df, tests):
# for each column apply the conditional test and then cumsum()
deltas = [getattr(test_df.diff()[col], test)(0).cumsum() for col, test
in zip(test_df.columns, tests)]
# the first time the condition is true, cumsum() == 1
return (pd.concat(deltas, axis=1) == 1).idxmax()
How?
We take the .diff() of each column
We then apply the test to see when the diff changes signs
We then .cumsum() on the Boolean result and find when it is == 1
The index when == 1 is the index when it first changed direction
Test Code:
import pandas as pd
df = pd.read_fwf(StringIO(u"""
A B C
0.96 1.2 0.75
0.94 1.3 0.72
0.92 1.15 0.68
0.90 1.0 0.73"""), header=1)
print(df)
tests = ('lt', 'lt', 'gt')
print(met_condition_at(df, tests))
print(''.join(met_condition_at(df, tests).sort_values().index.values))
Results:
A B C
0 0.96 1.20 0.75
1 0.94 1.30 0.72
2 0.92 1.15 0.68
3 0.90 1.00 0.73
A 1
B 2
C 3
dtype: int64
ABC

How to replace NaNs by average of preceding and succeeding values in pandas DataFrame?

If I have some missing values and I would like to replace all NaN with average of preceding and succeeding values, how can I do that ?.
I know I can use pandas.DataFrame.fillna with method='ffill' or method='bfill' options to replace the NaN values by preceding or succeeding values, however I would like to apply the average of those values on the dataframe instead of iterating over rows and columns.
Try DataFrame.interpolate(). Example from the panda docs:
In [65]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
....: 'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
....:
In [66]: df
Out[66]:
A B
0 1.0 0.25
1 2.1 NaN
2 NaN NaN
3 4.7 4.00
4 5.6 12.20
5 6.8 14.40
In [67]: df.interpolate()
Out[67]:
A B
0 1.0 0.25
1 2.1 1.50
2 3.4 2.75
3 4.7 4.00
4 5.6 12.20
5 6.8 14.40
Maybe late but I just had the same question and the (unique) answer in this page did not satisfy my expectations. That's why I am answering now.
Your post states that you want to replace the NaNs with averages however, the interpolation is not a correct answer for me because it fills the empty cells with a linear equation. If you want to fill it with the averages of the preceding and succeeding rows. This code helped me:
dfb = df.fillna(method='bfill')
dff = df.fillna(method='ffill')
dfmeans = (dfb+dff)/2
dfmeans
For the datafrme of the example above, the result looks like
A B
0 1.0 0.250
1 2.1 2.125
2 3.4 2.125
3 4.7 4.000
4 5.6 12.200
5 6.8 14.400
Where you can see, at index 2 of the column A they both produce 3.4 because there the interpolation is (2.1 + 4.7)/2 but in column B the values differ.
For a one-line script and it's application to time series, you can see this post: Average between values with unevenly distributed time in Pandas DataFrame

Categories

Resources