I have two pandas dataframes in a panel and would like to create a third df that ranks the first df (by row) but only include those where the corresponding element of the second df is True. Some sample data to illustrate:
p['x']
A B C D E
2015-12-31 0.957941 -0.686432 1.087717 1.363008 -1.528369
2016-01-31 0.079616 0.524744 1.675234 0.665511 0.023160
2016-02-29 -0.300144 -0.705346 -0.141015 1.341883 0.855853
2016-03-31 0.435728 1.046326 -0.422501 0.536986 -0.656256
p['y']
A B C D E
2015-12-31 True False True False NaN
2016-01-31 True True True False NaN
2016-02-29 False True True True NaN
2016-03-31 NaN NaN NaN NaN NaN
I have managed to do this with a few ugly hacks but still get stuck on the fact that rank won't let me use method='first' on non-numeric data. I want to force incremental integer ranks (even if duplicates) and NaN for any cell that didn't have True in the boolean df.
Output should be of the form:
A B C D E
2015-12-31 2.0 NaN 1.0 NaN NaN
2016-01-31 3.0 2.0 1.0 NaN NaN
2016-02-29 NaN 3.0 2.0 1.0 NaN
2016-03-31 NaN NaN NaN NaN NaN
My hacked attempt is below. It works, although there should clearly be a better way to replace false with NaN. However it doesn't work once I add method='first' and this is necessary as I may have instances of duplicated values.
# I first had to hack a replacement of False with NaN.
# np.nan did not evaluate correctly
# I wasn't sure how else to specify pandas NaN
rank=p['Z'].replace(False,p['Z'].iloc[3,0])
# eliminate the elements without a corresponding True
rank=rank*p['X']
# then this works
p['rank'] = rank.rank(axis=1, ascending=False)
# but this doesn't
p['rank'] = rank.rank(axis=1, ascending=False, method='first')
Any help would be much appreciated!
thanks
List item
pd.DataFrame(np.where(p['y'] == True, p['x'], np.nan),
p.major_axis, p.minor_axis).rank(1, ascending=False)
Related
This question already has answers here:
Appending a list or series to a pandas DataFrame as a row?
(13 answers)
Create a Pandas Dataframe by appending one row at a time
(31 answers)
Closed 1 year ago.
I am having trouble using pandas dataframe.append() as it doesn't work the way it is described in the the help(pandas.DataFrame.append), or online in the various sites, blogs, answered questions etc.
This is exactly what I am doing
import pandas as pd
import numpy as np
dataset = pd.DataFrame.from_dict({"0": [0,0,0,0]}, orient="index", columns=["time", "cost", "mult", "class"])
row= [3, 1, 3, 1]
dataset = dataset.append(row, sort=True )
Trying to get to this result
time cost mult class
0 0.0 0.0 0.0 0.0
1 1 1 1 1
what I am getting instead is
0 class cost mult time
0 NaN 0.0 0.0 0.0 0.0
0 3.0 NaN NaN NaN NaN
1 1.0 NaN NaN NaN NaN
2 3.0 NaN NaN NaN NaN
3 1.0 NaN NaN NaN NaN
I have tried all sorts of things, but some examples (online and in documentation) can't be done since .append() doesn't uses anymore the parameter "columns"
append(self, other, ignore_index: 'bool' = False, verify_integrity:
'bool' = False, sort: 'bool' = False) -> 'DataFrame'
Append rows of other to the end of caller, returning a new object. other : DataFrame or Series/dict-like object, or list of these
The data to append.
ignore_index : bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
verify_integrity : bool, default False
If True, raise ValueError on creating index with duplicates.
sort : bool, default False
Sort columns if the columns of self and other are not aligned.
I have tried all combinations of those parameter but it keeps showing me that crap of new rows with values on a new separated columns, moreover it changes the order of the columns that I defined in the initial dataset. (I have tried also various things with .concat but it still gave similar problems wven with axis=0)
Since even the examples in the documentaition don't show this result while having the same code structure, if anyone could enlighten me on what is happening and why, and how to fix this, it would be great.
In response to the answer, I had already tried
row= pd.Series([3, 1, 3, 1])
row = row.to_frame()
dataset = dataset.append(row, ignore_index=True )
0 class cost mult time
0 NaN 0.0 0.0 0.0 0.0
1 3.0 NaN NaN NaN NaN
2 1.0 NaN NaN NaN NaN
3 3.0 NaN NaN NaN NaN
4 1.0 NaN NaN NaN NaN
alternatively
row= pd.Series([3, 1, 3, 1])
dataset = dataset.append(row, ignore_index=True )
time cost mult class 0 1 2 3
0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 3.0 1.0 3.0 1.0
without the ingore_index raises this error in this second case
TypeError: Can only append a Series if ignore_index=True or if the
Series has a name
One option is to just explicitly turn the list into a pd.Series:
In [46]: dataset.append(pd.Series(row, index=dataset.columns), ignore_index=True)
Out[46]:
time cost mult class
0 0 0 0 0
1 3 1 3 1
You can also do it natively with a dict:
In [47]: dataset.append(dict(zip(dataset.columns, row)), ignore_index=True)
Out[47]:
time cost mult class
0 0 0 0 0
1 3 1 3 1
The issue you're having is that other needs to be a DataFrame, a Series (or another dict-like object), or a list of DataFrames or Serieses, not a list of integers.
Could anybody help me with fill missing values with the most common value but grouped form? .Here I want to fill missing value of cylinders columns with the same model of cars.
I tried this :
sh_cars['cylinders']=sh_cars['cylinders'].fillna(sh_cars.groupby('model')['cylinders'].agg(pd.Series.mode))
and other ones but I got everytime error messages.
Thanks in advance.
I think problem is there are only NaNs per some (or all) groups, so error is raised. Possible solution is use custom function with GroupBy.transform for return Series with same size like original DataFrame:
data = {'model':['a','a','a','a','b','b','a'],
'cylinders':[2,9,9,np.nan,np.nan,np.nan,np.nan]}
sh_cars = pd.DataFrame(data)
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = sh_cars.groupby('model')['cylinders'].transform(f)
sh_cars['new']=sh_cars['cylinders'].fillna(s)
print (sh_cars)
model cylinders new
0 a 2.0 2.0
1 a 9.0 9.0
2 a 9.0 9.0
3 a NaN 9.0
4 b NaN NaN
5 b NaN NaN
6 a NaN 9.0
Replace original column:
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = sh_cars.groupby('model')['cylinders'].transform(f)
sh_cars['cylinders']=sh_cars['cylinders'].fillna(s)
print (sh_cars)
model cylinders
0 a 2.0
1 a 9.0
2 a 9.0
3 a 9.0
4 b NaN
5 b NaN
6 a 9.0
I'm trying to evaluate a new column in a DF by values from two others, but if a value is missing I try to pass another expression.
df_merge["3"] = df_merge.apply(lambda row: row["1"] + row["2"]
if pd.isnull(row["1"]) or pd.isnull(row["2"])
else (row["1"] + row["2"])/2,
axis=1)
loc 1 2 3
0 135200 0.391 0.224 0.3075
1 135210 0.400 0.220 0.3100
95 136150 NaN 0.505 NaN
96 136160 NaN 0.527 NaN
This is what I got. So if 1 or 2 is null I want to use the first expression, else the last one.
However, the first expression never gets passed. If I try to test for example:
pd.isnull(df_merge.iloc[96,3])
It evaluates to True, so why isn't the first expression passed in that instance??
I also tried:
df_merge["3"].fillna(value=df_merge["1"] + df_merge["2"],inplace=True)
Which did exactly nothing.
Sincerely,
Fredrik
The simpliest here is use mean per rows, because mean by default in pandas omit NaNs (if not both NaNs like row 2):
df_merge = pd.DataFrame({'1':[np.nan, np.nan, 1, 2],
'2':[5, np.nan, np.nan, 4]})
df_merge["3"] = df_merge[["1",'2']].mean(axis=1)
print (df_merge)
1 2 3
0 NaN 5.0 5.0
1 NaN NaN NaN
2 1.0 NaN 1.0
3 2.0 4.0 3.0
I have a dataset:
367235 419895 992194
1999-01-11 8 5 1
1999-03-23 NaN 4 NaN
1999-04-30 NaN NaN 1
1999-06-02 NaN 9 NaN
1999-08-08 2 NaN NaN
1999-08-12 NaN 3 NaN
1999-08-17 NaN NaN 10
1999-10-22 NaN 3 NaN
1999-12-04 NaN NaN 4
2000-03-04 2 NaN NaN
2000-09-29 9 NaN NaN
2000-09-30 9 NaN NaN
When I plot it, using plt.plot(df, '-o') I get this:
But what I would like is for the datapoints from each column to be connected in a line, like so:
I understand that matplotlib does not connect datapoints that are separate by NaN values. I looked at all the options here for dealing with missing data, but all of them would essentially misrepresent the data in the dataframe. This is because each value within the dataframe represents an incident; if I try to replace the NaNs with scalar values or use the interpolate option, I get a bunch of points that are not actually in my dataset. Here's what interpolate looks like:
df_wanted2 = df.apply(pd.Series.interpolate)
If I try to use dropna I'll lose entire rows\columns from the dataframe, and these rows hold valuable data.
Does anyone know a way to connect up my dots? I suspect I need to extract individual arrays from the datasframe and plot them, as is the advice given here, but this seems like a lot of work (and my actual dataframe is much bigger.) Does anyone have a solution?
use interpolate method with parameter 'index'
df.interpolate('index').plot(marker='o')
alternative answer
plot after iteritems
for _, c in df.iteritems():
c.dropna().plot(marker='o')
extra credit
only interpolate from first valid index to last valid index for each column
for _, c in df.iteritems():
fi, li = c.first_valid_index(), c.last_valid_index()
c.loc[fi:li].interpolate('index').plot(marker='o')
Try iterating through with apply and then inside the apply function drop the missing values
def make_plot(s):
s.dropna().plot()
df.apply(make_plot)
An alternative would be to outsource the NaN handling to the graph libary Plotly using its connectgaps function.
import plotly
import pandas as pd
txt = """367235 419895 992194
1999-01-11 8 5 1
1999-03-23 NaN 4 NaN
1999-04-30 NaN NaN 1
1999-06-02 NaN 9 NaN
1999-08-08 2 NaN NaN
1999-08-12 NaN 3 NaN
1999-08-17 NaN NaN 10
1999-10-22 NaN 3 NaN
1999-12-04 NaN NaN 4
2000-03-04 2 NaN NaN
2000-09-29 9 NaN NaN
2000-09-30 9 NaN NaN"""
data_points = [line.split(' ') for line in txt.splitlines()[1:]]
df = pd.DataFrame(data_points)
data = list()
for i in range(1, len(df.columns)):
data.append(plotly.graph_objs.Scatter(
x = df.iloc[:,0].tolist(),
y = df.iloc[:,i].tolist(),
mode = 'line',
connectgaps = True
))
fig = dict(data=data)
plotly.plotly.sign_in('user', 'token')
plot_url = plotly.plotly.plot(fig)
I have a dataframe (it is the product of using the pivot function, which is why it has the c and a):
c 367235 419895 992194
a
1999-02-06 Nan 9 Nan
2000-04-03 2 Nan Nan
1999-04-12 Nan Nan 4
1999-08-08 2 Nan Nan
1999-11-01 8 5 1
1999-12-08 Nan 3 Nan
1999-08-17 Nan Nan 10
1999-10-22 Nan 3 Nan
1999-03-23 Nan 4 Nan
2000-09-29 9 Nan Nan
1999-04-30 Nan Nan 1
2000-09-30 9 Nan Nan
I would like to add a new row at the bottom of this dataframe. Each cell in the new row will evaluate the column above it; if the column contains the numbers 9, 8 or 3, the cell will evaluate to "TRUE". If the column does not contain those numbers, the cell will evaluate to "FALSE". Ultimately, my goal is to delete the columns with a "FALSE" cell using the drop function, creating a dataset like so:
c 367235 419895
a
1999-02-06 Nan 9
2000-04-03 2 Nan
1999-04-12 Nan Nan
1999-08-08 2 Nan
1999-11-01 8 5
1999-12-08 Nan 3
1999-08-17 Nan Nan
1999-10-22 Nan 3
1999-03-23 Nan 4
2000-09-29 9 Nan
1999-04-30 Nan Nan
2000-09-30 9 Nan
TRUE TRUE
My problem:
I can write a function that evaluates if one of several numbers are in a list, but I cannot write this function into .apply.
That is, I found that this works for determining if a group of numbers is in a list:
How to check if one of the following items is in a list?
I tried to modify it as follows for the apply function:
def BIS(i):
L1 = [9,8,3]
if i in L1:
return "TRUE"
else:
return "FALSE"
df_wanted.apply(BIS, axis = 0)
this results in an error:
('the truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item, a.any().' u'occured at index 367235')
This makes me think that although .apply takes an entire column as input, it cannot aggregate the truth value of all the individual cells and come up with a total truth value about the column. I looked up a.any and a.bool, and they look very useful, but I don't know where to stick them in? For example, this didn't work:
df_wanted.apply.any(BIS, axis = 0)
nor did this
df_wanted.apply(BIS.any, axis = 0).
Can anyone point me in the right direction? Many thanks in advance
You can use the .isin() method:
df.loc[:, df.isin(['9','8','3']).any()]
And if you need to append the condition to the data frame:
cond = df.isin(['9','8','3']).any().rename("cond")
df.append(cond).loc[:, cond]