I'm trying to evaluate a new column in a DF by values from two others, but if a value is missing I try to pass another expression.
df_merge["3"] = df_merge.apply(lambda row: row["1"] + row["2"]
if pd.isnull(row["1"]) or pd.isnull(row["2"])
else (row["1"] + row["2"])/2,
axis=1)
loc 1 2 3
0 135200 0.391 0.224 0.3075
1 135210 0.400 0.220 0.3100
95 136150 NaN 0.505 NaN
96 136160 NaN 0.527 NaN
This is what I got. So if 1 or 2 is null I want to use the first expression, else the last one.
However, the first expression never gets passed. If I try to test for example:
pd.isnull(df_merge.iloc[96,3])
It evaluates to True, so why isn't the first expression passed in that instance??
I also tried:
df_merge["3"].fillna(value=df_merge["1"] + df_merge["2"],inplace=True)
Which did exactly nothing.
Sincerely,
Fredrik
The simpliest here is use mean per rows, because mean by default in pandas omit NaNs (if not both NaNs like row 2):
df_merge = pd.DataFrame({'1':[np.nan, np.nan, 1, 2],
'2':[5, np.nan, np.nan, 4]})
df_merge["3"] = df_merge[["1",'2']].mean(axis=1)
print (df_merge)
1 2 3
0 NaN 5.0 5.0
1 NaN NaN NaN
2 1.0 NaN 1.0
3 2.0 4.0 3.0
Related
This question already has answers here:
Appending a list or series to a pandas DataFrame as a row?
(13 answers)
Create a Pandas Dataframe by appending one row at a time
(31 answers)
Closed 1 year ago.
I am having trouble using pandas dataframe.append() as it doesn't work the way it is described in the the help(pandas.DataFrame.append), or online in the various sites, blogs, answered questions etc.
This is exactly what I am doing
import pandas as pd
import numpy as np
dataset = pd.DataFrame.from_dict({"0": [0,0,0,0]}, orient="index", columns=["time", "cost", "mult", "class"])
row= [3, 1, 3, 1]
dataset = dataset.append(row, sort=True )
Trying to get to this result
time cost mult class
0 0.0 0.0 0.0 0.0
1 1 1 1 1
what I am getting instead is
0 class cost mult time
0 NaN 0.0 0.0 0.0 0.0
0 3.0 NaN NaN NaN NaN
1 1.0 NaN NaN NaN NaN
2 3.0 NaN NaN NaN NaN
3 1.0 NaN NaN NaN NaN
I have tried all sorts of things, but some examples (online and in documentation) can't be done since .append() doesn't uses anymore the parameter "columns"
append(self, other, ignore_index: 'bool' = False, verify_integrity:
'bool' = False, sort: 'bool' = False) -> 'DataFrame'
Append rows of other to the end of caller, returning a new object. other : DataFrame or Series/dict-like object, or list of these
The data to append.
ignore_index : bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
verify_integrity : bool, default False
If True, raise ValueError on creating index with duplicates.
sort : bool, default False
Sort columns if the columns of self and other are not aligned.
I have tried all combinations of those parameter but it keeps showing me that crap of new rows with values on a new separated columns, moreover it changes the order of the columns that I defined in the initial dataset. (I have tried also various things with .concat but it still gave similar problems wven with axis=0)
Since even the examples in the documentaition don't show this result while having the same code structure, if anyone could enlighten me on what is happening and why, and how to fix this, it would be great.
In response to the answer, I had already tried
row= pd.Series([3, 1, 3, 1])
row = row.to_frame()
dataset = dataset.append(row, ignore_index=True )
0 class cost mult time
0 NaN 0.0 0.0 0.0 0.0
1 3.0 NaN NaN NaN NaN
2 1.0 NaN NaN NaN NaN
3 3.0 NaN NaN NaN NaN
4 1.0 NaN NaN NaN NaN
alternatively
row= pd.Series([3, 1, 3, 1])
dataset = dataset.append(row, ignore_index=True )
time cost mult class 0 1 2 3
0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 3.0 1.0 3.0 1.0
without the ingore_index raises this error in this second case
TypeError: Can only append a Series if ignore_index=True or if the
Series has a name
One option is to just explicitly turn the list into a pd.Series:
In [46]: dataset.append(pd.Series(row, index=dataset.columns), ignore_index=True)
Out[46]:
time cost mult class
0 0 0 0 0
1 3 1 3 1
You can also do it natively with a dict:
In [47]: dataset.append(dict(zip(dataset.columns, row)), ignore_index=True)
Out[47]:
time cost mult class
0 0 0 0 0
1 3 1 3 1
The issue you're having is that other needs to be a DataFrame, a Series (or another dict-like object), or a list of DataFrames or Serieses, not a list of integers.
Could anybody help me with fill missing values with the most common value but grouped form? .Here I want to fill missing value of cylinders columns with the same model of cars.
I tried this :
sh_cars['cylinders']=sh_cars['cylinders'].fillna(sh_cars.groupby('model')['cylinders'].agg(pd.Series.mode))
and other ones but I got everytime error messages.
Thanks in advance.
I think problem is there are only NaNs per some (or all) groups, so error is raised. Possible solution is use custom function with GroupBy.transform for return Series with same size like original DataFrame:
data = {'model':['a','a','a','a','b','b','a'],
'cylinders':[2,9,9,np.nan,np.nan,np.nan,np.nan]}
sh_cars = pd.DataFrame(data)
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = sh_cars.groupby('model')['cylinders'].transform(f)
sh_cars['new']=sh_cars['cylinders'].fillna(s)
print (sh_cars)
model cylinders new
0 a 2.0 2.0
1 a 9.0 9.0
2 a 9.0 9.0
3 a NaN 9.0
4 b NaN NaN
5 b NaN NaN
6 a NaN 9.0
Replace original column:
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = sh_cars.groupby('model')['cylinders'].transform(f)
sh_cars['cylinders']=sh_cars['cylinders'].fillna(s)
print (sh_cars)
model cylinders
0 a 2.0
1 a 9.0
2 a 9.0
3 a 9.0
4 b NaN
5 b NaN
6 a 9.0
I have a single series with 2 columns that looks like
1 5.3
2 2.5
3 1.6
4 3.8
5 2.8
...and so on. I would like to take this series and break it into 6 columns of different sizes. So (for example) the first column would have 30 items, the next 31, the next 28, and so on. I have seen plenty of examples for same-sized columns but have not seen away to make multiple custom-sized columns.
Based on comments you can try use the index of the series to fill your dataframe
s = pd.Series([5, 2, 1, 3, 2])
df = pd.DataFrame([], index=s.index)
df['col1'] = s.loc[:2]
df['col2'] = s.loc[3:3]
df['col3'] = s.loc[4:]
Result:
col1 col2 col3
0 5.0 NaN NaN
1 2.0 NaN NaN
2 1.0 NaN NaN
3 NaN 3.0 NaN
4 NaN NaN 4.0
wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]}
I want to delete the row with 'hhh', because all datas in 'a' are numbers.
The original data size is huge. Thank you very much.
Option 1
Convert a using pd.to_numeric
df.a = pd.to_numeric(df.a, errors='coerce')
df
a b
0 NaN 1.0
1 2.0 2.0
2 3.0 NaN
3 4.0 NaN
4 5.0 5.0
Non-Numeric columns are coerced to NaN. You can then drop this row -
df.dropna(subset=['a'])
a b
1 2.0 2.0
2 3.0 NaN
3 4.0 NaN
4 5.0 5.0
Option 2
Another alternative is using str.isdigit -
df.a.str.isdigit()
0 False
1 NaN
2 NaN
3 NaN
4 NaN
Name: a, dtype: object
Filter as such -
df[df.a.str.isdigit().isnull()]
a b
1 2 2.0
2 3 NaN
3 4 NaN
4 5 5.0
Notes -
This won't work for float columns
If the numbers are also as strings, then drop the isnull bit -
df[df.a.str.isdigit()]
import pandas as pd
import numpy as np
wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})
#wu = wu[wu.a.str.contains('\d+',na=False)]
#wu = wu[wu.a.apply(lambda x: x.isnumeric())]
wu = wu[wu.a.apply(lambda x: isinstance(x, (int, np.int64)))]
print(wu)
Note that you missed out a closing parenthesis when creating your DataFrame.
I tried 3 ways, but only the third one worked. You can always try the other ones (commented out) if that works for you. Do let me know if it works on the larger dataset.
df = pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})
df.drop(df[df['a'].apply(type) != int].index, inplace=True)
if you just want to view the appropriate rows:
df.loc[df['a'].apply(type) != int, :]
I have two pandas dataframes in a panel and would like to create a third df that ranks the first df (by row) but only include those where the corresponding element of the second df is True. Some sample data to illustrate:
p['x']
A B C D E
2015-12-31 0.957941 -0.686432 1.087717 1.363008 -1.528369
2016-01-31 0.079616 0.524744 1.675234 0.665511 0.023160
2016-02-29 -0.300144 -0.705346 -0.141015 1.341883 0.855853
2016-03-31 0.435728 1.046326 -0.422501 0.536986 -0.656256
p['y']
A B C D E
2015-12-31 True False True False NaN
2016-01-31 True True True False NaN
2016-02-29 False True True True NaN
2016-03-31 NaN NaN NaN NaN NaN
I have managed to do this with a few ugly hacks but still get stuck on the fact that rank won't let me use method='first' on non-numeric data. I want to force incremental integer ranks (even if duplicates) and NaN for any cell that didn't have True in the boolean df.
Output should be of the form:
A B C D E
2015-12-31 2.0 NaN 1.0 NaN NaN
2016-01-31 3.0 2.0 1.0 NaN NaN
2016-02-29 NaN 3.0 2.0 1.0 NaN
2016-03-31 NaN NaN NaN NaN NaN
My hacked attempt is below. It works, although there should clearly be a better way to replace false with NaN. However it doesn't work once I add method='first' and this is necessary as I may have instances of duplicated values.
# I first had to hack a replacement of False with NaN.
# np.nan did not evaluate correctly
# I wasn't sure how else to specify pandas NaN
rank=p['Z'].replace(False,p['Z'].iloc[3,0])
# eliminate the elements without a corresponding True
rank=rank*p['X']
# then this works
p['rank'] = rank.rank(axis=1, ascending=False)
# but this doesn't
p['rank'] = rank.rank(axis=1, ascending=False, method='first')
Any help would be much appreciated!
thanks
List item
pd.DataFrame(np.where(p['y'] == True, p['x'], np.nan),
p.major_axis, p.minor_axis).rank(1, ascending=False)