Pandas append different from documentation [duplicate] - python

This question already has answers here:
Appending a list or series to a pandas DataFrame as a row?
(13 answers)
Create a Pandas Dataframe by appending one row at a time
(31 answers)
Closed 1 year ago.
I am having trouble using pandas dataframe.append() as it doesn't work the way it is described in the the help(pandas.DataFrame.append), or online in the various sites, blogs, answered questions etc.
This is exactly what I am doing
import pandas as pd
import numpy as np
dataset = pd.DataFrame.from_dict({"0": [0,0,0,0]}, orient="index", columns=["time", "cost", "mult", "class"])
row= [3, 1, 3, 1]
dataset = dataset.append(row, sort=True )
Trying to get to this result
time cost mult class
0 0.0 0.0 0.0 0.0
1 1 1 1 1
what I am getting instead is
0 class cost mult time
0 NaN 0.0 0.0 0.0 0.0
0 3.0 NaN NaN NaN NaN
1 1.0 NaN NaN NaN NaN
2 3.0 NaN NaN NaN NaN
3 1.0 NaN NaN NaN NaN
I have tried all sorts of things, but some examples (online and in documentation) can't be done since .append() doesn't uses anymore the parameter "columns"
append(self, other, ignore_index: 'bool' = False, verify_integrity:
'bool' = False, sort: 'bool' = False) -> 'DataFrame'
Append rows of other to the end of caller, returning a new object. other : DataFrame or Series/dict-like object, or list of these
The data to append.
ignore_index : bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
verify_integrity : bool, default False
If True, raise ValueError on creating index with duplicates.
sort : bool, default False
Sort columns if the columns of self and other are not aligned.
I have tried all combinations of those parameter but it keeps showing me that crap of new rows with values on a new separated columns, moreover it changes the order of the columns that I defined in the initial dataset. (I have tried also various things with .concat but it still gave similar problems wven with axis=0)
Since even the examples in the documentaition don't show this result while having the same code structure, if anyone could enlighten me on what is happening and why, and how to fix this, it would be great.
In response to the answer, I had already tried
row= pd.Series([3, 1, 3, 1])
row = row.to_frame()
dataset = dataset.append(row, ignore_index=True )
0 class cost mult time
0 NaN 0.0 0.0 0.0 0.0
1 3.0 NaN NaN NaN NaN
2 1.0 NaN NaN NaN NaN
3 3.0 NaN NaN NaN NaN
4 1.0 NaN NaN NaN NaN
alternatively
row= pd.Series([3, 1, 3, 1])
dataset = dataset.append(row, ignore_index=True )
time cost mult class 0 1 2 3
0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
1 NaN NaN NaN NaN 3.0 1.0 3.0 1.0
without the ingore_index raises this error in this second case
TypeError: Can only append a Series if ignore_index=True or if the
Series has a name

One option is to just explicitly turn the list into a pd.Series:
In [46]: dataset.append(pd.Series(row, index=dataset.columns), ignore_index=True)
Out[46]:
time cost mult class
0 0 0 0 0
1 3 1 3 1
You can also do it natively with a dict:
In [47]: dataset.append(dict(zip(dataset.columns, row)), ignore_index=True)
Out[47]:
time cost mult class
0 0 0 0 0
1 3 1 3 1
The issue you're having is that other needs to be a DataFrame, a Series (or another dict-like object), or a list of DataFrames or Serieses, not a list of integers.

Related

Pandas .iloc indexing coupled with boolean indexing in a Dataframe

I looked into existing threads regarding indexing, none of said threads address the present use case.
I would like to alter specific values in a DataFrame based on their position therein, ie., I'd like the values in the second column from the first to the 4th row to be NaN and values in the third column, first and second row to be NaN say we have the following `DataFrame`:
df = pd.DataFrame(np.random.standard_normal((7,3)))
print(df)
0 1 2
0 -1.102888 1.293658 -2.290175
1 -1.826924 -0.661667 -1.067578
2 1.015479 0.058240 -0.228613
3 -0.760368 0.256324 -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350
And I want alter df like below with the least amount of code:
0 1 2
0 -1.102888 NaN NaN
1 -1.826924 NaN NaN
2 1.015479 NaN -0.228613
3 -0.760368 NaN -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350
I tried using boolean indexing with .loc but resulted in an error:
df.loc[(:2,1:) & (2:4,1)] = np.nan
# exception message:
df.loc[(:2,1:) & (2:4,1)] = np.nan
^
SyntaxError: invalid syntax
I also thought about converting the DataFrame object to a numpy narray object but then I wouldn't know how to use boolean in that case.
One way is define the requirement and assign to be clear:
d = {1:4,2:2}
for col,val in d.items():
df.iloc[:val,col] = np.nan
print(df)
0 1 2
0 -1.102888 NaN NaN
1 -1.826924 NaN NaN
2 1.015479 NaN -0.228613
3 -0.760368 NaN -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350

Fill missing values with the most common value in the grouped form

Could anybody help me with fill missing values with the most common value but grouped form? .Here I want to fill missing value of cylinders columns with the same model of cars.
I tried this :
sh_cars['cylinders']=sh_cars['cylinders'].fillna(sh_cars.groupby('model')['cylinders'].agg(pd.Series.mode))
and other ones but I got everytime error messages.
Thanks in advance.
I think problem is there are only NaNs per some (or all) groups, so error is raised. Possible solution is use custom function with GroupBy.transform for return Series with same size like original DataFrame:
data = {'model':['a','a','a','a','b','b','a'],
'cylinders':[2,9,9,np.nan,np.nan,np.nan,np.nan]}
sh_cars = pd.DataFrame(data)
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = sh_cars.groupby('model')['cylinders'].transform(f)
sh_cars['new']=sh_cars['cylinders'].fillna(s)
print (sh_cars)
model cylinders new
0 a 2.0 2.0
1 a 9.0 9.0
2 a 9.0 9.0
3 a NaN 9.0
4 b NaN NaN
5 b NaN NaN
6 a NaN 9.0
Replace original column:
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = sh_cars.groupby('model')['cylinders'].transform(f)
sh_cars['cylinders']=sh_cars['cylinders'].fillna(s)
print (sh_cars)
model cylinders
0 a 2.0
1 a 9.0
2 a 9.0
3 a 9.0
4 b NaN
5 b NaN
6 a 9.0

Pandas lambda function won't recognize NaN

I'm trying to evaluate a new column in a DF by values from two others, but if a value is missing I try to pass another expression.
df_merge["3"] = df_merge.apply(lambda row: row["1"] + row["2"]
if pd.isnull(row["1"]) or pd.isnull(row["2"])
else (row["1"] + row["2"])/2,
axis=1)
loc 1 2 3
0 135200 0.391 0.224 0.3075
1 135210 0.400 0.220 0.3100
95 136150 NaN 0.505 NaN
96 136160 NaN 0.527 NaN
This is what I got. So if 1 or 2 is null I want to use the first expression, else the last one.
However, the first expression never gets passed. If I try to test for example:
pd.isnull(df_merge.iloc[96,3])
It evaluates to True, so why isn't the first expression passed in that instance??
I also tried:
df_merge["3"].fillna(value=df_merge["1"] + df_merge["2"],inplace=True)
Which did exactly nothing.
Sincerely,
Fredrik
The simpliest here is use mean per rows, because mean by default in pandas omit NaNs (if not both NaNs like row 2):
df_merge = pd.DataFrame({'1':[np.nan, np.nan, 1, 2],
'2':[5, np.nan, np.nan, 4]})
df_merge["3"] = df_merge[["1",'2']].mean(axis=1)
print (df_merge)
1 2 3
0 NaN 5.0 5.0
1 NaN NaN NaN
2 1.0 NaN 1.0
3 2.0 4.0 3.0

Python Pandas: Breaking a list or series into columns of different sizes

I have a single series with 2 columns that looks like
1 5.3
2 2.5
3 1.6
4 3.8
5 2.8
...and so on. I would like to take this series and break it into 6 columns of different sizes. So (for example) the first column would have 30 items, the next 31, the next 28, and so on. I have seen plenty of examples for same-sized columns but have not seen away to make multiple custom-sized columns.
Based on comments you can try use the index of the series to fill your dataframe
s = pd.Series([5, 2, 1, 3, 2])
df = pd.DataFrame([], index=s.index)
df['col1'] = s.loc[:2]
df['col2'] = s.loc[3:3]
df['col3'] = s.loc[4:]
Result:
col1 col2 col3
0 5.0 NaN NaN
1 2.0 NaN NaN
2 1.0 NaN NaN
3 NaN 3.0 NaN
4 NaN NaN 4.0

Pandas DataFrame: most data in columns are 'float' , I want to delete the row which is 'str'

wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]}
I want to delete the row with 'hhh', because all datas in 'a' are numbers.
The original data size is huge. Thank you very much.
Option 1
Convert a using pd.to_numeric
df.a = pd.to_numeric(df.a, errors='coerce')
df
a b
0 NaN 1.0
1 2.0 2.0
2 3.0 NaN
3 4.0 NaN
4 5.0 5.0
Non-Numeric columns are coerced to NaN. You can then drop this row -
df.dropna(subset=['a'])
a b
1 2.0 2.0
2 3.0 NaN
3 4.0 NaN
4 5.0 5.0
Option 2
Another alternative is using str.isdigit -
df.a.str.isdigit()
0 False
1 NaN
2 NaN
3 NaN
4 NaN
Name: a, dtype: object
Filter as such -
df[df.a.str.isdigit().isnull()]
a b
1 2 2.0
2 3 NaN
3 4 NaN
4 5 5.0
Notes -
This won't work for float columns
If the numbers are also as strings, then drop the isnull bit -
df[df.a.str.isdigit()]
import pandas as pd
import numpy as np
wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})
#wu = wu[wu.a.str.contains('\d+',na=False)]
#wu = wu[wu.a.apply(lambda x: x.isnumeric())]
wu = wu[wu.a.apply(lambda x: isinstance(x, (int, np.int64)))]
print(wu)
Note that you missed out a closing parenthesis when creating your DataFrame.
I tried 3 ways, but only the third one worked. You can always try the other ones (commented out) if that works for you. Do let me know if it works on the larger dataset.
df = pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})
df.drop(df[df['a'].apply(type) != int].index, inplace=True)
if you just want to view the appropriate rows:
df.loc[df['a'].apply(type) != int, :]

Categories

Resources