Pass arguments to function while using apply to pandas series - python

I want to pass an argument (dropna=False) to value_counts, when using apply with pandas dataframe:
columns = ['a','c']
df = pd.DataFrame({'a':[1,2,2,np.nan], 'b':[2,3,4,3], 'c': [4,np.nan,6,4]})
print (df.apply(pd.Series.value_counts)) #this works
print (df['a'].value_counts(dropna=False)) #this works
print (df.apply(pd.Series.value_counts(value_counts=False))) #combining doesn't
OUT:
a b c
1.0 1.0 NaN NaN
2.0 2.0 1.0 NaN
3.0 NaN 2.0 NaN
4.0 NaN 1.0 2.0
6.0 NaN NaN 1.0
2.0 2
1.0 1
NaN 1
Name: a, dtype: int64
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [109], in <cell line: 5>()
3 print (df.apply(pd.Series.value_counts))
4 print (df['a'].value_counts(dropna=False))
----> 5 print (df.apply(pd.Series.value_counts(value_counts=False)))
TypeError: value_counts() got an unexpected keyword argument 'value_counts'

IIUC you need pass like argument dropna=False:
print (df.apply(pd.Series.value_counts, dropna=False))
a b c
1.0 1.0 NaN NaN
2.0 2.0 1.0 NaN
3.0 NaN 2.0 NaN
4.0 NaN 1.0 2.0
6.0 NaN NaN 1.0
NaN 1.0 NaN 1.0
Or lambda function:
print (df.apply(lambda x: x.value_counts(dropna=False)))
a b c
1.0 1.0 NaN NaN
2.0 2.0 1.0 NaN
3.0 NaN 2.0 NaN
4.0 NaN 1.0 2.0
6.0 NaN NaN 1.0
NaN 1.0 NaN 1.0

Related

Dataframe compare, combine and merge for rectangular meshgrid

I have two dataframes shown below:
df_1 =
Lon Lat N
0 2 1 1
1 2 2 3
2 2 3 1
3 3 2 2
and
df_2 =
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 NaN
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 NaN
6 3.0 2.0 NaN
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 NaN
10 3.0 3.0 NaN
11 4.0 3.0 NaN
What I want to do is to compare these two dfs and merge them according to Lon and Lat. That is to say NaN in df_2 will be covered with values in df_1 if the corresponding Lon and Lat are identical. The ideal output should be as:
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3
6 3.0 2.0 2
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1
10 3.0 3.0 NaN
11 4.0 3.0 NaN
The reason I want to do this is df_1's coordinates Lat and Lon are non-rectangular or unstructured grid, and I need to fill some NaN values so as to get a rectangular meshgrid and make contourf applicable. It would be highly appreciated if you can provide better ways to make the contour plot.
I have tried df_2.combine_first(df_1), but it doesn't work.
Thanks!
df_2.drop(columns = 'N').merge(df_1, on = ['Lon', 'Lat'], how = 'left')
Lon Lat N
0 1.0 1.0 NaN
1 2.0 1.0 1.0
2 3.0 1.0 NaN
3 4.0 1.0 NaN
4 1.0 2.0 NaN
5 2.0 2.0 3.0
6 3.0 2.0 2.0
7 4.0 2.0 NaN
8 1.0 3.0 NaN
9 2.0 3.0 1.0
10 3.0 3.0 NaN
11 4.0 3.0 NaN
If you first create the df_2 with all needed values you can update it with the second DataFrame by using pandas.DataFrame.update.
For this you need to first set the the correct index by using pandas.DataFrame.set_index.
Have a look at this Post for more information.

How to add the values of one smaller DataFrame to part of another mixed type DataFrame, but only to rows after some arbitrary row index?

I have two .csv files, one contains what could be described as a header and a body. The header contains data like the total number of rows, datetime, what application generated the data, and what line the body starts on. The second file contains a single row.
>>> import pandas as pd
>>> df = pd.read_csv("data.csv", names=list('abcdef'))
>>> df
a b c d e f
0 data start row 5 NaN NaN NaN NaN
1 row count 7 NaN NaN NaN NaN
2 made by foo.exe NaN NaN NaN NaN
3 date 01-01-2000 NaN NaN NaN NaN
4 a b c d e f
5 0.0 1.0 2.0 3.0 4.0 5.0
6 0.0 1.0 2.0 3.0 4.0 5.0
7 0.0 1.0 2.0 3.0 4.0 5.0
8 0.0 1.0 2.0 3.0 4.0 5.0
9 0.0 1.0 2.0 3.0 4.0 5.0
10 0.0 1.0 2.0 3.0 4.0 5.0
11 0.0 1.0 2.0 3.0 4.0 5.0
>>> df2 = pd.read_csv("extra_data.csv")
>>> df2
a b c
0 6.0 5.0 4.0
>>> row = df2.loc[0]
>>>
I am having trouble modifying the 'a', 'b' and 'c' columns and then saving the DataFrame to a new .csv file.
I have tried adding the row by way of slicing and the addition operator but this did not work:
>>> df[5:,'a':'c'] += row
TypeError: '(slice(5, None, None), slice('a', 'c', None))' is an invalid key
>>>
I also tried the answer I found here, but this gave a similar error:
>>> df[5:,row.index] += row
TypeError: '(slice(5, None, None), Index(['a', 'b', 'c'], dtype='object'))' is an invalid key
>>>
I suspect the problem is coming from object dtypes so I tried converting a subframe to the float type:
>>> sub_section = df.loc[5:,['a','b','c']].astype(float)
>>> sub_section
a b c
5 0.0 1.0 2.0
6 0.0 1.0 2.0
7 0.0 1.0 2.0
8 0.0 1.0 2.0
9 0.0 1.0 2.0
10 0.0 1.0 2.0
11 0.0 1.0 2.0
>>> sub_section += row
>>> sub_section
a b c
5 6.0 6.0 6.0
6 6.0 6.0 6.0
7 6.0 6.0 6.0
8 6.0 6.0 6.0
9 6.0 6.0 6.0
10 6.0 6.0 6.0
11 6.0 6.0 6.0
>>> df
a b c d e f
0 data start row 5 NaN NaN NaN NaN
1 row count 7 NaN NaN NaN NaN
2 made by foo.exe NaN NaN NaN NaN
3 date 01-01-2000 NaN NaN NaN NaN
4 a b c d e f
5 0.0 1.0 2.0 3.0 4.0 5.0
6 0.0 1.0 2.0 3.0 4.0 5.0
7 0.0 1.0 2.0 3.0 4.0 5.0
8 0.0 1.0 2.0 3.0 4.0 5.0
9 0.0 1.0 2.0 3.0 4.0 5.0
10 0.0 1.0 2.0 3.0 4.0 5.0
11 0.0 1.0 2.0 3.0 4.0 5.0
>>>
Obviously, in this case df.loc[] is returning a copy, and then modifying the copy does nothing to the df.
How do I modify parts of a DataFrame (dtype=object) and then save the changes?

Concatenate three Float64 variables into one variable

I have df that has many variables and I need to concatenate only 3 float variables of it:
v1 v2 v3
0 2.0 NaN 1.0
1 1.0 1.0 1.0
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN 2.0
df.dtypes()
v1 float64
v2 float64
v3 float64
dtype: object
I need to concatenate all 3 variables into df['concatenated'] and to have these result:
v1 v2 v3 concatenated
0 2.0 NaN 1.0 2.0_NaN_1.0
1 1.0 1.0 1.0 1.0_1.0_1.0
2 NaN NaN 2.0 NaN_NaN_2.0
3 NaN NaN NaN NaN_NaN_NaN
4 NaN NaN 2.0 NaN_NaN_2.0
If the capitalization of your NaNs doesn't matter, this would be sufficient:
df['concatenated'] = df.astype(str).apply('_'.join,1)
>>> df
v1 v2 v3 concatenated
0 2.0 NaN 1.0 2.0_nan_1.0
1 1.0 1.0 1.0 1.0_1.0_1.0
2 NaN NaN 2.0 nan_nan_2.0
3 NaN NaN NaN nan_nan_nan
4 NaN NaN 2.0 nan_nan_2.0
If the capitalization matters, then you have to use replace beforehand:
df['concatenated'] = df.astype(str).replace('nan','NaN').apply('_'.join,1)
>>> df
v1 v2 v3 concatenated
0 2.0 NaN 1.0 2.0_NaN_1.0
1 1.0 1.0 1.0 1.0_1.0_1.0
2 NaN NaN 2.0 NaN_NaN_2.0
3 NaN NaN NaN NaN_NaN_NaN
4 NaN NaN 2.0 NaN_NaN_2.0

Generate New DataFrame without NaN Values

I've the following Dataframe:
a b c d e
0 NaN 2.0 NaN 4.0 5.0
1 NaN 2.0 3.0 NaN 5.0
2 1.0 NaN 3.0 4.0 NaN
3 1.0 2.0 NaN 4.0 NaN
4 NaN 2.0 NaN 4.0 5.0
What I try to to is to generate a new Dataframe without the NaN values.
There are always the same number of NaN Values in a row.
The final Dataframe should look like this:
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
Does someone know an easy way to do this?
Any help is appreciated.
Using array indexing:
pd.DataFrame(df.values[df.notnull().values].reshape(df.shape[0],3),
columns=list('xyz'),dtype=int)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If the dataframe has more inconsistance values across rows like 1st row with 4 values and from 2nd row if it has 3 values, Then this will do:
a b c d e g
0 NaN 2.0 NaN 4.0 5.0 6.0
1 NaN 2.0 3.0 NaN 5.0 NaN
2 1.0 NaN 3.0 4.0 NaN NaN
3 1.0 2.0 NaN 4.0 NaN NaN
4 NaN 2.0 NaN 4.0 5.0 NaN
pd.DataFrame(df.apply(lambda x: x.values[x.notnull()],axis=1).tolist())
0 1 2 3
0 2.0 4.0 5.0 6.0
1 2.0 3.0 5.0 NaN
2 1.0 3.0 4.0 NaN
3 1.0 2.0 4.0 NaN
4 2.0 4.0 5.0 NaN
Here we cannot remove NaN's in last column.
Use justify function and select first 3 columns:
df = pd.DataFrame(justify(df.values,invalid_val=np.nan)[:, :3].astype(int),
columns=list('xyz'),
index=df.index)
print (df)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If, as in your example, values increase across columns, you can sort over axis=1:
res = pd.DataFrame(np.sort(df.values, 1)[:, :3],
columns=list('xyz'), dtype=int)
print(res)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
You can use panda's method for dataframe df.fillna()
This method is used for converting the NaN or NA to your given parameter.
df.fillna(param to replace Nan)
import numpy as np
import pandas as pd
data = {
'A':[np.nan, 2.0, np.nan, 4.0, 5.0],
'B':[np.nan, 2.0, 3.0, np.nan, 5.0],
'C':[1.0 , np.nan, 3.0, 4.0, np.nan],
'D':[1.0 , 2.0, np.nan, 4.0, np.nan,],
'E':[np.nan, 2.0, np.nan, 4.0, 5.0]
}
df = pd.DataFrame(data)
print(df)
A B C D E
0 NaN NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 NaN 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
df = df.fillna(0) # Applying the method with parameter 0
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
If you want to apply this method to the particular column, the syntax would be like this
df[column_name] = df[column_name].fillna(param)
df['A'] = df['A'].fillna(0)
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
You can also use Python's replace() method to replace np.nan
df = df.replace(np.nan,0)
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
df['A'] = df['A'].replace() # Replacing only column A
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0

replace nan in pandas dataframe

given the dataframe df
df = pd.DataFrame(data=[[np.nan,1],
[np.nan,np.nan],
[1,2],
[2,3],
[np.nan,np.nan],
[np.nan,np.nan],
[3,4],
[4,5],
[np.nan,np.nan],
[np.nan,np.nan]],columns=['A','B'])
df
Out[16]:
A B
0 NaN 1.0
1 NaN NaN
2 1.0 2.0
3 2.0 3.0
4 NaN NaN
5 NaN NaN
6 3.0 4.0
7 4.0 5.0
8 NaN NaN
9 NaN NaN
I would need to replace the nan using the following rules:
1) if nan is at the beginning replace with the first values after the nan
2) if nan is in the middle of 2 or more values replace the nan with the average of these values
3) if nan is at the end replace with the last value
df
Out[16]:
A B
0 1.0 1.0
1 1.0 1.5
2 1.0 2.0
3 2.0 3.0
4 2.5 3.5
5 2.5 3.5
6 3.0 4.0
7 4.0 5.0
8 4.0 5.0
9 4.0 5.0
Use add between forward filling and backfilling values, then divide by 2 and last replace last and first NaNs:
df = df.bfill().add(df.ffill()).div(2).ffill().bfill()
print (df)
A B
0 1.0 1.0
1 1.0 1.5
2 1.0 2.0
3 2.0 3.0
4 2.5 3.5
5 2.5 3.5
6 3.0 4.0
7 4.0 5.0
8 4.0 5.0
9 4.0 5.0
Detail:
print (df.bfill().add(df.ffill()))
A B
0 NaN 2.0
1 NaN 3.0
2 2.0 4.0
3 4.0 6.0
4 5.0 7.0
5 5.0 7.0
6 6.0 8.0
7 8.0 10.0
8 NaN NaN
9 NaN NaN

Categories

Resources