Inserting blank row pandas dataframe - python

i have a columns called 'factor' and each time a name in that column changes, i would like to insert a blank row, is this possible?
for i in range(0, end):
if df2.at[i + 1, 'factor'] != df2.at[i, 'factor']:

It's inefficient to manually insert rows sequentially in a for loop. As an alternative, you can find the indices where changes occur, construct a new dataframe, concatenate, then sort by index:
df = pd.DataFrame([[1, 1], [2, 1], [3, 2], [4, 2],
[5, 2], [6, 3]], columns=['A', 'B'])
switches = df['B'].ne(df['B'].shift(-1))
idx = switches[switches].index
df_new = pd.DataFrame(index=idx + 0.5)
df = pd.concat([df, df_new]).sort_index()
print(df)
A B
0.0 1.0 1.0
1.0 2.0 1.0
1.5 NaN NaN
2.0 3.0 2.0
3.0 4.0 2.0
4.0 5.0 2.0
4.5 NaN NaN
5.0 6.0 3.0
5.5 NaN NaN
If necessary, you can use reset_index to normalize the index:
print(df.reset_index(drop=True))
A B
0 1.0 1.0
1 2.0 1.0
2 NaN NaN
3 3.0 2.0
4 4.0 2.0
5 5.0 2.0
6 NaN NaN
7 6.0 3.0
8 NaN NaN

Use reindex by Float64Index of edge indices added to 0.5 with union of original index.
df2 = pd.DataFrame({'factor':list('aaabbccdd')})
idx = df2.index.union(df2.index[df2['factor'].shift(-1).ne(df2['factor'])] + .5)[:-1]
print (idx)
Float64Index([0.0, 1.0, 2.0, 2.5, 3.0, 4.0, 4.5, 5.0, 6.0, 6.5, 7.0, 8.0], dtype='float64')
df2 = df2.reindex(idx, fill_value='').reset_index(drop=True)
print (df2)
factor
0 a
1 a
2 a
3
4 b
5 b
6
7 c
8 c
9
10 d
11 d
If want missing values:
df2 = df2.reindex(idx).reset_index(drop=True)
print (df2)
factor
0 a
1 a
2 a
3 NaN
4 b
5 b
6 NaN
7 c
8 c
9 NaN
10 d
11 d

Related

Count NaNs windows (and their size) in DataFrame columns

I have HUGE dataframes (milions, tens) and lot of missing (NaNs) values along columns.
I need to count the windows of NaNs and their size, for every column, in the fastest way possible (my code is too slow).
Something like this: frome here
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df
Out[65]:
a b c
0 1.0 NaN NaN
1 2.0 2.0 2.0
2 NaN 1.0 1.0
3 NaN 1.0 NaN
4 3.0 3.0 3.0
5 3.0 3.0 3.0
6 NaN NaN NaN
7 4.0 NaN NaN
8 NaN 2.0 2.0
9 NaN NaN 8.0
To here:
result
Out[61]:
a b c
0 2 1 1
1 1 2 1
2 2 1 2
Here's one way to do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df_n = pd.DataFrame({'a':df['a'].isnull().values,
'b':df['b'].isnull().values,
'c':df['c'].isnull().values})
pr={}
for column_name, _ in df_n.iteritems():
fst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(1).fillna(False)]
lst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(-1).fillna(False)]
pr[column_name] = [j-i+1 for i, j in zip(fst, lst)]
df_new=pd.DataFrame(pr)
Output:
a b c
0 2 1 1
1 1 2 1
2 2 1 2
Try this one (example only for a - do analogically for other columns):
>>> df=df.assign(a_count_sum=0)
>>> df["a_count_sum"][np.isnan(df["a"])]=df.groupby(np.isnan(df.a)).cumcount()+1
>>> df
a b c a_count_sum
0 1.0 NaN NaN 0
1 2.0 2.0 2.0 0
2 NaN 1.0 1.0 1
3 NaN 1.0 NaN 2
4 3.0 3.0 3.0 0
5 3.0 3.0 3.0 0
6 NaN NaN NaN 3
7 4.0 NaN NaN 0
8 NaN 2.0 2.0 4
9 NaN NaN 8.0 5
>>> res_1 = df["a_count_sum"][((df["a_count_sum"].shift(-1) == 0) | (np.isnan(df["a_count_sum"].shift(-1)))) & (df["a_count_sum"]!=0)]
>>> res_1
3 2
6 3
9 5
Name: a_count_sum, dtype: int64
>>> res_2 = (-res_1.shift(1).fillna(0)).astype(np.int64)
>>> res_2
3 0
6 -2
9 -3
Name: a_count_sum, dtype: int64
>>> res=res_1+res_2
>>> res
3 2
6 1
9 2
Name: a_count_sum, dtype: int64

pandas - add missing rows on the basis of column values to have linspace

I have a pandas dataframe like
a b c
0 0.5 10 7
1 1.0 6 6
2 2.0 1 7
3 2.5 6 -5
4 3.5 9 7
and I would like to fill the missing columns with respect to the column 'a' on the basis of a certain step. In this case, given the step of 0.5, I would like to fill the 'a' column with the missing values, that is 1.5 and 3.0, and set the other columns to null, in order to obtain the following result.
a b c
0 0.5 10.0 7.0
1 1.0 6.0 6.0
2 1.5 NaN NaN
3 2.0 1.0 7.0
4 2.5 6.0 -5.0
5 3.0 NaN NaN
6 3.5 9.0 7.0
Which is the cleanest way to do this with pandas or other libraries like numpy or scipy?
Thanks!
Create array by numpy.arange, then create index by set_index and last reindex with reset_index:
step= .5
idx = np.arange(df['a'].min(), df['a'].max() + step, step)
df = df.set_index('a').reindex(idx).reset_index()
print (df)
a b c
0 0.5 10.0 7.0
1 1.0 6.0 6.0
2 1.5 NaN NaN
3 2.0 1.0 7.0
4 2.5 6.0 -5.0
5 3.0 NaN NaN
6 3.5 9.0 7.0
One simple way to achieve that is to first create the index you want and then merge the remaining of the information on it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.5, 1, 2, 2.5, 3.5],
'b': [10, 6, 1, 6, 9],
'c': [7, 6, 7, -5, 7]})
ls = np.arange(df.a.min(), df.a.max(), 0.5)
new_df = pd.DataFrame({'a':ls})
new_df = new_df.merge(df, on='a', how='left')

How Can I replace NaN in a row with values in another row Pandas

I tried several methods to replace NaN in a row with values in another row, but none of them worked as expected. Here is my Dataframe:
test = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": [4, 5, 6, np.nan, np.nan],
"c": [7, 8, 9, np.nan, np.nan],
"d": [7, 8, 9, np.nan, np.nan]
}
)
a b c d
0 1 4.0 7.0 7.0
1 2 5.0 8.0 8.0
2 3 6.0 9.0 9.0
3 4 NaN NaN NaN
4 5 NaN NaN NaN
I need to replace NaN in 4th row with values first row, i.e.,
a b c d
0 1 **4.0 7.0 7.0**
1 2 5.0 8.0 8.0
2 3 6.0 9.0 9.0
3 4 **4.0 7.0 7.0**
4 5 NaN NaN NaN
And the second question is how can I multiply some/part values in a row by a number, for example, I need to double the values in second two when the columns are ['b', 'c', 'd'], then the result is:
a b c d
0 1 4.0 7.0 7.0
1 2 **10.0 16.0 16.0**
2 3 6.0 9.0 9.0
3 4 NaN NaN NaN
4 5 NaN NaN NaN
First of all, I suggest you do some reading on Indexing and selecting data in pandas.
Regaring the first question you can use .loc() with isnull() to perform boolean indexing on the column vaulues:
mask_nans = test.loc[3,:].isnull()
test.loc[3, mask_nans] = test.loc[0, mask_nans]
And to double the values you can directly multiply by 2 the sliced dataframe also using .loc():
test.loc[1,'b':] *= 2
a b c d
0 1 4.0 7.0 7.0
1 2 10.0 16.0 16.0
2 3 6.0 9.0 9.0
3 4 4.0 7.0 7.0
4 5 NaN NaN NaN
Indexing with labels
If you wish to filter by a, and a values are unique, consider making it your index to simplify your logic and make it more efficient:
test = test.set_index('a')
test.loc[4] = test.loc[4].fillna(test.loc[1])
test.loc[2] *= 2
Boolean masks
If a is not unique and Boolean masks are required, you can still use fillna with an additional step::
mask = test['a'].eq(4)
test.loc[mask] = test.loc[mask].fillna(test.loc[test['a'].eq(1).idxmax()])
test.loc[test['a'].eq(2)] *= 2

Split pandas dataframe into multiple dataframes based on null columns

I have a pandas dataframe as follows:
a b c
0 1.0 NaN NaN
1 NaN 7.0 5.0
2 3.0 8.0 3.0
3 4.0 9.0 2.0
4 5.0 0.0 NaN
Is there a simple way to split the dataframe into multiple dataframes based on non-null values?
a
0 1.0
b c
1 7.0 5.0
a b c
2 3.0 8.0 3.0
3 4.0 9.0 2.0
a b
4 5.0 0.0
Using groupby with dropna
for _, x in df.groupby(df.isnull().dot(df.columns)):
print(x.dropna(1))
a b c
2 3.0 8.0 3.0
3 4.0 9.0 2.0
b c
1 7.0 5.0
a
0 1.0
a b
4 5.0 0.0
We can save them in dict
d = {y : x.dropna(1) for y, x in df.groupby(df.isnull().dot(df.columns))}
More Info using the dot to get the null column , if they are same we should combine them together
df.isnull().dot(df.columns)
Out[1250]:
0 bc
1 a
2
3
4 c
dtype: object
So here is a possible solution
def getMap(some_list):
return "".join(["1" if np.isnan(x) else "0" for x in some_list])
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, np.NaN, np.NaN], [np.NaN, 7, 5], [3, 8, 3], [4, 9, 2], [5, 0, np.NaN]])
print(df.head())
x = df[[0, 1, 2]].apply(lambda x: x.tolist(), axis=1).tolist()
nullMap = [getMap(y) for y in x]
nullSet = set(nullMap)
some_dict = {y:[] for y in nullSet}
for y in x:
some_dict[getMap(y)] = [*some_dict[getMap(y)], [z for z in y if ~np.isnan(z)]]
dfs = [pd.DataFrame(y) for y in some_dict.values()]
for df in dfs:
print(df)
This gives the exact output for the input you gave. :)
a
1.0
b c
7.0 5.0
a b c
3.0 8.0 3.0
4.0 9.0 2.0
a b
5.0 0.0

Combine two rows in Dataframe using a Unique Value

I converted a list into a Dataframe and now my data looks like this.
I want to use the unique Business ID to merge two rows in this Dataframe. How can I do this?
Use first in a groupby to get first non-null value
Consider the data frame df
df = pd.DataFrame(dict(
Bars=[np.nan, 1, 1, np.nan],
BusID=list('AABB'),
Nightlife=[1, np.nan, np.nan, 1]
))
df
Bars BusID Nightlife
0 NaN A 1.0
1 1.0 A NaN
2 1.0 B NaN
3 NaN B 1.0
Then
df.groupby('BusID', as_index=False).first()
BusID Bars Nightlife
0 A 1.0 1.0
1 B 1.0 1.0
You could use something like df.groupby('Business ID').sum(). As an example:
df = pd.DataFrame(data = {'a': [1, 2, 3, 1],
'b': [5, 6, None, None],
'c': [None, None, 7, 8]})
df
# a b c
# 0 1 5.0 NaN
# 1 2 6.0 NaN
# 2 3 NaN 7.0
# 3 1 NaN 8.0
new_df = df.groupby('a').sum()
new_df
# b c
# a
# 1 5.0 8.0
# 2 6.0 0.0
# 3 0.0 7.0

Categories

Resources