In an existing dataframe, how can I add a column with new values in it, but throw these new values from a specific index and increase the dataframe's index size?
As in this example, put the new values from index 2, and go to index 6:
Dataframe:
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
df
Output:
Col1 Col2
0 1 2
1 1 2
2 1 2
3 1 2
New values:
new_values = [3, 3, 3, 3, 3]
Desired Result:
Col1 Col2 Col3
0 1 2 NaN
1 1 2 NaN
2 1 2 3
3 1 2 3
4 NaN NaN 3
5 NaN NaN 3
6 NaN NaN 3
First create a new list and add NaN values that total to the number you want to offset.
Then do a concat.
You can set the series name when you concatnate it and that will be the new column name.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
new_values = [3, 3, 3, 3, 3]
offset = 2 # set your offset here
new_values = [np.NaN] * offset + new_values # looks like [np.NaN, np.NaN, 3, 3, ... ]
new = pd.concat([df, pd.Series(new_values).rename('Col3')], axis=1)
new looks like this,
Col1 Col2 Col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0
The answer by #anarchy works nicely, a similar approach
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
specific_index_offset = 2
new_data = [3,3,3,3,3]
new_df = pd.DataFrame(
data=[[e] for e in new_data],
columns=['Col3'],
index=range(specific_index_offset, specific_index_offset + len(new_data))
)
desired_df = pd.concat([df, new_df], axis=1)
You can also try doing some changes from list level
n = 2 #Input here
xs = [None] * 2
new_values = [3, 3, 3, 3, 3]
new= xs+new_values
#create new df
df2 = pd.DataFrame({'col3':new})
df_final = pd.concat([df, df2], axis=1)
print(df_final)
output#
Col1 Col2 col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0
Another option is to create a Series with the specific index, before concatenating:
new_values = [3, 3, 3, 3, 3]
arr = pd.Series(new_values, index = range(2, 7), name = 'Col3')
pd.concat([df, arr], axis = 1)
Col1 Col2 Col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0
Related
I have this dataframe that have duplicate column name, I want to remove the remove the repeated column but I need to keep the values.
I want to remove the C and D column at the end but move the values on the same row in the first C and D column.
df = df.loc[:,~df.columns.duplicated(keep='first')]
Tried this code but it remove the duplicate column and keeping the first but it also remove the values
Example
make minimal and reproducible example for answer
data = [[0, 1, 2, 3, None, None],
[1, None, 3, None, 2, 4],
[2, 3, 4, 5, None, None]]
df = pd.DataFrame(data, columns=list('ABCDBD'))
df
A B C D B D
0 0 1.0 2 3.0 NaN NaN
1 1 NaN 3 NaN 2.0 4.0
2 2 3.0 4 5.0 NaN NaN
Code
df.groupby(level=0, axis=1).first()
result:
A B C D
0 0.0 1.0 2.0 3.0
1 1.0 2.0 3.0 4.0
2 2.0 3.0 4.0 5.0
I have HUGE dataframes (milions, tens) and lot of missing (NaNs) values along columns.
I need to count the windows of NaNs and their size, for every column, in the fastest way possible (my code is too slow).
Something like this: frome here
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df
Out[65]:
a b c
0 1.0 NaN NaN
1 2.0 2.0 2.0
2 NaN 1.0 1.0
3 NaN 1.0 NaN
4 3.0 3.0 3.0
5 3.0 3.0 3.0
6 NaN NaN NaN
7 4.0 NaN NaN
8 NaN 2.0 2.0
9 NaN NaN 8.0
To here:
result
Out[61]:
a b c
0 2 1 1
1 1 2 1
2 2 1 2
Here's one way to do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df_n = pd.DataFrame({'a':df['a'].isnull().values,
'b':df['b'].isnull().values,
'c':df['c'].isnull().values})
pr={}
for column_name, _ in df_n.iteritems():
fst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(1).fillna(False)]
lst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(-1).fillna(False)]
pr[column_name] = [j-i+1 for i, j in zip(fst, lst)]
df_new=pd.DataFrame(pr)
Output:
a b c
0 2 1 1
1 1 2 1
2 2 1 2
Try this one (example only for a - do analogically for other columns):
>>> df=df.assign(a_count_sum=0)
>>> df["a_count_sum"][np.isnan(df["a"])]=df.groupby(np.isnan(df.a)).cumcount()+1
>>> df
a b c a_count_sum
0 1.0 NaN NaN 0
1 2.0 2.0 2.0 0
2 NaN 1.0 1.0 1
3 NaN 1.0 NaN 2
4 3.0 3.0 3.0 0
5 3.0 3.0 3.0 0
6 NaN NaN NaN 3
7 4.0 NaN NaN 0
8 NaN 2.0 2.0 4
9 NaN NaN 8.0 5
>>> res_1 = df["a_count_sum"][((df["a_count_sum"].shift(-1) == 0) | (np.isnan(df["a_count_sum"].shift(-1)))) & (df["a_count_sum"]!=0)]
>>> res_1
3 2
6 3
9 5
Name: a_count_sum, dtype: int64
>>> res_2 = (-res_1.shift(1).fillna(0)).astype(np.int64)
>>> res_2
3 0
6 -2
9 -3
Name: a_count_sum, dtype: int64
>>> res=res_1+res_2
>>> res
3 2
6 1
9 2
Name: a_count_sum, dtype: int64
I converted a list into a Dataframe and now my data looks like this.
I want to use the unique Business ID to merge two rows in this Dataframe. How can I do this?
Use first in a groupby to get first non-null value
Consider the data frame df
df = pd.DataFrame(dict(
Bars=[np.nan, 1, 1, np.nan],
BusID=list('AABB'),
Nightlife=[1, np.nan, np.nan, 1]
))
df
Bars BusID Nightlife
0 NaN A 1.0
1 1.0 A NaN
2 1.0 B NaN
3 NaN B 1.0
Then
df.groupby('BusID', as_index=False).first()
BusID Bars Nightlife
0 A 1.0 1.0
1 B 1.0 1.0
You could use something like df.groupby('Business ID').sum(). As an example:
df = pd.DataFrame(data = {'a': [1, 2, 3, 1],
'b': [5, 6, None, None],
'c': [None, None, 7, 8]})
df
# a b c
# 0 1 5.0 NaN
# 1 2 6.0 NaN
# 2 3 NaN 7.0
# 3 1 NaN 8.0
new_df = df.groupby('a').sum()
new_df
# b c
# a
# 1 5.0 8.0
# 2 6.0 0.0
# 3 0.0 7.0
Assuming there is a pandas.DataFrame like:
pd.DataFrame([[np.nan,np.nan],[[1,2],[3,4]],[[11,22],[33,44]]],columns=['A','B'])
What's the easiest way to produce 2 pandas.DataFrames that each contains the 1st and 2nd element from every value list in the frame (nan if the position is nan).
pd.DataFrame([[np.nan,np.nan],[1,3],[11,33]],columns=['A','B'])
pd.DataFrame([[np.nan,np.nan],[2,4],[22,44]],columns=['A','B'])
You can use:
#replace NaN to [] - a bit hack
df = df.mask(df.isnull(), pd.Series([[]] * len(df.columns), index=df.columns), axis=1)
print (df)
A B
0 [] []
1 [1, 2] [3, 4]
2 [11, 22] [33, 44]
#create new df by each column, concanecate together
df3 = pd.concat([pd.DataFrame(df[col].values.tolist()) for col in df],
axis=1,
keys=df.columns)
print (df3)
A B
0 1 0 1
0 NaN NaN NaN NaN
1 1.0 2.0 3.0 4.0
2 11.0 22.0 33.0 44.0
#select by xs
df1 = df3.xs(0, level=1, axis=1)
print (df1)
A B
0 NaN NaN
1 1.0 3.0
2 11.0 33.0
df2 = df3.xs(1, level=1, axis=1)
print (df2)
A B
0 NaN NaN
1 2.0 4.0
2 22.0 44.0
You can do what you need with a function that return the n'th element of each column.
Code:
def row_element(elem_num):
def func(row):
ret = []
for item in row:
try:
ret.append(item[elem_num])
except:
ret.append(item)
return ret
return func
Test Code:
df = pd.DataFrame(
[[np.nan, np.nan], [[1, 2], [3, 4]], [[11, 22], [33, 44]]],
columns=['A', 'B'])
print(df)
print(df.apply(row_element(0), axis=1))
print(df.apply(row_element(1), axis=1))
Results:
A B
0 NaN NaN
1 [1, 2] [3, 4]
2 [11, 22] [33, 44]
A B
0 NaN NaN
1 1.0 3.0
2 11.0 33.0
A B
0 NaN NaN
1 2.0 4.0
2 22.0 44.0
There is a DataFrame with some NaN values:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 2], 'B': [1, 1, np.NaN, 2, 3, np.NaN, 3, 4]})
A B
0 1 1.0
1 1 1.0
2 1 NaN <-
3 1 2.0
4 2 3.0
5 2 NaN <-
6 2 3.0
7 2 4.0
Set label 'A' as an index:
df.set_index(['A'], inplace=True)
Now there are two groups with the indices 1 and 2:
B
A
1 1.0
1 1.0
1 NaN <-
1 2.0
2 3.0
2 NaN <-
2 3.0
2 4.0
What is the best way to do fillna() on the DataFrame with the most frequent value from each group?
So, I would like to do a call of something like this:
df.B.fillna(df.groupby('A').B...)
and get:
B
A
1 1.0
1 1.0
1 1.0 <-
1 2.0
2 3.0
2 3.0 <-
2 3.0
2 4.0
I hope there's a way and it also works with multiindex.
groupby column A and apply fillna() to B within each group;
drop missing values from the series, and do value_counts, use idxmax() to pick up the most frequent value;
Assuming there are no groups where all values are missing:
df['B'] = df.groupby('A')['B'].transform(lambda x: x.fillna(x.dropna().value_counts().idxmax()))
df