Pandas dataframe - Removing repeated/duplicate column in dataframe but keep the values - python

I have this dataframe that have duplicate column name, I want to remove the remove the repeated column but I need to keep the values.
I want to remove the C and D column at the end but move the values on the same row in the first C and D column.
df = df.loc[:,~df.columns.duplicated(keep='first')]
Tried this code but it remove the duplicate column and keeping the first but it also remove the values

Example
make minimal and reproducible example for answer
data = [[0, 1, 2, 3, None, None],
[1, None, 3, None, 2, 4],
[2, 3, 4, 5, None, None]]
df = pd.DataFrame(data, columns=list('ABCDBD'))
df
A B C D B D
0 0 1.0 2 3.0 NaN NaN
1 1 NaN 3 NaN 2.0 4.0
2 2 3.0 4 5.0 NaN NaN
Code
df.groupby(level=0, axis=1).first()
result:
A B C D
0 0.0 1.0 2.0 3.0
1 1.0 2.0 3.0 4.0
2 2.0 3.0 4.0 5.0

Related

Add new values to a dataframe from a specific index?

In an existing dataframe, how can I add a column with new values in it, but throw these new values from a specific index and increase the dataframe's index size?
As in this example, put the new values from index 2, and go to index 6:
Dataframe:
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
df
Output:
Col1 Col2
0 1 2
1 1 2
2 1 2
3 1 2
New values:
new_values = [3, 3, 3, 3, 3]
Desired Result:
Col1 Col2 Col3
0 1 2 NaN
1 1 2 NaN
2 1 2 3
3 1 2 3
4 NaN NaN 3
5 NaN NaN 3
6 NaN NaN 3
First create a new list and add NaN values that total to the number you want to offset.
Then do a concat.
You can set the series name when you concatnate it and that will be the new column name.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
new_values = [3, 3, 3, 3, 3]
offset = 2 # set your offset here
new_values = [np.NaN] * offset + new_values # looks like [np.NaN, np.NaN, 3, 3, ... ]
new = pd.concat([df, pd.Series(new_values).rename('Col3')], axis=1)
new looks like this,
Col1 Col2 Col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0
The answer by #anarchy works nicely, a similar approach
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
specific_index_offset = 2
new_data = [3,3,3,3,3]
new_df = pd.DataFrame(
data=[[e] for e in new_data],
columns=['Col3'],
index=range(specific_index_offset, specific_index_offset + len(new_data))
)
desired_df = pd.concat([df, new_df], axis=1)
You can also try doing some changes from list level
n = 2 #Input here
xs = [None] * 2
new_values = [3, 3, 3, 3, 3]
new= xs+new_values
#create new df
df2 = pd.DataFrame({'col3':new})
df_final = pd.concat([df, df2], axis=1)
print(df_final)
output#
Col1 Col2 col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0
Another option is to create a Series with the specific index, before concatenating:
new_values = [3, 3, 3, 3, 3]
arr = pd.Series(new_values, index = range(2, 7), name = 'Col3')
pd.concat([df, arr], axis = 1)
Col1 Col2 Col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0

Extending the Value of non-Missing Cells to Subsequent Rows in Pandas

This is what I have:
df=pd.DataFrame({'A':[1,2,3,4,5],'B':[6,np.nan,np.nan,3,np.nan]})
A B
0 1 6.0
1 2 NaN
2 3 NaN
3 4 3.0
4 5 NaN
I would like to extend non-missing values of B to missing values of B underneath, so I have:
A B C
0 1 6.0 6.0
1 2 NaN NaN
2 3 NaN NaN
3 4 3.0 3.0
4 5 NaN NaN
I tried something like this, and it worked last night:
for i in df.index:
df['C'][i]=np.where(pd.isnull(df['B'].iloc[i]),df['C'][i-1],df.B.iloc[i])
But when I woke up this morning it said it didn't recognize 'C.' I couldn't identify the conditions in which it worked and didn't work.
Thanks!
You could use pandas fillna() method to forward fill the missing values with the last non-null value. See the pandas documentation for more details.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, np.nan, np.nan, 3, np.nan]
})
df['C'] = df['B'].fillna(method='ffill')
df
# A B C
# 0 1 6.0 6.0
# 1 2 NaN 6.0
# 2 3 NaN 6.0
# 3 4 3.0 3.0
# 4 5 NaN 3.0

Adding Columns in Dataframe

behaviour in DF
I am seeing the unexplained while adding a new column to a DF
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
# Adding a new column to an existing DataFrame object with column label by passing new series
print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)
print ("Adding a new column using the existing columns in DataFrame:")
print("print df['one']")
print(df['one'])
print("#print(df['two']")
print(df['two'])
print("#print df['three']")
df['three']
print("df['four']=df['one']+df['three']")
df['four']=df['one']+df['three']
#print(df)
print(df)
Actual Result :
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Adding a new column by passing as Series:
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
Adding a new column using the existing columns in DataFrame:
print(df['one']):
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
print(df['two']):
a 1
b 2
c 3
d 4
Name: two, dtype: int64
print(df['three']):
nothing
df['four']=df['one']+df['three']
one two three four
a 1.0 1 10.0 11.0
b 2.0 2 20.0 22.0
c 3.0 3 30.0 33.0
d NaN 4 NaN NaN
Question:
Why I am not getting anythign when printing df["three"] ?

Combine two rows in Dataframe using a Unique Value

I converted a list into a Dataframe and now my data looks like this.
I want to use the unique Business ID to merge two rows in this Dataframe. How can I do this?
Use first in a groupby to get first non-null value
Consider the data frame df
df = pd.DataFrame(dict(
Bars=[np.nan, 1, 1, np.nan],
BusID=list('AABB'),
Nightlife=[1, np.nan, np.nan, 1]
))
df
Bars BusID Nightlife
0 NaN A 1.0
1 1.0 A NaN
2 1.0 B NaN
3 NaN B 1.0
Then
df.groupby('BusID', as_index=False).first()
BusID Bars Nightlife
0 A 1.0 1.0
1 B 1.0 1.0
You could use something like df.groupby('Business ID').sum(). As an example:
df = pd.DataFrame(data = {'a': [1, 2, 3, 1],
'b': [5, 6, None, None],
'c': [None, None, 7, 8]})
df
# a b c
# 0 1 5.0 NaN
# 1 2 6.0 NaN
# 2 3 NaN 7.0
# 3 1 NaN 8.0
new_df = df.groupby('a').sum()
new_df
# b c
# a
# 1 5.0 8.0
# 2 6.0 0.0
# 3 0.0 7.0

pandas: What is the best way to do fillna() on a (multiindexed) DataFrame with the most frequent value from every group?

There is a DataFrame with some NaN values:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 2], 'B': [1, 1, np.NaN, 2, 3, np.NaN, 3, 4]})
A B
0 1 1.0
1 1 1.0
2 1 NaN <-
3 1 2.0
4 2 3.0
5 2 NaN <-
6 2 3.0
7 2 4.0
Set label 'A' as an index:
df.set_index(['A'], inplace=True)
Now there are two groups with the indices 1 and 2:
B
A
1 1.0
1 1.0
1 NaN <-
1 2.0
2 3.0
2 NaN <-
2 3.0
2 4.0
What is the best way to do fillna() on the DataFrame with the most frequent value from each group?
So, I would like to do a call of something like this:
df.B.fillna(df.groupby('A').B...)
and get:
B
A
1 1.0
1 1.0
1 1.0 <-
1 2.0
2 3.0
2 3.0 <-
2 3.0
2 4.0
I hope there's a way and it also works with multiindex.
groupby column A and apply fillna() to B within each group;
drop missing values from the series, and do value_counts, use idxmax() to pick up the most frequent value;
Assuming there are no groups where all values are missing:
df['B'] = df.groupby('A')['B'].transform(lambda x: x.fillna(x.dropna().value_counts().idxmax()))
df

Categories

Resources