Combine two rows in Dataframe using a Unique Value - python

I converted a list into a Dataframe and now my data looks like this.
I want to use the unique Business ID to merge two rows in this Dataframe. How can I do this?

Use first in a groupby to get first non-null value
Consider the data frame df
df = pd.DataFrame(dict(
Bars=[np.nan, 1, 1, np.nan],
BusID=list('AABB'),
Nightlife=[1, np.nan, np.nan, 1]
))
df
Bars BusID Nightlife
0 NaN A 1.0
1 1.0 A NaN
2 1.0 B NaN
3 NaN B 1.0
Then
df.groupby('BusID', as_index=False).first()
BusID Bars Nightlife
0 A 1.0 1.0
1 B 1.0 1.0

You could use something like df.groupby('Business ID').sum(). As an example:
df = pd.DataFrame(data = {'a': [1, 2, 3, 1],
'b': [5, 6, None, None],
'c': [None, None, 7, 8]})
df
# a b c
# 0 1 5.0 NaN
# 1 2 6.0 NaN
# 2 3 NaN 7.0
# 3 1 NaN 8.0
new_df = df.groupby('a').sum()
new_df
# b c
# a
# 1 5.0 8.0
# 2 6.0 0.0
# 3 0.0 7.0

Related

Pandas dataframe - Removing repeated/duplicate column in dataframe but keep the values

I have this dataframe that have duplicate column name, I want to remove the remove the repeated column but I need to keep the values.
I want to remove the C and D column at the end but move the values on the same row in the first C and D column.
df = df.loc[:,~df.columns.duplicated(keep='first')]
Tried this code but it remove the duplicate column and keeping the first but it also remove the values
Example
make minimal and reproducible example for answer
data = [[0, 1, 2, 3, None, None],
[1, None, 3, None, 2, 4],
[2, 3, 4, 5, None, None]]
df = pd.DataFrame(data, columns=list('ABCDBD'))
df
A B C D B D
0 0 1.0 2 3.0 NaN NaN
1 1 NaN 3 NaN 2.0 4.0
2 2 3.0 4 5.0 NaN NaN
Code
df.groupby(level=0, axis=1).first()
result:
A B C D
0 0.0 1.0 2.0 3.0
1 1.0 2.0 3.0 4.0
2 2.0 3.0 4.0 5.0

Add new values to a dataframe from a specific index?

In an existing dataframe, how can I add a column with new values in it, but throw these new values from a specific index and increase the dataframe's index size?
As in this example, put the new values from index 2, and go to index 6:
Dataframe:
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
df
Output:
Col1 Col2
0 1 2
1 1 2
2 1 2
3 1 2
New values:
new_values = [3, 3, 3, 3, 3]
Desired Result:
Col1 Col2 Col3
0 1 2 NaN
1 1 2 NaN
2 1 2 3
3 1 2 3
4 NaN NaN 3
5 NaN NaN 3
6 NaN NaN 3
First create a new list and add NaN values that total to the number you want to offset.
Then do a concat.
You can set the series name when you concatnate it and that will be the new column name.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
new_values = [3, 3, 3, 3, 3]
offset = 2 # set your offset here
new_values = [np.NaN] * offset + new_values # looks like [np.NaN, np.NaN, 3, 3, ... ]
new = pd.concat([df, pd.Series(new_values).rename('Col3')], axis=1)
new looks like this,
Col1 Col2 Col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0
The answer by #anarchy works nicely, a similar approach
df = pd.DataFrame({
'Col1':[1, 1, 1, 1],
'Col2':[2, 2, 2, 2]
})
specific_index_offset = 2
new_data = [3,3,3,3,3]
new_df = pd.DataFrame(
data=[[e] for e in new_data],
columns=['Col3'],
index=range(specific_index_offset, specific_index_offset + len(new_data))
)
desired_df = pd.concat([df, new_df], axis=1)
You can also try doing some changes from list level
n = 2 #Input here
xs = [None] * 2
new_values = [3, 3, 3, 3, 3]
new= xs+new_values
#create new df
df2 = pd.DataFrame({'col3':new})
df_final = pd.concat([df, df2], axis=1)
print(df_final)
output#
Col1 Col2 col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0
Another option is to create a Series with the specific index, before concatenating:
new_values = [3, 3, 3, 3, 3]
arr = pd.Series(new_values, index = range(2, 7), name = 'Col3')
pd.concat([df, arr], axis = 1)
Col1 Col2 Col3
0 1.0 2.0 NaN
1 1.0 2.0 NaN
2 1.0 2.0 3.0
3 1.0 2.0 3.0
4 NaN NaN 3.0
5 NaN NaN 3.0
6 NaN NaN 3.0

Create a standard dataframe to receive data from other dataframes in Pandas

I have an algorithm that makes a request in different databases and receives a dataframe. However, these databases can be different from each other and send only a few columns, as in the example below.
Note that the column names of the dataframes are not standardized and it can contain NaN values in some rows. In addition, some columns appear in some dataframes and in others, they do not appear.
As I need to do operations that can concatenate the dataframes of the different databases, my idea would be to create a standard dataframe that contains all possible columns and start it with NaN values, as in the example below.
So, at each request, I would just fill the standard dataframe with the columns of the received dataframes. I thought about associating the name of the columns of the standard dataframe to the possible names of the dataframes of the databases through a dictionary.
dict{A: [A_1, A_a, A_y], B: [B_3, B_b], C: [C_c, C_w], D: [D_5, D_d]}
The idea of ​​the dictionary is because I need a practical way to update the possible column names to the columns of the standard dataframe, since there may be new names that I have not yet mapped.
In the end, my result would be both the following dataframe, in case I have requested the three dataframes above.
Or the following, if I only requested the first dataframe.
Could anyone suggest an elegant way to do this?
Let's try creating a mapper that can be used with columns.map:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'A_1': [1, np.nan, 3, 4, np.nan, 6],
'B_3': ['a', 'b', 'c', 'd', np.nan, 'f'],
'D_5': ['a', 'b', 'c', 'd', np.nan, 'f']})
df2 = pd.DataFrame({'A_a': [1, np.nan, 3, 4, 5, 6],
'B_b': ['a', np.nan, 'c', 'd', 'e', 'f'],
'C_c': [1, np.nan, 3, 4, np.nan, 6],
'D_d': ['a', np.nan, 'c', 'd', np.nan, 'f']})
df3 = pd.DataFrame({'A_y': [1, np.nan, 3, 4, 5, 6],
'C_w': [1, 2, 3, np.nan, 5, 6]})
alias_map = {'A': ['A_1', 'A_a', 'A_y'], 'B': ['B_3', 'B_b'],
'C': ['C_c', 'C_w'], 'D': ['D_5', 'D_d']}
# Turn alias map into something that works for columns.map
mapper = {new_k: new_v for new_v, lst in alias_map.items() for new_k in lst}
# List of DFs
dfs = [df1, df2, df3]
# Rename Columns
for df in dfs:
df.columns = df.columns.map(mapper)
# Have Empty DF First with All Columns
default_df = pd.DataFrame(columns=list(alias_map.keys()))
merged = pd.concat((default_df, *dfs)).reset_index(drop=True)
print(merged)
merged:
A B C D
0 1.0 a NaN a
1 NaN b NaN b
2 3.0 c NaN c
3 4.0 d NaN d
4 NaN NaN NaN NaN
5 6.0 f NaN f
6 1.0 a 1.0 a
7 NaN NaN NaN NaN
8 3.0 c 3.0 c
9 4.0 d 4.0 d
10 5.0 e NaN NaN
11 6.0 f 6.0 f
12 1.0 NaN 1.0 NaN
13 NaN NaN 2.0 NaN
14 3.0 NaN 3.0 NaN
15 4.0 NaN NaN NaN
16 5.0 NaN 5.0 NaN
17 6.0 NaN 6.0 NaN
With just 1 DF
merged = pd.concat((default_df, df1)).reset_index(drop=True)
print(merged)
merged:
A B C D
0 1.0 a NaN a
1 NaN b NaN b
2 3.0 c NaN c
3 4.0 d NaN d
4 NaN NaN NaN NaN
5 6.0 f NaN f
IIUC, you can do it this way:
import pandas as pd
import numpy as np
from functools import reduce
df1 = pd.DataFrame({'A_1':[1,np.nan,3,4,np.nan,6],
'B_3':['a','b','c','d',np.nan,'f'],
'D_5':['a','b','c','d',np.nan,'f']})
df2 = pd.DataFrame({'A_a':[1, np.nan,3,4,5,6],
'B_b':['a',np.nan,'c', 'd', 'e','f'],
'C_c':[1, np.nan, 3,4,np.nan,6],
'D_d':['a',np.nan,'c','d', np.nan,'f']})
df3 = pd.DataFrame({'A_y':[1,np.nan,3,4,5,6],
'C_w':[1,2,3,np.nan,5,6]})
dd = {'A': ['A_1', 'A_a', 'A_y'], 'B': ['B_3', 'B_b'], 'C': ['C_c', 'C_w'], 'D': ['D_5', 'D_d']}
#Invert your custom dictionary
col_dict = {}
for k, v in dd.items():
for i in v:
col_dict[i]=k
#Changed due to comment
df_out = pd.concat([i.rename(columns=col_dict) for i in [df1,df2,df3]])
df_out
output:
A B D C
0 1.0 a a NaN
1 NaN b b NaN
2 3.0 c c NaN
3 4.0 d d NaN
4 NaN NaN NaN NaN
5 6.0 f f NaN
0 1.0 a a 1.0
1 NaN NaN NaN NaN
2 3.0 c c 3.0
3 4.0 d d 4.0
4 5.0 e NaN NaN
5 6.0 f f 6.0
0 1.0 NaN NaN 1.0
1 NaN NaN NaN 2.0
2 3.0 NaN NaN 3.0
3 4.0 NaN NaN NaN
4 5.0 NaN NaN 5.0
5 6.0 NaN NaN 6.0
Let's just get first dataframe using slicing notation:
ldfs = [df1,df2,df3]
df_out = pd.concat([i.rename(columns=col_dict) for i in ldfs[0:1]])

pandas: What is the best way to do fillna() on a (multiindexed) DataFrame with the most frequent value from every group?

There is a DataFrame with some NaN values:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 2], 'B': [1, 1, np.NaN, 2, 3, np.NaN, 3, 4]})
A B
0 1 1.0
1 1 1.0
2 1 NaN <-
3 1 2.0
4 2 3.0
5 2 NaN <-
6 2 3.0
7 2 4.0
Set label 'A' as an index:
df.set_index(['A'], inplace=True)
Now there are two groups with the indices 1 and 2:
B
A
1 1.0
1 1.0
1 NaN <-
1 2.0
2 3.0
2 NaN <-
2 3.0
2 4.0
What is the best way to do fillna() on the DataFrame with the most frequent value from each group?
So, I would like to do a call of something like this:
df.B.fillna(df.groupby('A').B...)
and get:
B
A
1 1.0
1 1.0
1 1.0 <-
1 2.0
2 3.0
2 3.0 <-
2 3.0
2 4.0
I hope there's a way and it also works with multiindex.
groupby column A and apply fillna() to B within each group;
drop missing values from the series, and do value_counts, use idxmax() to pick up the most frequent value;
Assuming there are no groups where all values are missing:
df['B'] = df.groupby('A')['B'].transform(lambda x: x.fillna(x.dropna().value_counts().idxmax()))
df

Replace Nulls in DataFrame with Max in Row

Is there a way (more efficient than using a for loop) to replace all the nulls in a Pandas' DataFrame with the max value in its respective row.
I guess that is what you are looking for:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 0], 'b': [3, 0, 10], 'c':[0, 5, 34]})
a b c
0 1 3 0
1 2 0 5
2 0 10 34
You can use apply, iterate over all rows and replace 0 by the maximal number of the row by using the replace function which gives you the expected output:
df.apply(lambda row: row.replace(0, max(row)), axis=1)
a b c
0 1 3 3
1 2 5 5
2 34 10 34
If you want to to replace NaN - which seemed to be your actual goal according to your comment - you can use
df = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 10], 'c':[np.nan, 5, 34]})
a b c
0 1.0 3.0 NaN
1 2.0 NaN 5.0
2 NaN 10.0 34.0
df.T.fillna(df.max(axis=1)).T
yielding
a b c
0 1.0 3.0 3.0
1 2.0 5.0 5.0
2 34.0 10.0 34.0
which might be more efficient (have not done the timings) than
df.apply(lambda row: row.fillna(row.max()), axis=1)
Please note that
df.apply(lambda row: row.fillna(max(row)), axis=1)
does not work in each case as explained here.

Categories

Resources