Create a standard dataframe to receive data from other dataframes in Pandas

Create a standard dataframe to receive data from other dataframes in Pandas - python

I have an algorithm that makes a request in different databases and receives a dataframe. However, these databases can be different from each other and send only a few columns, as in the example below.
Note that the column names of the dataframes are not standardized and it can contain NaN values in some rows. In addition, some columns appear in some dataframes and in others, they do not appear.
As I need to do operations that can concatenate the dataframes of the different databases, my idea would be to create a standard dataframe that contains all possible columns and start it with NaN values, as in the example below.
So, at each request, I would just fill the standard dataframe with the columns of the received dataframes. I thought about associating the name of the columns of the standard dataframe to the possible names of the dataframes of the databases through a dictionary.
dict{A: [A_1, A_a, A_y], B: [B_3, B_b], C: [C_c, C_w], D: [D_5, D_d]}
The idea of the dictionary is because I need a practical way to update the possible column names to the columns of the standard dataframe, since there may be new names that I have not yet mapped.
In the end, my result would be both the following dataframe, in case I have requested the three dataframes above.
Or the following, if I only requested the first dataframe.
Could anyone suggest an elegant way to do this?

Let's try creating a mapper that can be used with columns.map:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'A_1': [1, np.nan, 3, 4, np.nan, 6],
'B_3': ['a', 'b', 'c', 'd', np.nan, 'f'],
'D_5': ['a', 'b', 'c', 'd', np.nan, 'f']})
df2 = pd.DataFrame({'A_a': [1, np.nan, 3, 4, 5, 6],
'B_b': ['a', np.nan, 'c', 'd', 'e', 'f'],
'C_c': [1, np.nan, 3, 4, np.nan, 6],
'D_d': ['a', np.nan, 'c', 'd', np.nan, 'f']})
df3 = pd.DataFrame({'A_y': [1, np.nan, 3, 4, 5, 6],
'C_w': [1, 2, 3, np.nan, 5, 6]})
alias_map = {'A': ['A_1', 'A_a', 'A_y'], 'B': ['B_3', 'B_b'],
'C': ['C_c', 'C_w'], 'D': ['D_5', 'D_d']}
# Turn alias map into something that works for columns.map
mapper = {new_k: new_v for new_v, lst in alias_map.items() for new_k in lst}
# List of DFs
dfs = [df1, df2, df3]
# Rename Columns
for df in dfs:
df.columns = df.columns.map(mapper)
# Have Empty DF First with All Columns
default_df = pd.DataFrame(columns=list(alias_map.keys()))
merged = pd.concat((default_df, *dfs)).reset_index(drop=True)
print(merged)
merged:
A B C D
0 1.0 a NaN a
1 NaN b NaN b
2 3.0 c NaN c
3 4.0 d NaN d
4 NaN NaN NaN NaN
5 6.0 f NaN f
6 1.0 a 1.0 a
7 NaN NaN NaN NaN
8 3.0 c 3.0 c
9 4.0 d 4.0 d
10 5.0 e NaN NaN
11 6.0 f 6.0 f
12 1.0 NaN 1.0 NaN
13 NaN NaN 2.0 NaN
14 3.0 NaN 3.0 NaN
15 4.0 NaN NaN NaN
16 5.0 NaN 5.0 NaN
17 6.0 NaN 6.0 NaN
With just 1 DF
merged = pd.concat((default_df, df1)).reset_index(drop=True)
print(merged)
merged:
A B C D
0 1.0 a NaN a
1 NaN b NaN b
2 3.0 c NaN c
3 4.0 d NaN d
4 NaN NaN NaN NaN
5 6.0 f NaN f

IIUC, you can do it this way:
import pandas as pd
import numpy as np
from functools import reduce
df1 = pd.DataFrame({'A_1':[1,np.nan,3,4,np.nan,6],
'B_3':['a','b','c','d',np.nan,'f'],
'D_5':['a','b','c','d',np.nan,'f']})
df2 = pd.DataFrame({'A_a':[1, np.nan,3,4,5,6],
'B_b':['a',np.nan,'c', 'd', 'e','f'],
'C_c':[1, np.nan, 3,4,np.nan,6],
'D_d':['a',np.nan,'c','d', np.nan,'f']})
df3 = pd.DataFrame({'A_y':[1,np.nan,3,4,5,6],
'C_w':[1,2,3,np.nan,5,6]})
dd = {'A': ['A_1', 'A_a', 'A_y'], 'B': ['B_3', 'B_b'], 'C': ['C_c', 'C_w'], 'D': ['D_5', 'D_d']}
#Invert your custom dictionary
col_dict = {}
for k, v in dd.items():
for i in v:
col_dict[i]=k
#Changed due to comment
df_out = pd.concat([i.rename(columns=col_dict) for i in [df1,df2,df3]])
df_out
output:
A B D C
0 1.0 a a NaN
1 NaN b b NaN
2 3.0 c c NaN
3 4.0 d d NaN
4 NaN NaN NaN NaN
5 6.0 f f NaN
0 1.0 a a 1.0
1 NaN NaN NaN NaN
2 3.0 c c 3.0
3 4.0 d d 4.0
4 5.0 e NaN NaN
5 6.0 f f 6.0
0 1.0 NaN NaN 1.0
1 NaN NaN NaN 2.0
2 3.0 NaN NaN 3.0
3 4.0 NaN NaN NaN
4 5.0 NaN NaN 5.0
5 6.0 NaN NaN 6.0
Let's just get first dataframe using slicing notation:
ldfs = [df1,df2,df3]
df_out = pd.concat([i.rename(columns=col_dict) for i in ldfs[0:1]])

Related

Pandas dataframe - Removing repeated/duplicate column in dataframe but keep the values

I have this dataframe that have duplicate column name, I want to remove the remove the repeated column but I need to keep the values.
I want to remove the C and D column at the end but move the values on the same row in the first C and D column.
df = df.loc[:,~df.columns.duplicated(keep='first')]
Tried this code but it remove the duplicate column and keeping the first but it also remove the values

Example
make minimal and reproducible example for answer
data = [[0, 1, 2, 3, None, None],
[1, None, 3, None, 2, 4],
[2, 3, 4, 5, None, None]]
df = pd.DataFrame(data, columns=list('ABCDBD'))
df
A B C D B D
0 0 1.0 2 3.0 NaN NaN
1 1 NaN 3 NaN 2.0 4.0
2 2 3.0 4 5.0 NaN NaN
Code
df.groupby(level=0, axis=1).first()
result:
A B C D
0 0.0 1.0 2.0 3.0
1 1.0 2.0 3.0 4.0
2 2.0 3.0 4.0 5.0

Extending the Value of non-Missing Cells to Subsequent Rows in Pandas

This is what I have:
df=pd.DataFrame({'A':[1,2,3,4,5],'B':[6,np.nan,np.nan,3,np.nan]})
A B
0 1 6.0
1 2 NaN
2 3 NaN
3 4 3.0
4 5 NaN
I would like to extend non-missing values of B to missing values of B underneath, so I have:
A B C
0 1 6.0 6.0
1 2 NaN NaN
2 3 NaN NaN
3 4 3.0 3.0
4 5 NaN NaN
I tried something like this, and it worked last night:
for i in df.index:
df['C'][i]=np.where(pd.isnull(df['B'].iloc[i]),df['C'][i-1],df.B.iloc[i])
But when I woke up this morning it said it didn't recognize 'C.' I couldn't identify the conditions in which it worked and didn't work.
Thanks!

You could use pandas fillna() method to forward fill the missing values with the last non-null value. See the pandas documentation for more details.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, np.nan, np.nan, 3, np.nan]
})
df['C'] = df['B'].fillna(method='ffill')
df
# A B C
# 0 1 6.0 6.0
# 1 2 NaN 6.0
# 2 3 NaN 6.0
# 3 4 3.0 3.0
# 4 5 NaN 3.0

How Can I replace NaN in a row with values in another row Pandas

I tried several methods to replace NaN in a row with values in another row, but none of them worked as expected. Here is my Dataframe:
test = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": [4, 5, 6, np.nan, np.nan],
"c": [7, 8, 9, np.nan, np.nan],
"d": [7, 8, 9, np.nan, np.nan]
}
)
a b c d
0 1 4.0 7.0 7.0
1 2 5.0 8.0 8.0
2 3 6.0 9.0 9.0
3 4 NaN NaN NaN
4 5 NaN NaN NaN
I need to replace NaN in 4th row with values first row, i.e.,
a b c d
0 1 **4.0 7.0 7.0**
1 2 5.0 8.0 8.0
2 3 6.0 9.0 9.0
3 4 **4.0 7.0 7.0**
4 5 NaN NaN NaN
And the second question is how can I multiply some/part values in a row by a number, for example, I need to double the values in second two when the columns are ['b', 'c', 'd'], then the result is:
a b c d
0 1 4.0 7.0 7.0
1 2 **10.0 16.0 16.0**
2 3 6.0 9.0 9.0
3 4 NaN NaN NaN
4 5 NaN NaN NaN

First of all, I suggest you do some reading on Indexing and selecting data in pandas.
Regaring the first question you can use .loc() with isnull() to perform boolean indexing on the column vaulues:
mask_nans = test.loc[3,:].isnull()
test.loc[3, mask_nans] = test.loc[0, mask_nans]
And to double the values you can directly multiply by 2 the sliced dataframe also using .loc():
test.loc[1,'b':] *= 2
a b c d
0 1 4.0 7.0 7.0
1 2 10.0 16.0 16.0
2 3 6.0 9.0 9.0
3 4 4.0 7.0 7.0
4 5 NaN NaN NaN

Indexing with labels
If you wish to filter by a, and a values are unique, consider making it your index to simplify your logic and make it more efficient:
test = test.set_index('a')
test.loc[4] = test.loc[4].fillna(test.loc[1])
test.loc[2] *= 2
Boolean masks
If a is not unique and Boolean masks are required, you can still use fillna with an additional step::
mask = test['a'].eq(4)
test.loc[mask] = test.loc[mask].fillna(test.loc[test['a'].eq(1).idxmax()])
test.loc[test['a'].eq(2)] *= 2

Combine two rows in Dataframe using a Unique Value

I converted a list into a Dataframe and now my data looks like this.
I want to use the unique Business ID to merge two rows in this Dataframe. How can I do this?

Use first in a groupby to get first non-null value
Consider the data frame df
df = pd.DataFrame(dict(
Bars=[np.nan, 1, 1, np.nan],
BusID=list('AABB'),
Nightlife=[1, np.nan, np.nan, 1]
))
df
Bars BusID Nightlife
0 NaN A 1.0
1 1.0 A NaN
2 1.0 B NaN
3 NaN B 1.0
Then
df.groupby('BusID', as_index=False).first()
BusID Bars Nightlife
0 A 1.0 1.0
1 B 1.0 1.0

You could use something like df.groupby('Business ID').sum(). As an example:
df = pd.DataFrame(data = {'a': [1, 2, 3, 1],
'b': [5, 6, None, None],
'c': [None, None, 7, 8]})
df
# a b c
# 0 1 5.0 NaN
# 1 2 6.0 NaN
# 2 3 NaN 7.0
# 3 1 NaN 8.0
new_df = df.groupby('a').sum()
new_df
# b c
# a
# 1 5.0 8.0
# 2 6.0 0.0
# 3 0.0 7.0

Values in Wrong Columns After Pandas DataFrame.to_csv()

I am concatenating two data files using Pandas. The concat is working well but when I write the data back to csv the data loses some coherency:
# Define DataFrame 1
headerList1 = ['A', 'B', 'C', 'D']
b1 = np.array([[0, 'B_foo', 2, 'D_one'],
[3, 'B_bar', 5, 'D_two'],
[6, 'B_cat', 8, 'D_one']])
df1 = pd.DataFrame(b1, columns=headerList1)
# Define DataFrame 2
headerList2 = ['C', 'E', 'F', 'G']
b2 = np.array([[12, 'E_foo', 2, 'G_one'],
[15, 'E_bar', 5, 'G_two'],
[19, 'E_cat', 8, 'G_one']])
df2 = pd.DataFrame(b2, columns=headerList2)
# Concat DataFrames
df3 = pd.concat([df1, df2], axis=0, ignore_index=True)
# Write to csv
scratchFile = os.path.join(dir, 'scratch.csv')
df3.to_csv(scratchFile, index_label=False, ignore_index=True)
I am looking for:
A B C D E F G
0 B_foo 2 D_one NaN NaN NaN
3 B_bar 5 D_two NaN NaN NaN
6 B_cat 8 D_one NaN NaN NaN
NaN NaN 12 NaN E_foo 2 G_one
NaN NaN 15 NaN E_bar 5 G_two
NaN NaN 19 NaN E_cat 8 G_one
but get:
A B C D E F G
0 0 B_foo 2 D_one Nan Nan
1 3 B_bar 5 D_two Nan Nan
2 6 B_cat 8 D_one Nan Nan
3 Nan Nan 12 Nan E_foo 2 G_one
4 Nan Nan 15 Nan E_bar 5 G_two
5 Nan Nan 19 Nan E_cat 8 G_one
I can almost reach the desired result by removing index_label=False from the to_csv() command but this results in the addition of an undesired index column.
Is there a way to get the desired output without the index column? Also, of personal interest, why does removing the index_label=False disrupt the column organization?
Thanks!

df3.to_csv('df3.csv', index = False)
This worked for me. index = False means that the dataframe index is not included in the csv.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a standard dataframe to receive data from other dataframes in Pandas - python

Related

Pandas dataframe - Removing repeated/duplicate column in dataframe but keep the values

Extending the Value of non-Missing Cells to Subsequent Rows in Pandas

How Can I replace NaN in a row with values in another row Pandas

Combine two rows in Dataframe using a Unique Value

Values in Wrong Columns After Pandas DataFrame.to_csv()

Categories

Resources