Adding Columns in Dataframe - python

behaviour in DF
I am seeing the unexplained while adding a new column to a DF
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
# Adding a new column to an existing DataFrame object with column label by passing new series
print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)
print ("Adding a new column using the existing columns in DataFrame:")
print("print df['one']")
print(df['one'])
print("#print(df['two']")
print(df['two'])
print("#print df['three']")
df['three']
print("df['four']=df['one']+df['three']")
df['four']=df['one']+df['three']
#print(df)
print(df)
Actual Result :
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Adding a new column by passing as Series:
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
Adding a new column using the existing columns in DataFrame:
print(df['one']):
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
print(df['two']):
a 1
b 2
c 3
d 4
Name: two, dtype: int64
print(df['three']):
nothing
df['four']=df['one']+df['three']
one two three four
a 1.0 1 10.0 11.0
b 2.0 2 20.0 22.0
c 3.0 3 30.0 33.0
d NaN 4 NaN NaN
Question:
Why I am not getting anythign when printing df["three"] ?

Related

Pandas dataframe - Removing repeated/duplicate column in dataframe but keep the values

I have this dataframe that have duplicate column name, I want to remove the remove the repeated column but I need to keep the values.
I want to remove the C and D column at the end but move the values on the same row in the first C and D column.
df = df.loc[:,~df.columns.duplicated(keep='first')]
Tried this code but it remove the duplicate column and keeping the first but it also remove the values
Example
make minimal and reproducible example for answer
data = [[0, 1, 2, 3, None, None],
[1, None, 3, None, 2, 4],
[2, 3, 4, 5, None, None]]
df = pd.DataFrame(data, columns=list('ABCDBD'))
df
A B C D B D
0 0 1.0 2 3.0 NaN NaN
1 1 NaN 3 NaN 2.0 4.0
2 2 3.0 4 5.0 NaN NaN
Code
df.groupby(level=0, axis=1).first()
result:
A B C D
0 0.0 1.0 2.0 3.0
1 1.0 2.0 3.0 4.0
2 2.0 3.0 4.0 5.0

Create a standard dataframe to receive data from other dataframes in Pandas

I have an algorithm that makes a request in different databases and receives a dataframe. However, these databases can be different from each other and send only a few columns, as in the example below.
Note that the column names of the dataframes are not standardized and it can contain NaN values in some rows. In addition, some columns appear in some dataframes and in others, they do not appear.
As I need to do operations that can concatenate the dataframes of the different databases, my idea would be to create a standard dataframe that contains all possible columns and start it with NaN values, as in the example below.
So, at each request, I would just fill the standard dataframe with the columns of the received dataframes. I thought about associating the name of the columns of the standard dataframe to the possible names of the dataframes of the databases through a dictionary.
dict{A: [A_1, A_a, A_y], B: [B_3, B_b], C: [C_c, C_w], D: [D_5, D_d]}
The idea of ​​the dictionary is because I need a practical way to update the possible column names to the columns of the standard dataframe, since there may be new names that I have not yet mapped.
In the end, my result would be both the following dataframe, in case I have requested the three dataframes above.
Or the following, if I only requested the first dataframe.
Could anyone suggest an elegant way to do this?
Let's try creating a mapper that can be used with columns.map:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'A_1': [1, np.nan, 3, 4, np.nan, 6],
'B_3': ['a', 'b', 'c', 'd', np.nan, 'f'],
'D_5': ['a', 'b', 'c', 'd', np.nan, 'f']})
df2 = pd.DataFrame({'A_a': [1, np.nan, 3, 4, 5, 6],
'B_b': ['a', np.nan, 'c', 'd', 'e', 'f'],
'C_c': [1, np.nan, 3, 4, np.nan, 6],
'D_d': ['a', np.nan, 'c', 'd', np.nan, 'f']})
df3 = pd.DataFrame({'A_y': [1, np.nan, 3, 4, 5, 6],
'C_w': [1, 2, 3, np.nan, 5, 6]})
alias_map = {'A': ['A_1', 'A_a', 'A_y'], 'B': ['B_3', 'B_b'],
'C': ['C_c', 'C_w'], 'D': ['D_5', 'D_d']}
# Turn alias map into something that works for columns.map
mapper = {new_k: new_v for new_v, lst in alias_map.items() for new_k in lst}
# List of DFs
dfs = [df1, df2, df3]
# Rename Columns
for df in dfs:
df.columns = df.columns.map(mapper)
# Have Empty DF First with All Columns
default_df = pd.DataFrame(columns=list(alias_map.keys()))
merged = pd.concat((default_df, *dfs)).reset_index(drop=True)
print(merged)
merged:
A B C D
0 1.0 a NaN a
1 NaN b NaN b
2 3.0 c NaN c
3 4.0 d NaN d
4 NaN NaN NaN NaN
5 6.0 f NaN f
6 1.0 a 1.0 a
7 NaN NaN NaN NaN
8 3.0 c 3.0 c
9 4.0 d 4.0 d
10 5.0 e NaN NaN
11 6.0 f 6.0 f
12 1.0 NaN 1.0 NaN
13 NaN NaN 2.0 NaN
14 3.0 NaN 3.0 NaN
15 4.0 NaN NaN NaN
16 5.0 NaN 5.0 NaN
17 6.0 NaN 6.0 NaN
With just 1 DF
merged = pd.concat((default_df, df1)).reset_index(drop=True)
print(merged)
merged:
A B C D
0 1.0 a NaN a
1 NaN b NaN b
2 3.0 c NaN c
3 4.0 d NaN d
4 NaN NaN NaN NaN
5 6.0 f NaN f
IIUC, you can do it this way:
import pandas as pd
import numpy as np
from functools import reduce
df1 = pd.DataFrame({'A_1':[1,np.nan,3,4,np.nan,6],
'B_3':['a','b','c','d',np.nan,'f'],
'D_5':['a','b','c','d',np.nan,'f']})
df2 = pd.DataFrame({'A_a':[1, np.nan,3,4,5,6],
'B_b':['a',np.nan,'c', 'd', 'e','f'],
'C_c':[1, np.nan, 3,4,np.nan,6],
'D_d':['a',np.nan,'c','d', np.nan,'f']})
df3 = pd.DataFrame({'A_y':[1,np.nan,3,4,5,6],
'C_w':[1,2,3,np.nan,5,6]})
dd = {'A': ['A_1', 'A_a', 'A_y'], 'B': ['B_3', 'B_b'], 'C': ['C_c', 'C_w'], 'D': ['D_5', 'D_d']}
#Invert your custom dictionary
col_dict = {}
for k, v in dd.items():
for i in v:
col_dict[i]=k
#Changed due to comment
df_out = pd.concat([i.rename(columns=col_dict) for i in [df1,df2,df3]])
df_out
output:
A B D C
0 1.0 a a NaN
1 NaN b b NaN
2 3.0 c c NaN
3 4.0 d d NaN
4 NaN NaN NaN NaN
5 6.0 f f NaN
0 1.0 a a 1.0
1 NaN NaN NaN NaN
2 3.0 c c 3.0
3 4.0 d d 4.0
4 5.0 e NaN NaN
5 6.0 f f 6.0
0 1.0 NaN NaN 1.0
1 NaN NaN NaN 2.0
2 3.0 NaN NaN 3.0
3 4.0 NaN NaN NaN
4 5.0 NaN NaN 5.0
5 6.0 NaN NaN 6.0
Let's just get first dataframe using slicing notation:
ldfs = [df1,df2,df3]
df_out = pd.concat([i.rename(columns=col_dict) for i in ldfs[0:1]])

Creating two shifted columns in grouped pandas data-frame

I have looked all over and I still can't find an example of how to create two shifted columns in a Pandas Dataframe within its groups.
I have done it with one column as follows:
data_frame['previous_category'] = data_frame.groupby('id')['category'].shift()
But I have to do it with 2 columns, shifting one upwards and the other downwards.
Any ideas?
It is possible by custom function with GroupBy.apply, because one column need shift down and second shift up:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'F':list('aaabbb')
})
def f(x):
x['B'] = x['B'].shift()
x['C'] = x['C'].shift(-1)
return x
df = df.groupby('F').apply(f)
print (df)
B C F
0 NaN 8.0 a
1 4.0 9.0 a
2 5.0 NaN a
3 NaN 2.0 b
4 5.0 3.0 b
5 5.0 NaN b
If want shift same way only specify all columns in lists:
df[['B','C']] = df.groupby('F')['B','C'].shift()
print (df)
B C F
0 NaN NaN a
1 4.0 7.0 a
2 5.0 8.0 a
3 NaN NaN b
4 5.0 4.0 b
5 5.0 2.0 b

Combine two rows in Dataframe using a Unique Value

I converted a list into a Dataframe and now my data looks like this.
I want to use the unique Business ID to merge two rows in this Dataframe. How can I do this?
Use first in a groupby to get first non-null value
Consider the data frame df
df = pd.DataFrame(dict(
Bars=[np.nan, 1, 1, np.nan],
BusID=list('AABB'),
Nightlife=[1, np.nan, np.nan, 1]
))
df
Bars BusID Nightlife
0 NaN A 1.0
1 1.0 A NaN
2 1.0 B NaN
3 NaN B 1.0
Then
df.groupby('BusID', as_index=False).first()
BusID Bars Nightlife
0 A 1.0 1.0
1 B 1.0 1.0
You could use something like df.groupby('Business ID').sum(). As an example:
df = pd.DataFrame(data = {'a': [1, 2, 3, 1],
'b': [5, 6, None, None],
'c': [None, None, 7, 8]})
df
# a b c
# 0 1 5.0 NaN
# 1 2 6.0 NaN
# 2 3 NaN 7.0
# 3 1 NaN 8.0
new_df = df.groupby('a').sum()
new_df
# b c
# a
# 1 5.0 8.0
# 2 6.0 0.0
# 3 0.0 7.0

I am trying to fill all NaN values in rows with number data types to zero in pandas

I have a DateFrame with a mixture of string, and float rows. The float rows are all still whole numbers and were only changed to floats because their were missing values. I want to fill in all the NaN rows that are numbers with zero while leaving the NaN in columns that are strings. Here is what I have currently.
df.select_dtypes(include=['int', 'float']).fillna(0, inplace=True)
This doesn't work and I think it is because .select_dtypes() returns a view of the DataFrame so the .fillna() doesn't work. Is there a method similar to this to fill all the NaNs on only the float rows.
Use either DF.combine_first (does not act inplace):
df.combine_first(df.select_dtypes(include=[np.number]).fillna(0))
or DF.update (modifies inplace):
df.update(df.select_dtypes(include=[np.number]).fillna(0))
The reason why fillna fails is because DF.select_dtypes returns a completely new dataframe which although forms a subset of the original DF, but is not really a part of it. It behaves as a completely new entity in itself. So any modifications done to it will not affect the DF it gets derived from.
Note that np.number selects all numeric type.
Your pandas.DataFrame.select_dtypes approach is good; you've just got to cross the finish line:
>>> df = pd.DataFrame({'A': [np.nan, 'string', 'string', 'more string'], 'B': [np.nan, np.nan, 3, 4], 'C': [4, np.nan, 5, 6]})
>>> df
A B C
0 NaN NaN 4.0
1 string NaN NaN
2 string 3.0 5.0
3 more string 4.0 6.0
Don't try to perform the in-place fillna here (there's a time and place for inplace=True, but here is not one). You're right in that what's returned by select_dtypes is basically a view. Create a new dataframe called filled and join the filled (or "fixed") columns back with your original data:
>>> filled = df.select_dtypes(include=['int', 'float']).fillna(0)
>>> filled
B C
0 0.0 4.0
1 0.0 0.0
2 3.0 5.0
3 4.0 6.0
>>> df = df.join(filled, rsuffix='_filled')
>>> df
A B C B_filled C_filled
0 NaN NaN 4.0 0.0 4.0
1 string NaN NaN 0.0 0.0
2 string 3.0 5.0 3.0 5.0
3 more string 4.0 6.0 4.0 6.0
Then you can drop whatever original columns you had to keep only the "filled" ones:
>>> df.drop([x[:x.find('_filled')] for x in df.columns if '_filled' in x], axis=1, inplace=True)
>>> df
A B_filled C_filled
0 NaN 0.0 4.0
1 string 0.0 0.0
2 string 3.0 5.0
3 more string 4.0 6.0
Consider a dataframe like this
col1 col2 col3 id
0 1 1 1 a
1 0 NaN 1 a
2 NaN 1 1 NaN
3 1 0 1 b
You can select the numeric columns and fillna
num_cols = df.select_dtypes(include=[np.number]).columns
df[num_cols]=df.select_dtypes(include=[np.number]).fillna(0)
col1 col2 col3 id
0 1 1 1 a
1 0 0 1 a
2 0 1 1 NaN
3 1 0 1 b

Categories

Resources