I want to merge 3 columns into a single column. I have tried changing the column types. However, I could not do it.
For example, I have 3 columns such as A: {1,2,4}, B:{3,4,4}, C:{1,1,1}
Output expected: ABC Column {131, 241, 441}
My inputs are like this:
df['ABC'] = df['A'].map(str) + df['B'].map(str) + df['C'].map(str)
df.head()
ABC {13.01.0 , 24.01.0, 44.01.0}
The type of ABC seems object and I could not change via str, int.
df['ABC'].apply(str)
Also, I realized that there are NaN values in A, B, C column. Is it possible to merge these even with NaN values?
# Example
import pandas as pd
import numpy as np
df = pd.DataFrame()
# Considering NaN's in the data-frame
df['colA'] = [1,2,4, np.NaN,5]
df['colB'] = [3,4,4,3,np.NaN]
df['colC'] = [1,1,1,4,1]
# Using pd.isna() to check for NaN values in the columns
df['colA'] = df['colA'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colB'] = df['colB'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colC'] = df['colC'].apply(lambda x: x if pd.isna(x) else str(int(x)))
# Filling the NaN values with a blank space
df = df.fillna('')
# Transform columns into string
df = df.astype(str)
# Concatenating all together
df['ABC'] = df.sum(axis=1)
A workaround your NaN problem could look like this but now NaN will be 0
import numpy as np
df = pd.DataFrame({'A': [1,2,4, np.nan], 'B':[3,4,4,4], 'C':[1,np.nan,1, 3]})
df = df.replace(np.nan, 0, regex=True).astype(int).applymap(str)
df['ABC'] = df['A'] + df['B'] + df['C']
output
A B C ABC
0 1 3 1 131
1 2 4 0 240
2 4 4 1 441
3 0 4 3 043
Related
Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0
I created a dataframe df = pd.DataFrame({'col':[1,2,3,4,5,6]}) and I would like to take some values and put them in another dataframe df2 = pd.DataFrame({'A':[0,0]})by creating new columns.
I created a new column 'B' df2['B'] = df.iloc[0:2,0] and everything was fine, but then i created another column C df2['C'] = df.iloc[2:4,0] and there were only NaN values. I don't know why and if I print print(df.iloc[2:4]) everything is normal.
full code:
import pandas as pd
df = pd.DataFrame({'col':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'A':[0,0]})
df2['B'] = df.iloc[0:2,0]
df2['C'] = df.iloc[2:4,0]
print(df2)
print('\n',df.iloc[2:4])
output:
A B C
0 0 1 NaN
1 0 2 NaN
col
2 3
3 4
Assignement df2['C'] = df.iloc[2:4,0] does not work as expected, because index is not the same. You can skip this using .values attributes.
import pandas as pd
df = pd.DataFrame({'col':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'A':[0,0]})
df2['B'] = df.iloc[0:2,0]
df2['C'] = df.iloc[2:4,0].values
print(df2)
I have a df with many columns. I would like to group by id and transform a subset of those columns leaving the rest untouched. What is the optimal way to do this? In particular, I have a df with a bunch of id's and I would like to z-score columns a and b within each id. Column c should remain untouched. In my actual problem I have many more columns.
The best I can think of is passing a dict of {col_name: function_name} to transform. For some reason this raises a TypeError.
MWE:
import pandas as pd
import numpy as np
np.random.seed(123) #reproducible ex
df = pd.DataFrame(data = {"a": np.arange(10), "b": np.arange(10)[::-1], "c": np.random.choice(a = np.arange(10), size = 10)}, index = pd.Index(data = np.random.choice(a = [1,2,3], size = 10), name = "id"))
#create a dict for all columns other than "c" and the function to do the transform
fmap = {k: lambda x: (x - x.mean()) / x.std() for k in df.columns if k != "c"}
df.groupby("id").transform(fmap) #yields error that "dict" is unhashable
Turns out this is a known bug: https://github.com/pandas-dev/pandas/issues/17309
One possible solution is filter columns names first by difference, because dict cannot working with transfrom yet:
cols = df.columns.difference(['c'])
print (cols)
Index(['a', 'b'], dtype='object')
fmap = lambda x: (x - x.mean()) / x.std()
df[cols] = df.groupby("id")[cols].transform(fmap)
print (df)
a b c
id
3 -1.000000 1.000000 2
2 -1.091089 1.091089 2
1 -1.134975 1.134975 6
3 0.000000 0.000000 1
1 -0.529655 0.529655 3
2 0.218218 -0.218218 9
3 1.000000 -1.000000 6
2 0.872872 -0.872872 1
1 0.680985 -0.680985 0
1 0.983645 -0.983645 1
I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.
The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)
you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps
Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()
You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6
I have a big data frame with lots of NaN, I want to store it into a smaller data frame which stores all the indexes and the values of the non-NaN, non-zero values.
dff = pd.DataFrame(np.random.randn(4,3), columns=list('ABC'))
dff.iloc[0:2,0] = np.nan
dff.iloc[2,2] = np.nan
dff.iloc[1:4,1] = 0
The data frame may look like this:
A B C
0 NaN -2.268882 0.337074
1 NaN 0.000000 1.340350
2 -1.526945 0.000000 NaN
3 -1.223816 0.000000 -2.185926
I want a data frame looks like this:
0 B -2.268882
0 C 0.337074
1 C 1.340350
2 A -1.526945
3 A -1.223816
4 C -2.185926
How can I do it quickly, as i have a relatively big data frame, thousands by thousands...
Many Thanks!
Replace 0 with np.nan and .stack() the result (see docs).
If there's a chance that you have all np.nan values in rows after .replace(), you could do .dropna(how='all') before .stack() to reduce the number of rows to pivot. If that could apply to columns do `.dropna(how='all', axis=1).
df.replace(0, np.nan).stack()
0 B -2.268882
C 0.337074
1 C 1.340350
2 A -1.526945
3 A -1.223816
C -2.185926
Combine with .reset_index() as needed.
To select from a Series with MultiIndex use .loc[(level_0, level_1)]:
df.loc[(0, 'B')] = -2.268882
Details on slicing etc in the docs.
I've come up with a bit ugly way of achieving things, but hey, it works. But this solution has index going from 0 and does not preserve the original order of 'A', 'B', 'C' as in your question, if that matters.
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(4,3), columns=list('ABC'))
dff.iloc[0:2,0] = np.nan
dff.iloc[2,2] = np.nan
dff.iloc[1:4,1] = 0
dff.iloc[2,1] = np.nan
# mask to do logical and for two lists
mask = lambda y,z: list(map(lambda x: x[0] and x[1], zip(y,z)))
# create new frame
new_df = pd.DataFrame()
types = []
vals = []
# iterate over columns
for col in dff.columns:
# get the non empty and non zero values from current column
data = dff[col][mask(dff[col].notnull(), dff[col] != 0)]
# add corresponding original column name
types.extend([col for x in range(len(data))])
vals.extend(data)
# populate the dataframe
new_df['Types'] = pd.Series(types)
new_df['Vals'] = pd.Series(vals)
print(new_df)
# A B C
#0 NaN -1.167975 -1.362128
#1 NaN 0.000000 1.388611
#2 1.482621 NaN NaN
#3 -1.108279 0.000000 -1.454491
# Types Vals
#0 A 1.482621
#1 A -1.108279
#2 B -1.167975
#3 C -1.362128
#4 C 1.388611
#5 C -1.454491
I am looking forward for more pandas/python like answer myself!