If I have two pandas.DataFrame with the same columns.
df1 = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
df2 = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
I concatenate them into one:
df = pd.concat([df1, df2], ignore_index = False)
The index values now are not ignored.
After I perform some data manipulation without changing the index values, how can I reverse back the concatenation, so that I end up with a list of the two data frames again?
I recommend using keys in concat
df = pd.concat([df1, df2], ignore_index = False,keys=['df1','df2'])
df
Out[28]:
a b c d e f
df1 0 0.426246 0.162134 0.231001 0.645908 0.282457 0.715134
1 0.973173 0.854198 0.419888 0.617750 0.115466 0.565804
2 0.474284 0.757242 0.452319 0.046627 0.935915 0.540498
3 0.046215 0.740778 0.204866 0.047914 0.143158 0.317274
4 0.311755 0.456133 0.704235 0.255057 0.558791 0.319582
df2 0 0.449926 0.330672 0.830240 0.861221 0.234013 0.299515
1 0.552645 0.620980 0.313907 0.039247 0.356451 0.849368
2 0.159485 0.620178 0.428837 0.315384 0.910175 0.020809
3 0.687249 0.824803 0.118434 0.661684 0.013440 0.611711
4 0.576244 0.915196 0.544099 0.750581 0.192548 0.477207
Convert back
df1,df2=[y.reset_index(level=0,drop=True) for _, y in df.groupby(level=0)]
df1
Out[30]:
a b c d e f
0 0.426246 0.162134 0.231001 0.645908 0.282457 0.715134
1 0.973173 0.854198 0.419888 0.617750 0.115466 0.565804
2 0.474284 0.757242 0.452319 0.046627 0.935915 0.540498
3 0.046215 0.740778 0.204866 0.047914 0.143158 0.317274
4 0.311755 0.456133 0.704235 0.255057 0.558791 0.319582
If you prefer to do without groupby, you could use this.
list_dfs = [df1, df2]
df = pd.concat(list_dfs, ignore_index = False)
new_dfs = []
counter = 0
for i in list_dfs:
new_dfs.append(df[counter:counter+len(i)])
counter += len(i)
Related
I have a list of dataframes, I want to add a new column to each dataframe that is the name of the dataframe.
df_all = [df1,df2,df3]
for df in df_all:
df["Loc"] = df[df].astype.(str)
Boolean array expected for the condition, not object
is this possible to achieve?
You can't do this, python objects have no possibility to know their name(s).
You could emulate it with:
df_all = [df1, df2, df3]
for i, df in enumerate(df_all, start=1):
df['Loc'] = f'df{i}'
Alternatively, use a dictionary:
df_all = {'df1': df1, 'df2': df2, 'df3': df3}
for k, df in df_all.items():
df['Loc'] = k
It can be done with using the system's locals() dictionary, which contains variable names and references, and the is operator to match.
df1, df2, df3 = pd.DataFrame([1, 1, 1]), pd.DataFrame([2, 2, 2]), pd.DataFrame([3, 3, 3])
df_all = [df1, df2, df3]
_df = k = v = None
for _df in df_all:
for k, v in locals().items():
if v is _df and k != '_df':
_df["Loc"] = k
print(*df_all, sep='\n\n')
0 Loc
0 1 df1
1 1 df1
2 1 df1
0 Loc
0 2 df2
1 2 df2
2 2 df2
0 Loc
0 3 df3
1 3 df3
2 3 df3
Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0
I have two large DataFrames that I don't want to make copies of, but want to apply the same change to. How can I do this properly? For example, this is similar to what I want to do, but on a smaller scale. This only creates the temporary variable df that gives the result of each DataFrame, but I want both DataFrames to be themselves changed:
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
for df in [df1, df2]:
df = df[df['a'] < 3]
We can do query with inplace
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
for df in [df1, df2]:
df.query('a<3',inplace=True)
df1
a
0 1
1 2
df2
a
0 0
1 1
Don't think this is the best solution, but should do the job.
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
dfs = [df1, df2]
for i, df in enumerate(dfs):
dfs[i] = df[df['a'] < 3]
dfs[0]
a
0 1
1 2
I am trying to convert a dataframe that has list of various size for example something like this:
d={'A':[1,2,3],'B':[[1,2,3],[3,5],[4]]}
df = pd.DataFrame(data=d)
df
to something like this:
d1={'A':[1,2,3],'B-1':[1,0,0],'B-2':[1,0,0],'B-3':[1,1,0],'B-4':[0,0,1],'B-5':[0,1,0]}
df1 = pd.DataFrame(data=d1)
df1
Thank you for the help
explode the lists then get_dummies and sum over the original index. (max [credit to #JonClements] if you want true dummies and not counts in case there can be multiples). Then join the result back
dfB = pd.get_dummies(df['B'].explode()).sum(level=0).add_prefix('B-')
#dfB = pd.get_dummies(df['B'].explode()).max(level=0).add_prefix('B-')
df = pd.concat([df['A'], dfB], axis=1)
# A B-1 B-2 B-3 B-4 B-5
#0 1 1 1 1 0 0
#1 2 0 0 1 0 1
#2 3 0 0 0 1 0
You can use pop to remove the column you explode so you don't need to specify df[list_of_all_columns_except_B] in the concat:
df = pd.concat([df, pd.get_dummies(df.pop('B').explode()).sum(level=0).add_prefix('B-')],
axis=1)
Suppose I have the following dataset (2 rows, 2 columns, headers are Char0 and Char1):
dataset = [['A', 'B'], ['B', 'C']]
columns = ['Char0', 'Char1']
df = pd.DataFrame(dataset, columns=columns)
I would like to one-hot encode the columns Char0 and Char1, so:
df = pd.concat([df, pd.get_dummies(df["Char0"], prefix='Char0')], axis=1)
df = pd.concat([df, pd.get_dummies(df["Char1"], prefix='Char1')], axis=1)
df.drop(['Char0', "Char1"], axis=1, inplace=True)
which results in a dataframe with column headers Char0_A, Char0_B, Char1_B, Char1_C.
Now, I would like to, for each column, have an indication for both A, B, C, and D (even though, there is currently no 'D' in the dataset). In this case, this would mean 8 columns: Char0_A, Char0_B, Char0_C, Char0_D, Char1_A, Char1_B, Char1_C, Char1_D.
Can somebody help me out?
Use get_dummies with all columns and then add DataFrame.reindex with all possible combinations of columns created by itertools.product:
dataset = [['A', 'B'], ['B', 'C']]
columns = ['Char0', 'Char1']
df = pd.DataFrame(dataset, columns=columns)
vals = ['A','B','C','D']
from itertools import product
cols = ['_'.join(x) for x in product(df.columns, vals)]
print (cols)
['Char0_A', 'Char0_B', 'Char0_C', 'Char0_D', 'Char1_A', 'Char1_B', 'Char1_C', 'Char1_D']
df1 = pd.get_dummies(df).reindex(cols, axis=1, fill_value=0)
print (df1)
Char0_A Char0_B Char0_C Char0_D Char1_A Char1_B Char1_C Char1_D
0 1 0 0 0 0 1 0 0
1 0 1 0 0 0 0 1 0