I have two large DataFrames that I don't want to make copies of, but want to apply the same change to. How can I do this properly? For example, this is similar to what I want to do, but on a smaller scale. This only creates the temporary variable df that gives the result of each DataFrame, but I want both DataFrames to be themselves changed:
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
for df in [df1, df2]:
df = df[df['a'] < 3]
We can do query with inplace
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
for df in [df1, df2]:
df.query('a<3',inplace=True)
df1
a
0 1
1 2
df2
a
0 0
1 1
Don't think this is the best solution, but should do the job.
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame({'a':[0,1,5,7]})
dfs = [df1, df2]
for i, df in enumerate(dfs):
dfs[i] = df[df['a'] < 3]
dfs[0]
a
0 1
1 2
Related
I have a list of dataframes, I want to add a new column to each dataframe that is the name of the dataframe.
df_all = [df1,df2,df3]
for df in df_all:
df["Loc"] = df[df].astype.(str)
Boolean array expected for the condition, not object
is this possible to achieve?
You can't do this, python objects have no possibility to know their name(s).
You could emulate it with:
df_all = [df1, df2, df3]
for i, df in enumerate(df_all, start=1):
df['Loc'] = f'df{i}'
Alternatively, use a dictionary:
df_all = {'df1': df1, 'df2': df2, 'df3': df3}
for k, df in df_all.items():
df['Loc'] = k
It can be done with using the system's locals() dictionary, which contains variable names and references, and the is operator to match.
df1, df2, df3 = pd.DataFrame([1, 1, 1]), pd.DataFrame([2, 2, 2]), pd.DataFrame([3, 3, 3])
df_all = [df1, df2, df3]
_df = k = v = None
for _df in df_all:
for k, v in locals().items():
if v is _df and k != '_df':
_df["Loc"] = k
print(*df_all, sep='\n\n')
0 Loc
0 1 df1
1 1 df1
2 1 df1
0 Loc
0 2 df2
1 2 df2
2 2 df2
0 Loc
0 3 df3
1 3 df3
2 3 df3
The goal of following code is to go through each row in df_label, extract app1 and app2 names, filter df_all using those two names, concatenate the result and return it as a dataframe. Here is the code:
def create_dataset(se):
# extracting the names of applications
app1 = se.app1
app2 = se.app2
# extracting each application from df_all
df1 = df_all[df_all.workload == app1]
df1.columns = df1.columns + '_0'
df2 = df_all[df_all.workload == app2]
df2.columns = df2.columns + '_1'
# combining workloads to create the pairs dataframe
df3 = pd.concat([df1, df2], axis=1)
display(df3)
return df3
df_pairs = pd.DataFrame()
df_label.apply(create_dataset, axis=1)
#df_pairs = df_pairs.append(df_label.apply(create_dataset, axis=1))
I would like to append all dataframes returned from apply. However, while display(df3) shows the correct dataframe, when returned from function, it's not a dataframe anymore and it's a series. A series with one element and that element seems to be the whole dataframe. Any ideas what I am doing wrong?
When you select a single column, you'll get a Series instead of a DataFrame so df1 and df2 will both be series.
However, concatenating them on axis=1 should produce a DataFrame (whereas combining them on axis=0 would produce a series). For example:
df = pd.DataFrame({'a':[1,2],'b':[3,4]})
df1 = df['a']
df2 = df['b']
>>> pd.concat([df1,df2],axis=1)
a b
0 1 3
1 2 4
>>> pd.concat([df1,df2],axis=0)
0 1
1 2
0 3
1 4
dtype: int64
I have two data frames that I would like to compare for equality in a row-wise manner. I am interested in computing the number of rows that have the same values for non-joined attributes.
For example,
import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,10,30]})
df2 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,40,40]})
I will be joining these two data frames on column a and b. There are two rows (first two) that have the same values for c and d in both the data frames.
I am currently using the following approach where I first join these two data frames, and then compute each row's values for equality.
df = df1.merge(df2, on=['a','b'])
cols1 = [c for c in df.columns.tolist() if c.endswith("_x")]
cols2 = [c for c in df.columns.tolist() if c.endswith("_y")]
num_rows_equal = 0
for index, row in df.iterrows():
not_equal = False
for col1,col2 in zip(cols1,cols2):
if row[col1] != row[col2]:
not_equal = True
break
if not not_equal: # row values are equal
num_rows_equal += 1
num_rows_equal
Is there a more efficient (pythonic) way to achieve the same result?
A shorter way of achieving that:
import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,10,30]})
df2 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,40,40]})
df = df1.merge(df2, on=['a','b'])
comparison_cols = [c.strip('_x') for c in df.columns.tolist() if c.endswith("_x")]
num_rows_equal = (df1[comparison_cols][df1[comparison_cols] == df2[comparison_cols]].isna().sum(axis=1) == 0).sum()
use pandas merge ordered, merging with 'inner'. From there, you can get your dataframe shape and by extension your number of rows.
df_r = pd.merge_ordered(df1,df2,how='inner')
a b c d
0 1 2 60 50
1 2 3 20 90
no_of_rows = df_r.shape[0]
#print(no_of_rows)
#2
If I have two pandas.DataFrame with the same columns.
df1 = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
df2 = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
I concatenate them into one:
df = pd.concat([df1, df2], ignore_index = False)
The index values now are not ignored.
After I perform some data manipulation without changing the index values, how can I reverse back the concatenation, so that I end up with a list of the two data frames again?
I recommend using keys in concat
df = pd.concat([df1, df2], ignore_index = False,keys=['df1','df2'])
df
Out[28]:
a b c d e f
df1 0 0.426246 0.162134 0.231001 0.645908 0.282457 0.715134
1 0.973173 0.854198 0.419888 0.617750 0.115466 0.565804
2 0.474284 0.757242 0.452319 0.046627 0.935915 0.540498
3 0.046215 0.740778 0.204866 0.047914 0.143158 0.317274
4 0.311755 0.456133 0.704235 0.255057 0.558791 0.319582
df2 0 0.449926 0.330672 0.830240 0.861221 0.234013 0.299515
1 0.552645 0.620980 0.313907 0.039247 0.356451 0.849368
2 0.159485 0.620178 0.428837 0.315384 0.910175 0.020809
3 0.687249 0.824803 0.118434 0.661684 0.013440 0.611711
4 0.576244 0.915196 0.544099 0.750581 0.192548 0.477207
Convert back
df1,df2=[y.reset_index(level=0,drop=True) for _, y in df.groupby(level=0)]
df1
Out[30]:
a b c d e f
0 0.426246 0.162134 0.231001 0.645908 0.282457 0.715134
1 0.973173 0.854198 0.419888 0.617750 0.115466 0.565804
2 0.474284 0.757242 0.452319 0.046627 0.935915 0.540498
3 0.046215 0.740778 0.204866 0.047914 0.143158 0.317274
4 0.311755 0.456133 0.704235 0.255057 0.558791 0.319582
If you prefer to do without groupby, you could use this.
list_dfs = [df1, df2]
df = pd.concat(list_dfs, ignore_index = False)
new_dfs = []
counter = 0
for i in list_dfs:
new_dfs.append(df[counter:counter+len(i)])
counter += len(i)
Is it possible to append to an empty data frame that doesn't contain any indices or columns?
I have tried to do this, but keep getting an empty dataframe at the end.
e.g.
import pandas as pd
df = pd.DataFrame()
data = ['some kind of data here' --> I have checked the type already, and it is a dataframe]
df.append(data)
The result looks like this:
Empty DataFrame
Columns: []
Index: []
This should work:
>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df = df.append(data)
>>> df
A
0 0
1 1
2 2
Since the append doesn't happen in-place, so you'll have to store the output if you want it:
>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df.append(data) # without storing
>>> df
Empty DataFrame
Columns: []
Index: []
>>> df = df.append(data)
>>> df
A
0 0
1 1
2 2
And if you want to add a row, you can use a dictionary:
df = pd.DataFrame()
df = df.append({'name': 'Zed', 'age': 9, 'height': 2}, ignore_index=True)
which gives you:
age height name
0 9 2 Zed
You can concat the data in this way:
InfoDF = pd.DataFrame()
tempDF = pd.DataFrame(rows,columns=['id','min_date'])
InfoDF = pd.concat([InfoDF,tempDF])
The answers are very useful, but since pandas.DataFrame.append was deprecated (as already mentioned by various users), and the answers using pandas.concat are not "Runnable Code Snippets" I would like to add the following snippet:
import pandas as pd
df = pd.DataFrame(columns =['name','age'])
row_to_append = pd.DataFrame([{'name':"Alice", 'age':"25"},{'name':"Bob", 'age':"32"}])
df = pd.concat([df,row_to_append])
So df is now:
name age
0 Alice 25
1 Bob 32
pandas.DataFrame.append Deprecated since version 1.4.0: Use concat() instead.
Therefore:
df = pd.DataFrame() # empty dataframe
df2 = pd..DataFrame(...) # some dataframe with data
df = pd.concat([df, df2])