I have a dataframe, where in one column (we'll call it info) all the cells/rows contain another dataframe inside. I want to loop through all the rows in this column and literally stack the nested dataframes on top of each other, because they all have the same columns
How would I go about this?
You could try as follows:
import pandas as pd
length=5
# some dfs
nested_dfs = [pd.DataFrame({'a': [*range(length)],
'b': [*range(length)]}) for x in range(length)]
print(nested_dfs[0])
a b
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
# df with nested_dfs in info
df = pd.DataFrame({'info_col': nested_dfs})
# code to be implemented
lst_dfs = df['info_col'].values.tolist()
df_final = pd.concat(lst_dfs,axis=0, ignore_index=True)
df_final.tail()
a b
20 0 0
21 1 1
22 2 2
23 3 3
24 4 4
This method should be a bit faster than the solution offered by nandoquintana, which also works.
Incidentally, it is ill advised to name a df column info. This is because df.info is actually a function. E.g., normally df['col_name'].values.tolist() can also be written as df.col_name.values.tolist(). However, if you try this with df.info.values.tolist(), you will run into an error:
AttributeError: 'function' object has no attribute 'values'
You also run the risk of overwriting the function if you start assigning values to your column on top of doing something which you probably don't want to do. E.g.:
print(type(df.info))
<class 'method'>
df.info=1
# column is unaffected, you just create an int variable
print(type(df.info))
<class 'int'>
# but:
df['info']=1
# your column now has all 1's
print(type(df['info']))
<class 'pandas.core.series.Series'>
This is the solution that I came up with, although it's not the fastest which is why I am still leaving the question unanswered
df1 = pd.DataFrame()
for frame in df['Info'].tolist():
df1 = pd.concat([df1, frame], axis=0).reset_index(drop=True)
Our dataframe has three columns (col1, col2 and info).
In info, each row has a nested df as value.
import pandas as pd
nested_d1 = {'coln1': [11, 12], 'coln2': [13, 14]}
nested_df1 = pd.DataFrame(data=nested_d1)
nested_d2 = {'coln1': [15, 16], 'coln2': [17, 18]}
nested_df2 = pd.DataFrame(data=nested_d2)
d = {'col1': [1, 2], 'col2': [3, 4], 'info': [nested_df1, nested_df2]}
df = pd.DataFrame(data=d)
We could combine all nested dfs rows appending them to a list (as nested dfs schema is constant) and concatenating them later.
nested_dfs = []
for index, row in df.iterrows():
nested_dfs.append(row['info'])
result = pd.concat(nested_dfs, sort=False).reset_index(drop=True)
print(result)
This would be the result:
coln1 coln2
0 11 13
1 12 14
2 15 17
3 16 18
I have 10 dataframes that have the same structure (same number of rows and columns) and I am trying to find an efficient way of performing several actions such as renaming columns with a for loop. I have tried putting them in a list such as
dfs = [df1, df2, df3]
for i in dfs:
i.rename(columns={'A': 'a1'},inplace=True)
but it doesn't work. Another issue occurs if I try to use a function and then loop such as:
def groupdfs(anydf)
anydf = anydf.groupby("A").sum
for i in dfs:
groupdfs(i)
No changes are happening to the dataframes. I have searched similar old questions but nothing have worked. What is the best way to loop through many dataframes when you want to perform the same changes to each of them?
The first piece of code you wrote should work fine.
data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
dff = pd.DataFrame(data, columns=['c', 'a', 'b'])
dff
c a b
0 1 2 3
1 4 5 6
2 7 8 9
dato = np.array([(11, 12, 13), (41, 15, 16), (17, 18, 9)])
dfg = pd.DataFrame(data, columns=['c', 'a', 'b'])
dfg
c a b
0 1 2 3
1 4 5 6
2 7 8 9
dffs = [dff, dfg]
for i in dffs:
i.rename(columns={'a': 'a1'},inplace=True)
c a1 b
0 1 2 3
1 4 5 6
2 7 8 9
The only thing I can think of is that you should add a line in the end to save changes to files.
For the first part
Since everything is the same, you can create a list with new column names and assign it to all of them like this:
column_names = ['a1', 'a2', 'a3']
for df in [df1, df2, df3]:
df.columns = column_names
Or, if you want to use dictionary to change some columns only you can:
for df in [df1, df2, df3]:
df.rename({'A':'a1'}, axis=1, inplace = True)
Note that axis = 1 stands for columns level
For the second part
There are two issues:
The groupby creates a new DataFrame that has to be assigned to a variable if you want to use it again
Since you are in a function, you have to return that new DataFrame to be assigned to a variable outside the function as below:
def groupdfs(anydf)
return anydf.groupby("A").sum()
for i in dfs:
i = groupdfs(i)
This will replace the old DataFrame with the new groupby one. It's better to create new variables for the new groupby dataframes
I've two pandas data frames that have some rows in common.
Suppose dataframe2 is a subset of dataframe1.
How can I get the rows of dataframe1 which are not in dataframe2?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
df2
col1 col2
0 1 10
1 2 11
2 3 12
Expected result:
col1 col2
3 4 13
4 5 14
The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.
First, we need to modify the original DataFrame to add the row with data [3, 10].
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
Create a boolean condition:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
Why other solutions are wrong
A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
This solution gets the same wrong result:
df1.isin(df2.to_dict('l')).all(1)
One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
Another method as you've found is to use isin which will produce NaN rows which you can drop:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
However if df2 does not start rows in the same manner then this won't work:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will produce the entire df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):
df1[~df1.index.isin(df2.index)]
As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
could alternatively be used to create the indices, though I doubt this is more efficient.
Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.
Step1.Add a column key1 and key2 to df_1 and df_2 respectively.
Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.
Step3.Select only those rows from df_1 where key1 is not equal to key2.
Step4.Drop key1 and key2.
This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
a bit late, but it might be worth checking the "indicator" parameter of pd.merge.
See this other question for an example:
Compare PandaS DataFrames and return rows that are missing from the first one
This is the best way to do it:
df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(),
how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']
Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.
Why this is the best way?
index.difference only works for unique index based comparisons
pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.
I think those answers containing merging are extremely slow. Therefore I would suggest another way of getting those rows which are different between the two dataframes:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
DISCLAIMER: My solution works if you're interested in one specific column where the two dataframes differ. If you are interested only in those rows, where all columns are equal do not use this approach.
Let's say, col1 is a kind of ID, and you only want to get those rows, which are not contained in both dataframes:
ids_in_df2 = df2.col1.unique()
not_found_ids = df[~df['col1'].isin(ids_in_df2 )]
And that's it. You get a dataframe containing only those rows where col1 isn't appearent in both dataframes.
You can also concat df1, df2:
x = pd.concat([df1, df2])
and then remove all duplicates:
y = x.drop_duplicates(keep=False, inplace=False)
I have an easier way in 2 simple steps:
As the OP mentioned Suppose dataframe2 is a subset of dataframe1, columns in the 2 dataframes are the same,
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
### Step 1: just append the 2nd df at the end of the 1st df
df_both = df1.append(df2)
### Step 2: drop rows which contain duplicates, Drop all duplicates.
df_dif = df_both.drop_duplicates(keep=False)
## mission accompliched!
df_dif
Out[20]:
col1 col2
3 4 13
4 5 14
5 3 10
you can do it using isin(dict) method:
In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
col1 col2
3 4 13
4 5 14
Explanation:
In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}
In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
col1 col2
0 True True
1 True True
2 True True
3 False False
4 False False
In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0 True
1 True
2 True
3 False
4 False
dtype: bool
Here is another way of solving this:
df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
Or:
df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
extract the dissimilar rows using the merge function
df = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
save the dissimilar rows in CSV
df[df['_merge'] == 'left_only'].to_csv('output.csv')
My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry
df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)
This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want
answer = nonuni[nonuni['Empt'] == 0]
How about this:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5],
'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]
Easier, simpler and elegant
uncommon_indices = np.setdiff1d(df1.index.values, df2.index.values)
new_df = df1.loc[uncommon_indices,:]
pd.concat([df1, df2]).drop_duplicates(keep=False) will concatenate the two DataFrames together, and then drop all the duplicates, keeping only the unique rows. By default it will keep the first occurrence of the duplicate, but setting keep=False will drop all the duplicates.
Keep in mind that if you need to compare the DataFrames with columns with different names, you will have to make sure the columns have the same name before concatenating the dataframes.
Also, if the dataframes have a different order of columns, it will also affect the final result.
I am not sure if this is possible. I have two dataframes df1 and df2 which are presented like this:
df1 df2
id value id value
a 5 a 3
c 9 b 7
d 4 c 6
f 2 d 8
e 2
f 1
They will have many more entries in reality than presented here. I would like to create a third dataframe df3 based on the values in df1 and df2. Any values in df1 would take precedence over values in df2 when writing to df3 (if the same id is present in both df1 and df2) so in this example I would return:
df3
id value
a 5
b 7
c 9
d 4
e 2
f 2
I have tried using df2 as the base (df2 will have all of the id's present for the whole universe) and then overwriting the value for id's that are present in df1, but cannot find the merge syntax to do this.
You could use combine_first, provided that you first make the DataFrame index id (so that the values get aligned by id):
In [80]: df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[80]:
id value
0 a 5.0
1 b 7.0
2 c 9.0
3 d 4.0
4 e 2.0
5 f 2.0
Since you mentioned merging, you might be interested in seeing that
you could merge df1 and df2 on id, and then use fillna to replace NaNs in df1's the value column with values from df2's value column:
df1 = pd.DataFrame({'id': ['a', 'c', 'd', 'f'], 'value': [5, 9, 4, 2]})
df2 = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [3, 7, 6, 8, 2, 1]})
result = pd.merge(df2, df1, on='id', how='left', suffixes=('_x', ''))
result['value'] = result['value'].fillna(result['value_x'])
result = result[['id', 'value']]
print(result)
yields the same result, though the first method is simpler.
With the nice indexing methods in Pandas I have no problems extracting data in various ways. On the other hand I am still confused about how to change data in an existing DataFrame.
In the following code I have two DataFrames and my goal is to update values in a specific row in the first df from values of the second df. How can I achieve this?
import pandas as pd
df = pd.DataFrame({'filename' : ['test0.dat', 'test2.dat'],
'm': [12, 13], 'n' : [None, None]})
df2 = pd.DataFrame({'filename' : 'test2.dat', 'n':16}, index=[0])
# this overwrites the first row but we want to update the second
# df.update(df2)
# this does not update anything
df.loc[df.filename == 'test2.dat'].update(df2)
print(df)
gives
filename m n
0 test0.dat 12 None
1 test2.dat 13 None
[2 rows x 3 columns]
but how can I achieve this:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
[2 rows x 3 columns]
So first of all, pandas updates using the index. When an update command does not update anything, check both left-hand side and right-hand side. If you don't update the indices to follow your identification logic, you can do something along the lines of
>>> df.loc[df.filename == 'test2.dat', 'n'] = df2[df2.filename == 'test2.dat'].loc[0]['n']
>>> df
Out[331]:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
If you want to do this for the whole table, I suggest a method I believe is superior to the previously mentioned ones: since your identifier is filename, set filename as your index, and then use update() as you wanted to. Both merge and the apply() approach contain unnecessary overhead:
>>> df.set_index('filename', inplace=True)
>>> df2.set_index('filename', inplace=True)
>>> df.update(df2)
>>> df
Out[292]:
m n
filename
test0.dat 12 None
test2.dat 13 16
In SQL, I would have do it in one shot as
update table1 set col1 = new_value where col1 = old_value
but in Python Pandas, we could just do this:
data = [['ram', 10], ['sam', 15], ['tam', 15]]
kids = pd.DataFrame(data, columns = ['Name', 'Age'])
kids
which will generate the following output :
Name Age
0 ram 10
1 sam 15
2 tam 15
now we can run:
kids.loc[kids.Age == 15,'Age'] = 17
kids
which will show the following output
Name Age
0 ram 10
1 sam 17
2 tam 17
which should be equivalent to the following SQL
update kids set age = 17 where age = 15
If you have one large dataframe and only a few update values I would use apply like this:
import pandas as pd
df = pd.DataFrame({'filename' : ['test0.dat', 'test2.dat'],
'm': [12, 13], 'n' : [None, None]})
data = {'filename' : 'test2.dat', 'n':16}
def update_vals(row, data=data):
if row.filename == data['filename']:
row.n = data['n']
return row
df.apply(update_vals, axis=1)
Update null elements with value in the same location in other.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1.combine_first(df2)
A B
0 1.0 3.0
1 0.0 4.0
more information in this link
There are probably a few ways to do this, but one approach would be to merge the two dataframes together on the filename/m column, then populate the column 'n' from the right dataframe if a match was found. The n_x, n_y in the code refer to the left/right dataframes in the merge.
In[100] : df = pd.merge(df1, df2, how='left', on=['filename','m'])
In[101] : df
Out[101]:
filename m n_x n_y
0 test0.dat 12 None NaN
1 test2.dat 13 None 16
In[102] : df['n'] = df['n_y'].fillna(df['n_x'])
In[103] : df = df.drop(['n_x','n_y'], axis=1)
In[104] : df
Out[104]:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
If you want to put anything in the iith row, add square brackets:
df.loc[df.iloc[ii].name, 'filename'] = [{'anything': 0}]
I needed to update and add suffix to few rows of the dataframe on conditional basis based on the another column's value of the same dataframe -
df with column Feature and Entity and need to update Entity based on specific feature type
df.loc[df.Feature == 'dnb', 'Entity'] = 'duns_' + df.loc[df.Feature == 'dnb','Entity']