I have a data frame with:
A B C
1 3 6
I want to take the 2 columns and create column D that reads {"A":"1", "C":"6}
new dataframe output would be:
A B C D
1 3 6 {"A":"1", "C":"6}
I have the following code:
df['D'] = n.apply(lambda x: x.to_json(), axis=1)
but this is taking all columns while I only need columns A and C and want to leave B from the JSON that is created.
Any tips on just targeting the two columns would be appreciated.
It's not exactly what you ask but you can convert your 2 columns into a dict then if you want to export your data in JSON format, use df['D'].to_json():
df['D'] = df[['A', 'C']].apply(dict, axis=1)
print(df)
# Output
A B C D
0 1 3 6 {'A': 1, 'C': 6}
For example, export the column D as JSON:
print(df['D'].to_json(orient='records', indent=4))
# Output
[
{
"A":1,
"C":6
}
]
Use subset in lambda function:
df['D'] = df.apply(lambda x: x[['A','C']].to_json(), axis=1)
Or sellect columns before apply:
df['D'] = df[['A','C']].apply(lambda x: x.to_json(), axis=1)
If possible create dictionaries:
df['D'] = df[['A','C']].to_dict(orient='records')
print (df)
A B C D
0 1 3 6 {'A': 1, 'C': 6}
Related
This question already has answers here:
pandas - filter dataframe by another dataframe by row elements
(7 answers)
Closed 2 years ago.
I have two dataframes like this
import pandas as pd
df1 = pd.DataFrame(
{
'A': list('abcaewar'),
'B': list('ghjglmgb'),
'C': list('lkjlytle'),
'ignore': ['stuff'] * 8
}
)
df2 = pd.DataFrame(
{
'A': list('abfu'),
'B': list('ghio'),
'C': list('lkqw'),
'stuff': ['ignore'] * 4
}
)
and I would like to remove all rows in df1 where A, B and C are identical to values in df2, so in the above case the expected outcome is
A B C ignore
0 c j j stuff
1 e l y stuff
2 w m t stuff
3 r b e stuff
One way of achieving this would be
comp_columns = ['A', 'B', 'C']
df1 = df1.set_index(comp_columns)
df2 = df2.set_index(comp_columns)
keep_ind = [
ind for ind in df1.index if ind not in df2.index
]
new_df1 = df1.loc[keep_ind].reset_index()
Does anyone see a more straightforward way of doing this which avoids the reset_index() operations and the loop to identify non-overlapping indices, e.g. by a mart way of masking? Ideally, I don't have to hardcode the columns, but can define them in a list as above as I sometimes need 2, sometimes 3 or sometimes 4 or more columns for the removal.
Use DataFrame.merge with optional parameter indicator=True, then use boolean masking to filter the rows in df1:
df3 = df1.merge(df2[['A', 'B', 'C']], on=['A', 'B', 'C'], indicator=True, how='left')
df3 = df3[df3.pop('_merge').eq('left_only')]
Result:
# print(df3)
A B C ignore
2 c j j stuff
4 e l y stuff
5 w m t stuff
7 r b e stuff
I have two different dataframes and I want to get the sorted
values of two columns.
Setup
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
'id': range(7),
'c': list('EDBBCCC')
})
df2 = pd.DataFrame({
'id': range(8),
'c': list('EBBCCCAA')
})
Desired Output
# notice that ABCDE appear in alphabetical order
c_first c_second
NAN A
B B
C C
D NAN
E E
What I've tried
pd.concat([df1.c.sort_values().drop_duplicates().rename('c_first'),
df2.c.sort_values().drop_duplicates().rename('c_second')
],axis=1)
How to get the output as given in required format?
Here one possible way to achive it:
t1 = df1.c.drop_duplicates()
t2 = df2.c.drop_duplicates()
tmp1 = pd.DataFrame({'id':t1, 'c_first':t1})
tmp2 = pd.DataFrame({'id':t2, 'c_second':t2})
result = pd.merge(tmp1,tmp2, how='outer').sort_values('id').drop('id', axis=1)
result
c_first c_second
4 NaN A
0 B B
1 C C
2 D NaN
3 E E
https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.concat.html
There is an argument in concat function.
Try to add sort=True.
I am stuck with a seemingly easy problem: dropping unique rows in a pandas dataframe. Basically, the opposite of drop_duplicates().
Let's say this is my data:
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
I would like to drop the rows when A, and B are unique, i.e. I would like to keep only the rows 1 and 2.
I tried the following:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates()
duplicates = df[~df.index.isin(uniques.index)]
But I only get the row 2, as 0, 1, and 3 are in the uniques!
Solutions for select all duplicated rows:
You can use duplicated with subset and parameter keep=False for select all duplicates:
df = df[df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
1 foo 1 A
2 foo 1 B
Solution with transform:
df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1]
print (df)
A B C
1 foo 1 A
2 foo 1 B
A bit modified solutions for select all unique rows:
#invert boolean mask by ~
df = df[~df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
0 foo 0 A
3 bar 1 A
df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1]
print (df)
A B C
0 foo 0 A
3 bar 1 A
I came up with a solution using groupby:
groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'})
uniques = groupped[groupped['count'] == 1]
duplicates = df[~df.index.isin(uniques.index)]
Duplicates now has the proper result:
A B C
2 foo 1 B
3 bar 1 A
Also, my original attempt in the question can be fixed by simply adding keep=False in the drop_duplicates method:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates(keep=False)
duplicates = df[~df.index.isin(uniques.index)]
Please #jezrael answer, I think it is safest(?), as I am using pandas indexes here.
df1 = df.drop_duplicates(['A', 'B'],keep=False)
df1 = pd.concat([df, df1])
df1 = df1.drop_duplicates(keep=False)
This technique is more suitable when you have two datasets dfX and dfY with millions of records. You may first concatenate dfX and dfY and follow the same steps.
Let's say I have a data frame with such column names:
['a','b','c','d','e','f','g']
And I would like to change names from 'c' to 'f' (actually add string to the name of column), so the whole data frame column names would look like this:
['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']
Well, firstly I made a function that changes column names with the string i want:
df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)
But now I really want to understand how to implement something like this:
df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)
You can a use a list comprehension for that like:
Code:
new_columns = ['var_{}_equal'.format(c) if c in 'cdef' else c for c in columns]
Test Code:
import pandas as pd
df = pd.DataFrame({'a':(1,2), 'b':(1,2), 'c':(1,2), 'd':(1,2)})
print(df)
df.columns = ['var_{}_equal'.format(c) if c in 'cdef' else c
for c in df.columns]
print(df)
Results:
a b c d
0 1 1 1 1
1 2 2 2 2
a b var_c_equal var_d_equal
0 1 1 1 1
1 2 2 2 2
One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.
Contiguous columns by position
d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)
Contiguous columns by name
If you need to calculate the numerical indices:
cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)
Specifically identified columns
If you want to provide the columns explicitly:
d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)
I have a dataframe with repeated column names which account for repeated measurements.
df = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df2 = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df3 = pd.concat([df,df2], axis=1)
df3
A B A B
0 -0.875884 -0.298203 0.877414 1.282025
1 1.605602 -0.127038 -0.286237 0.572269
2 1.349540 -0.067487 0.126440 1.063988
3 -0.142809 1.282968 0.941925 -1.593592
4 -0.630353 1.888605 -1.176436 -1.623352
I'd like to take the mean of cols 'A's and 'B's such that the dataframe shrinks to
A B
0 0.000765 0.491911
1 0.659682 0.222616
2 0.737990 0.498251
3 0.399558 -0.155312
4 -0.903395 0.132627
If I do the typical
df3['A'].mean(axis=1)
I get a Series (with no column name) and I should then build a new dataframe with the means of each col group. Also the .groupby() method apparently doesn't allow you to group by column name, but rather you give the columns and it sorts the indexes. Is there a fancy way to do this?
Side question: why does
df = pd.DataFrame({'A': randn(5), 'B': randn(5), 'A': randn(5), 'B': randn(5)})
not generate a 4-column dataframe but merges same-name cols?
You can use the level keyword (regarding your columns as the first level (level 0) of the index with only one level in this case):
In [11]: df3
Out[11]:
A B A B
0 -0.367326 -0.422332 2.379907 1.502237
1 -1.060848 0.083976 0.619213 -0.303383
2 0.805418 -0.109793 0.257343 0.186462
3 2.419282 -0.452402 0.702167 0.216165
4 -0.464248 -0.980507 0.823302 0.900429
In [12]: df3.mean(axis=1, level=0)
Out[12]:
A B
0 1.006291 0.539952
1 -0.220818 -0.109704
2 0.531380 0.038334
3 1.560725 -0.118118
4 0.179527 -0.040039
You've created df3 in a strange way for this simple case the following would work:
In [86]:
df = pd.DataFrame({'A': randn(5), 'B': randn(5)})
df2 = pd.DataFrame({'A': randn(5), 'B': randn(5)})
print(df)
print(df2)
A B
0 -0.732807 -0.571942
1 -1.546377 -1.586371
2 0.638258 0.569980
3 -1.017427 1.395300
4 0.666853 -0.258473
[5 rows x 2 columns]
A B
0 0.589185 1.029062
1 -1.447809 -0.616584
2 -0.506545 0.432412
3 -1.168424 0.312796
4 1.390517 1.074129
[5 rows x 2 columns]
In [87]:
(df+df2)/2
Out[87]:
A B
0 -0.071811 0.228560
1 -1.497093 -1.101477
2 0.065857 0.501196
3 -1.092925 0.854048
4 1.028685 0.407828
[5 rows x 2 columns]
to answer your side question, this is nothing to do with Pandas and more to do with the dict constructor:
In [88]:
{'A': randn(5), 'B': randn(5), 'A': randn(5), 'B': randn(5)}
Out[88]:
{'B': array([-0.03087831, -0.24416885, -2.29924624, 0.68849978, 0.41938536]),
'A': array([ 2.18471335, 0.68051101, -0.35759988, 0.54023489, 0.49029071])}
dict keys must be unique so my guess is that in the constructor it just reassigns the values to the pre-existing keys
EDIT
If you insist on having duplicate columns then you have to create a new dataframe from this because if you were to update the columns 'A' and 'B', the mean will be duplicated still as the columns are repeated:
In [92]:
df3 = pd.concat([df,df2], axis=1)
new_df = pd.DataFrame()
new_df['A'], new_df['B'] = df3['A'].sum(axis=1)/df3['A'].shape[1], df3['B'].sum(axis=1)/df3['B'].shape[1]
new_df
Out[92]:
A B
0 -0.071811 0.228560
1 -1.497093 -1.101477
2 0.065857 0.501196
3 -1.092925 0.854048
4 1.028685 0.407828
[5 rows x 2 columns]
So the above would work with df3 and in fact for an arbritary numer of repeated columns which is why I am using shape, you could hard code this to 2 if you new the columns were only ever duplicated once