I would like to drop the [] for a given df
df=pd.DataFrame(dict(a=[1,2,4,[],5]))
Such that the expected output will be
a
0 1
1 2
2 4
3 5
Edit:
or to make thing more interesting, what if we have two columns and some of the cell is with [] to be dropped.
df=pd.DataFrame(dict(a=[1,2,4,[],5],b=[2,[],1,[],6]))
One way is to get the string repr and filter:
df = df[df['a'].map(repr)!='[]']
Output:
a
0 1
1 2
2 4
4 5
For multiple columns, we could apply the above:
out = df[df.apply(lambda c: c.map(repr)).ne('[]').all(axis=1)]
Output:
a b
0 1 2
2 4 1
4 5 6
You can't use equality directly as pandas will try to align a Series and a list, but you can use isin:
df[~df['a'].isin([[]])]
output:
a
0 1
1 2
2 4
4 5
To act on all columns:
df[~df.isin([[]]).any(1)]
output:
a b
0 1 2
2 4 1
4 5 6
Related
I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')
This seems to be easy but couldn't find a working solution for it:
I have a dataframe with 3 columns:
df = pd.DataFrame({'A': [0,0,2,2,2],
'B': [1,1,2,2,3],
'C': [1,1,2,3,4]})
A B C
0 0 1 1
1 0 1 1
2 2 2 2
3 2 2 3
4 2 3 4
I want to select rows based on values of column A, then groupby based on values of column B, and finally transform values of column C into sum. something along the line of this (obviously not working) code:
df[df['A'].isin(['2']), 'C'] = df[df['A'].isin(['2']), 'C'].groupby('B').transform('sum')
desired output for above example is:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
I also know how to split dataframe and do it. I am looking more for a solution that does it without the need of split+concat/merge. Thank you.
Is it just
s = df['A'].isin([2])
pd.concat((df[s].groupby(['A','B'])['C'].sum().reset_index(),
df[~s])
)
Output:
A B C
0 2 2 5
1 2 3 4
0 0 1 1
Update: Without splitting, you can assign a new column indicating special values of A:
(df.sort_values('A')
.assign(D=(~df['A'].isin([2])).cumsum())
.groupby(['D','A','B'])['C'].sum()
.reset_index('D',drop=True)
.reset_index()
)
Output:
A B C
0 0 1 1
1 0 1 1
2 2 2 5
3 2 3 4
How can I concatenate two Series and create one DataFrame ?
For example, I have series like:
a=pd.Series([1,2,3])
b=pd.Series([4,5,6])
And, I want to get a data frame like:
pd.DataFrame([[1,4], [2,5], [3,6]])
Shortest would be:
pd.DataFrame([a,b]).T
Or:
pd.DataFrame(zip(a,b))
0 1
0 1 4
1 2 5
2 3 6
Or use concat:
>>> pd.concat([a,b],axis=1)
0 1
0 1 4
1 2 5
2 3 6
>>>
Or join:
>>> a.to_frame().join(b.to_frame(name=1))
0 1
0 1 4
1 2 5
2 3 6
>>>
Another possible faster solution could be,
pd.DataFrame(np.vstack((a,b)).T)
I don't have much experience with working with pandas. I have a pandas dataframe as shown below.
df = pd.DataFrame({ 'A' : [1,2,1],
'start' : [1,3,4],
'stop' : [3,4,8]})
I would like to create a new dataframe that iterates through the rows and appends to resulting dataframe. For example, from row 1 of the input dataframe - Generate a sequence of numbers [1,2,3] and corresponding column to named 1
A seq
1 1
1 2
1 3
2 3
2 4
1 4
1 5
1 6
1 7
1 8
So far, I've managed to identify what function to use to iterate through the rows of the pandas dataframe.
Here's one way with apply:
(df.set_index('A')
.apply(lambda x: pd.Series(np.arange(x['start'], x['stop'] + 1)), axis=1)
.stack()
.to_frame('seq')
.reset_index(level=1, drop=True)
.astype('int')
)
Out:
seq
A
1 1
1 2
1 3
2 3
2 4
1 4
1 5
1 6
1 7
1 8
If you would want to use loops.
In [1164]: data = []
In [1165]: for _, x in df.iterrows():
...: data += [[x.A, y] for y in range(x.start, x.stop+1)]
...:
In [1166]: pd.DataFrame(data, columns=['A', 'seq'])
Out[1166]:
A seq
0 1 1
1 1 2
2 1 3
3 2 3
4 2 4
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
To add to the answers above, here's a method that defines a function for interpreting the dataframe input shown, into a form that the poster wants:
def gen_df_permutations(perm_def_df):
m_list = []
for i in perm_def_df.index:
row = perm_def_df.loc[i]
for n in range(row.start, row.stop+1):
r_list = [row.A,n]
m_list.append(r_list)
return m_list
Call it, referencing the specification dataframe:
gen_df_permutations(df)
Or optionally call it wrapped in a dataframe creation function to return a final dataframe output:
pd.DataFrame(gen_df_permutations(df),columns=['A','seq'])
A seq
0 1 1
1 1 2
2 1 3
3 2 3
4 2 4
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
N.B. the first column there is the dataframe index that can be removed/ignored as requirements allow.
I'm trying to replace a row in a dataframe with the row of another dataframe only if they share a common column.
Here is the first dataframe:
index no foo
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
and the second dataframe:
index no foo
0 2 aaa
1 3 bbb
2 22 3
3 33 4
4 44 5
5 55 6
I'd like my result to be
index no foo
0 0 1
1 1 2
2 2 aaa
3 3 bbb
4 4 5
5 5 6
The result of the inner merge between both dataframes returns the correct rows, but I'm having trouble inserting them at the correct index in the first dataframe
Any help would be greatly appreciated.
Thank you.
This should work as well
df1['foo'] = pd.merge(df1, df2, on='no', how='left').apply(lambda r: r['foo_y'] if r['foo_y'] == r['foo_y'] else r['foo_x'], axis=1)
You could use apply, there is probably a better way than this:
In [67]:
# define a function that takes a row and tries to find a match
def func(x):
# find if 'no' value matches, test the length of the series
if len(df1.loc[df1.no ==x.no, 'foo']) > 0:
return df1.loc[df1.no ==x.no, 'foo'].values[0] # return the first array value
else:
return x.foo # no match so return the existing value
# call apply and using a lamda apply row-wise (axis=1 means row-wise)
df.foo = df.apply(lambda row: func(row), axis=1)
df
Out[67]:
index no foo
0 0 0 1
1 1 1 2
2 2 2 aaa
3 3 3 bbb
4 4 4 5
5 5 5 6
[6 rows x 3 columns]