I want to select rows from my dataframe df where any of the many columns contains a value that's in a list my_list. There are dozens of columns, and there could be more in the future, so I don't want to iterate over each column in a list.
I don't want this:
# for loop / iteration
for col in df.columns:
df.loc[df[col].isin(my_list), "indicator"] = 1
Nor this:
# really long indexing
df = df[(df.col1.isin(my_list) | (df.col2.isin(my_list) | (df.col3.isin(my_list) ... (df.col_N.isin(my_list)] # ad nauseum
Nor do I want to reshape the dataframe from a wide to a long format.
I'm thinking (hoping) there's a way to do this in one line, applying the isin() to many columns all at once.
Thanks!
Solution
I ended up using
df[df.isin(my_list).any(axis=1)]
You can use DataFrame.isin() which is a DataFrame method and not a string method.
new_df = df[df.isin(my_list)]
Alternately you may try:
df[df.apply(lambda x: x.isin(mylist)).any(axis=1)]
OR
df[df[df.columns].isin(mylist)]
Even you don't need o create a list if not utmost necessary rather directly assign it as follows.
df[df[df.columns].isin([3, 12]).any(axis=1)]
After checking your efforts:
Example DataFrame:
>>> df
col_1 col_2 col_3
0 1 1 10
1 2 4 12
2 3 7 18
List construct:
>>> mylist
[3, 12]
Solutions:
>>> df[df.col_1.isin(mylist) | df.col_2.isin(mylist) | df.col_3.isin(mylist)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18
>>> df[df.isin(mylist).any(axis=1)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18
or :
>>> df[df[df.columns].isin(mylist).any(axis=1)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18
Or :
>>> df[df.apply(lambda x: x.isin(mylist)).any(axis=1)]
col_1 col_2 col_3
1 2 4 12
2 3 7 18
Related
I have a pandas DataFrame:
Name Col_1 Col_2 Col_3
0 A 3 5 5
1 B 1 6 7
2 C 3 7 4
3 D 5 8 3
I need to create a Series object with the values of (Col_1-Col_2)/Col_3 using groupby, so basically this:
Name
A (3-5)/5
B (1-6)/7
C (3-7)/4
D (5-8)/3
Repeated names are a possiblty, hence the groupby usage. for example:
Name Col_1 Col_2 Col_3
0 A 3 5 5
1 B 1 6 7
2 B 3 6 7
The expected result:
Name
A (3-5)/5
B ((1+3)-6)/7
I Created a groupby object:
df.groupby['Name']
but it seems like no groupby method fits the bill for what I'm trying to do. How can I tackle this matter?
Let's try:
g = df.groupby('Name')
out = (g['Col_1'].sum()-g['Col_2'].first()).div(g['Col_3'].first())
Or:
(df.groupby('Name')
.apply(lambda g: (g['Col_1'].sum()-g['Col_2'].iloc[0])/g['Col_3'].iloc[0])
)
Output:
Name
A -0.400000
B -0.285714
dtype: float64
I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way
import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5
You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22
You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]
I have a Dataframe as below:
col_1 col_2 col_3
0 1 3 2
1 3 5 3
2 3 4 3
I am trying to have the sequence in which the columns are displayed using the index value of the header.
I have the index value of column stored in a variable called val. Set value of val = 1
I have tried the below:
cols = list(df.columns.values)
cols.pop(cols.index(val))
df = df[[val] + cols]
Expected output:
col_2 col_1 col_3
0 3 1 2
1 5 3 3
2 4 3 3
I however see the sequence has not changed.
val = 1
cols = list(df.columns.values)
extracted =cols.pop(val)
cols.insert(0,extracted)
df = df[cols]
output
col_2 col_1 col_3
0 3 1 2
1 5 3 3
2 4 3 3
I am trying to select only one row from a dask.dataframe by using command x.loc[0].compute(). It returns 4 rows with all having index=0. I tried reset_index, but there will still be 4 rows having index=0 after resetting. (I think I did reset correctly because I did reset_index(drop=False) and I could see the original index in the new column).
I read dask.dataframe document and it says something along the line that there might be more than one rows with index=0 due to how dask structuring the chunk data.
So, if I really want only one row by using index=0 for subsetting, how can I do this?
Edit
Probably, your problem comes from reset_index. This issue is explained at the end of the answer. Earlier part of the text is just how to solve it.
For example, there is the following dask DataFrame:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
df.compute()
Out[1]:
col_1 col_2
0 1 a
0 2 b
1 3 c
2 4 d
3 5 e
4 6 f
5 7 g
it has a numerical index with repeated 0 values. As loc is a
Purely label-location based indexer for selection by label
- it selects both 0-labeled values, if you'll do a
df.loc[0].compute()
Out[]:
col_1 col_2
0 1 a
0 2 b
- you'll get all the rows with 0-s (or another specified label).
In pandas there is a pd.DataFrame.iloc which helps us to select a row by it's numerical index. Unfortunately, in dask you can't do so, because the iloc is
Purely integer-location based indexing for selection by position.
Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.
To beat this problem, you can do some indexing tricks:
df.compute()
Out[2]:
index col_1 col_2
x
0 0 1 a
1 0 2 b
2 1 3 c
3 2 4 d
4 3 5 e
5 4 6 f
6 5 7 g
- now, there's new index ranged from 0 to the length of the data frame - 1.
It's possible to slice it with the loc and do the following (I suppose that select 0 label via loc means "select first row"):
df.loc[0].compute()
Out[3]:
index col_1 col_2
x
0 0 1 a
About multiplicated 0 index label
If you need original index, it's still here an it could be accessed through the
df.loc[:, 'index'].compute()
Out[4]:
x
0 0
1 0
2 1
3 2
4 3
5 4
6 5
I guess, you get such a duplication from reset_index() or so, because it genretates new 0-started index for each partition, for example, for this table of 2 partitions:
df.reset_index().compute()
Out[5]:
index col_1 col_2
0 0 1 a
1 0 2 b
2 1 3 c
3 2 4 d
0 3 5 e
1 4 6 f
2 5 7 g
I am not really sure how multi indexing works, so I maybe simply be trying to do the wrong things here. If I have a dataframe with
Value
A B
1 1 5.67
1 2 6.87
1 3 7.23
2 1 8.67
2 2 9.87
2 3 10.23
If I want to access the elements where B=2, how would I do that? df.ix[2] gives me the A=2. To get a particular value it seems df.ix[(1,2)] but that is the purpose of the B index if you can't access it directly?
You can use xs:
In [11]: df.xs(2, level='B')
Out[11]:
Value
A
1 6.87
2 9.87
alternatively:
In [12]: df1.xs(1, level=1)
Out[12]:
Value
A
1 5.67
2 8.67
Just as an alternative, you could use df.loc:
>>> df.loc[(slice(None),2),:]
Value
A B
1 2 6.87
2 2 9.87
The tuple accesses the indexes in order. So, slice(None) grabs all values from index 'A', the second position limits based on the second level index, where 'B'=2 in this example. The : specifies that you want all columns, but you could subet the columns there as well.
If you only want to return a cross-section, use xs (as mentioned by #Andy Hayden).
However, if you want to overwrite some values in the original dataframe, use pd.IndexSlice (with pd.loc) instead. Given a dataframe df:
In [73]: df
Out[73]:
col_1 col_2
index_1 index_2
1 1 5 6
1 5 6
2 5 6
2 2 5 6
if you want to overwrite with 0 all elements in col_1 where index_2 == 2 do:
In [75]: df.loc[pd.IndexSlice[:, 2], 'col_1'] = 0
In [76]: df
Out[76]:
col_1 col_2
index_1 index_2
1 1 5 6
1 5 6
2 0 6
2 2 0 6