Selecting specific columns with specific values pandas

Selecting specific columns with specific values pandas - python

So I have a data frame of 30 columns and I want to filter it for values found in 10 of those columns and return all the rows that match. In the example below, I want to search for values equal to 1 in all df columns that end with "good..."
df[df[[i for i in df.columns if i.endswith('good')]].isin([1])]
df[df[[i for i in df.columns if i.endswith('good')]] == 1]
Both of these work to find those columns but everything that does not match appears as NaN. My question is how can I query specific columns for specific values and have all the rows that don't match not appear as NaN?

You can filter columns first with str.endswith, select columns by [] and compare by eq. Last add any for at least one 1 per row
cols = df.columns[df.columns.str.endswith('good')]
df1 = df[df[cols].eq(1).any(axis=1)]
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[1,1,4,5,5,1],
'C good':[7,8,9,4,2,3],
'D good':[1,3,5,7,1,0],
'E good':[5,3,6,9,2,1],
'F':list('aaabbb')})
print (df)
A B C good D good E good F
0 a 1 7 1 5 a
1 b 1 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 1 3 0 1 b
cols = df.columns[df.columns.str.endswith('good')]
print (df[cols].eq(1))
C good D good E good
0 False True False
1 False False False
2 False False False
3 False False False
4 False True False
5 False False True
df1 = df[df[cols].eq(1).any(1)]
print (df1)
A B C good D good E good F
0 a 1 7 1 5 a
4 e 5 2 1 2 b
5 f 1 3 0 1 b
You solution was really close, only add any:
df1 = df[df[[i for i in df.columns if i.endswith('good')]].isin([1]).any(axis=1)]
print (df1)
A B C good D good E good F
0 a 1 7 1 5 a
4 e 5 2 1 2 b
5 f 1 3 0 1 b
EDIT:
If need only 1 and all another rows and columns remove:
df1 = df.loc[:, df.columns.str.endswith('good')]
df2 = df1.loc[df1.eq(1).any(1), df1.eq(1).any(0)]
print (df2)
D good E good
0 1 5
4 1 2
5 0 1

Related

Pandas how to count the string appear times in columns?

It would be more easy to explain start from a simple example df:
df1:
A B C D
0 a 6 1 b/5/4
1 a 6 1 a/1/6
2 c 9 3 9/c/3
There were four columns in the df1(ABCD).The task is to find out columns D's strings appeared how many times in columnsABC(3coulumns)?Here is expect output and more explanation:
df2(expect output):
A B C D E (New column)
0 a 6 1 b/5/4 0 <--Found 0 ColumnD's Strings from ColumnABC
1 a 6 1 a/1/6 3 <--Found a、1 & 6 so it should return 3
2 c 9 3 9/c/3 3 <--Found all strings (3 totally)
Anyone has good idea for this? Thanks!

You can use a list comprehension with set operations:
df['E'] = [len(set(l).intersection(s.split('/'))) for l, s in
zip(df.drop(columns='D').astype(str).to_numpy().tolist(),
df['D'])]
Output:
A B C D E
0 a 6 1 b/5/4 0
1 a 6 1 a/1/6 3
2 c 9 3 9/c/3 3

import pandas as pd
from pandas import DataFrame as df
dt = {'A':['a','a','c'], 'B': [6,6,9], 'C': [1,1,3], 'D':['b/5/4', 'a/1/6', 'c/9/3']}
E = []
nu_data =pd.DataFrame(data=dt)
for itxid, itx in enumerate(nu_data['D']):
match = 0
str_list = itx.split('/')
for keyid, keys in enumerate(dt):
if keyid < len(dt)-1:
for seg_str in str_list:
if str(dt[keys][itxid]) == seg_str:
match += 1
E.append(match)
nu_data['E'] = E
print(nu_data)

sort headers by specific cols - pandas

I'm trying to sort col headers by last 3 columns only. Using below, sort_index works on the whole data frame but not when I select the last 3 cols only.
Note: I can't hard-code the sorting because I don't know the columns headers beforehand.
import pandas as pd
df = pd.DataFrame({
'Z' : [1,1,1,1,1],
'B' : ['A','A','A','A','A'],
'C' : ['B','A','A','A','A'],
'A' : [5,6,6,5,5],
})
# sorts all cols
df = df.sort_index(axis = 1)
# aim to sort by last 3 cols
#df.iloc[:,1:3] = df.iloc[:,1:3].sort_index(axis=1)
Intended Out:
Z A B C
0 1 A B 5
1 1 A A 6
2 1 A A 6
3 1 A A 5
4 1 A A 5

Try with reindex
out = df.reindex(columns=df.columns[[0]].tolist()+sorted(df.columns[1:].tolist()))
Out[66]:
Z A B C
0 1 5 A B
1 1 6 A A
2 1 6 A A
3 1 5 A A
4 1 5 A A
Method two insert
newdf = df.iloc[:,1:].sort_index(axis=1)
newdf.insert(loc=0, column='Z', value=df.Z)
newdf
Out[74]:
Z A B C
0 1 5 A B
1 1 6 A A
2 1 6 A A
3 1 5 A A
4 1 5 A A

Find all groups in which specific values show up

I'm new to Python and Pandas. I have the following DataFrame:
import pandas as pd
df = pd.DataFrame( {'a':['A','A','B','B','B','C','C','C'], 'b':[1,3,1,2,3,1,3,3]})
a b
0 A 1
1 A 3
2 B 1
3 B 2
4 B 3
5 C 1
6 C 3
7 C 3
I would like to create a new DataFrame in which only groups from column A that have the values 1 and 2 in column b show up, that is:
a b
0 B 1
1 B 2
2 B 3
I know we can create groups using df.groupby('a'), and the method df.all() seems to be related to this, but I can't figure it out by myself. It seems like it should be straightforward. Any help?

Use GroupBy.filter + Series.any:
new_df=df.groupby('a').filter(lambda x: x.b.eq(2).any() & x.b.eq(1).any())
print(new_df)
a b
2 B 1
3 B 2
4 B 3
We could also use:
new_df=df[df.groupby('a').transform(lambda x: x.eq(1).any() & x.eq(2).any()).b]
print(new_df)
a b
2 B 1
3 B 2
4 B 3

Another approach:
s = (pd.DataFrame(df['b'].values == np.array([[1],[2]])).T
.groupby(df['a'])
.transform('any')
.all(1)
)
df[s]
Output:
a b
2 B 1
3 B 2
4 B 3

How do i put together two pandas DataFrames only at timestamp data matching eachother? [duplicate]

Right now I have two dataframes (data1 and data2)
I would like to print a column of string values in the dataframe called data1, based on whether the ID exists in both data2 and data1.
What I am doing now gives me a boolean list (True or False if the ID exists in the both dataframes but not the column of strings).
print(data2['id'].isin(data1.id).to_string())
yields
0 True
1 True
2 True
3 True
4 True
5 True
Any ideas would be appreciated.
Here is a sample of data1
'user_id', 'id', 'rating', 'unix_timestamp'
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
And data2 contains something like this
'id', 'title', 'release_date',
'video_release_date', 'imdb_url'
37|Nadja (1994)|01-Jan-1994||http://us.imdb.com/M/title-exact?Nadja%20(1994)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
38|Net, The (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Net,%20The%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|1|0|0
39|Strange Days (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Strange%20Days%20(1995)|0|1|0|0|0|0|1|0|0|0|0|0|0|0|0|1|0|0|0

If all values of ids are unique:
I think you need merge with inner join. For data2 select only id column, on parameter should be omit, because joining on all columns - here only id:
df = pd.merge(data1, data2[['id']])
Sample:
data1 = pd.DataFrame({'id':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3]})
print (data1)
B C id
0 4 7 a
1 5 8 b
2 4 9 c
3 5 4 d
4 5 2 e
5 4 3 f
data2 = pd.DataFrame({'id':list('frcdeg'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],})
print (data2)
D E id
0 1 5 f
1 3 3 r
2 5 6 c
3 7 9 d
4 1 2 e
5 0 4 g
df = pd.merge(data1, data2[['id']])
print (df)
B C id
0 4 9 c
1 5 4 d
2 5 2 e
3 4 3 f
If id are duplicated in one or another Dataframe use another answer, also added similar solutions:
df = data1[data1['id'].isin(set(data1['id']) & set(data2['id']))]
ids = set(data1['id']) & set(data2['id'])
df = data2.query('id in #ids')
df = data1[np.in1d(data1['id'], np.intersect1d(data1['id'], data2['id']))]
Sample:
data1 = pd.DataFrame({'id':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3]})
print (data1)
B C id
0 4 7 a
1 5 8 b
2 4 9 c
3 5 4 d
4 5 2 e
5 4 3 f
data2 = pd.DataFrame({'id':list('fecdef'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],})
print (data2)
D E id
0 1 5 f
1 3 3 e
2 5 6 c
3 7 9 d
4 1 2 e
5 0 4 f
df = data1[data1['id'].isin(set(data1['id']) & set(data2['id']))]
print (df)
B C id
2 4 9 c
3 5 4 d
4 5 2 e
5 4 3 f
EDIT:
You can use:
df = data2.loc[data1['id'].isin(set(data1['id']) & set(data2['id'])), ['title']]
ids = set(data1['id']) & set(data2['id'])
df = data2.query('id in #ids')[['title']]
df = data2.loc[np.in1d(data1['id'], np.intersect1d(data1['id'], data2['id'])), ['title']]

You can compute the set intersection of the two columns -
ids = set(data1['id']).intersection(data2['id'])
Or,
ids = np.intersect1d(data1['id'], data2['id'])
Next, query/filter out relevant rows.
data1.loc[data1['id'].isin(ids), 'id']

How do I get panda's "update" function to overwrite numbers in one column but not another?

Currently, I'm using:
csvdata.update(data, overwrite=True)
How can I make it update and overwrite a specific column but not another, small but simple question, is there a simple answer?

Rather than update with the entire DataFrame, just update with the subDataFrame of columns which you are interested in. For example:
In [11]: df1
Out[11]:
A B
0 1 99
1 3 99
2 5 6
In [12]: df2
Out[12]:
A B
0 a 2
1 b 4
2 c 6
In [13]: df1.update(df2[['B']]) # subset of cols = ['B']
In [14]: df1
Out[14]:
A B
0 1 2
1 3 4
2 5 6

If you want to do it for a single column:
import pandas
import numpy
csvdata = pandas.DataFrame({"a":range(12), "b":range(12)})
other = pandas.Series(list("abcdefghijk")+[numpy.nan])
csvdata["a"].update(other)
print csvdata
a b
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
6 g 6
7 h 7
8 i 8
9 j 9
10 k 10
11 11 11
or, as long as the column names match, you can do this:
other = pandas.DataFrame({"a":list("abcdefghijk")+[numpy.nan], "b":list("abcdefghijk")+[numpy.nan]})
csvdata.update(other["a"])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting specific columns with specific values pandas - python

Related

Pandas how to count the string appear times in columns?

sort headers by specific cols - pandas

Find all groups in which specific values show up

How do i put together two pandas DataFrames only at timestamp data matching eachother? [duplicate]

How do I get panda's "update" function to overwrite numbers in one column but not another?

Categories

Resources