Using the condition, select the desired columns in pandas DataFrame - python

I have a DataFrame I created using pandas and want to create new table based on the original, but filtered based on certain conditions.
df = pd.DataFrame(
[['Y', 'Cat', 'no', 'yes', 6],
['Y', 4, 7, 9, 'dog'],
['N', 6, 4, 6, 'pig'],
['N', 3, 6, 'beer', 8]],
columns = ('Data', 'a', 'b', 'c', 'd')
)
My condition that doesnt work:
if (df['Data']=='Y') & (df['Data']=='N'):
df3=df.loc[:,['Data', 'a', 'b', 'c']]
else:
df3=df.loc[:,['Data', 'a', 'b']]
I want the new table to contain data that matches the following criteria:
If df.Data has value 'Y' and 'N', the new table get columns ('Data', 'a', 'b')
If not, the new table gets columns ('Data', 'a', 'b', 'c')
Data a b
0 Y Cat no
1 Y 4 7
2 N 6 4
3 N 3 6
Data a b c
0 Y Cat no yes
1 Y 4 7 9
2 Y 6 4 6
3 Y 3 6 beer

You are comparing a series with a character rather than checking existence for a single Boolean result. You can, instead, use pd.Series.any which returns True if any value in a series is True:
if (df['Data']=='Y').any() & (df['Data']=='N').any():
# do something
An alternative method is to use pd.DataFrame.drop with a ternary statement:
df = df.drop(['d'] if set(df['Data']) == {'Y', 'N'} else ['c', 'd'], 1)
print(df)
Data a b c
0 Y Cat no yes
1 Y 4 7 9
2 N 6 4 6
3 N 3 6 beer

if all(df.Data.unique() == ['Y','N']) == True:
df3 = df[['Data', 'a', 'b', 'c']]
else:
df3 = df[['Data','a','b']]

Related

Sort or groupby dataframe in python using given string

I have given dataframe
Id Direction Load Unit
1 CN05059815 LoadFWD 0,0 NaN
2 CN05059815 LoadBWD 0,0 NaN
4 ....
....
and the given list.
list =['CN05059830','CN05059946','CN05060010','CN05060064' ...]
I would like to sort or group the data by a given element of the list.
For example,
The new data will have exactly the same sort as the list. The first column would start withCN05059815 which doesn't belong to the list, then the second CN05059830 CN05059946 ... are both belong to the list. With remaining the other data
One way is to use Categorical Data. Here's a minimal example:
# sample dataframe
df = pd.DataFrame({'col': ['A', 'B', 'C', 'D', 'E', 'F']})
# required ordering
lst = ['D', 'E', 'A', 'B']
# convert to categorical
df['col'] = df['col'].astype('category')
# set order, adding values not in lst to the front
order = list(set(df['col']) - set(lst)) + lst
# attach ordering information to categorical series
df['col'] = df['col'].cat.reorder_categories(order)
# apply ordering
df = df.sort_values('col')
print(df)
col
2 C
5 F
3 D
4 E
0 A
1 B
Consider below approach and example:
df = pd.DataFrame({
'col': ['a', 'b', 'c', 'd', 'e']
})
list_ = ['d', 'b', 'a']
print(df)
Output:
col
0 a
1 b
2 c
3 d
4 e
Then in order to sort the df with the list and its ordering:
df.reindex(df.assign(dummy=df['col'])['dummy'].apply(lambda x: list_.index(x) if x in list_ else -1).sort_values().index)
Output:
col
2 c
4 e
3 d
1 b
0 a

Get a subset of data from one row of Dataframe

Let's say I have a dataframe df with columns 'A', 'B', 'C'
Now I just want to extract row 2 of df and only columns 'B' and 'C'. What is the most efficient way to do that?
Can you please tell me why df.ix[2, ['B', 'C']] didn't work?
Thank you!
Consider the dataframe df
df = pd.DataFrame(np.arange(9).reshape(3, 3), list('xyz'), list('ABC'))
df
A B C
x 0 1 2
y 3 4 5
z 6 7 8
If you want to maintain a dataframe
df.loc[df.index[[1]], ['B', 'C']]
B C
y 4 5
If you want a series
df.loc[df.index[1], ['B', 'C']]
B 4
C 5
Name: y, dtype: int64
row_2 = df[['B', 'C']].iloc[1]
OR
# Convert column to 2xN vector, grab row 2
row_2 = list(df[['B', 'C']].apply(tuple, axis=1))[1]

Can I replace some values at once with dataframe?

Now I have to do like this:
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
df['column'] = df['column'].str.replace('A', 'cat').replace('B', 'rabit').replace('C', 'octpath').replace('D', 'spider').replace('E', 'mammoth').replace('F', 'snake').replace('G', 'starfish')
But I think this is long and unreadable.
Do you know a simple solution?
Here is another approach using pandas.Series.replace:
d = {'A':'cat','B':'rabit', 'C':'octpath','D':'spider','E':'mammoth','F':'snake','G':'starfish'}
df['column'] = df['column'].replace(d)
Output:
column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 snake
6 starfish
7 -
You can define a dict of your replacement values and call map on the column passing in your dict, to handle values that are not present you can pass param na_action='ignore', this will return NaN or None as you want to keep your existing values if not present you can call fillna and pass your orig column:
In[60]:
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
d = {'A':'cat','B':'rabit', 'C':'octpath','D':'spider','E':'mammoth','F':'snake','G':'starfish'}
df['column'] = df['column'].map(d, na_action='ignore').fillna(df['column'])
df
Out[60]:
column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 snake
6 starfish
7 -
df = pd.DataFrame({'column': ['A', 'B', 'C', 'D', 'E', 'F', 'G', '-']})
mapper={'A':'cat','B':'rabit','C':'octpath','D':'spider','E':'mammoth'}
df['column']=df.column.apply(lambda x:mapper.get(x))
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 None
6 None
7 None
in case you want to set default values
df['column']=df.column.apply(lambda x:mapper.get(x) if mapper.get(x) is not None else "pandas")
df.column
0 cat
1 rabit
2 octpath
3 spider
4 mammoth
5 pandas
6 pandas
7 pandas
greatings from shibuya

Python Pandas lookup and replace df1 value from df2

I have two dataframes
df df2
df column FOUR matches with df2 column LOOKUP COL
I need to match df column FOUR with df2 column LOOKUP COL and replace df column FOUR with the corresponding values from df2 column RETURN THIS
The resulting dataframe could overwrite df but I have it listed as result below.
NOTE: THE INDEX DOES NOT MATCH ON EACH OF THE DATAFRAMES
df = pd.DataFrame([['a', 'b', 'c', 'd'],
['e', 'f', 'g', 'h'],
['j', 'k', 'l', 'm'],
['x', 'y', 'z', 'w']])
df.columns = ['ONE', 'TWO', 'THREE', 'FOUR']
ONE TWO THREE FOUR
0 a b c d
1 e f g h
2 j k l m
3 x y z w
df2 = pd.DataFrame([['a', 'b', 'd', '1'],
['e', 'f', 'h', '2'],
['j', 'k', 'm', '3'],
['x', 'y', 'w', '4']])
df2.columns = ['X1', 'Y2', 'LOOKUP COL', 'RETURN THIS']
X1 Y2 LOOKUP COL RETURN THIS
0 a b d 1
1 e f h 2
2 j k m 3
3 x y w 4
RESULTING DF
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4
You can use Series.map. You'll need to create a dictionary or a Series to use in map. A Series makes more sense here but the index should be LOOKUP COL:
df['FOUR'] = df['FOUR'].map(df2.set_index('LOOKUP COL')['RETURN THIS'])
df
Out:
ONE TWO THREE FOUR
0 a b c 1
1 e f g 2
2 j k l 3
3 x y z 4
df['Four']=[df2[df2['LOOKUP COL']==i]['RETURN THIS'] for i in df['Four']]
Should be something like sufficient to do the trick? There's probably a more pandas native way to do it.
Basically, list comprehension - We generate a new array of df2['RETURN THIS'] values based on using the lookup column as we iterate over the i in df['Four'] list.

Pandas df manipulation: new column with list of values if other column rows repeated [duplicate]

This question already has answers here:
How to group dataframe rows into list in pandas groupby
(17 answers)
Closed 6 years ago.
I have a df like this:
ID Cluster Product
1 4 'b'
1 4 'f'
1 4 'w'
2 7 'u'
2 7 'b'
3 5 'h'
3 5 'f'
3 5 'm'
3 5 'd'
4 7 's'
4 7 'b'
4 7 'g'
Where ID is the primary and unique key of another df that is the source for this df. Cluster is not a key, different IDs often have same Cluster value; anyway it's an information I have to carry on.
What I want to obtain is this dataframe:
ID Cluster Product_List_by_ID
1 4 ['b','f','w']
2 7 ['u','b']
3 5 ['h','f','m','d']
4 7 ['s','b','g']
If this is not possible, also a dictionary like this could be fine:
d = {ID:[1,2,3,4], Cluster:[4,7,5,7],
Product_List_by_ID:[['b','f','w'],['u','b'],['h','f','m','d'],['s','b','g']]}
I have tried many ways unsuccessfully.. it seems that it is not possible to insert lists as pandas dataframe values..
Anyway I think it should not be so difficult to get the goal in some tricky way.. Sorry if I am going out of mind, but I am new to coding
Any suggests?! Thanks
use groupby
df.groupby(['ID', 'Cluster']).Product.apply(list)
ID Cluster
1 4 ['b', 'f', 'w']
2 7 ['u', 'b']
3 5 ['h', 'f', 'm', 'd']
4 7 ['s', 'b', 'g']
Name: Product, dtype: object
Another solution is first remove ' from column Product if necessary by str.strip:
df.Product = df.Product.str.strip("'")
And then groupby with apply, last if need dictionary use to_dict with parameter orient='list'
print (df.groupby(['ID', 'Cluster'])
.Product.apply(lambda x: x.tolist())
.reset_index()
.to_dict(orient='list'))
{'Cluster': [4, 7, 5, 7],
'ID': [1, 2, 3, 4],
'Product': [['b', 'f', 'w'], ['u', 'b'],
['h', 'f', 'm', 'd'], ['s', 'b', 'g']]}

Categories

Resources