find column value in dataframe

find column value in dataframe - python

Have 2 dataframes.
First has 1 column.
test1: 1,2,3,4,5
Second has 2 columns.
test2: 0 1 1 1 1. test3: 2 2 3 3 4
I need to create new column in First dataframe that with search row value exist in whole dataframe2 (simple ctrl+F).
As result I need to get
test1: 1,2,3,4,5
check: yes,yes,yes,yes,no
UPD
Below code I found, but it shows good result only for first row, don't know if that make sense
first['check'] = second.eq(first['test1'],0).any(1).astype(int)

You can check with isin with values flatten
test1['col2']=test1['col1'].isin(test2.values.ravel())

In [1]: import pandas as pd
...: df1 = pd.DataFrame({'test1': [1,2,3,4,5]})
...: df2 = pd.DataFrame({'test2': [0,1,1,1,1], 'test3': [2,2,3,3,4]})
In [2]: df1
Out[2]:
test1
0 1
1 2
2 3
3 4
4 5
In [3]: df2
Out[3]:
test2 test3
0 0 2
1 1 2
2 1 3
3 1 3
4 1 4
In [4]: df1['check'] = df1['test1'].isin(df2['test2']) \
...: | df1['test1'].isin(df2['test3'])
...: df1
Out[4]:
test1 check
0 1 True
1 2 True
2 3 True
3 4 True
4 5 False

Related

Set value of first item in slice in python pandas

So I would like make a slice of a dataframe and then set the value of the first item in that slice without copying the dataframe. For example:
df = pandas.DataFrame(numpy.random.rand(3,1))
df[df[0]>0][0] = 0
The slice here is irrelevant and just for the example and will return the whole data frame again. Point being, by doing it like it is in the example you get a setting with copy warning (understandably). I have also tried slicing first and then using ILOC/IX/LOC and using ILOC twice, i.e. something like:
df.iloc[df[0]>0,:][0] = 0
df[df[0]>0,:].iloc[0] = 0
And neither of these work. Again- I don't want to make a copy of the dataframe even if it id just the sliced version.
EDIT:
It seems there are two ways, using a mask or IdxMax. The IdxMax method seems to work if your index is unique, and the mask method if not. In my case, the index is not unique which I forgot to mention in the initial post.

I think you can use idxmax for get index of first True value and then set by loc:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print (df)
0
0 1
1 3
2 0
3 0
4 3
print ((df[0] == 0).idxmax())
2
df.loc[(df[0] == 0).idxmax(), 0] = 100
print (df)
0
0 1
1 3
2 100
3 0
4 3
df.loc[(df[0] == 3).idxmax(), 0] = 200
print (df)
0
0 1
1 200
2 0
3 0
4 3
EDIT:
Solution with not unique index:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
0
1 1
2 3
2 0
3 0
4 3
df = df.reset_index()
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.set_index('index')
df.index.name = None
print (df)
0
1 1
2 200
2 0
3 0
4 3
EDIT1:
Solution with MultiIndex:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)), index=[1,2,2,3,4])
print (df)
0
1 1
2 3
2 0
3 0
4 3
df.index = [np.arange(len(df.index)), df.index]
print (df)
0
0 1 1
1 2 3
2 2 0
3 3 0
4 4 3
df.loc[(df[0] == 3).idxmax(), 0] = 200
df = df.reset_index(level=0, drop=True)
print (df)
0
1 1
2 200
2 0
3 0
4 3
EDIT2:
Solution with double cumsum:
np.random.seed(1)
df = pd.DataFrame([4,0,4,7,4], index=[1,2,2,3,4])
print (df)
0
1 4
2 0
2 4
3 7
4 4
mask = (df[0] == 0).cumsum().cumsum()
print (mask)
1 0
2 1
2 2
3 3
4 4
Name: 0, dtype: int32
df.loc[mask == 1, 0] = 200
print (df)
0
1 4
2 200
2 4
3 7
4 4

Consider the dataframe df
df = pd.DataFrame(dict(A=[1, 2, 3, 4, 5]))
print(df)
A
0 1
1 2
2 3
3 4
4 5
Create some arbitrary slice slc
slc = df[df.A > 2]
print(slc)
A
2 3
3 4
4 5
Access the first row of slc within df by using index[0] and loc
df.loc[slc.index[0]] = 0
print(df)
A
0 1
1 2
2 0
3 4
4 5

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(6,1),index=[1,2,2,3,3,3])
df[1] = 0
df.columns=['a','b']
df['b'][df['a']>=0.5]=1
df=df.sort(['b','a'],ascending=[0,1])
df.loc[df[df['b']==0].index.tolist()[0],'a']=0
In this method extra copy of the dataframe is not created but an extra column is introduced which can be dropped after processing. To choose any index instead o the first one you can change the last line as follows
df.loc[df[df['b']==0].index.tolist()[n],'a']=0
to change any nth item in a slice
df
a
1 0.111089
2 0.255633
2 0.332682
3 0.434527
3 0.730548
3 0.844724
df after slicing and labelling them
a b
1 0.111089 0
2 0.255633 0
2 0.332682 0
3 0.434527 0
3 0.730548 1
3 0.844724 1
After changing value of first item in slice (labelled as 0) to 0
a b
3 0.730548 1
3 0.844724 1
1 0.000000 0
2 0.255633 0
2 0.332682 0
3 0.434527 0

So using some of the answers I managed to find a one liner way to do this:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(4, size=(5,1)))
print df
0
0 1
1 3
2 0
3 0
4 3
df.loc[(df[0] == 0).cumsum()==1,0] = 1
0
0 1
1 3
2 1
3 0
4 3
Essentially this is using the mask inline with a cumsum.

Efficiently creating a Pandas DataFrame Column that contains the instance number of a value in another column

Assume that you have a Pandas column with the following information:
>> df
num
0 0
1 1
2 1
3 2
4 3
5 3
The column to the left of the num column is the index column.
I want to create an instance column that tells me what instance of num appears. This is the outcome that I want:
>> df
num instance
0 0 1
1 1 1
2 1 2
3 2 1
4 3 1
5 3 2
Here's the code that I wrote to do this:
>> my_list = []
>> for index, row in df.iterrows():
>> my_list.append(df.loc[index,'num'])
>> # The IF condition is done to prevent my_list from growing too big.
>> if len(my_list)>1:
>> if my_list[len(my_list)-1] == my_list[len(my_list)-2]:
>> del my_list[:len(my_list)-2]
>> my_list['instance'] = len([element for element in my_list if \
element == df.loc[index,'num'])
This code works perfectly for small DataFrames, but it takes an exorbitantly long amount of time to complete when the num column consists of several million lines. Is there a way of creating the instance column in the manner that I'm thinking without using .iterrows()?

try this:
In [11]: df['instance'] = df.groupby('num').cumcount()+1
In [12]: df
Out[12]:
num instance
0 0 1
1 1 1
2 1 2
3 2 1
4 3 1
5 3 2

You can groupby on 'num' column and call rank with param method=dense':
In [5]:
df['instance'] = df.groupby('num').transform(lambda x: x.rank(method='dense'))
df
Out[5]:
num instance
0 0 1
1 1 1
2 1 2
3 2 1
4 3 1
5 3 2

Python Pandas replicate rows in dataframe

If the dataframe looks like:
Store,Dept,Date,Weekly_Sales,IsHoliday
1,1,2010-02-05,24924.5,FALSE
1,1,2010-02-12,46039.49,TRUE
1,1,2010-02-19,41595.55,FALSE
1,1,2010-02-26,19403.54,FALSE
1,1,2010-03-05,21827.9,FALSE
1,1,2010-03-12,21043.39,FALSE
1,1,2010-03-19,22136.64,FALSE
1,1,2010-03-26,26229.21,FALSE
1,1,2010-04-02,57258.43,FALSE
And I wanna duplicate rows with IsHoliday equal to TRUE, I can do:
is_hol = df['IsHoliday'] == True
df_try = df[is_hol]
df=df.append(df_try*10)
But is there a better way to do this as I need to duplicate holiday rows 5 times, and I have to append 5 times if using the above way.

You can put df_try inside a list and then do what you have in mind:
>>> df.append([df_try]*5,ignore_index=True)
Store Dept Date Weekly_Sales IsHoliday
0 1 1 2010-02-05 24924.50 False
1 1 1 2010-02-12 46039.49 True
2 1 1 2010-02-19 41595.55 False
3 1 1 2010-02-26 19403.54 False
4 1 1 2010-03-05 21827.90 False
5 1 1 2010-03-12 21043.39 False
6 1 1 2010-03-19 22136.64 False
7 1 1 2010-03-26 26229.21 False
8 1 1 2010-04-02 57258.43 False
9 1 1 2010-02-12 46039.49 True
10 1 1 2010-02-12 46039.49 True
11 1 1 2010-02-12 46039.49 True
12 1 1 2010-02-12 46039.49 True
13 1 1 2010-02-12 46039.49 True

Other way is using concat() function:
import pandas as pd
In [603]: df = pd.DataFrame({'col1':list("abc"),'col2':range(3)},index = range(3))
In [604]: df
Out[604]:
col1 col2
0 a 0
1 b 1
2 c 2
In [605]: pd.concat([df]*3, ignore_index=True) # Ignores the index
Out[605]:
col1 col2
0 a 0
1 b 1
2 c 2
3 a 0
4 b 1
5 c 2
6 a 0
7 b 1
8 c 2
In [606]: pd.concat([df]*3)
Out[606]:
col1 col2
0 a 0
1 b 1
2 c 2
0 a 0
1 b 1
2 c 2
0 a 0
1 b 1
2 c 2

This is an old question, but since it still comes up at the top of my results in Google, here's another way.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':list("abc"),'col2':range(3)},index = range(3))
Say you want to replicate the rows where col1="b".
reps = [3 if val=="b" else 1 for val in df.col1]
df.loc[np.repeat(df.index.values, reps)]
You could replace the 3 if val=="b" else 1 in the list interpretation with another function that could return 3 if val=="b" or 4 if val=="c" and so on, so it's pretty flexible.

Appending and concatenating is usually slow in Pandas so I recommend just making a new list of the rows and turning that into a dataframe (unless appending a single row or concatenating a few dataframes).
import pandas as pd
df = pd.DataFrame([
[1,1,'2010-02-05',24924.5,False],
[1,1,'2010-02-12',46039.49,True],
[1,1,'2010-02-19',41595.55,False],
[1,1,'2010-02-26',19403.54,False],
[1,1,'2010-03-05',21827.9,False],
[1,1,'2010-03-12',21043.39,False],
[1,1,'2010-03-19',22136.64,False],
[1,1,'2010-03-26',26229.21,False],
[1,1,'2010-04-02',57258.43,False]
], columns=['Store','Dept','Date','Weekly_Sales','IsHoliday'])
temp_df = []
for row in df.itertuples(index=False):
if row.IsHoliday:
temp_df.extend([list(row)]*5)
else:
temp_df.append(list(row))
df = pd.DataFrame(temp_df, columns=df.columns)

You can do it in one line:
df.append([df[df['IsHoliday'] == True]] * 5, ignore_index=True)
or
df.append([df[df['IsHoliday']]] * 5, ignore_index=True)

Another alternative to append() is to first replace the values of a column by a list of entries and then explode() (either using ignore_index=True or not, depending on what you want):
df['IsHoliday'] = df['IsHoliday'].apply(lambda x: 5*[x] if (x == True) else x)
df.explode('IsHoliday', ignore_index=True)
The nice thing about this one is that you can already use the list in the apply() call to build copies of rows with modified values in a column, in case you wanted to do that later anyways...

Pandas row manipulation

I'm trying to replace a row in a dataframe with the row of another dataframe only if they share a common column.
Here is the first dataframe:
index no foo
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
and the second dataframe:
index no foo
0 2 aaa
1 3 bbb
2 22 3
3 33 4
4 44 5
5 55 6
I'd like my result to be
index no foo
0 0 1
1 1 2
2 2 aaa
3 3 bbb
4 4 5
5 5 6
The result of the inner merge between both dataframes returns the correct rows, but I'm having trouble inserting them at the correct index in the first dataframe
Any help would be greatly appreciated.
Thank you.

This should work as well
df1['foo'] = pd.merge(df1, df2, on='no', how='left').apply(lambda r: r['foo_y'] if r['foo_y'] == r['foo_y'] else r['foo_x'], axis=1)

You could use apply, there is probably a better way than this:
In [67]:
# define a function that takes a row and tries to find a match
def func(x):
# find if 'no' value matches, test the length of the series
if len(df1.loc[df1.no ==x.no, 'foo']) > 0:
return df1.loc[df1.no ==x.no, 'foo'].values[0] # return the first array value
else:
return x.foo # no match so return the existing value
# call apply and using a lamda apply row-wise (axis=1 means row-wise)
df.foo = df.apply(lambda row: func(row), axis=1)
df
Out[67]:
index no foo
0 0 0 1
1 1 1 2
2 2 2 aaa
3 3 3 bbb
4 4 4 5
5 5 5 6
[6 rows x 3 columns]

Pandas: Create new dataframe that averages duplicates from another dataframe

Say I have a dataframe my_df with column duplicates, e..g
foo bar foo hello
0 1 1 5
1 1 2 5
2 1 3 5
I would like to create another dataframe that averages the duplicates:
foo bar hello
0.5 1 5
1.5 1 5
2.5 1 5
How can I do this in Pandas?
So far I have managed to identify duplicates:
my_columns = my_df.columns
my_duplicates = print [x for x, y in collections.Counter(my_columns).items() if y > 1]
By I don't know how to ask Pandas to average them.

You can groupby the column index and take the mean:
In [11]: df.groupby(level=0, axis=1).mean()
Out[11]:
bar foo hello
0 1 0.5 5
1 1 1.5 5
2 1 2.5 5
A somewhat trickier example is if there is a non numeric column:
In [21]: df
Out[21]:
foo bar foo hello
0 0 1 1 a
1 1 1 2 a
2 2 1 3 a
The above will raise: DataError: No numeric types to aggregate. Definitely not going to win any prizes for efficiency, but here's generic method to do in this case:
In [22]: dupes = df.columns.get_duplicates()
In [23]: dupes
Out[23]: ['foo']
In [24]: pd.DataFrame({d: df[d] for d in df.columns if d not in dupes})
Out[24]:
bar hello
0 1 a
1 1 a
2 1 a
In [25]: pd.concat(df.xs(d, axis=1) for d in dupes).groupby(level=0, axis=1).mean()
Out[25]:
foo
0 0.5
1 1.5
2 2.5
In [26]: pd.concat([Out[24], Out[25]], axis=1)
Out[26]:
foo bar hello
0 0.5 1 a
1 1.5 1 a
2 2.5 1 a
I think the thing to take away is avoid column duplicates... or perhaps that I don't know what I'm doing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

find column value in dataframe - python

You can check with isin with values flatten test1['col2']=test1['col1'].isin(test2.values.ravel())

Related

Set value of first item in slice in python pandas

Efficiently creating a Pandas DataFrame Column that contains the instance number of a value in another column

Python Pandas replicate rows in dataframe

Pandas row manipulation

Pandas: Create new dataframe that averages duplicates from another dataframe

Categories

Resources