Pandas dataframe self-dependency in data to fill a column - python

I have dataframe with data as:
The value of "relation" is determined from the codeid. Leather has "codeid"=11 which is already appeared against bag, so in relation we put the value bag.
Same happens for shoes.
ToDo: Fill the value of "relation", by putting check on codeid in terms of dataframes. Any help would be appreciated.
Edit: Same codeid e.g. 11 can appear > twice. But the "relation" can have only value as bag because bag is the first one to have codeid=11. i have updated the picture as well.

If want only first dupe value to last duplicated use transform with first and then set NaN values by loc with duplicated:
df = pd.DataFrame({'id':[1,2,3,4,5],
'name':list('brslp'),
'codeid':[11,12,13,11,13]})
df['relation'] = df.groupby('codeid')['name'].transform('first')
print (df)
id name codeid relation
0 1 b 11 b
1 2 r 12 r
2 3 s 13 s
3 4 l 11 b
4 5 p 13 s
#get first duplicated values of codeid
print (df['codeid'].duplicated(keep='last'))
0 True
1 False
2 True
3 False
4 False
Name: codeid, dtype: bool
#get all duplicated values of codeid with inverting boolenam mask by ~ for unique rows
print (~df['codeid'].duplicated(keep=False))
0 False
1 True
2 False
3 False
4 False
Name: codeid, dtype: bool
#chain boolen mask together
print (df['codeid'].duplicated(keep='last') | ~df['codeid'].duplicated(keep=False))
0 True
1 True
2 True
3 False
4 False
Name: codeid, dtype: bool
#replace True values by mask by NaN
df.loc[df['codeid'].duplicated(keep='last') |
~df['codeid'].duplicated(keep=False), 'relation'] = np.nan
print (df)
id name codeid relation
0 1 b 11 NaN
1 2 r 12 NaN
2 3 s 13 NaN
3 4 l 11 b
4 5 p 13 s

I think you want to do something like this:
import pandas as pd
df = pd.DataFrame([['bag', 11, 'null'],
['shoes', 12, 'null'],
['shopper', 13, 'null'],
['leather', 11, 'bag'],
['plastic', 13, 'shoes']], columns = ['name', 'codeid', 'relation'])
def codeid_analysis(rows):
if rows['codeid'] == 11:
rows['relation'] = 'bag'
elif rows['codeid'] == 12:
rows['relation'] = 'shirt' #for example. You should put what you want here
elif rows['codeid'] == 13:
rows['relation'] = 'pants' #for example. You should put what you want here
return rows
result = df.apply(codeid_analysis, axis = 1)
print(result)

It is not the optimal solution since it is costly to your memory, but here is my try. df1 is created in order to hold the null values of the relation column, since it seems that nulls are the first occurrence. After some cleaning, the two dataframes are merged to provide into one.
import pandas as pd
df = pd.DataFrame([['bag', 11, 'null'],
['shoes', 12, 'null'],
['shopper', 13, 'null'],
['leather', 11, 'bag'],
['plastic', 13, 'shopper'],
['something',13,""]], columns = ['name', 'codeid', 'relation'])
df1=df.loc[df['relation'] == 'null'].copy()#create a df with only null values in relation
df1.drop_duplicates(subset=['name'], inplace=True)#drops the duplicates and retains the first entry
df1=df1.drop("relation",axis=1)#drop the unneeded column
final_df=pd.merge(df, df1, left_on='codeid', right_on='codeid')#merge the two dfs on the columns names

Related

Python pandas: Need to know how many people have met two criteria

With this data set I want to know the people (id) who have made payments for both types a and b. Want to create a subset of data with the people who have made both a and b payments. (this is just an example set of data, one I'm using is much larger)
I've tried grouping by the id then making subset of data where type.len >= 2. Then tried creating another subset based on conditions df.loc[(df.type == 'a') & (df.type == 'b')]. I thought if I grouped by the id first then ran that df.loc code it would work but it doesn't.
Any help is much appreciated.
Thanks.
Separate the dataframe into two, one with type a payments and the other with type b payments, then merge them,
df_typea = df[(df['type'] == 'a')]
df_typeb = df[(df['type'] == 'b')]
df_merge = pd.merge(df_typea, df_typeb, how = 'outer', on = ['id', 'id'], suffixes =('_a', '_b'))
This will create a separate column for each payment type.
Now, you can find the ids for which both payments have been made,
df_payments = df_merge[(df_merge['type_a'] == 'a') & (df_merge['type_b'] == 'b')]
Note that this will create two records for items similar to that of id 9, for which there is more than two payments. I am assuming that you simply want to check if any payments of type 'a' and 'b' have been made for each id. In this case, you can simply drop any duplicates,
df_payments_no_duplicates = df_payments['id'].drop_duplicates()
You first split your DataFrame into two DataFrames:
one with type a payments only
one with type b payments only
You then join both DataFrames on id.
You can use groupby to solve this problem. This first time, group by id and type and then you can group again to see if the id had both types.
import pandas as pd
df = pd.DataFrame({"id" : [1, 1, 2, 3, 4, 4, 5, 5], 'payment' : [10, 15, 5, 20, 35, 30, 10, 20], 'type' : ['a', 'b', 'a','a','a','a','b', 'a']})
df_group = df.groupby(['id', 'type']).nunique()
#print(df_group)
'''
payment
id type
1 a 1
b 1
2 a 1
3 a 1
4 a 2
5 a 1
b 1
'''
# if the value in this series is 2, the id has both a and b
data = df_group.groupby('id').size()
#print(data)
'''
id
1 2
2 1
3 1
4 1
5 2
dtype: int64
'''
You can use groupby and nunique to get the count of unique payment types done.
print (df.groupby('id')['type'].agg(['nunique']))
This will give you:
id
1 2
2 1
3 1
4 1
5 1
6 2
7 1
8 1
9 2
If you want to list out only the rows that had both a and b types.
df['count'] = df.groupby('id')['type'].transform('nunique')
print (df[df['count'] > 1])
By using groupby.transform, each row will be populated with the unique count value. Then you can use count > 1 to filter out the rows that have both a and b.
This will give you:
id payment type count
0 1 10 a 2
1 1 15 b 2
7 6 10 b 2
8 6 15 a 2
11 9 35 a 2
12 9 30 a 2
13 9 10 b 2
You may also use the length of the returned set for the given id for column 'type':
len(set(df[df['id']==1]['type'])) # returns 2
len(set(df[df['id']==2]['type'])) # returns 1
Thus, the following would give you an answer to your question
paid_both = []
for i in set(df['id']):
if len(set(df[df['id']==i]['type'])) == 2:
paid_both.append(i)
## paid_both = [1,6,9] #the id's who paid both
You could then iterate through the unique id values to return the results for all ids. If 2 is returned, then the people have made payments for both types (a) and (b).

i want to match two dataframe columns in python

I have a two data frame df1 (35k record) and df2(100k records). In df1['col1'] and df2['col3'] i have unique id's. I want to match df1['col1'] with df2['col3']. If they match, I want to update df1 with one more column say df1['Match'] with value true and if not match, update with False value. I want to map this TRUE and False value against Matching and non-matching record only.
I am using .isin()function, I am getting the correct match and not match count but not able to map them correctly.
Match = df1['col1'].isin(df2['col3'])
df1['match'] = Match
I have also used merge function using by passing the parameter how=rightbut did not get the results.
You can simply do as follows:
df1['Match'] = df1['col1'].isin(df2['col3'])
For instance:
import pandas as pd
data1 = [1,2,3,4,5]
data2 = [2,3,5]
df1 = pd.DataFrame(data1, columns=['a'])
df2 = pd.DataFrame(data2,columns=['c'])
print (df1)
print (df2)
df1['Match'] = df1['a'].isin(df2['c']) # if matches it returns True else False
print (df1)
Output:
a
0 1
1 2
2 3
3 4
4 5
c
0 2
1 3
2 5
a Match
0 1 False
1 2 True
2 3 True
3 4 False
4 5 True
Use df.loc indexing:
df1['Match'] = False
df1.loc[df1['col1'].isin(df2['col3']), 'Match'] = True

Find greater row between two dataframes of same shape

I have two dataframes of the same shape and am trying to find all the rows in df A where every value is greater than the corresponding row in df B.
Mini-example:
df_A = pd.DataFrame({'one':[20,7,2],'two':[11,9,1]})
df_B = pd.DataFrame({'one':[1,8,12],'two':[10,5,3]})
I'd like to return only row 0.
one two
0 20 11
I realise that df_A > df_B gets me most of the way, but I just can't figure out how to return only those rows where everything is True.
(I tried merging the two, but that didn't seem to make it simpler.)
IIUIC, you can use all
In [633]: m = (df_A > df_B).all(1)
In [634]: m
Out[634]:
0 True
1 False
2 False
dtype: bool
In [635]: df_A[m]
Out[635]:
one two
0 20 11
In [636]: df_B[m]
Out[636]:
one two
0 1 10
In [637]: pd.concat([df_A[m], df_B[m]])
Out[637]:
one two
0 20 11
0 1 10
Or, if you just need row indices.
In [642]: m.index[m]
Out[642]: Int64Index([0], dtype='int64')
df_A.loc[(df_A > df_B).all(axis=1)]
import pandas as pd
df_A = pd.DataFrame({"one": [20, 7, 2], "two": [11, 9, 1]})
df_B = pd.DataFrame({"one": [1, 8, 12], "two": [10, 5, 3]})
row_indices = (df_A > df_B).apply(min, axis=1)
print(df_A[row_indices])
print()
print(df_B[row_indices])
Output is:
one two
0 20 11
one two
0 1 10
Explanation:
df_A > df_B compares element wise, this is the result:
one two
0 True True
1 False True
2 False False
Pythons max interprets True > False, so applying min row wise (this is why I used axis=1) only computes True if both values in a row are True:
0 True
1 False
2 False
This is now a boolean index to extract rows from df_A resp. df_B.
It can be done in one line code if you are interested.
df_A[(df_A > df_B)].dropna(axis=0, how='any')
Here df_A[(df_A > df_B)] gives the output after matching true false either the value or na.
one two
0 20.0 11.0
1 NaN 9.0
2 NaN NaN
Then we can drop the na values along the the 0 axis if there is at least anynot a number value.

Boolean Comparison across multiple dataframes

I have an issue where I want to compare values across multiple dataframes. Here is a snippet example:
data0 = [[1,'01-01'],[2,'01-02']]
data1 = [[11,'02-30'],[12,'02-25']]
data2 = [[8,'02-30'],[22,'02-25']]
data3 = [[7,'02-30'],[5,'02-25']]
df0 = pd.DataFrame(data0,columns=['Data',"date"])
df1 = pd.DataFrame(data1,columns=['Data',"date"])
df2 = pd.DataFrame(data2,columns=['Data',"date"])
df3 = pd.DataFrame(data3,columns=['Data',"date"])
result=(df0['Data']| df1['Data'])>(df2['Data'] | df3['Data'])
What I would like to do as I hope it can be seen is say if a value in df0 rowX or df1 rowX is greater than df2 rowX or df3 rowX return True else it should be false. In the code above 11 in df1 is greater than both 8 and 7 (df2 and 3 respectively) so the result should be True and then for the second row neither 2 or 12 is greater than 22 (df2) so should be False. However, result gives me
False,False
instead of
True,False
any thoughts or help?
Problem
For your data:
>>> df0['Data']
0 1
1 2
Name: Data, dtype: int64
>>> df1['Data']
0 11
1 12
Name: Data, dtype: int64
your a doing a bitwise or with |:
>>> df0['Data']| df1['Data']
0 11
1 14
Name: Data, dtype: int64
>>> df2['Data']| df3['Data']
0 15
1 23
Name: Data, dtype: int64
Do this for the single numbers:
>>> 1 | 11
11
>>> 2 | 12
14
This is not what you want.
Solution
You can use np.maximum for find the biggest values from each series:
>>> np.maximum(df0['Data'], df1['Data']) > np.maximum(df2['Data'], df3['Data'])
0 True
1 False
Name: Data, dtype: bool
Your existing solution does not work because the | operator performs a bitwise OR operation on the elements.
df0.Data | df1.Data
0 11
1 14
Name: Data, dtype: int64
This results in you comparing values that are different to the values in your dataframe columns. In summary, your approach does not compare values as you'd expect.
You can make this easy by finding -
the max per row of df0 and df1, and
the max per row of df2 and df3
Comparing these two columns to retrieve your result -
i = np.max([df0.Data, df1.Data], axis=0)
j = np.max([df2.Data, df3.Data], axis=0)
i > j
array([ True, False], dtype=bool)
This approach happens to be extremely scalable for any number of dataframes.

Python Pandas mark all but one specific duplicate row

I have a Pandas dataframe that has already been reduced to duplicates only and sorted.
Duplicates are identified by column "HASH" and then sorted by "HASH" and "SIZE"
df_out['is_duplicated'] = df.duplicated(['HASH'], keep=False) #keep=False: mark all duplicates as true
df_out = df_out.ix[(df_out['is_duplicated'] == True)] #Keep only duplicate records
df_out = df_out.sort_values(['HASH', 'SIZE'], ascending=[True, False]) #Sort by "HASH", then by "SIZE"
Result:
HASH SIZE is_duplicated
1 5 TRUE
1 3 TRUE
1 2 TRUE
9 7 TRUE
9 5 TRUE
I would like to add 2 more columns.
First column would identify rows of data with the same "HASH" by an ID.
First set of rows with the same "HASH" would be 1, next set would be 2, etc...
Second column would mark the a single row in each group that has the largest "SIZE"
HASH SIZE ID KEEP
1 5 1 TRUE
1 3 1 FALSE
1 2 1 FALSE
9 7 2 TRUE
9 5 2 FALSE
Perhaps use dicts and list comprehension:
import pandas as pd
df = pd.DataFrame([[1,1,1,9,9],[5,3,2,7,5]]).T
df.columns = ['HASH','SIZE']
hash_dict = dict(zip(df.HASH.unique(),range(1,df.HASH.nunique()+1)))
df['ID'] = [hash_dict[k] for k in df.HASH]
max_dict = dict(df.groupby('HASH')['SIZE'].max())
df['KEEP'] = [True if b==max_dict[a] else False for a,b in zip(df.HASH,df.SIZE)]

Categories

Resources