I have below data frame.
df = pd.DataFrame({'vin':['aaa','bbb','bbb','bbb','ccc','ccc','ddd','eee','eee','fff'],'module':['NORMAL','1ST_PRIORITY','2ND_PRIORITY','HELLO','3RD_PRIORITY','2ND_PRIORITY','2ND_PRIORITY','3RD_PRIORITY','HELLO','ABS']})
I wanted to find if the vin column contains a unique value then in the Result column it should return 'YES' and if the vin column is not unique then it will check the 'module' column and return 'YES' where the module column has more priority value.
I want output like the below data frame.
df = pd.DataFrame({'vin':['aaa','bbb','bbb','bbb','ccc','ccc','ddd','eee','eee','fff'],'module':['NORMAL','1ST_PRIORITY','2ND_PRIORITY','HELLO','3RD_PRIORITY','2ND_PRIORITY','2ND_PRIORITY','3RD_PRIORITY','HELLO','ABS'],
'Result':['YES','YES','NO','NO','NO','YES','YES','YES','NO','YES']})
Below code, I have tried and it gives the correct result but it involves too many steps.
df['count'] = df.groupby('vin').vin.transform('count')
def Check1(df):
if (df["count"] == 1):
return 1
elif ((df["count"] != 1) & (df["module"] == '1ST_PRIORITY')):
return 1
elif ((df["count"] != 1) & (df["module"] == '2ND_PRIORITY')):
return 2
elif ((df["count"] != 1) & (df["module"] == '3RD_PRIORITY')):
return 3
else:
return 4
df['Sort'] = df.apply(Check1, axis=1)
df = df.sort_values(by=['vin', 'Sort'])
df.drop_duplicates(subset=['vin'], keep='first',inplace = True)
df
Here's the trick, you need a custom order:
from pandas.api.types import CategoricalDtype
#create your custom order
custom_order = CategoricalDtype(
['Delhi','Agra','Paris','ABS','HELLO','NORMAL'],
ordered=True)
#then attribute it to the desired column
df['module'] = df['module'].astype(custom_order)
df['Result'] = ((~df.sort_values('module', ascending=True).duplicated('vin'))
.replace({True: 'YES', False: 'NO'}))
Result:
index
vin
module
Result
0
aaa
NORMAL
YES
1
bbb
Delhi
YES
2
bbb
Agra
NO
3
bbb
HELLO
NO
4
ccc
Paris
NO
5
ccc
Agra
YES
6
ddd
Agra
YES
7
eee
Paris
YES
8
eee
HELLO
NO
9
fff
ABS
YES
IIUC, you can use duplicated after sort_values:
df['Result'] = ((~df.sort_values('module').duplicated('vin'))
.replace({True: 'YES', False: 'NO'}))
print(df)
# Output
vin module Result
0 aaa NORMAL YES
1 bbb 1ST_PRIORITY YES
2 bbb 2ND_PRIORITY NO
3 bbb HELLO NO
4 ccc 3RD_PRIORITY NO
5 ccc 2ND_PRIORITY YES
6 ddd 2ND_PRIORITY YES
7 eee 3RD_PRIORITY YES
8 eee HELLO NO
9 fff ABS YES
Related
Starting from an imported df from excel like that:
Code
Time
Rev
AAA
5
3
AAA
3
2
AAA
6
1
BBB
10
2
BBB
5
1
I want to add a new column like that evidence the last revision:
Code
Time
Rev
Last
AAA
5
3
OK
AAA
3
2
NOK
AAA
6
1
NOK
BBB
10
2
OK
BBB
5
1
NOK
The df is already sorted by 'Code' and 'Rev'
df= df.sort_values(['Code', 'Rev'],
ascending = [True,False])
I thought to evaluate the column 'Code', if the value in column Code is equal to the value in upper row I must have NOK in the new column.
Unfortunately, I am not able to write it in python
You can do:
#Create a column called 'Last' with 'NOK' values
df['Last'] = 'NOK'
#Skipping sorting because you say df is already sorted.
#Then locate the first row in each group and change its value to 'OK'
df.loc[df.groupby('Code', as_index=False).nth(0).index, 'Last'] = 'OK'
You can use pandas.groupby.cumcount and set every first row of group to 'OK'.
dict_ = {
'Code': ['AAA', 'AAA', 'AAA', 'BBB', 'BBB'],
'Time': [5, 3, 6, 10, 5],
'Rev': [3, 2, 1, 2, 1],
}
df = pd.DataFrame(dict_)
df['Last'] = 'NOK'
df.loc[df.groupby('Code').cumcount() == 0,'Last']='OK'
This gives us the expected output:
df
Code Time Rev Last
0 AAA 5 3 OK
1 AAA 3 2 NOK
2 AAA 6 1 NOK
3 BBB 10 2 OK
4 BBB 5 1 NOK
or you can try fetching the head of each group and set the value to OK for it.
df.loc[df.groupby('Code').head(1).index, 'Last'] = 'OK'
which gives us the same thing
df
Code Time Rev Last
0 AAA 5 3 OK
1 AAA 3 2 NOK
2 AAA 6 1 NOK
3 BBB 10 2 OK
4 BBB 5 1 NOK
I'm reading a csv file in Python and trying to split the values in one of the columns such that I can parse off a certain values.
My input would look something like this:
ColA
AA_BBB_CCC_DDD
AAA_BBBB_CCC_DDDDDD
AAAA_B_ZZ_CC_DDD
AAA_BBB_CCCC_DDDD
The entries would get split on an underscore(_). Using the following
jobs = pd.read_csv(somefile.csv)
jobs["Val1"] = jobs["ColA"].str.split("_", expand=True)[1]
jobs["Val2"] = jobs["ColA"].str.split("_", expand=True)[2]
print(jobs["Val1"])
print(jobs["Val2"])
That works and give me output like this -
0 WRE
1 BBB
2 BBBB
3 B
4 BBB
Name: Val1, dtype: object
0 CMD
1 CCC
2 CCC
3 ZZ
4 CCCC
Name: Val2, dtype: object
My issue is there are instances where the underscore is actually part of Val1 and shouldn't be dropped. If Val1 is 2 characters or less, than Val1 really needs to be combined with Val2 to get the correct value.
For example the third entry in my example. Val1 would be "B" while Val2 would be "ZZ". As Val1 is only one character, then the true value of Val1 should be "B_ZZ".
To try and achieve that I'm doing the following –
if len(jobs["Val1"]) <=2:
jobs["Val1"] = jobs["Val1"] + "_" + jobs["Val2"]
However that doesn't do anything for me. I get the same result as not including it at all.
If I change the <= value to 5, which is certainly incorrect, it then does the merge. However it does it on all values, with output looking like this -
0 WRE_CMD
1 BBB_CCC
2 BBBB_CCC
3 B_ZZ
4 BBB_CCCC
Name: Val1, dtype: object
0 CMD
1 CCC
2 CCC
3 ZZ
4 CCCC
Name: Val2, dtype: object
I'm not sure what I'm missing with here. Or if there is a better approach to what I'm trying to achieve.
Sorry for the long winded note.
Thanks
Where you are trying this:
if len(jobs["Val1"]) <=2:
jobs["Val1"] = jobs["Val1"] + "_" + jobs["Val2"]
Instead, you can pass a function that does this via apply
def adjust_val1(row):
if len(row['val1']) <= 2:
return row['val1'] + '_' + row['val2']
else:
return row['val1']
and then
jobs['val3'] = jobs.apply(adjust_val1, axis=1)
You can use df['ColA'].str.split with a negative lookbehind regex:
df['ColA'].str.split(r'(?<!_[A-Za-z])_', expand=True)
0 1 2 3
0 AA BBB CCC DDD
1 AAA BBBB CCC DDDDDD
2 AAAA B_ZZ CC DDD
3 AAA BBB CCCC DDDD
Using numpy maybe this can help
jobs["Val3"] = np.where(jobs.Val1.str.len()<=2, jobs.Val1+"_"+jobs.Val2, jobs.Val1)
ColA Val1 Val2 Val3
0 AA_BBB_CCC_DDD BBB CCC BBB
1 AAA_BBBB_CCC_DDDDDD BBBB CCC BBBB
2 AAAA_B_ZZ_CC_DDD B ZZ B_ZZ
3 AAA_BBB_CCCC_DDDD BBB CCCC BBB
I would try to use a function using apply method:
import pandas as pd
df = pd.DataFrame()
df['col'] = pd.Series(['AA_BBB_CCC_DDD','AAA_BBBB_CCC_DDDDDD','AAAA_B_ZZ_CC_DDD','AAA_BBB_CCCC_DDDD'])
df:
col
0 AA_BBB_CCC_DDD
1 AAA_BBBB_CCC_DDDDDD
2 AAAA_B_ZZ_CC_DDD
3 AAA_BBB_CCCC_DDDD
A function to split as you wish:
def get_val1(x):
l = x.split('_')
if len(l[1]) < 2:
return l[1] + '_' + l[2]
else:
return l[1]
df['val1'] = df['col'].apply(lambda x: get_val1(x))
Resultado val1:
0 BBB
1 BBBB
2 B_ZZ
3 BBB
Name: val1, dtype: object
df["val2"] = df["col"].str.split("_", expand=True)[2]
Result:
col val1 val2
0 AA_BBB_CCC_DDD BBB CCC
1 AAA_BBBB_CCC_DDDDDD BBBB CCC
2 AAAA_B_ZZ_CC_DDD B_ZZ ZZ
3 AAA_BBB_CCCC_DDDD BBB CCCC
I have a dataframe with 100s of columns and 1000s of rows but the basic structure is
Index 0 1 2
0 AAA NaN AAA
1 NaN BBB NaN
2 NaN NaN CCC
3 DDD DDD DDD
I would like to add two new columns one would be and id which would be equal to the first value in each row the second would be a count of the values in each row. It would look like this. To be clear all rows will always have the same value.
Index id count 0 1 2
0 AAA 2 AAA NaN AAA
1 BBB 1 NaN BBB NaN
2 CCC 1 NaN NaN CCC
3 DDD 3 DDD DDD DDD
Any help in figuring out a way to do this would be greatly appreciated. Thanks
This should work.
df['id'] = df.bfill(axis=1).iloc[:, 0].fillna('All NANs')
df['count'] = df.drop(columns=["id"]).notnull().sum(axis=1)
To maintain the order of columns:
df = df[list(df.columns[-2:]) + list(df.columns[:-2])]
Create the Dataframe
test_df = pd.DataFrame([['AAA',np.nan,'AAA'], [np.nan,'BBB',np.nan], [np.nan,np.nan, 'CCC'], ['DDD','DDD','DDD']])
Count the non-NaN elements in each row as count
test_df['count'] = test_df.notna().sum(axis=1)
Option-1: Select the first element in the row as id (regardless of NaN value)
test_df['id'] = test_df[0]
Option-2: Select the first non-NaN element as id for each row
test_df['id'] = test_df.apply(lambda x: x[x.first_valid_index()], axis=1)
I am new to pandas and I am trying to get a list of values that exists in both columns, values that exist in column A, values that only exist in column B.
My .csv file looks like this:
A B
AAA ZZZ
BBB BBB
CCC EEE
DDD FFF
EEE AAA
DDD
GGG HHH
JJJ
Columns have a different length and my outcome would be 3 lists or one csv that I would ouput having 3 columns, one for items existing in both columns, one for items existing in only A column and one for items existing in only B column.
IN BOTH IN COLUMN A IN COLUMN B
AAA CCC ZZZ
BBB GGG FFF
DDD JJJ HHH
EEE
(empty one)
I have tried using .isin() module but it returns true of false rather than the actual list.
existing_in_both = df_column_a.isin(df_column_b)
And I do not know how I should try to extract values that only exist in either column A or B.
Thank you for your suggestions.
My actual .csv has the following:
id clickout_id timestamp click_id click_type
1 123abc 2019-11-25 c51c56d1 1
1 123dce 2019-11-25 c51c5fs1 12
and other file is looking like this:
timestamp id gid type
2019-11-25 1 c51c56d1 2
2019-11-25 1 c51c5fs1 2
And I am trying to compare click_id from first file and gid from the second file.
When I print out using your answer I get the header names as answers rather than the values from the columns.
Use sets with intersection and difference, then for new DataFrame are used Series, because different lengths of outputs:
a = set(df.A)
b = set(df.B)
df = pd.DataFrame({'IN BOTH': pd.Series(list(a & b)),
'IN COLUMN A': pd.Series(list(a - b)),
'IN COLUMN B': pd.Series(list(b - a))})
print (df)
IN BOTH IN COLUMN A IN COLUMN B
0 DDD CCC FFF
1 BBB GGG ZZZ
2 AAA JJJ HHH
3 NaN NaN
4 EEE NaN NaN
Or use numpy.intersect1d with numpy.setdiff1d:
df = pd.DataFrame({'IN BOTH': pd.Series(np.intersect1d(df.A, df.B)),
'IN COLUMN A': pd.Series(np.setdiff1d(df.A, df.B)),
'IN COLUMN B': pd.Series(np.setdiff1d(df.B, df.A))})
print (df)
IN BOTH IN COLUMN A IN COLUMN B
0 CCC FFF
1 AAA GGG HHH
2 BBB JJJ ZZZ
3 DDD NaN NaN
4 EEE NaN NaN
I have a csv file. It looks something like this;
name,id,
AAA,1111,
BBB,2222,
CCC,3333,
DDD,2222,
I want to find out whether there is a duplicate in column id. If yes, find out the duplicate. In this case, the answer is 2222.
I have the code to find out whether a duplicate exists. Here it is;
import pandas as pd
csv_file = 'C:/test.csv'
df = pd.read_csv(csv_file)
df['id'].duplicated().any()
The problem is how can one find out the duplicate?
I am using python 2.7 and panda.
I think you can use duplicated (keep is omit, because keep='first' is default). Or if you need values tolist:
print df['id'][df.duplicated(subset=['id'])]
3 2222
Name: id, dtype: int64
print df['id'][df.duplicated(subset=['id'])].tolist()
[2222]
You can check duplicated:
print df.duplicated(subset=['id'], keep='first')
0 False
1 False
2 False
3 True
dtype: bool
print df.duplicated(subset=['id'], keep='last')
0 False
1 True
2 False
3 False
dtype: bool
print df.duplicated(subset=['id'], keep=False)
0 False
1 True
2 False
3 True
dtype: bool
And if you need duplicated rows use subset:
print df[df.duplicated(subset=['id'], keep='first')]
name id
3 DDD 2222
print df[df.duplicated(subset=['id'], keep='last')]
name id
1 BBB 2222
print df[df.duplicated(subset=['id'], keep=False)]
name id
1 BBB 2222
3 DDD 2222
Use drop_duplicates for dropping:
print df.drop_duplicates(subset=['id'], keep='first')
name id
0 AAA 1111
1 BBB 2222
2 CCC 3333
print df.drop_duplicates(subset=['id'], keep='last')
name id
0 AAA 1111
2 CCC 3333
3 DDD 2222
print df.drop_duplicates(subset=['id'], keep=False)
name id
0 AAA 1111
2 CCC 3333