I have below data frame.
df = pd.DataFrame({'vin':['aaa','bbb','bbb','bbb','ccc','ccc','ddd','eee','eee','fff'],'module':['NORMAL','1ST_PRIORITY','2ND_PRIORITY','HELLO','3RD_PRIORITY','2ND_PRIORITY','2ND_PRIORITY','3RD_PRIORITY','HELLO','ABS']})
I wanted to find if the vin column contains a unique value then in the Result column it should return 'YES' and if the vin column is not unique then it will check the 'module' column and return 'YES' where the module column has more priority value.
I want output like the below data frame.
df = pd.DataFrame({'vin':['aaa','bbb','bbb','bbb','ccc','ccc','ddd','eee','eee','fff'],'module':['NORMAL','1ST_PRIORITY','2ND_PRIORITY','HELLO','3RD_PRIORITY','2ND_PRIORITY','2ND_PRIORITY','3RD_PRIORITY','HELLO','ABS'],
'Result':['YES','YES','NO','NO','NO','YES','YES','YES','NO','YES']})
Below code, I have tried and it gives the correct result but it involves too many steps.
df['count'] = df.groupby('vin').vin.transform('count')
def Check1(df):
if (df["count"] == 1):
return 1
elif ((df["count"] != 1) & (df["module"] == '1ST_PRIORITY')):
return 1
elif ((df["count"] != 1) & (df["module"] == '2ND_PRIORITY')):
return 2
elif ((df["count"] != 1) & (df["module"] == '3RD_PRIORITY')):
return 3
else:
return 4
df['Sort'] = df.apply(Check1, axis=1)
df = df.sort_values(by=['vin', 'Sort'])
df.drop_duplicates(subset=['vin'], keep='first',inplace = True)
df
Here's the trick, you need a custom order:
from pandas.api.types import CategoricalDtype
#create your custom order
custom_order = CategoricalDtype(
['Delhi','Agra','Paris','ABS','HELLO','NORMAL'],
ordered=True)
#then attribute it to the desired column
df['module'] = df['module'].astype(custom_order)
df['Result'] = ((~df.sort_values('module', ascending=True).duplicated('vin'))
.replace({True: 'YES', False: 'NO'}))
Result:
index
vin
module
Result
0
aaa
NORMAL
YES
1
bbb
Delhi
YES
2
bbb
Agra
NO
3
bbb
HELLO
NO
4
ccc
Paris
NO
5
ccc
Agra
YES
6
ddd
Agra
YES
7
eee
Paris
YES
8
eee
HELLO
NO
9
fff
ABS
YES
IIUC, you can use duplicated after sort_values:
df['Result'] = ((~df.sort_values('module').duplicated('vin'))
.replace({True: 'YES', False: 'NO'}))
print(df)
# Output
vin module Result
0 aaa NORMAL YES
1 bbb 1ST_PRIORITY YES
2 bbb 2ND_PRIORITY NO
3 bbb HELLO NO
4 ccc 3RD_PRIORITY NO
5 ccc 2ND_PRIORITY YES
6 ddd 2ND_PRIORITY YES
7 eee 3RD_PRIORITY YES
8 eee HELLO NO
9 fff ABS YES
How to add two text rows to one and keep other rows same as before pandas
How to do that by pandas?
original dataframe:
textA TextB
0 a zz
1 bbb zzzzz
2 ccc zzz
desired output is:
textA TextB
0 a bbb zz
1 bbb zzzzz
2 ccc zzz
i mean i just add two row text to specific row and other rows keep
original values
Do you mean by something like:
>>> df.loc[0, 'textA'] += ' ' + df.loc[1, 'textA']
>>> df
textA TextB
0 a bbb zz
1 bbb zzzzz
2 ccc zzz
>>>
I have a DataFrame like below
Name Mail-Body
Oliver I am recently doing AAA, BBB and BBB....
Jack Here is my report. It seemed AAA was.. so AAA is..
Jacob How are you doing? Next week we launch our AAA porject...
And with this DataFrame, I would like perform some data analysis.
But I found out emails containing names such as "AAA" and "BBB"so many times tend to be just a scheduling notification and something like that so it is pretty much meaningless.
So I would like to drop all the rows that contain the same string such as "AAA" and "BBB"more than 5 times in Mail-Body column.
Is there any pythonic way to drop all rows?
Sample:
print (df)
Name Mail-Body
0 Oliver I AAA BBB am recently doing AAA, BBB and BBB
1 Jack AAA AAA. AAA BBB It seemed AAA was.. so AAA is
2 Jacob AAA AAA BBB BBB AAA BBB AAA AAA BBB BBB BBB
3 Bal AAA BBB
If want remove rows with AAA more like 5 times it means need filter rows with less like 5 values of AAA with Series.str.count, Series.lt and boolean indexing:
df0 = df[df['Mail-Body'].str.count('AAA').lt(5)]
print (df0)
Name Mail-Body
0 Oliver I AAA BBB am recently doing AAA, BBB and BBB
3 Bal AAA BBB
If want filter like AAA or BBB values sum together per row, not important how much AAA and how much BBB use AAA|BBB pattern:
df1 = df[df['Mail-Body'].str.count('AAA|BBB').lt(5)]
print (df1)
Name Mail-Body
3 Bal AAA BBB
If want test separately AAA and BBB - chain masks by | for bitwise OR - so it means filter less like 5 values of AAA or less like 5 values B:
df2 = df[df['Mail-Body'].str.count('AAA').lt(5) | df['Mail-Body'].str.count('BBB').lt(5)]
print (df2)
Name Mail-Body
0 Oliver I AAA BBB am recently doing AAA, BBB and BBB
1 Jack AAA AAA. AAA BBB It seemed AAA was.. so AAA is
3 Bal AAA BBB
And if want filter with and by & for bitwise AND solution is:
df3 = df[df['Mail-Body'].str.count('AAA').lt(5) & df['Mail-Body'].str.count('BBB').lt(5)]
print (df3)
Name Mail-Body
0 Oliver I AAA BBB am recently doing AAA, BBB and BBB
3 Bal AAA BBB
I am new to pandas and I am trying to get a list of values that exists in both columns, values that exist in column A, values that only exist in column B.
My .csv file looks like this:
A B
AAA ZZZ
BBB BBB
CCC EEE
DDD FFF
EEE AAA
DDD
GGG HHH
JJJ
Columns have a different length and my outcome would be 3 lists or one csv that I would ouput having 3 columns, one for items existing in both columns, one for items existing in only A column and one for items existing in only B column.
IN BOTH IN COLUMN A IN COLUMN B
AAA CCC ZZZ
BBB GGG FFF
DDD JJJ HHH
EEE
(empty one)
I have tried using .isin() module but it returns true of false rather than the actual list.
existing_in_both = df_column_a.isin(df_column_b)
And I do not know how I should try to extract values that only exist in either column A or B.
Thank you for your suggestions.
My actual .csv has the following:
id clickout_id timestamp click_id click_type
1 123abc 2019-11-25 c51c56d1 1
1 123dce 2019-11-25 c51c5fs1 12
and other file is looking like this:
timestamp id gid type
2019-11-25 1 c51c56d1 2
2019-11-25 1 c51c5fs1 2
And I am trying to compare click_id from first file and gid from the second file.
When I print out using your answer I get the header names as answers rather than the values from the columns.
Use sets with intersection and difference, then for new DataFrame are used Series, because different lengths of outputs:
a = set(df.A)
b = set(df.B)
df = pd.DataFrame({'IN BOTH': pd.Series(list(a & b)),
'IN COLUMN A': pd.Series(list(a - b)),
'IN COLUMN B': pd.Series(list(b - a))})
print (df)
IN BOTH IN COLUMN A IN COLUMN B
0 DDD CCC FFF
1 BBB GGG ZZZ
2 AAA JJJ HHH
3 NaN NaN
4 EEE NaN NaN
Or use numpy.intersect1d with numpy.setdiff1d:
df = pd.DataFrame({'IN BOTH': pd.Series(np.intersect1d(df.A, df.B)),
'IN COLUMN A': pd.Series(np.setdiff1d(df.A, df.B)),
'IN COLUMN B': pd.Series(np.setdiff1d(df.B, df.A))})
print (df)
IN BOTH IN COLUMN A IN COLUMN B
0 CCC FFF
1 AAA GGG HHH
2 BBB JJJ ZZZ
3 DDD NaN NaN
4 EEE NaN NaN
Say I have the following file test.txt:
Aaa Bbb
Foo 0
Bar 1
Baz NULL
(The separator is actually a tab character, which I can't seem to input here.)
And I try to read it using pandas (0.10.0):
In [523]: pd.read_table("test.txt")
Out[523]:
Aaa Bbb
0 Foo NaN
1 Bar 1
2 Baz NaN
Note that the zero value in the first column has suddenly turned into NaN! I was expecting a DataFrame like this:
Aaa Bbb
0 Foo 0
1 Bar 1
2 Baz NaN
What do I need to change to obtain the latter? I suppose I could use pd.read_table("test.txt", na_filter=False) and subsequently replace 'NULL' values with NaN and change the column dtype. Is there a more straightforward solution?
I think this is issue #2599, "read_csv treats zeroes as nan if column contains any nan", which is now closed. I can't reproduce in my development version:
In [27]: with open("test.txt") as fp:
....: for line in fp:
....: print repr(line)
....:
'Aaa\tBbb\n'
'Foo\t0\n'
'Bar\t1\n'
'Baz\tNULL\n'
In [28]: pd.read_table("test.txt")
Out[28]:
Aaa Bbb
0 Foo 0
1 Bar 1
2 Baz NaN
In [29]: pd.__version__
Out[29]: '0.10.1.dev-f7f7e13'
Try:
import pandas as pd
df = pd.read_table("14256839_input.txt", sep=" ", na_values="NULL")
print df
print df.dtypes
This gives me
Aaa Bbb
0 Foo 0
1 Bar 1
2 Baz NaN
Aaa object
Bbb float64