How to delete rows that contain the same string again and again - python

I have a DataFrame like below
Name Mail-Body
Oliver I am recently doing AAA, BBB and BBB....
Jack Here is my report. It seemed AAA was.. so AAA is..
Jacob How are you doing? Next week we launch our AAA porject...
And with this DataFrame, I would like perform some data analysis.
But I found out emails containing names such as "AAA" and "BBB"so many times tend to be just a scheduling notification and something like that so it is pretty much meaningless.
So I would like to drop all the rows that contain the same string such as "AAA" and "BBB"more than 5 times in Mail-Body column.
Is there any pythonic way to drop all rows?

Sample:
print (df)
Name Mail-Body
0 Oliver I AAA BBB am recently doing AAA, BBB and BBB
1 Jack AAA AAA. AAA BBB It seemed AAA was.. so AAA is
2 Jacob AAA AAA BBB BBB AAA BBB AAA AAA BBB BBB BBB
3 Bal AAA BBB
If want remove rows with AAA more like 5 times it means need filter rows with less like 5 values of AAA with Series.str.count, Series.lt and boolean indexing:
df0 = df[df['Mail-Body'].str.count('AAA').lt(5)]
print (df0)
Name Mail-Body
0 Oliver I AAA BBB am recently doing AAA, BBB and BBB
3 Bal AAA BBB
If want filter like AAA or BBB values sum together per row, not important how much AAA and how much BBB use AAA|BBB pattern:
df1 = df[df['Mail-Body'].str.count('AAA|BBB').lt(5)]
print (df1)
Name Mail-Body
3 Bal AAA BBB
If want test separately AAA and BBB - chain masks by | for bitwise OR - so it means filter less like 5 values of AAA or less like 5 values B:
df2 = df[df['Mail-Body'].str.count('AAA').lt(5) | df['Mail-Body'].str.count('BBB').lt(5)]
print (df2)
Name Mail-Body
0 Oliver I AAA BBB am recently doing AAA, BBB and BBB
1 Jack AAA AAA. AAA BBB It seemed AAA was.. so AAA is
3 Bal AAA BBB
And if want filter with and by & for bitwise AND solution is:
df3 = df[df['Mail-Body'].str.count('AAA').lt(5) & df['Mail-Body'].str.count('BBB').lt(5)]
print (df3)
Name Mail-Body
0 Oliver I AAA BBB am recently doing AAA, BBB and BBB
3 Bal AAA BBB

Related

VLOOKUP in Python Pandas without using MERGE

I have two DataFrames with one common column as the key and I want to perform a VLOOKUP sort of operation to fetch values from the first DataFrame corresponding to the keys in second DataFrame.
DataFrame 1
key value
0 aaa 111
1 bbb 222
2 ccc 333
DataFrame 2
key value
0 bbb None
1 ccc 333
2 aaa None
3 aaa 111
Desired Output
key value
0 bbb 222
1 ccc 333
2 aaa 111
3 aaa 111
I do not want to use merge as both of my DFs might have NULL values for the key column and since pandas merge behave differently than sql join, all such rows might get joined with each other.
I tried below approach
DF2['value'] = np.where(DF2['key'].isnull(), DF1.loc[DF2['key'].equals(DF1['key'])]['value'], DF2['value'])
but have been getting KeyError: False error.
You can use:
df2['value'] = df2['value'].fillna(df2['key'].map(df1.set_index('key')['value']))
print(df2)
# Output
key value
0 bbb 222
1 ccc 333
2 aaa 111
3 aaa 111

How to add two text rows to one and keep other rows same as before pandas

How to add two text rows to one and keep other rows same as before pandas
How to do that by pandas?
original dataframe:
textA TextB
0 a zz
1 bbb zzzzz
2 ccc zzz
desired output is:
textA TextB
0 a bbb zz
1 bbb zzzzz
2 ccc zzz
i mean i just add two row text to specific row and other rows keep
original values
Do you mean by something like:
>>> df.loc[0, 'textA'] += ' ' + df.loc[1, 'textA']
>>> df
textA TextB
0 a bbb zz
1 bbb zzzzz
2 ccc zzz
>>>

Merging a split string in python

I'm reading a csv file in Python and trying to split the values in one of the columns such that I can parse off a certain values.
My input would look something like this:
ColA
AA_BBB_CCC_DDD
AAA_BBBB_CCC_DDDDDD
AAAA_B_ZZ_CC_DDD
AAA_BBB_CCCC_DDDD
The entries would get split on an underscore(_). Using the following
jobs = pd.read_csv(somefile.csv)
jobs["Val1"] = jobs["ColA"].str.split("_", expand=True)[1]
jobs["Val2"] = jobs["ColA"].str.split("_", expand=True)[2]
print(jobs["Val1"])
print(jobs["Val2"])
That works and give me output like this -
0 WRE
1 BBB
2 BBBB
3 B
4 BBB
Name: Val1, dtype: object
0 CMD
1 CCC
2 CCC
3 ZZ
4 CCCC
Name: Val2, dtype: object
My issue is there are instances where the underscore is actually part of Val1 and shouldn't be dropped. If Val1 is 2 characters or less, than Val1 really needs to be combined with Val2 to get the correct value.
For example the third entry in my example. Val1 would be "B" while Val2 would be "ZZ". As Val1 is only one character, then the true value of Val1 should be "B_ZZ".
To try and achieve that I'm doing the following –
if len(jobs["Val1"]) <=2:
jobs["Val1"] = jobs["Val1"] + "_" + jobs["Val2"]
However that doesn't do anything for me. I get the same result as not including it at all.
If I change the <= value to 5, which is certainly incorrect, it then does the merge. However it does it on all values, with output looking like this -
0 WRE_CMD
1 BBB_CCC
2 BBBB_CCC
3 B_ZZ
4 BBB_CCCC
Name: Val1, dtype: object
0 CMD
1 CCC
2 CCC
3 ZZ
4 CCCC
Name: Val2, dtype: object
I'm not sure what I'm missing with here. Or if there is a better approach to what I'm trying to achieve.
Sorry for the long winded note.
Thanks
Where you are trying this:
if len(jobs["Val1"]) <=2:
jobs["Val1"] = jobs["Val1"] + "_" + jobs["Val2"]
Instead, you can pass a function that does this via apply
def adjust_val1(row):
if len(row['val1']) <= 2:
return row['val1'] + '_' + row['val2']
else:
return row['val1']
and then
jobs['val3'] = jobs.apply(adjust_val1, axis=1)
You can use df['ColA'].str.split with a negative lookbehind regex:
df['ColA'].str.split(r'(?<!_[A-Za-z])_', expand=True)
0 1 2 3
0 AA BBB CCC DDD
1 AAA BBBB CCC DDDDDD
2 AAAA B_ZZ CC DDD
3 AAA BBB CCCC DDDD
Using numpy maybe this can help
jobs["Val3"] = np.where(jobs.Val1.str.len()<=2, jobs.Val1+"_"+jobs.Val2, jobs.Val1)
ColA Val1 Val2 Val3
0 AA_BBB_CCC_DDD BBB CCC BBB
1 AAA_BBBB_CCC_DDDDDD BBBB CCC BBBB
2 AAAA_B_ZZ_CC_DDD B ZZ B_ZZ
3 AAA_BBB_CCCC_DDDD BBB CCCC BBB
I would try to use a function using apply method:
import pandas as pd
df = pd.DataFrame()
df['col'] = pd.Series(['AA_BBB_CCC_DDD','AAA_BBBB_CCC_DDDDDD','AAAA_B_ZZ_CC_DDD','AAA_BBB_CCCC_DDDD'])
df:
col
0 AA_BBB_CCC_DDD
1 AAA_BBBB_CCC_DDDDDD
2 AAAA_B_ZZ_CC_DDD
3 AAA_BBB_CCCC_DDDD
A function to split as you wish:
def get_val1(x):
l = x.split('_')
if len(l[1]) < 2:
return l[1] + '_' + l[2]
else:
return l[1]
df['val1'] = df['col'].apply(lambda x: get_val1(x))
Resultado val1:
0 BBB
1 BBBB
2 B_ZZ
3 BBB
Name: val1, dtype: object
df["val2"] = df["col"].str.split("_", expand=True)[2]
Result:
col val1 val2
0 AA_BBB_CCC_DDD BBB CCC
1 AAA_BBBB_CCC_DDDDDD BBBB CCC
2 AAAA_B_ZZ_CC_DDD B_ZZ ZZ
3 AAA_BBB_CCCC_DDDD BBB CCCC

Applying math to columns where rows hold the same value in pandas

I have 2 dataframes which look like this:
df1
A B
AAA 50
BBB 100
CCC 200
df2
C D
CCC 500
AAA 10
EEE 2100
I am trying to output the dataset where column E would be B - D if A = C. Since A values are not aligned with C values I cant seem to find the appropriate method to apply calculations and compare the right numbers.
There also are values which are not shared between two datasets in this case I want to add text value 'not found' in those places so that the output would look like this:
output
A B C D E
AAA 50 AAA 10 B-D
BBB 100 Not found Not found Not found
CCC 200 CCC 500 B-D
Not found Not found EEE 2100 Not found
Thank you for your suggestions.
Use outer join with left_on and right_on parameters with DataFrame.merge and then subtract columns, for possible subtract numeric is better use missing values:
df = (df1.merge(df2, left_on='A', right_on='C', how='outer')
.fillna({'A':'Not found', 'C':'Not found'})
.assign(E = lambda x: x.B - x.D))
print (df)
A B C D E
0 AAA 50.0 AAA 10.0 40.0
1 BBB 100.0 Not found NaN NaN
2 CCC 200.0 CCC 500.0 -300.0
3 Not found NaN EEE 2100.0 NaN
Last is possible replace all missing values, only numeric columns are now mixed - strings with numbers, so next processing like some arithmetic operations is problematic:
df = (df1.merge(df2, left_on='A', right_on='C', how='outer')
.assign(E = lambda x: x.B - x.D)
.fillna('Not found'))
print (df)
A B C D E
0 AAA 50 AAA 10 40
1 BBB 100 Not found Not found Not found
2 CCC 200 CCC 500 -300
3 Not found Not found EEE 2100 Not found

Extracting existing and non existing values from 2 columns using pandas

I am new to pandas and I am trying to get a list of values that exists in both columns, values that exist in column A, values that only exist in column B.
My .csv file looks like this:
A B
AAA ZZZ
BBB BBB
CCC EEE
DDD FFF
EEE AAA
DDD
GGG HHH
JJJ
Columns have a different length and my outcome would be 3 lists or one csv that I would ouput having 3 columns, one for items existing in both columns, one for items existing in only A column and one for items existing in only B column.
IN BOTH IN COLUMN A IN COLUMN B
AAA CCC ZZZ
BBB GGG FFF
DDD JJJ HHH
EEE
(empty one)
I have tried using .isin() module but it returns true of false rather than the actual list.
existing_in_both = df_column_a.isin(df_column_b)
And I do not know how I should try to extract values that only exist in either column A or B.
Thank you for your suggestions.
My actual .csv has the following:
id clickout_id timestamp click_id click_type
1 123abc 2019-11-25 c51c56d1 1
1 123dce 2019-11-25 c51c5fs1 12
and other file is looking like this:
timestamp id gid type
2019-11-25 1 c51c56d1 2
2019-11-25 1 c51c5fs1 2
And I am trying to compare click_id from first file and gid from the second file.
When I print out using your answer I get the header names as answers rather than the values from the columns.
Use sets with intersection and difference, then for new DataFrame are used Series, because different lengths of outputs:
a = set(df.A)
b = set(df.B)
df = pd.DataFrame({'IN BOTH': pd.Series(list(a & b)),
'IN COLUMN A': pd.Series(list(a - b)),
'IN COLUMN B': pd.Series(list(b - a))})
print (df)
IN BOTH IN COLUMN A IN COLUMN B
0 DDD CCC FFF
1 BBB GGG ZZZ
2 AAA JJJ HHH
3 NaN NaN
4 EEE NaN NaN
Or use numpy.intersect1d with numpy.setdiff1d:
df = pd.DataFrame({'IN BOTH': pd.Series(np.intersect1d(df.A, df.B)),
'IN COLUMN A': pd.Series(np.setdiff1d(df.A, df.B)),
'IN COLUMN B': pd.Series(np.setdiff1d(df.B, df.A))})
print (df)
IN BOTH IN COLUMN A IN COLUMN B
0 CCC FFF
1 AAA GGG HHH
2 BBB JJJ ZZZ
3 DDD NaN NaN
4 EEE NaN NaN

Categories

Resources