Replace s.str.startwith parameters only in a series

Replace s.str.startwith parameters only in a series - python

I have a df on which I want to filter a column and replace the str.startswith parameter. Example:
df = pd.DataFrame(data={'fname':['Anky','Anky','Tom','Harry','Harry','Harry'],'lname':['sur1','sur1','sur2','sur3','sur3','sur3'],'role':['','abc','def','ghi','','ijk'],'mobile':['08511663451212','','0851166346','','0851166347',''],'Pmobile':['085116634512','1234567890','8885116634','','+353051166347','0987654321'],'Isactive':['Active','','','','Active','']})
by executing the below line :
df['Pmobile'][df['Pmobile'].str.startswith(('08','8','+353'),na=False)]
I get :
0 085116634512
2 8885116634
4 +353051166347
How do i replace only the parameters I passed under s.str.startswith() here for example : ('08','8','+3538') and don't touch any other number except the starting numbers inside the tuple (on the fly)?

I found this most convenient and concise
df.Pmobile = df.Pmobile.replace(r'^[08|88|+3538]', '')

You can use pandas's replace with regex.
below is sample code.
df.Pmobile.replace(regex={r'^08':'',r'^8':'',r'^[+]353':''})
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html

Related

How to split based on string matching?

I have two lists, one that contains the user input and the other one that contains the mapping.
The user input looks like this :
The mapping looks like this :
I am trying to split the strings in the user input list. Sometime they enter one record as CO109CO45 but in reality these are two codes and don't belong together. They need to be separated with a comma or space as such CO109,CO45.
There are many examples that have the same behavior and i was thinking to use a mapping list to match and split. Is this something that can be done? What do you suggest? Thanks in advance for your help!

Use a combination of look ahead and look behind regex in the split.
df = pd.DataFrame({'RCode': ['CO109', 'CO109CO109']})
print(df)
RCode
0 CO109
1 CO109CO109
df.RCode.str.split('(?<=\d)(?=\D)')
0 [CO109]
1 [CO109, CO109]
Name: RCode, dtype: object

You can try with regex:
import re
l = ['CO2740CO96', 'CO12', 'CO973', 'CO870CO397', 'CO584', 'CO134CO42CO685']
df = pd.DataFrame({'code':l})
df.code = df.code.str.findall('[A-Za-z]+\d+')
print(df)
Output:
code
0 [CO2740, CO96]
1 [CO12]
2 [CO973]
3 [CO870, CO397]
4 [CO584]
5 [CO134, CO42, CO685]

I usually use something like this, for an input original_list:
output_list = [
[
('CO' + target).strip(' ,')
for target in item.split('CO')
]
for item in original_list
]
There are probably more efficient ways of doing it, but you don't need the overhead of dataframes / pandas, or the hard-to-read aspects of regexes.
If you have a manageable number of prefixes ("CO", "PR", etc.), you can set up a recursive function splitting on each of them. - Or you can use .find() with the full codes.

Using split function in Pandas bracket indexer

I'm trying to keep the text rows in a data frame that contains a specific word. I have tried the following:
df['hello' in df['text_column'].split()]
and received the following error:
'Series' object has no attribute 'split'
Please pay attention that I'm trying to check if they contain a word, not a char series, so df[df['text_column'].str.contains('hello')] is not a solution because in that case, 'helloss' or 'sshello' would also be returned as True.

Another answer in addition to the regex answer mentioned above would be to use split combined with the map function like below
df['keep_row'] = df['text_column'].map(lambda x: 'hello' in x.split())
df = df[df['keep_row']]
OR
df = df[df['text_column'].map(lambda x: 'hello' in x.split())]

rstrip() the unwanted parts from string column

I have a column of strings consists of the following values:
'20/25+1'
'9/200E'
'20/50+1'
'20/30 # 8 inches'
'20/60-2+1'
'20/20 !!'
'20/20(slow)'
'20/70-1 "slowly"'
And I only want the first fraction, so I am trying to find a way to get to the following values:
'20/25'
'9/200'
'20/50'
'20/30'
'20/60'
'20/20'
'20/20'
'20/70'
I have tried the following command but it doesn't seem to do the job:
df['colname'].apply(lambda x: x.rstrip(' .*')).unique()
How can I fix it? Thanks in advance!

Assuming that the fraction would always start the column's value, we can use str.extract here as follows:
df['pct'] = df['colname'].str.extract(r'^(\d+/\d+)')
Demo

How do I present my output as a Pandas dataframe?

CHECK_OUTPUT_HERE
Currently, the output I am getting is in the string format. I am not sure how to convert that string to a pandas dataframe.
I am getting 3 different tables in my output. It is in a string format.
One of the following 2 solutions will work for me:
Convert that string output to 3 different dataframes. OR
Change something in the function so that I get the output as 3 different data frames.
I have tried using RegEx to convert the string output to a dataframe but it won't work in my case since I want my output to be dynamic. It should work if I give another input.
def column_ch(self, sample_count=10):
report = render("header.txt")
match_stats = []
match_sample = []
any_mismatch = False
for column in self.column_stats:
if not column["all_match"]:
any_mismatch = True
match_stats.append(
{
"Column": column["column"],
"{} dtype".format(self.df1_name): column["dtype1"],
"{} dtype".format(self.df2_name): column["dtype2"],
"# Unequal": column["unequal_cnt"],
"Max Diff": column["max_diff"],
"# Null Diff": column["null_diff"],
}
)
if column["unequal_cnt"] > 0:
match_sample.append(
self.sample_mismatch(column["column"], sample_count, for_display=True)
)
if any_mismatch:
for sample in match_sample:
report += sample.to_string()
report += "\n\n"
print("type is", type(report))
return report

Since you have a string, you can pass your string into a file-like buffer and then read it with pandas read_csv into a dataframe.
Assuming that your string with the dataframe is called dfstring, the code would look like this:
import io
bufdf = io.StringIO(dfstring)
df = pd.read_csv(bufdf, sep=???)
If your string contains multiple dataframes, split it with split and use a loop.
import io
dflist = []
for sdf in dfstring.split('\n\n'): ##this seems the separator between two dataframes
bufdf = io.StringIO(sdf)
dflist.append(pd.read_csv(bufdf, sep=???))
Be careful to pass an appropriate sep parameter, my ??? means that I am not able to understand what could be a proper parameter. Your field are separated by spaces, so you could use sep='\s+') but I see that you have also spaces which are not meant to be a separator, so this may cause a parsing error.
sep accept regex, so to have 2 consecutive spaces as a separator, you could do: sep='\s\s+' (this will require an additional parameter engine='python'). But again, be sure that you have at least 2 spaces between two consecutive fields.
See here for reference about the io module and StringIO.
Note that the io module exists in python3 but not in python2 (it has another name) but since the latest pandas versions require python3, I guess you are using python3.

Comparing a list to a dataframe column and create new column with numbers

I have a dataframe where one column contains urls. I want to compare it to a list of string values and wherever they match add a number to a new column.
The column looks something like this:
source
www.fox5.com/some_article
www.nyt.com/some_article
www.fox40news.com/some_article
www.cnn.com/another_article
...
I want to compare it to this list:
sources = ['fox', 'yahoo', 'abcnews', 'google', 'cnn', 'nyt', 'nbc',
'washingtonpost', 'wsj', 'huffingtonpost']
and where the sources value is contained in the source column add the corresponding number of the list location to a new column. So the resulting new column would look something like this:
sources sourcenum
www.fox5.com/some_article 1
www.nyt.com/some_article 6
www.fox40news.com/some_article 1
www.cnn.com/another_article 5
... ...
Ive tried using a for loop with a count:
count = 1
for x in sources:
if x in df.source.values:
df.sourcenum = count
count += 1
but the output is just all 0's
I also tried using numpys where but that doesnt accept 10 arguments.
The list could be changed to a dictionary like so if that would work better
sources = {'fox':1, 'yahoo':2, 'abcnews':3, 'google':4, 'cnn':5, 'nyt':6,
'nbc':7, 'washingtonpost':8, 'wsj':9, 'huffingtonpost':10}
Any help would be appreciated, thanks.

One way is to use a generator expression with enumerate. In the below implementation we cycle through an enumerated sources list. next extracts the first instance of a partial match. If no partial match exists, 0 is returned.
sources = ['fox', 'yahoo', 'abcnews', 'google', 'cnn', 'nyt', 'nbc',
'washingtonpost', 'wsj', 'huffingtonpost']
def sourcer(x):
return next((i for i, j in enumerate(sources, 1) if j in x), 0)
df['sourcenum'] = df['source'].apply(sourcer)
print(df)
source sourcenum
0 www.fox5.com/some_article 1
1 www.nyt.com/some_article 6
2 www.fox40news.com/some_article 1
3 www.cnn.com/another_article 5

It looks like regular expression can help resolve the problem. Python has 're' module, though I'm not the expert of Python.
But the idea is compose a 'pattern' with your sources list, and match that pattern against the strings. I believe you could get the count of matches which is the number you need.

You can also use tldextract package to get domain name of the url.
Then, apply get_close_matches function from
difflib package to get closest string.
And finally use .index to get corresponding index number from list of sources:
import tldextract
from difflib import get_close_matches
df['sourcenum'] = df['source'].apply(lambda row:sources.index(
get_close_matches(
tldextract.extract(row).domain, sources, cutoff=.5)[0])+1)
print(df)
Result:
source sourcenum
0 www.fox5.com/some_article 1
1 www.nyt.com/some_article 6
2 www.fox40news.com/some_article 1
3 www.cnn.com/another_article 5
Note: in code above, for function get_close_matches the value for cutoff=.5 was set otherwise close match for fox40news was not found.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace s.str.startwith parameters only in a series - python

I found this most convenient and concise df.Pmobile = df.Pmobile.replace(r'^[08|88|+3538]', '')

You can use pandas's replace with regex. below is sample code. df.Pmobile.replace(regex={r'^08':'',r'^8':'',r'^[+]353':''}) https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html

Related

How to split based on string matching?

Using split function in Pandas bracket indexer

rstrip() the unwanted parts from string column

How do I present my output as a Pandas dataframe?

Comparing a list to a dataframe column and create new column with numbers

Categories

Resources