How to replace only particular dots in the string? - python

How to replace only particular dots in the text string?:
string_expample = '123|4.3|123.54|sdflk|hfghjkkf.ffg..t.s..9.7..tg..3..654..2.fd'
I need to get only dots that are 1 and between 2 digits( 4.3 from |4.3|; 3.5 from 123.54, etc.)
be replaced by commas in the original string, is it possible?
If so, how?
So, the result string must be:
string_final = '123|4,3|123,54|sdflk|hfghjkkf.ffg..t.s..9,7..tg..3...654..2.fd'
Thanks in advance.

import re
string_example = '123|4.3|123.54|sdflk|hfghjkkf.ffg..t.s..4..tg..3...654..2.fd'
string_final = re.sub(r'(\d)\.(\d)', r'\1,\2', string_example)
print(string_final)
123|4,3|123,54|sdflk|hfghjkkf.ffg..t.s..4..tg..3...654..2.fd
We use a regular expression to find "digit . digit" (the digits are captured into groups with parentheses) and replace them with "group 1 , group 2" (the groups are the corresponding digits).

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot
You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).
It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai
Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.
First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)
It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

Pandas: count dots in a string - same as length?

I'm trying to count the number of dots in an email address using Python + Pandas.
The first record is "addison.shepherd#gmail.com". It should count 2 dots. Instead, it returns 26, the length of the string.
import pandas as pd
url = "http://profalibania.com.br/python/EmailsDoctors.xlsx"
docs = pd.read_excel(url)
docs["PosAt"] = docs["Email"].str.count('.')
Can anybody help me? Thanks in advance!
pandas.Series.str.count takes a regex expression as input. To match a literal period (.), you must escape it:
docs["Email"].str.count('\.')
Just specifying . will use the regex meaning of the period (matching any single character)
The .str.count(..) method [pandas-doc] works with a regular expression [wiki]. This is specified in the documentation:
This function is used to count the number of times a particular regex pattern is repeated in each of the string elements of the Series.
For a regex, the dot means "all characters except new line". You can use a character set (by surrounding it by square brackets):
docs["PosAt"] = docs["Email"].str.count('[.]')
A variant here would be to compare the length of the original email column with the length of that column with all dots removed:
docs["Email"].str.len() - docs["Email"].str.replace("[.]", "").len()

How To Extract Three Letters Followed By Five Digits Using Regex in Python

I have the following dataframe in Python:
abc12345
abc1234
abc1324.
How do I extract only the ones that have three letters followed by five digits?
The desired result would be:
abc12345.
df.column.str.extract('[^0-9](\d\d\d\d\d)$')
I think this works, but is there any better way to modify (\d\d\d\d\d) ?
What if I had like 30 digits. Then I'll have to type \d 30 times, which is inefficient.
You should be able to use:
'[a-zA-Z]{3}\d{5}'
If the strings don't include capital letters this can reduce to:
'[a-z]{3}\d{5}'
Change the values in the {x} to adjust the number of chars to capture.
Or like this following code:
'
import re
s = "abc12345"
p = re.compile(r"\d{5}")
c = p.match(s,3)
print(c.group())
'

Python regular expression match number in string

I used regular expression in python2.7 to match the number in a string but I can't match a single number in my expression, here are my code
import re
import cv2
s = '858 1790 -156.25 2'
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{1,10}')
data = re.findall(re_matchData, s)
print data
and then print:
['858', '1790', '-156.25']
but when I change expression from
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{1,10}')
to
re_matchData = re.compile(r'\-?\d{0,10}\.?\d{1,10}')
then print:
['858', '1790', '-156.25', '2']
is there any confuses between d{1, 10} and d{0,10} ?
If I did wrong, how to correct it ?
Thanks for checking my question !
try this:
r'\-?\d{1,10}(?:\.\d{1,10})?'
use (?:)? to make fractional part optional.
for r'\-?\d{0,10}\.?\d{1,10}', it is \.?\d{1,10} who matched 2.
The first \d{1,10} matches from 1 to 10 digits, and the second \d{1,10} also matches from 1 to 10 digits. In order for them both to match, you need at least 2 digits in your number, with an optional . between them.
You should make the entire fraction optional, not just the ..
r'\-?\d{1,10}(?:\.\d{1,10})?'
I would rather do as follows:
import re
s = '858 1790 -156.25 2'
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{0,10}')
data = re_matchData.findall(s)
print data
Output:
['858', '1790', '-156.25', '2']

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Categories

Resources