I would like to extract the words from ":" to slash - python

I have asked this question before and now i edit it because i found some lines that doesn't correspond to the format i gave before ...
here's an example of the lines:
data = "09:55:04.125 mta Messages I Doc O:SERVER (NVS:SMTP/me#domain.com) R:NVS:FAXG3.I0.0101 mid:6393"
data2= "09:55:05.045 mta Messages I Doc O:SERVER (NVS:SMTP/me#domain.com) R:ADMIN (NVS:SMTP.0/me#domain.fr) mid:6397"
at first i have matched what's between the slash and two points but i've noticed that there's some lines like the first where the type "FAXG3.I0.0101" isn't followed by a slash
here's the regex i use:
exp = result = re.findall(r'[\w\.]+(?=:*)',data) # type S & D
the result i want is 'SMTP','FAXG3.I0.0101' for the first line and 'SMTP','SMTP.0' for the second.
can someone help correcting my regex to get that ??

You just need to change the regex such that it also accepts '.' as a valid character, e.g.:
import re
data = "This is a test message I Res O:Myself (KTP:SMTP/me#domain.com) R:KTP:SMS.CLASS/+345854595 id:21"
result = re.findall(r'[\w\.]+(?=:*/)',data)
print result
['SMTP', 'SMS.CLASS']
The [\w\.]+ says you'll accept a sequence consisting of at least one 'any alphanumeric character and the underscore' (\w) or . (\. - it needs to be escaped, as . otherwise means 'any character').

That should work:
result = re.findall(r'(?<=:)[\w.]+(?=/)',data)
Saying "a sequence of alphanumerical characters (or underscore or dot) between : and a /".

Related

How I can use regex to remove repeated characters from string

I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)
The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.
If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result
Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.

How to find part of string?

I am working with a string. I could find the part of string I need but not all of it. Which part of my code needs to change?
s = "3D(filters:!!(),refreshInterval:(pause:!!t,value:0),time:(from:!%272019-10-01T20:28:50.088Z!%27,to:now))%26_a%3D(description:!%27!%27,filters:!!(),fullScreenMode:!!"
report_time = s[s.find("time:(") + 1:s.find("))")]
Output I need:
>>> report_time
'time:(from:!%272019-10-01T20:28:50.088Z!%27,to:now))'
Output I have:
>>> report_time
'ime:(from:!%272019-10-01T20:28:50.088Z!%27,to:now)'
You put the "+1" on the wrong index. You need to pick up from the first find location and go one character past the second to pick up the extra right parenthesis. This last needs even one more character (thanks to `smac89 for catching that).
report_time = s[s.find("time:("):s.find("))") + 2]
Output:
'time:(from:!%272019-10-01T20:28:50.088Z!%27,to:now))'
Alternatively use a regular expression, e.g:
import re
re.search(r'(time:\(.*\)\))', s).group(1)
Explanation: group(1) returns the matching content of the 1st set of parentheses. .* matches any characters in between. The parentheses in your search therm need to be escaped.
Output:
'time:(from:!%272019-10-01T20:28:50.088Z!%27,to:now))'

I want to replace a special character with a space

Here is the code i have until now :
dex = tree.xpath('//div[#class="cd-timeline-topic"]/text()')
names = filter(lambda n: n.strip(), dex)
table = str.maketrans(dict.fromkeys('?:,'))
for index, name in enumerate(dex, start = 0):
print('{}.{}'.format(index, name.strip().translate(table)))
The problem is that the output will print also strings with one special character "My name is/Richard". So what i need it's to replace that special character with a space and in the end the printing output will be "My name is Richard". Can anyone help me ?
Thanks!
Your call to dict.fromkeys() does not include the character / in its argument.
If you want to map all the special characters to None, just passing your list of special chars to dict.fromkeys() should be enough. If you want to replace them with a space, you could then iterate over the dict and set the value to for each key.
For example:
special_chars = "?:/"
special_char_dict = dict.fromkeys(special_chars)
for k in special_char_dict:
special_char_dict[k] = " "
You can do this by extending your translation table:
dex = ["My Name is/Richard????::,"]
table = str.maketrans({'?':None,':':None,',':None,'/':' '})
for index, name in enumerate(dex, start = 0):
print('{}.{}'.format(index, name.strip().translate(table)))
OUTPUT
0.My Name is Richard
You want to replace most special characters with None BUT forward slash with a space. You could use a different method to replace forward slashes as the other answers here do, or you could extend your translation table as above, mapping all the other special characters to None and forward slash to space. With this you could have a whole bunch of different replacements happen for different characters.
Alternatively you could use re.sub function following way:
import re
s = 'Te/st st?ri:ng,'
out = re.sub(r'\?|:|,|/',lambda x:' ' if x.group(0)=='/' else '',s)
print(out) #Te st string
Arguments meaning of re.sub is as follows: first one is pattern - it informs re.sub which substring to replace, ? needs to be escaped as otherwise it has special meaning there, | means: or, so re.sub will look for ? or : or , or /. Second argument is function which return character to be used in place of original substring: space for / and empty str for anything else. Third argument is string to be changed.
>>> a = "My name is/Richard"
>>> a.replace('/', ' ')
'My name is Richard'
To replace any character or sequence of characters from the string, you need to use `.replace()' method. So the solution to your answer is:
name.replace("/", " ")
here you can find details

fuzzy string split in Python 2.x

Input file:
rep_origin 607..1720
/label=Rep
Region 2643..5020
/label="region"
extra_info and stuff
I'm trying to split by the first column-esque entry. For example, I want to get a list that looks like this...
Desired Output:
['rep_origin 607..1720 /label=Rep', 'Region 2643..5020 /label="region" extra_info and stuff']
I tried splitting by ' ' but that gave me some crazy stuff. If I could add a "fuzzy" search term at the end that includes all alphabet characters but NOT a whitespace. That would solve the problem. I suppose you can do it with regex with something like ' [A-Z]' findall but I wasn't sure if there was a less complicated way.
Is there a way to add a "fuzzy" search term at the very end of string.split identifier? (i.e. original_string.' [alphabet_character]'
I'm not sure exactly what you're looking for but the parse function below takes the text from your question and returns a list of sections and a section is a list of the lines from each section (with leading and trailing whitespace removed).
#!/usr/bin/env python
import re
# This is the input from your question
INPUT_TEXT = '''\
rep_origin 607..1720
/label=Rep
Region 2643..5020
/label="region"
extra_info and stuff'''
# A regular expression that matches the start of a section. A section
# start is a line that has 4 spaces before the first non-space
# character.
match_section_start = re.compile(r'^ [^ ]').match
def parse(text):
sections = []
section_lines = None
def append_section_if_lines():
if section_lines:
sections.append(section_lines)
for line in text.split('\n'):
if match_section_start(line):
# We've found the start of a new section. Unless this is
# the first section, save the previous section.
append_section_if_lines()
section_lines = []
section_lines.append(line.strip())
# Save the last section.
append_section_if_lines()
return sections
sections = parse(INPUT_TEXT)
print(sections)

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Categories

Resources