Is there a way to remove all letters from a string - python

I have a list of titles with combined dates and descriptions, but I have to reduce this to just a list of dates. Some examples of these titles are stuff like this:
1/16 Stories of Time
5/18 Cock'a'doodle'do
However, some people are really bad at typing and have forgotten the spaces between the dates and the rest of the title. I need to remove everything except for numbers and the slashes between them. Using any method, but preferably regex, is there a simple way to do this? For the record, I do understand how to split and recompile the list for any method that would work on a single string.

You're thinking about this backwards. If you want to extract the date at the start of a line, do that instead of trying to get rid of everything else.
You can use a regex like this: ^\d{1,2}/\d{1,2} which means:
^ start of line
\d digit
{1,2} repeated one or two times
For example:
import re
lines = [
'1/16 Stories of Time',
"5/18 Cock'a'doodle'do",
'6/22Bible']
for line in lines:
match = re.match(r'^\d{1,2}/\d{1,2}', line)
if match:
print(match.group(0))
Output:
1/16
5/18
6/22
(Note that re.match always starts matching from the start of the string, so the ^ is redundant here.)
This is more rigorous against titles containing numbers and slashes, like say, 4/5 The 39 Steps / The Thirty-Nine Steps -> 4/5.
However, you'll have a problem if someone forgot the space for a title starts with a number, like say, 7/8100 Years of Solitude -> 7/81.

You can import string to get easy access to a string of all digits, add the slash to it, and then compare your date string against that to drop any character from the date string that's not in there:
import string
string.digits += "/"
for character in date_string:
if not character in string.digits:
date_string = date_string.replace(character, "")
This will convert the date_string 5/18 Cock'a'doodle'do to just 5/18 without using regex at all.

Barmar on the comment of the original question had the best answer. To remove all but the numbers and a slash from the string you can use the one line of code,
string = re.sub(r'[^\d/]', '', string)
This removes all letters but ignores slashes. Thank you Barmar, if you want to post this as an answer I can take this down and flag that instead.

string = "rk3k3rr3kk____"
print("".join([letter for letter in string if not letter.isalpha()]))
But this is what you actually want, since your data seems to always have be a specific kind of format:
string.split(" ")[0]
okay,okay,okay ... this is what you want:
string[:4]
for completness sake:
string = " 2/24 4/12 333333 effee24/22"
for i, x in enumerate(string):
if len(string) <= i + 4:
break
if i > 0 and x != " " and not x.isalpha():
continue
if not string[i+1].isnumeric():
continue
if string[i+2] != "/":
continue
if not string[i+3].isnumeric():
continue
if not string[i+4].isnumeric():
continue
if len(string) == i + 6 and string[i+5] != " " and not string[i+5].isalpha():
continue
print(string[i+1:i+5])

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot
You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).
It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai
Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.
First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)
It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

how to split string between different separators in python

I want to pick up a substring from <personne01166+30-90>, which the output should look like: +30 and -90.
The strings can be like: 'personne01144+0-30', 'personne01146+0+0', 'personne01180+60-75', etc.
I tried use
<string.split('+')[len(string.split('+')) -1 ].split('+')[0]>
but the output must be two correspondent numbers.
Here is how you can use a list comprehension and re.findall:
import re
s = ['personne01144+0-30', 'personne01146+0+0', 'personne01180+60-75']
print([re.findall('[+-]\d+', i) for i in s])
Output:
[['+0', '-30'], ['+0', '+0'], ['+60', '-75']]
re.findall('[+-]\d+', i) finds all the patterns of '[+-]\d+' in the string i.
[+-] means any either + or -. \d+ means all numbers in a row.
If you know the interesting part always comes after + then you can simply split twice.
numbers = string.split('+', 1)[1]
if '+' in numbers:
this, that = numbers.split('+')
elif '-' in numbers:
this, that = numbers.split('-')
that = -that
else:
raise ValueError('Could not parse %s', string)
Perhaps a regex-based approach makes more sense, though;
import re
m = re.search(r'([-+]\d+)([-+]\d+)$', string)
if m:
this, that = m.groups()

How to split strings with special characters without removing those characters?

I'm writing this function which needs to return an abbreviated version of a str. The return str must contain the first letter, number of characters removed and the, last letter;it must be abbreviated per word and not by sentence, then after that I need to join every word again with the same format including the special-characters. I tried using the re.findall() method but it automatically removes the special-characters so I can't use " ".join() because it will leave out the special-characters.
Here's my code:
import re
def abbreviate(wrd):
return " ".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.findall(r"[\w']+", wrd)])
print(abbreviate("elephant-rides are really fun!"))
The output would be:
e6t r3s are r4y fun
But the output should be:
e6t-r3s are r4y fun!
No need for str.join. Might as well take full advantage of what the re module has to offer.
re.sub accepts a string or a callable object (like a function or lambda), which takes the current match as an input and must return a string with which to replace the current match.
import re
pattern = "\\b[a-z]([a-z]{2,})[a-z]\\b"
string = "elephant-rides are really fun!"
def replace(match):
return f"{match.group(0)[0]}{len(match.group(1))}{match.group(0)[-1]}"
abbreviated = re.sub(pattern, replace, string)
print(abbreviated)
Output:
e6t-r3s are r4y fun!
>>>
Maybe someone else can improve upon this answer with a cuter pattern, or any other suggestions. The way the pattern is written now, it assumes that you're only dealing with lowercase letters, so that's something to keep in mind - but it should be pretty straightforward to modify it to suit your needs. I'm not really a fan of the repetition of [a-z], but that's just the quickest way I could think of for capturing the "inner" characters of a word in a separate capturing group. You may also want to consider what should happen with words/contractions like "don't" or "shouldn't".
Thank you for viewing my question. After a few more searches, trial, and error I finally found a way to execute my code properly without changing it too much. I simply substituted re.findall(r"[\w']+", wrd) with re.split(r'([\W\d\_])', wrd) and also removed the whitespace in "".join() for they were simply not needed anymore.
import re
def abbreviate(wrd):
return "".join([i if len(i) < 4 else i[0] + str(len(i[1:-1])) + i[-1] for i in re.split(r'([\W\d\_])', wrd)])
print(abbreviate("elephant-rides are not fun!"))
Output:
e6t-r3s are not fun!

How do I trim a string after certain amount of characters appear more then once in Python?

I am trying to scan a string and every time it reads a certain character 3 times, I would like to cut the remaining string
for example:
The string "C:\Temp\Test\Documents\Test.doc" would turn into "C:\Temp\Test\"
Every time the string hits "\" 3 times it should trim the string
here is my code that I am working on
prefix = ["" for x in range(size)]
num = 0
...
...
for char in os.path.realpath(src):
for x in prefix:
x = char
if x =='\': # I get an error here
num = num + 1
if num == 3:
break
print (num)
print(prefix)
...
...
the os.path.realpath(src) is the string with with the filepath. The "prefix" variable is the string array that I want to store the trimmed string.
Please let me know what I need to fix or if there is a simpler way to perform this.
Do split and then slice list to grab required and join:
s = 'C:\Temp\Test\Documents\Test.doc'
print('\\'.join(s.split('\\')[:3]) + '\\')
# C:\Temp\Test\
Note that \ (backslash) is an escaping character. To specifically mean a backslash, force it to be a backslash by adding a backslash before backslash \\, thereby removing the special meaning of backslash.
In python the backslash character is used as an escape character. If you do \n it does a newline, \t does a tab. There are many other things such as \" lets you do a quote in a string. If you want a regular backslash you should do "\\"
try
s = "C:\\Temp\\Test\\Documents\\Test.doc"
answer = '\\'.join(s.split('\\', 3)[:3])
Something like this would do..
x = "C:\Temp\Test\Documents\Test.doc"
print('\\'.join(x.split("\\")[:3])+"\\")

How to capitalize specific letters in a string given certain rules

I am massaging strings so that the 1st letter of the string and the first letter following either a dash or a slash needs to be capitalized.
So the following string:
test/string - this is a test string
Should look look like so:
Test/String - This is a test string
So in trying to solve this problem my 1st idea seems like a bad idea - iterate the string and check every character and using indexing etc. determine if a character follows a dash or slash, if it does set it to upper and write out to my new string.
def correct_sentence_case(test_phrase):
corrected_test_phrase = ''
firstLetter = True
for char in test_phrase:
if firstLetter:
corrected_test_phrase += char.upper()
firstLetter = False
#elif char == '/':
else:
corrected_test_phrase += char
This just seems VERY un-pythonic. What is a pythonic way to handle this?
Something along the lines of the following would be awesome but I can't pass in both a dash and a slash to the split:
corrected_test_phrase = ' - '.join(i.capitalize() for i in test_phrase.split(' - '))
Which I got from this SO:
Convert UPPERCASE string to sentence case in Python
Any help will be appreciated :)
I was able to accomplish the desired transformation with a regular expression:
import re
capitalized = re.sub(
'(^|[-/])\s*([A-Za-z])', lambda match: match[0].upper(), phrase)
The expression says "anywhere you match either the start of the string, ^, or a dash or slash followed by maybe some space and a word character, replace the word character with its uppercase."
demo
If you don't want to go with a messy splitting-joining logic, go with a regex:
import re
string = 'test/string - this is a test string'
print(re.sub(r'(^([a-z])|(?<=[-/])\s?([a-z]))',
lambda match: match.group(1).upper(), string))
# Test/String - This is a test string
Using double split
import re
' - '.join([i.strip().capitalize() for i in re.split(' - ','/'.join([i.capitalize() for i in re.split('/',test_phrase)]))])
I'm using that:
import string
last = 'pierre-GARCIA'
if last not in [None, '']:
last = last.strip()
if '-' in last:
last = string.capwords(last, sep='-')
else:
last = string.capwords(last, sep=None)

Categories

Resources