Change whitespace to underscore at specific positions - python

I have string like this:
strings = ['pic1.jpg siberian cat 24 25', 'pic2.jpg siemese cat 14 32', 'pic3.jpg american bobtail cat 8 13', 'pic4.jpg cat 9 1']
What I want is to replace whitespace between cat breeds to hyphen eliminating whitespace between .jpg and first word in breed, and numbers.
Expected output:
['pic1.jpg siberian_cat 24 25', 'pic2.jpg siemese_cat 14 32', 'pic3.jpg american_bobtail cat 8 13', 'pic4.jpg cat 9 1']
I tried to construct patterns as follows:
[re.sub(r'(?<!jpg\s)([a-z])\s([a-z])\s([a-z])', r'\1_\2_\3', x) for x in strings ]
However, I adds hyphen between .jpg and next word.
The problem is that "cat" is not always put at the end of the word combination.

Here is one approach using re.sub with a callback function:
strings = ['pic1.jpg siberian cat 24 25', 'pic2.jpg siemese cat 14 32', 'pic3.jpg american bobtail cat 8 13', 'pic4.jpg cat 9 1']
output = [re.sub(r'(?<!\S)\w+(?: \w+)* cat\b', lambda x: x.group().replace(' ', '_'), x) for x in strings]
print(output)
This prints:
['pic1.jpg siberian_cat 24 25',
'pic2.jpg siemese_cat 14 32',
'pic3.jpg american_bobtail_cat 8 13',
'pic4.jpg cat 9 1']
Here is an explanation of the regex pattern used:
(?<!\S) assert what precedes first word is either whitespace or start of string
\w+ match a word, which is then followed by
(?: \w+)* a space another word, zero or more times
[ ] match a single space
cat\b followed by 'cat'
In other words, taking the third list element as an example, the regex pattern matches american bobtail cat, then replaces all spaces by underscore in the lambda callback function.

Try this [re.sub(r'jpg\s((\S+\s)+)cat', "jpg " + "_".join(x.split('jpg')[1].split('cat')[0].strip().split()) + "_cat", x) for x in strings ]

Related

Python remove middle initial from then end of a name string

I am trying to remove the middle initial at the end of a name string. An example of how the data looks:
df = pd.DataFrame({'Name': ['Smith, Jake K',
'Howard, Rob',
'Smith-Howard, Emily R',
'McDonald, Jim T',
'McCormick, Erica']})
I am currently using the following code, which works for all names except for McCormick, Erica. I first use regex to identify all capital letters. Then any rows with 3 or more capital letters, I remove [:-1] from the string (in an attempt to remove the middle initial and extra space).
df['Cap_Letters'] = df['Name'].str.findall(r'[A-Z]')
df.loc[df['Cap_Letters'].str.len() >= 3, 'Name'] = df['Name'].str[:-1]
This outputs the following:
As you can see, this properly removes the middle initial for all names except for McCormick, Erica. Reason being she has 3 capital letters but no middle initial, which incorrectly removes the 'a' in Erica.
You can use Series.str.replace directly:
df['Name'] = df['Name'].str.replace(r'\s+[A-Z]$', '', regex=True)
Output:
0 Smith, Jake
1 Howard, Rob
2 Smith-Howard, Emily
3 McDonald, Jim
4 McCormick, Erica
Name: Name, dtype: object
See the regex demo. Regex details:
\s+ - one or more whitespaces
[A-Z] - an uppercase letter
$ - end of string.
Another solution(not so pretty) would be to split then take 2 elements then join again
df['Name'] = df['Name'].str.split().str[0:2].str.join(' ')
# 0 Smith, Jake
# 1 Howard, Rob
# 2 Smith-Howard, Emily
# 3 McDonald, Jim
# 4 McCormick, Erica
# Name: Name, dtype: object
I would use something like that :
def removeMaj(string):
tab=string.split(',')
tab[1]=lower(tab[1])
string=",".join(tab)
return(string)

Split a python string by particular identifications [duplicate]

This question already has answers here:
How to split a Python string on new line characters [duplicate]
(2 answers)
Closed 2 years ago.
I am trying to split a python string when a particular character appears.
For example:
mystring="I want to eat an apple. \n 12345 \n 12 34 56"
The output I want is a string with
[["I want to eat an apple"], [12345], [12, 34, 56]]
>>> mystring.split(" \n ")
['I want to eat an apple.', '12345', '12 34 56']
If you specifically want each string inside its own list:
>>> [[s] for s in mystring.split(" \n ")]
[['I want to eat an apple.'], ['12345'], ['12 34 56']]
mystring = "I want to eat an apple. \n 12345 \n 12 34 56"
# split and strip the lines in case they all dont have the format ' \n '
split_list = [line.strip() for line in mystring.split('\n')] # use [line.strip] to make each element a list...
print(split_list)
Output:
['I want to eat an apple.', '12345', '12 34 56']
Use split(),strip() and re for this question
First split the strings by nextline and then strip each of them and then extract numbers from string by re, if length is more than one then replace the item
import re
mystring="I want to eat an apple. \n 12345 \n 12 34 56"
l = [i.strip() for i in mystring.split("\n")]
for idx,i in enumerate(l):
if len(re.findall(r'\d+',i))>1:
l[idx] = re.findall(r'\d+',i)
print(l)
#['I want to eat an apple.', '12345', ['12', '34', '56']]

Remove words until a specific character is reached

I'm new to python and am having difficulties to remove words in a string
9 - Saturday, 19 May 2012
above is my string I would like to remove all string to
19 May 2012
so I could easily convert it to sql date
here is the could that I tried
new_s = re.sub(',', '', '9 - Saturday, 19 May 2012')
But it only remove the "," in the String. Any help?
You can use string.split(',')
and you will get
['9 - Saturday', '19 May 2012']
You are missing the .* (matching any number of chars) before the , (and a space after it which you probably also want to remove:
>>> new_s = re.sub('.*, ', '', '9 - Saturday, 19 May 2012')
>>> new_s
'19 May 2012'
Your regex is matching a single comma only hence that is the only thing it removes.
You may use a negated character class i.e. [^,]* to match everything until you match a comma and then match comma and trailing whitespace to remove it like this:
>>> print re.sub('[^,]*, *', '', '9 - Saturday, 19 May 2012')
19 May 2012
Regex is great, but for this you could also use .split()
test_string = "9 - Saturday, 19 May 2012"
splt_string = test_string.split(",")
out_String = splt_string[1]
print(out_String)
Outputs:
19 May 2012
If the leading ' ' is a propblem, you can remedy this with out_String.lstrip()
try this
a = "9 - Saturday, 19 May 2012"
f = a.find("19 May 2012")
b = a[f:]
print(b)

regex to parse out certain value that i want

Using https://regex101.com/
MY current regex Expression: ^.*'(\d\s*.*)'*$
which doesnt seem to be working. What is the right combination formula that i should use?
I want to able to parse out 4 variable namely
items, quantity, cost and Total
MY CODE:
import re
str = "xxxxxxxxxxxxxxxxxx"
match = re.match(r"^.*'(\d\s*.*)'*$",str)
print match.group(1)
The following regex matches each ingredient string and stores wanted informations into groups: r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$'
It defines 3 groups each separated from other by spaces:
^ marks the string start
(\d+) is the first group and looks for at least one digit
\s+ is the first separation between groups and looks for at least one white character
([A-Za-z ]+) is the second group and looks for a least one alphabetical character or space
\s+ is the second separation beween groups and looks for at least one white character
(\d+(?:\.\d*) is the third group and looks for at least one digit with eventually a decimal point and some other digits
$ marks the string end
A regex to obtain the total does not need to be explained I think.
Here is a test code using your test data. Is should be a good starting point:
import re
TEST_DATA = ['Table: Waiter: kenny',
'======================================',
'1 SAUSAGE WRAPPED WITH B 10.00',
'1 ESCARGOT WITH GARLIC H 12.00',
'1 PAN SEARED FOIE GRAS 15.00',
'1 SAUTE FIELD MUSHROOM W 9.00',
'1 CRISPY CHICKEN WINGS 7.00',
'1 ONION RINGS 6.00',
'----------------------------------',
'TOTAL 59.00',
'CASH 59.00',
'CHANGE 0.00',
'Signature:__________________________',
'Thank you & see you again soon!']
INGREDIENT_RE = re.compile(r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$')
TOTAL_RE = re.compile(r'^TOTAL (.+)$')
ingredients = []
total = None
for string in TEST_DATA:
match = INGREDIENT_RE.match(string)
if match:
ingredients.append(match.groups())
continue
match = TOTAL_RE.match(string)
if match:
total = match.groups()[0]
break
print(ingredients)
print(total)
this prints:
[('1', 'SAUSAGE WRAPPED WITH B', '10.00'), ('1', 'ESCARGOT WITH GARLIC H', '12.00'), ('1', 'PAN SEARED FOIE GRAS', '15.00'), ('1', 'SAUTE FIELD MUSHROOM W', '9.00'), ('1', 'CRISPY CHICKEN WINGS', '7.00'), ('1', 'ONION RINGS', '6.00')]
59.00
Edit on Python raw strings:
The r character before a Python string indicates that it is a raw string, which means that spécial characters (like \t, \n, etc...) are not interpreted.
To be clear, and for example, in a standard string \t is one tabulation character. It a raw string it is two characters: \ and t.
r'\t' is equivalent to '\\t'.
more details in the doc

python regex add space whenever a number is adjacent to a non-number

I am trying to separate non-numbers from numbers in a Python string. Numbers can include floats.
Examples
Original String Desired String
'4x5x6' '4 x 5 x 6'
'7.2volt' '7.2 volt'
'60BTU' '60 BTU'
'20v' '20 v'
'4*5' '4 * 5'
'24in' '24 in'
Here is a very good thread on how to achieve just that in PHP:
Regex: Add space if letter is adjacent to a number
I would like to manipulate the strings above in Python.
Following piece of code works in the first example, but not in the others:
new_element = []
result = [re.split(r'(\d+)', s) for s in (unit)]
for elements in result:
for element in elements:
if element != '':
new_element.append(element)
new_element = ' '.join(new_element)
break
Easy! Just replace it and use Regex variable. Don't forget to strip whitespaces.
Please try this code:
import re
the_str = "4x5x6"
print re.sub(r"([0-9]+(\.[0-9]+)?)",r" \1 ", the_str).strip() // \1 refers to first variable in ()
I used split, like you did, but modified it like this:
>>> tcs = ['123', 'abc', '4x5x6', '7.2volt', '60BTU', '20v', '4*5', '24in', 'google.com-1.2', '1.2.3']
>>> pattern = r'(-?[0-9]+\.?[0-9]*)'
>>> for test in tcs: print(repr(test), repr(' '.join(segment for segment in re.split(pattern, test) if segment)))
'123' '123'
'abc' 'abc'
'4x5x6' '4 x 5 x 6'
'7.2volt' '7.2 volt'
'60BTU' '60 BTU'
'20v' '20 v'
'4*5' '4 * 5'
'24in' '24 in'
'google.com-1.2' 'google.com -1.2'
'1.2.3' '1.2 . 3'
Seems to have the desired behavior.
Note that you have to remove empty strings from the beginning/end of the array before joining the string. See this question for an explanation.

Categories

Resources