How can I remove '\n + 1' from a DataFrame? - python

I have the following dataframe:
Senior
Location
False
Warszawa
True
Warszawa\n + 1
I try to remove that "\n + 1", which looks like a hidden character to me. At first, I tried with:
df['Location']=df['Location'].str.replace('Warszawa\n + 1','Warszawa')
but nothing happened.
I managed to remove those characters manually, with a long row of splits and replaces, but it is not a viable solution, because it gives me some weird results in subsequent part of the program: although I have "Warszawa" in both rows of the df, they are treated as being two different locations, although there is only one location.
What I want is this:
Senior
Location
False
Warszawa
True
Warszawa
How can I correctly remove that "\n + 1"? And what character is it?

The .str.replace method searches for regex (Regular Expression) patterns. In regex the + has a special meaning. In order to tell pandas, that you are searching for exactly +, you need to set regex = False.
df['Location'] = df['Location'].str.replace(r'Warszawa\n + 1','Warszawa', regex = False)
Here you can read more about the parameters:
pandas.Series.str.replace
You will have same problem if one of following characters are in the column, which you search:
., [, ], *, ?
For the complete list, search for regex special characters

When using str.replace() the regex parameter is set to True by default. Since you just want to replace the literal string you either want to do what #Amir Py has done and turn regex=False or you can use the replace() method and do an inplace literal string replacement. The regex parameter is replace() is set to False by default.
Code:
df['Location'].replace('Warszawa\n + 1', 'Warszawa', inplace=True)
It can also be useful if you have other similar issues in other columns of your dataframe. For more information there is a great question and answer on stack:
str.replace v replace

Related

Remove characters after matching two conditions

I have the Python code below and I would like the output to be a string: "P-1888" discarding all numbers after the 2nd "-" and removing the leading 0's after the 1st "-".
So far all I have been able to do in the following code is to remove the trailing 0's:
import re
docket_no = "P-01888-000"
doc_no_rgx1 = re.compile(r"^([^\-]+)\-(0+(.+))\-0[\d]+$")
massaged_dn1 = doc_no_rgx1.sub(r"\1-\2", docket_no)
print(massaged_dn1)
You can use the split() method to split the string on the "-" character and then use the join() method to join the first and second elements of the resulting list with a "-" character. Additionally, you can use the lstrip() method to remove the leading 0's after the 1st "-". Try this.
docket_no = "P-01888-000"
docket_no_list = docket_no.split("-")
docket_no_list[1] = docket_no_list[1].lstrip("0")
massaged_dn1 = "-".join(docket_no_list[:2])
print(massaged_dn1)
First way is to use capturing groups. You have already defined three of them using brackets. In your example the first capturing group will get "P", and the third capturing group will get numbers without leading zeros. You can get captured data by using re.match:
match = doc_no_rgx1.match(docket_no)
print(f'{match.group(1)}-{match.group(3)}') # Outputs 'P-1888'
Second way is to not use regex for such a simple task. You could split your string and reassemble it like this:
parts = docket_no.split('-')
print(f'{parts[0]}-{parts[1].lstrip("0")}')
It seems like a sledgehammer/nut situation but of you do want to use re then you could use:
doc_no_rgx1 = ''.join(re.findall('([A-Z]-)0+(\d+)-', docket_no)[0])
I don't think I'd use a regular expression for this purpose. Your usecase can be handled by standard string manipulation so using a regular expression would be overkill. Instead, consider doing this:
docket_nos = "P-01888-000".split('-')[:-1]
docket_nos[1] = docket_nos[1].lstrip('0')
docket_no = '-'.join(docket_nos)
print(docket_no) # P-1888
This might seem a little bit verbose but it does exactly what you're looking for. The first line splits docket_no by '-' characters, producing substrings P, 01888 and 000; and then discards the last substring. The second line strips leading zeros from the second substring. And the third line joins all these back together using '-' characters, producing your desired result of P-1888.
Functionally this is no different than other answers suggesting that you split on '-' and lstrip the zero(s), but personally I find my code more readable when I use multiple assignment to clarify intent vs. using indexes:
def convert_docket_no(docket_no):
letter, number, *_ = docket_no.split('-')
return f'{letter}-{number.lstrip("0")}'
_ is used here for a "throwaway" variable, and the * makes it accept all elements of the split list past the first two.

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot
You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).
It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai
Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.
First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)
It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

Remove Characters From A String Until A Specific Format is Reached

So I have the following strings and I have been trying to figure out how to manipulate them in such a way that I get a specific format.
string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo
I want to be able to get rid of any of the last string so I am just left with the month and year, like below:
string1-itd_jan2021
string2itd_mar2021
string3itd_feb2021
string4-itd_mar2021
string5itd_jun2021
string6-itd_feb2021
I thought about using string.split on the - but then realized that for some strings this wouldn't work. I also thought about getting rid of a set amount of characters by putting it into a list and slicing but the end is varying characters length?
Is there anything I can do it with regex or any other python module?
Use str.rsplit with the appropriate maxsplit parameter:
s = s.rsplit("-", 1)[0]
You could also use str.split (even though this is clearly the worse choice):
s = "-".join(s.split("-")[:-1])
Or using regular expressions:
s = re.sub(r'-[^-]*$', '', s)
# "-[^-]*" a "-" followed by any number of non-"-"
With a regex:
import re
re.sub(r'([0-9]{4}).*$', r'\1', s)
Use re.sub like so:
import re
lines = '''string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo'''
for old in lines.split('\n'):
new = re.sub(r'[-][^-]+$', '', old)
print('\t'.join([old, new]))
Prints:
string1-itd_jan2021-internal string1-itd_jan2021
string2itd_mar2021-space string2itd_mar2021
string3itd_feb2021-internal string3itd_feb2021
string4-itd_mar2021-moon string4-itd_mar2021
string5itd_jun2021-internal string5itd_jun2021
string6-itd_feb2021-apollo string6-itd_feb2021
Explanation:
r'[-][^-]+$' : Literal dash (-), followed by any character other than a dash ([^-]) repeated 1 or more times, followed by the end of the string ($).

Pandas unable to split on multiple asterisk

I'm trying to split on 5x asterisk in Pandas by reading in data that looks like this
"This place is not good ***** less salt on the popcorn!"
My code attempt is trying to read in the reviews column and get the zero index
review = review_raw["reviews"].str.split('*****').str[0]
print(review)
The error
sre_constants.error: nothing to repeat at position 0
My expectation
This place is not good
pandas.Series.str.split
Series.str.split(pat=None, n=- 1, expand=False)
Parameters:
patstr, optional String or regular expression to split on. If not
specified, split on whitespace.
* character is a part of regex string which defines zero or more number of occurrences, and this is the reason why your code is failing.
You can either try escaping the character:
>>df['review'].str.split('\*\*\*\*\*').str[0]
0 This place is not good
Name: review, dtype: object
Or you can just pass the regex:
>>df['review'].str.split('[*]{5}').str[0]
0 This place is not good
Name: review, dtype: object
Third option would be to use inbuilt str.split() instead of pandas' Series.str.split()
>>df['review'].apply(lambda x: x.split('*****')).str[0]
0 This place is not good
Name: review, dtype: object
Try out with this code
def replace_str(string):
return str(string).replace("*****",',').split(',')[0]
review = review_raw["reviews"].apply(lambda x:replace_str(x))
Well suppose we already have a ',' in our input string in that case the code can be little tweaked like below. Since I am replacing ***** , I can replace with any character like '[' in the modified answer.
def replace_str(string):
return str(string).replace("*****",'[').split('[')[0]
review = review_raw["reviews"].apply(lambda x:replace_str(x))

How do you find all instances of a substring, followed by a certain number of dynamic characters?

I'm trying to find all instances of a specific substring(a!b2 as an example) and return them with the 4 characters that follow after the substring match. These 4 following characters are always dynamic and can be any letter/digit/symbol.
I've tried searching, but it seems like the similar questions that are asked are requesting help with certain characters that can easily split a substring, but since the characters I'm looking for are dynamic, I'm not sure how to write the regex.
When using regex, you can use "." to dynamically match any character. Use {number} to specify how many characters to match, and use parentheses as in (.{number}) to specify that the match should be captured for later use.
>>> import re
>>> s = "a!b2foobar a!b2bazqux a!b2spam and eggs"
>>> print(re.findall("a!b2(.{4})", s))
['foob', 'bazq', 'spam']
import re
print (re.search(r'a!b2(.{4})')).group(1))
.{4} matches any 4 characters except special characters.
group(0) is the complete match of the searched string. You can read about group id here.
If you're only looking for how to grab the following 4 characters using Regex, what you are probably looking to use is the curly brace indicator for quantity to match: '{}'.
They go into more detail in the post here, but essentially you would do [a-Z][0-9]{X,Y} or (.{X,Y}), where X to Y is the number of characters you're looking for (in your case, you would only need {4}).
A more Pythonic way to solve this problem would be to make use of string slicing, and the index function however.
Eg. given an input_string, when you find the substring at index i using index, then you could use input_string[i+len(sub_str):i+len(sub_str)+4] to grab those special characters.
As an example,
input_string = 'abcdefg'
sub_str = 'abcd'
found_index = input_string.index(sub_str)
start_index = found_index + len(sub_str)
symbol = input_string[start_index: start_index + 4]
Outputs (to show it works with <4 as well): efg
Index also allows you to give start and end indexes for the search, so you could also use it in a loop if you wanted to find it for every sub string, with the start of the search index being the previous found index + 1.

Categories

Resources