python regex for filename - python

I am trying to do a regex on a dataframe.
For example a value will be ia wt template - tdct-c15-c5.doc
The best logic I can think of is to take everything after the - till the last digit in the string.
trying to trim it to tdct-c15-c5
any help would be appreciated.

Components
To stay flexible, assume your input filename(s) contain chunks:
filenames with fix extension .doc (denoting Word files or documents)
some important key (here tdct-c15-c5)
the separator as hyphen possibly surrounded by spaces (here surrounded by spaces -)
some prefix, does not matter currently (here ia wt template)
This information is contained inside ia wt template - tdct-c15-c5.doc.
Decomposition steps
Particularly the chunks (1) and (3) seem pretty stable and fixed constants.
So lets work with them:
we can strip-off from right or remove the extension (1) as ignored
we can split the remaining basename by separator (3) into 2 parts: prefix (4) and key (2)
The last part (2) is what we want to extract.
Implementation (pure Python only)
def extract_key(filename):
basename = filename.rstrip('.doc')
(prefix, key) = basename.split(' - ') # or use lenient regex r'\ ?-\ ?'
return key
filename = 'ia wt template - tdct-c15-c5.doc'
print('extracted key:', extract_key(filename))
Prints:
('extracted key:', 'tdct-c15-c5')
Applied to pandas
Use the function as suggested by C.Nivis inside apply():
df.apply(extract_key)

I don't know if a regex is the better option here. An apply is pretty readable:
mystr = "ia wt template - tdct-c15-c5.doc"
import pandas as pd
df = pd.DataFrame([[mystr] for i in range(4)], columns=['mystr'])
df.mystr.apply(lambda x: x.split(' ')[-1].rstrip('.doc'))
0 tdct-c15-c5
1 tdct-c15-c5
2 tdct-c15-c5
3 tdct-c15-c5
Name: mystr, dtype: object

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot
You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).
It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai
Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.
First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)
It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

How to replace string in python string with specific character?

for example, I have a column named Children in data frame of python,
few names are [ tom (peter) , lily, fread, gregson (jaeson 123)] etc.
I want to ask that what code I should write, that could remove part of each name staring with bracket e.g '(' and so on. So that from my given names example tom(peter) will become tom in my column and gregson (123) would become gregson. Since there are thousands of names with bracket part and I want to remove the part of string staring from bracket '(' and ending on bracket ')'. This is a data frame of many columns but i want to do this editing in one specific column named as CHILDREN in my dataframe named DF.
As suggested by #Ruslan S., you can use pandas.Series.str.replace or you could also use re.sub (and there are other methods as well):
import pandas as pd
df = pd.DataFrame({"name":["tom (peter)" , "lily", "fread", "gregson (jaeson 123)"]})
# OPTION 1 with str.replace :
df["name"] = df["name"].str.replace(r"\([a-zA-Z0-9\s]+\)", "").str.strip()
# OPTION 2 :with re sub
import re
r = re.compile(r"\([a-zA-Z0-9\s]+\)")
df["name"] = df["name"].apply(lambda x: r.sub("", x).strip())
And the result in both cases:
name
0 tom
1 lily
2 fread
3 gregson
Note that I also use strip to remove leading and trailing whitespaces here. For more info on the regular expression to use, see re doc for instance.
You can try:
#to remove text between ()
df['columnname'] = df['columnname'].str.replace(r'\((.*)\)', '')
#to remove text between %%
df['columnname'] = df['columnname'].str.replace(r'%(.*)%', '')

Python: zfill with rsub padding zeroes

I wrote a script to standardize a bunch of values pulled from a data bank using (mostly) r.sub. I am having a hard time incorporating zfill to pad the numerical values at 5 digits.
Input
FOO5864BAR654FOOBAR
Desired Output
FOO_05864-BAR-00654_FOOBAR
Using re.sub I have so far
FOO_5864-BAR-654_FOOBAR
One option was to do re.sub w/ capturing groups for each possible format [i.e. below], which works, but I don't think that's the correct way to do it.
(\d) sub 0000\1
(\d\d) sub 000\1
(\d\d\d) sub 00\1
(\d\d\d\d) sub 0\1
Assuming your inputs are all of the form letters-numbers-letters-numbers-letters (one or more of each), you just need to zero-fill the second and fourth groups from the match:
import re
s = 'FOO5864BAR654FOOBAR'
pattern = r'(\D+)(\d+)(\D+)(\d+)(\D+)'
m = re.match(pattern, s)
out = '{}_{:0>5}-{}-{:0>5}_{}'.format(*m.groups())
print(out) # -> FOO_05864-BAR-00654_FOOBAR
You could also do this with str.zfill(5), but the str.format method is just much cleaner.

how to exclude sentences containing specific word

I m reading a sentence from excel(containing bio data) file and want to extract the organizations where they are working. The file also contains sentences which specifies where the person is studying.
ex :
i m studying in 'x' instition(university)
i m student in 'y' college
i want to skip these type of sentences.
I am using regular expression to match these sentences, and if its related to student then skip the part, and only other lines i want write in a separate excel file.
my code is as below..
csvdata = pandas.read_csv("filename.csv",",");
for data in csvdata:
regEX=re.compile('|'.join([r'\bstudent\b',r'\bstudy[ing]\b']),re.I)
matched_data=re.match(regEX,data)
if matched_data is not None:
continue
else:
## write the sentence to excel
But, when i check the newly created excel file, it still contains the sentences that contain 'student', 'study'.
How regular expression can be modified to get the result.
There are 2 things here:
1) Use re.search (re.match only searches at the string start)
2) The regex should be regEX=re.compile(r"\b(?:{})\b".format('|'.join([r'student',r'study(?:ing)?'])),re.I)
The [ing] only matches 1 symbol, either i, n or g while you intended to match an optional ing ending. A non-capturing group with a ? quantifier - (?:ing)? - is actually matching 1 or 0 sequences of ings.
Also, \b(x|y)\b is a more efficient pattern than \bx\b|\by\b, as it involves fewer backtracking steps.
Here is just a demo of what this regex looks like:
import re
pat = r"\b(?:{})\b".format('|'.join([r'student',r'study(?:ing)?']))
print(pat)
# => \b(?:student|study(?:ing)?)\b
regEX=re.compile(pat,re.I)
s = "He is studying here."
mObj = regEX.search(s)
if mObj:
print(mObj.group(0))
# => studying

Replacing variable text in between two known elements

s = """Comment=This is a comment
Name=Frank J. Lapidus
GenericName=Some name"""
replace_name = "Dr. Jack Shephard"
I have some text in a file and have been trying to figure out how to search and replace a line so Name=Frank J. Lapidus becomes Name=Dr. Jack Shephard
How could I do this in Python? Edited: (BTW, the second element would be a \n just in case you were wondering).
Thanks.
Use string.replace (documented under http://docs.python.org/library/stdtypes.html#string-methods):
>>> s = """Comment=This is a comment
... Name=Frank J. Lapidus
... GenericName=Some name"""
>>> replace_name = "Dr. Jack Shephard"
>>> s.replace("Frank J. Lapidus", replace_name)
'Comment=This is a comment\nName=Dr. Jack Shephard\nGenericName=Some name'
You could use the regular expression functions from the re module. For example like this:
import re
pattern = re.compile(r"^Name=(.*)$", flags=re.MULTILINE)
re.sub(pattern, "Name=%s" % replace_name, s)
(The re.MULTILINE option makes ^ and $ match the beginning and the end of a line, respectively, in addition to the beginning and the end of the string.)
Edited to add: Based on your comments to Emil's answer, it seems you are manipulating Desktop Entry files. Their syntax seems to be quite close to that used by the ConfigParser module (perhaps some differences in the case-sensitivity of section names, and the expectation that comments should be preserved across a parse/serialize cycle).
An example:
import ConfigParser
parser = ConfigParser.RawConfigParser()
parser.optionxform = str # make option names case sensitive
parser.read("/etc/skel/examples.desktop")
parser.set("Desktop Entry", "Name", replace_name)
parser.write(open("modified.desktop", "w"))
As an alternative to the regular expression solution (Jukka's), if you're looking to do many of these replacements and the entire file is structured in this way, convert the entire file into a dictionary and then write it back out again after some replacements:
d = dict(x.split("=") for x in s.splitlines() if x.count("=") is 1)
d["Name"] = replace_name
new_string = "\n".join(x+"="+y for x,y in d.iteritems())
Caveats:
First, this only works if there are no '=' signs in your field names (it ignores lines that don't have exactly one = sign).
Second, converting to dict and back will not preserve the order of the fields, although you can at least sort the dictionary with some additional work.

Categories

Resources