Search pattern not unique? - Regular expression - python

I want to write a function to clean the index column of the dataframe.
Delete the whole row that has high-level IDs. For example, delete
East Kootenay (5901) 01010
Tailor the index into 7-digit number for low-level IDs. For example, turn
East Kootenay A (5901017) RDA 02020
into 5901017
If it has two parenthesis keep only the 7-digit number in the second parenthesis. For example,
Sechelt (Part) (5929803) IGD 02020 to 5929803
Capital H (Part 1) (5917054) RDA 01020 to 5917054
Capital H (Part 2) (5917056) RDA 02030 to 5917056
T'Sou-ke 1 (Sooke 1) (5917817) IRI 01010 to 5917817
T'Sou-ke 2 (Sooke 2) (5917818) IRI 00000 to 5917818
An example of code only works for one bracket is
def extract_id(s):
m = re.search('\((.*)\)', s)
if m:
i = int(m.group(0)[1:-1])
return i
if __name__ == '__main__':
# Read data
census_subdivision_profile = pd.read_excel('../data/census_subdivision_profile.xlsx', sheetname='Data',
index_col='Geography', encoding='utf-8').T
print(census_subdivision_profile.head())
print(census_subdivision_profile.shape)
census_subdivision_profile.index = census_subdivision_profile.index.map(extract_id)
print(census_subdivision_profile.index)
To see the full code, see another question I posted earlier
Merge dataframes that have indices that one contains another (but not the same)

I think you intended '\(([^)]*)\)' ... hth

I don't understand the distinction between points 2 and 3. In both cases you're just wanting to extract the 7 digit number in brackets? In that case I'd be more explicit with the regex, like \((\d{7})\)

Related

Need to filter some strings elements but I get TypeError: unsupported operand type(s) for |: 'str' and 'str'

So I'm using pandas to filter a csv and I need to filter three different string elements of a column, but when I use the or (|) I get that mistake. Any other way I can filter many strings without having to name different variables to act like one filter each? This is the code:
# What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
bdegree = df[(df["education"] == "Bachelors") & (df["salary"] >= "50K")].count()
mdegree = df[(df["education"] == "Masters") & (df["salary"] >= "50K")].count()
phddegree = df[(df["education"] == "Doctorate") & (df["salary"] >= "50K")].count()
all_degrees = bdegree + mdegree + phddegree
print(all_degrees)
percentaje_of_more50 = (all_degrees / df["education" == "Bachelors"|"Masters"|"Doctorate"].count())*100
print("The percentaje of people with bla bla bla is", percentaje_of_more50["education"].round(1))
By the way, I am working in an error in the logic on this code, so just ignore it :).
== looks for an exact match and since no one's "education" includes the string "Bachelors"|"Masters"|"Doctorate", it will return a Series of all Falses
.
You can use isin instead like:
msk = df["education"].isin(["Bachelors","Masters","Doctorate"])
The above will return a boolean Series, so using the .count method on it will just show the length of it, which is probably not something you want. So you need to use it to filter the relevant rows:
df[msk].count()
Then you can write percentage_of_more50 as:
percentage_of_more50 = (all_degrees / df[msk].count())*100
Note that you can also derive all_degrees using isin as well:
all_degrees = df[df["education"].isin(["Bachelors","Masters","Doctorate"]) & (df['salary']>='50K')].count()
Also df["salary"] >= "50K" works as you intend only if all salaries are below "99k" otherwise you'll end up with wrong output because if you check "100k" > "50k" it throws up False, even though it's True. One way to get rid of this problem is to fill the "salary" column data with "0"s until each entry is a certain number of characters long using str.zfill like:
df['salary'] = df['salary'].str.zfill(5)
Then each entry becomes 5 characters long. For example,
s = pd.Series(['100k','50k']).str.zfill(5)
becomes:
0 0100k
1 0050k
dtype: object
Then you can make the correct comparison.

pandas and "re" - search for total and partial strings

This an extended question from this topic. I would like to search in strings total and partial strings like the following keywords Series "w":
rigour*
*demeanour*
centre*
*arbour
fulfil
This obviously means that I wanted to search for words like rigour and rigours, endemeanour and demeanours, centre and centres, harbour and arbour, and fulfil. So the keywords list I have is a mix of complete and partial strings to find. I would like to apply the search on this DataFrame "df":
ID;name
01;rigour
02;rigours
03;endemeanour
04;endemeanours
05;centre
06;centres
07;encentre
08;fulfil
09;fulfill
10;harbour
11;arbour
12;harbours
What I tried so far is the following:
r = re.compile(r'.*({}).*'.format('|'.join(w.values)), re.IGNORECASE)
then I've build a mask to filter the DataFrame:
mask = [m.group(1) if m else None for m in map(r.search, df['Tweet'])]
in order to get a new column with the Keyword found:
df['keyword'] = mask
What I'm expecting is the following resulting DataFrame:
ID;name;keyword
01;rigour;rigour
02;rigours;rigour
03;endemeanour;demeanour
04;endemeanours;demeanour
05;centre;centre
06;centres;centre
07;encentre;None
08;fulfil;fulfil
09;fulfill;None
10;harbour;arbour
11;arbour;arbour
12;harbours;None
This works using a w list without *. Now I had several issues in formatting the keyword w List of words with the * conditions, in order to run the re.compile function correctly.
Any help would be really appreciated.
It looks like your input series w needs to be adjusted to be used as regex pattern like this:
rigour.*
.*demeanour.*
centre.*
\\b.*arbour\\b
\\bfulfil\\b
Note that * in regex goes after something it does not work on its own. It means that whatever it follows can be repeated 0 or more times.
Note also that fulfil is a part of fulfill and if you want to have strict match you need to tell regex this. For example by using 'word separator' - \b - it will catch only string as whole.
Here is how your regex might look like to give you results that you need:
s = '({})'.format('|'.join(w.values))
r = re.compile(s, re.IGNORECASE)
r
re.compile(r'(rigour.*|.*demeanour.*|centre*|\b.*arbour\b|\bfulfil\b)', re.IGNORECASE)
And your code to have the replacement could be done with pandas .where method like this:
df['keyword'] = df.name.where(df.name.str.match(r), None)
df
ID name keyword
0 1 rigour rigour
1 2 rigours rigours
2 3 endemeanour endemeanour
3 4 endemeanours endemeanours
4 5 centre centre
5 6 centres centres
6 7 encentre None
7 8 fulfil fulfil
8 9 fulfill None
9 10 harbour harbour
10 11 arbour arbour
11 12 harbours None

Finding exon/ intron borders in a gene

I would like to go through a gene and get a list of 10bp long sequences containing the exon/intron borders from each feature.type =='mRNA'. It seems like I need to use compoundLocation, and the locations used in 'join' but I can not figure out how to do it, or find a tutorial.
Could anyone please give me an example or point me to a tutorial?
Assuming all the info in the exact format you show in the comment, and that you're looking for 20 bp on either side of each intro/exon boundary, something like this might be a start:
Edit: If you're actually starting from a GenBank record, then it's not much harder. Assuming that the full junction string you're looking for is in the CDS feature info, then:
for f in record.features:
if f.type == 'CDS':
jct_info = str(f.location)
converts the "location" information into a string and you can continue as below.
(There are ways to work directly with the location information without converting to a string - in particular you can use "extract" to pull the spliced sequence directly out of the parent sequence -- but the steps involved in what you want to do are faster and more easily done by converting to str and then int.)
import re
jct_info = "join{[0:229](+), [11680:11768](+), [11871:12135](+), [15277:15339](+), [16136:16416](+), [17220:17471](+), [17547:17671](+)"
jctP = re.compile("\[\d+\:\d+\]")
jcts = jctP.findall(jct_info)
jcts
['[0:229]', '[11680:11768]', '[11871:12135]', '[15277:15339]', '[16136:16416]', '[17220:17471]', '[17547:17671]']
Now you can loop through the list of start:end values, pull them out of the text and convert them to ints so that you can use them as sequence indexes. Something like this:
for jct in jcts:
(start,end) = jct.replace('[', '').replace(']', '').split(':')
try: # You need to account for going out of index, e.g. where start = 0
start_20_20 = seq[int(start)-20:int(start)+20]
except IndexError:
# do your alternatives e.g. start = int(start)

Python Basics – slicing a long string and combine the sliced in wanted pieces

Environment: Win 7; Python 2.76
Hello all…I need to pick up some texts from a string, which looks like:
“C-603WallWizard45256CCCylinders:2HorizontalOpposedBore:1-1/4Stroke:1-1/8Length: SingleVerticalBore:1-111Height:6Width:K-720Cooling:AirWeight:6LBS1.5H.P.#54500RPMC-60150ccGasEngineCylinder:4VerticalInlineBore:1Stroke:1Cycle:4Weight:6-1/2LBSLength:10Width: :AirLength16Cooling:AirLength:5Width:4L-233Height:6Weight: 4TheBlackKnightc-609SteamEngineBore:11/16Stroke:11/16Length:3Width:3Height:4TheChallengerC-600Bore:1Stroke:1P-305Weight:18LBSLength:12Width:7Height:8C-606Wall15ccGasEngineJ-142Cylinder:SingleVerticalBore:1Stroke:1-1/8Cooling:1Stroke:1-1/4HP:: /4Stroke:1-7/:6Width:6Height:9Weight:4LBS1.75H.P.#65200RPM”
The wanted are:
I. Combinations of 1 letter + 3 numbers, joint by ‘-’. Such as: C-603, K-720, C-606 etc
II. Combinations of 5 continuous numbers. Such as: 45256, 54500, 60150, 65200 etc
My idea is to:
slice the string into every pieces, like ‘C’, ‘-’, ‘6’, ‘0’, ‘3’, … ‘R’, ‘P’, ‘M’
combine them into 4 digits and 5 digits, like ‘C-60’, ‘-603’, ‘603W’… and ‘C-603W’, ‘-603W’ , ‘603Wa’
pick up the ones fits the criteria I and II
sounds like a way? If yes, what commands I can use in the processes?
Thanks.
Going with regular expressions is one way to do it:
>>> data = '''C-603WallWizard45256CCCylinders:2HorizontalOpposedBore:1-1/4Stroke:1-1/8Length: SingleVerticalBore:1-111Height:6Width:K-720Cooling:AirWeight:6LBS1.5H.P.#54500RPMC-60150ccGasEngineCylinder:4VerticalInlineBore:1Stroke:1Cycle:4Weight:6-1/2LBSLength:10Width: :AirLength16Cooling:AirLength:5Width:4L-233Height:6Weight: 4TheBlackKnightc-609SteamEngineBore:11/16Stroke:11/16Length:3Width:3Height:4TheChallengerC-600Bore:1Stroke:1P-305Weight:18LBSLength:12Width:7Height:8C-606Wall15ccGasEngineJ-142Cylinder:SingleVerticalBore:1Stroke:1-1/8Cooling:1Stroke:1-1/4HP:: /4Stroke:1-7/:6Width:6Height:9Weight:4LBS1.75H.P.#65200RPM'''
>>> one_letter_three_numbers = re.compile(r'.\-\d{3}', re.IGNORECASE)
>>> re.findall(one_letter_three_numbers, data)
['C-603', '1-111', 'K-720', 'C-601', 'L-233', 'c-609', 'C-600', 'P-305', 'C-606', 'J-142']
>>> five_continuous = re.compile(r'\d{5}', re.IGNORECASE)
>>> re.findall(five_continuous, data)
['45256', '54500', '60150', '65200']

2 different blocks of text are merging together. Can I separate them if i know what 1 is?

I've used a number of pdf-->text methods to extract text from pdf documents. For one particular type of PDF I have, neither pyPDF or pdfMiner are doing a good job extracting the text. However, http://www.convertpdftotext.net/ does it (almost) perfectly.
I discovered that the pdf I'm using has some transparent text in it, and it is getting merged into the other text.
Some examples of the blocks of text I get back are:
12324 35th Ed. 01-MAR-12 Last LNM: 14/12 NAD 83 14/12 Corrective Object of Corrective
ChartTitle: Intracoastal Waterway Sandy Hook to Little Egg Harbor Position
C HAActRionT N Y -NJ - S A N D Y H OO K ATcO tionLI T TLE EGG HARBOR. Page/Side: N/A
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true.
Bearings RoEf LlighOCtAT seEc tors aSrehre towwsbuardry th Re ivligher Ct fhroanmn seel Lawighartde.d B Theuoy 5no minal range of lights is expressedf roin mna 4u0tic-24al -mi46les.56 0(NNM ) unless othe0r7w4is-00e n-o05te.d8.8 0 W
to 40-24-48.585N 074-00-05.967W
and
12352 33rd Ed. 01-MAR-11 Last LNM: 03/12 NAD 83 04/12 . . l . . . . Corrective Object of Corrective ChartTitle: Shinnecock Bay to East Rockaway Inlet Position C HAActRionT S H IN N E C OC K B A Y TO AcEtionAS T ROCKAWAY INLET. Page/Side: N/A (Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true. (BTeeamringp) s DoEf LlighETtE s ectors aSretat toew Baoratd Ctheh anlighnet lf Droaym beseacoawanr 3d. The nominal range of lights is expressedf roin mna 4u0tic-37al -mi11les.52 0(NNM ) unless othe0r7w3is-29e n-5o3te.d76. 0 W
and I have discovered that the "ghost text" is ALWAYS the following:
Corrective Object of Corrective Position
Action Action
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are given in degrees clockwise from 000 true.
Bearings of light sectors are toward the light from seaward. The nominal range of lights is expressed in nautical miles (NM) unless otherwise noted.
In the 2nd example I posted, the text I want (with the ghost text removed) is:
12352 33rd Ed. 01-Mar-11 Last LNM:03/12 NAD 83 04/12
Chart Title:Shinnecock Bay to East Rockaway Inlet. Page/Side:N/A
CGD01
(Temp) DELETE State Boat Channel Daybeacon 3 from 40-37-11.520N 073-29-53.760W
This problem occurs just once per document, and does not appear to be totally consistent (as seen above). I am wondering if one of you wizards could think of a way to remove the ghosted text (I don't need/want it) using python. If I had been using pyPDF, I would have used a regex to rip it out during the conversion to text. Unfortunately, since I'm starting out with a text file from the website listed above, the damage has already been done. I'm at a bit of a loss.
Thanks for reading.
EDIT:
The solution to this problem looks like it be more complex than the rest of the application, so I'm going to withdraw my request for help.
I very much appreciate the thought put into it by those who have contributed.
Given that the ghost text can be split up in seemingly unpredictable ways, I don't think there is a simple automatic way of removing it that would not have false positives. What you need is almost human-level pattern recognition. :-)
What you could try is exploiting the format of these kinds of messages. Roughly;
<number> <number>[rn]d Ed. <date> Last LNM:<mm>/<yy> NAD <date2>
Chart Title:<text>. Page/Side:<N/A or number(s)> CGD<number> <text>
<position>
Using this you could pluck out the nonsense from the predictable elements, and then if you have a list of chart names ('Shinnecock Bay to East Rockaway Inlet') and descriptive words (like 'State', 'Boat', 'Daybeacon') you might be able to reconstruct the original words by finding the smallest levenshtein distance between mangled words in the two text blocks and those in your word lists.
If you can install the poppler software, you could try and use pdftotext with the -layout option to keep the formatting from the original PDF as much as possible. That might make your problem disappear.
You could recursively find all possible ways that your Pattern
"Corrective Object of Corrective Position Action ..." can be contained within your mangled text,
Then you can unmangle the text for each of these possible paths, run some sort of spellcheck over them, and choose the one with the fewest spelling mistakes. Or since you know roughly where each substring should appear, you can use that as a heuristic.
Or you could simply use the first path.
some pseudocode (untested):
def findPaths(mangledText, pattern, path)
if len(pattern)==0: # end of pattern
return [path]
else:
nextLetter= pattern[0]
locations = findAllOccurences (mangledText, nextLetter) # get all indices in mangledText that contain nextLetter
allPaths = []
for loc in locations:
paths = findPaths( mangledText[loc+1:], pattern[1:], path + (loc,) )
allPaths.Extend(paths)
return allPaths # if no locations for the next letters exist, allPaths will be emtpy
Then you can call it like this (optionally remove all spaces from your search pattern, unless you are certain they are all included in the mangled text)
allPossiblePaths = findPaths ( YourMangledText, "Corrective Object...", () )
then allPossiblePaths should contain a list of all possible ways your pattern could be contained in your mangled text.
Each entry is a tuple with the same length as the pattern, containing the index at which the corresponding letter of the pattern occurs in the search text.

Categories

Resources