Get only first two numbers from the numeric value in python.How? - python

I have output from the below code and from that output I need to get only the maj and minor of that value.Is there any way I can get only the first two numbers 1.1 not full value 1.1.73.4
for version in issue["fields"]["fixVersions"]:
cacheData = json.dumps(version)
jsonToPython = json.loads(cacheData)
#lines = jsonToPython.items()
if jsonToPython['name'][:8] == "Ciagana ":
matches = re.findall(r"\d+\.\d+\.\d+\.\d+", jsonToPython['name'])
print matches[0]
Below is the output of the code currently:
Retrieving list of issues
Processing CTPT-2
1.1.73.4
1.1.90.0
Processing CTPT-1
1.5.73.4
Below is the desired output
Retrieving list of issues
Processing CTPT-2
1.1
1.1
Processing CTPT-1
1.1

Regex would work, or a simple split:
short_version_string = '.'.join(version_string.split('.')[:2])
Or this, though it only works in Python 3:
major, minor, *_ = version_string.split('.')

Another way, by modifying your regex pattern to have a look-ahead for another period:
text = ['4.4.73.4', '4.4.90.0', '4.5.73.4']
for version in text:
matches = re.findall(r"\d+\.\d+(?=\.)", version)
print matches[0]
#4.4
#4.4
#4.5
The pattern is:
\d+\.\d+: Any number of digits followed by a period followed by any number of digits
(?=\.): A non-capturing look-ahead for another period

Related

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.
You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']
import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)
The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

Regular expression not finding all the results

I am trying to clean up text from an actual English dictionary as my source. I have already written a python program which loads the data from a .txt file into a SQL DB in four different columns - id, word, definition. In the next step though, I am trying to define what 'type' of word it is by fetching from the definition of the word strings like n. for noun, adj. for adjective, adv. for adverb, so on and so forth.
Now, using the following regex I am trying to extract all words that end with a '.' like adv./abbr./n./adj. etc. and get a histogram of all such words to see what all the different types can be. Here my assumption is that such words will obviously be more frequent than normal words which end with '.' but even then I plan to check the top results manually to confirm. Here's my code:
for row in cur:
temp_var = re.findall('\w+[.]+ ',split)
if len(temp_var) >=1 :
temp_var = temp_var.pop()
typ_dict[temp_var] = typ_dict.get(temp_var,0) + 1
for key in typ_dict:
if typ_dict[key] > 50:
print(key, typ_dict[key])
After running this code I am not getting the desired result, with my count of numbers being way lower than in the definition. I have tested the word 'Abbr.' which this code shows occurs for 125 times but if you were to change the regex '\w+[.]+ ' to 'Abbr. ' the result shoots up186. I am not sure why my regex is not capturing all the occurrences.
Any idea as to why I am not getting all the matches?
Edit:
Here is the type of text I am working with
Aback - adv. take aback surprise, disconcert. [old english: related to *a2]
Abm - abbr. Anti-ballistic missile
Abnormal - adj. Deviating from the norm; exceptional. abnormality n. (pl. -ies). Abnormally adv. [french: related to *anomalous]
This is broken down into two the word and the rest into a definition and is loaded into a SQL table.
If you are using a dictionary to count items, then the best variant of a dictionary to use is Counter from the collections package. But you have another problem with your code. You check tep_var for length >= 1 but then you only do one pop operation. What happens when findall returns multiple items? You also do temp_var = temp_var.pop() which would prevent you from popping more items even if you you wanted to. So the result is to just yield the last match.
from collections import Counter
counters = Counter()
for row in cur:
temp_var = re.findall('\w+[.]+ ',split)
for x in temp_var:
counters[x] += 1
for key in counters:
if counters[key] > 50:
print(key, counters[key])

Regex to find alpha&numeric words in a text

I've got this text:
due to previous assess c6c587469 and 4ec0f198
nearest and with fill station in the citi
becaus of our satisfact in the d4a29a already
averaging my thoughts on e977f33588f react to
and I want to remove all "alpha&numeric" words
In output, I want
due to previous assess and
nearest and with fill station in the citi
becaus of our satisfact in the already
averaging my thoughts on react to
I tried this, but it doesn't work..
df_colum = df_colum.str.replace('[^A-Za-z0-9\s]+', '')
Any regex expert ?
Thanks
Try using this regex:
df_colum = df_colum.str.replace('\w*\d\w*', '')
Here's one way without regex:
def parser(x):
return ' '.join([i for i in x.split() if not any(c.isdigit() for c in i)])
df['text'] = df['text'].apply(parser)
print(df)
text
0 due to previous assess and
1 nearest and with fill station in the citi
2 becaus of our satisfact in the already
3 averaging my thoughts on react to
This one should work:
df_colum = df_colum.str.replace('(?:[0-9][^ ]*[A-Za-z][^ ]*)|(?:[A-Za-z][^ ]*[0-9][^ ]*)', '')
Explanation of the regex can be found here
You can look for where a digit meets a letter \d[a-z] or [a-z]\d then match up to end:
(?i)\b(?:[a-z]+\d+|\d+[a-z]+)\w*\b *
Live demo
(?i) Enables case-insensitivity
(?:...) Constructs a non-capturing group
\b Means a word boundary
Python code:
re.sub(r"\b(?:[a-z]+\d+|\d+[a-z]+)\w*\b *", "", str)

Python RegEx for Australian Phone Numbers - False negative - 2 matches in the same substring

I am trying to extract phone numbers from a web page using Python & RegEx
Australian number format
+61 (international code - shown below as 'i')
02, 03, 07 or 08 (state codes - shown below as 's')
1234-5678 (8 digit local number - shown below as 'x')
Common variations of format (in order of commonality):
Format 1: ss xxxx xxxx (e.g. 02 1234 5678)
Format 2: +ii s xxxx xxxx (e.g. +61 2 1234 5678) (note the first 's' digit is removed here)
Format 3: (seen rarely) +ii (s)s xxxx-xxxx (e.g. +61 (0)2 1234 5678
My RegEx:
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', sample_text))
works well on a simple sample_text:
sample_text =
"610212345678ABC##610312345678ABC##610712345678ABC##610812345678ABC##0212345678ABC##0312345678ABC##0712345678ABC##0812345678ABC##61212345678ABC##61312345678ABC##61712345678ABC##61812345678ABC##0412345678ABC##61412345678ABC##130012345678ABC##180012345678ABC##"
Result:
['0212345678', '0312345678', '0712345678', '0812345678',
'0212345678', '0312345678', '0712345678', '0812345678',
'61212345678', '61312345678', '61712345678', '61812345678',
'0412345678', '61412345678', '1300123456', '1800123456']
The Goal
Using http://www.outware.com.au/contact as an example ...
The 2 actual numbers on the page are:
+61 (0)3 8684 9912 and +61 (0)2 8064 7043 (both numbers appear twice - once in the main section of the page and once in the footer)
The Problem
#take HTML markup from body tags
b = driver.find_element_by_css_selector('body').text
#remove all non-alpha + white space.
b = re.sub(r'\W+', '', b)
Result:
"PORTFOLIOINNOVATIONSERVICESCAREERSINSIGHTSNEWSABOUTCONTACTCONTACTOUTWAREMelbourneLe......AFRFast100Nov92017EXPLOREOUTWAREPortfolioInnovationWorkingatOutwareAboutSitemapCONNECTMELBOURNELevel3469LaTrobeStMelbourneVIC3000610386849912SYDNEYLevel41SmailStUltimoNSW2007610280647043"
Now if I apply my regex to this string
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', b))
Result:
[u'0386849912', u'0761028064', u'0386849912', u'0761028064']
I am getting a false positive because I have concatenated a postcode "NSW2007" onto the start of the phone number.
I presume because the regex has parsed the first part of "NSW2007610280647043" matching "0761028064" it doesn't then match "0280647043" which is also part of the same substring
I actually don't mind the false positive (i.e. getting "0761028064") but I do need to solve the false negative (i.e. not getting "0280647043")
I know there's some RegEx gurus here who can help on this. :-)
Please help!!
Don't search/replace any text prior to using the regex. That will make your input unusable. Try this:
(?:(?:\+?61 )?(?:0|\(0\))?)?[2378] \d{4}[ -]?\d{4}
https://regex101.com/r/1Q4HuD/3
It might help if you use a negative look ahead to check to see make sure the following character is not a number. For example: (?!\d).
This could create a problem though if some data following a phone number starts with a number.
The look behind looks like this when implemented in your regex:
(02\d{8}|03\d{8}|07\d{8}|08\d{8}|612\d{8}|613\d{8}|617\d{8}|618\d{8}|04\d{8}|614\d{8}|1300\d{6}|1800\d{6})(?!\d)
(I removed the square brackets as you do not need them when trying to match a single character)
This answer should be a comment, it isn't because of my low reputation!
I've seen you're updating the regex and I think this variation can help you. It should match very uncommon formats!
(\+61 )?(?:0|\(0\))?[2378] (?:[\s-]*\d){8}

How to use regular expression extract data not followed by something with pandas

I just want to extract the years, but not the number. How can I define not followed by XXX?
I made the following example, but the result is always a literal more than I expected.
text = ["hi2017", "322017"]
text = pd.Series(text)
myPat = "([^\d]\d{4})"
res = text.str.extract(myPat)
res
Then I get the result:
0 i2017
1 NaN
dtype: object
Actually, I just want to get "2017", but not "i2017", how can I do it?
PS. The "322017" should not be extracted, because it is not a year, but a number
Give this a try:
(?<!\d)(\d{4})(?!\d)
which returns 2017 and is based almost entirely on the comment by #PauloAlmeida
As I understand, you need only year, defined as 4 digits followed by non-number.
"(?:[a-z]+)(\d{4})$" works for me. (which means 4 digits followed by more than one character & the 4 digits are the last characters of the string)
text = ["hi2017", "322017"]
text = pd.Series(text)
myPat = "(?:[a-z]+)(\d{4})$"
res = text.str.extract(myPat)
Output:
print(res)
'''
0 2017
1 NaN
'''
You want 4-digit numbers where the first digit is either a 1 or a 2. This translates to all the numbers between 1000 to 2999, inclusive.
The regex for this is: (1[0-9]{3})|(2[0-9]{3})
This will get all the numbers between 1000 and 2999, inclusive within a string.
In your case, hi2017 will result in 2017. Additionally, 322017 will result in 2201. This is also a valid year as per your definition.
Regexr is a great online tool http://regexr.com/3ghcq
myPat = "(\d{4})"

Categories

Resources