I'm trying to get the domain from a list of URL's. For that, I'm using regex in a function for pattern matching.
def get_domain(url):
m = re.search(r"https:\/\/(.*)\/", url)
result = m.group(1)
return result;
string_array = ("hTTps://stack0verflow.com/", "hTTps://stackoverfl0w.com/", "hTTps://stackoverfiow.com/")
m = list(map(get_domain, string_array))
The function get_domain works if I use a for-loop to iterate over the list of strings, but whenever I try to use the map function, I get the error below.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-575cbcab950e> in <module>
9 print(get_domain(url))
10
---> 11 m = list(map(get_domain, string_array))
12 ##print(m)
<ipython-input-19-fc11e511d74d> in get_domain(url)
12 def get_domain(url):
13 m = re.search(r"https:\/\/(.*)\/", url)
---> 14 result = m.group(1)
15 return result;
AttributeError: 'NoneType' object has no attribute 'group'
Why does this happen and what am I doing wrong? I've seen a lot of examples online of the map function, and I think i have the syntax down.
this regex will get only domain name in first group :
(?:https?:\/\/)?(?:(?:www|ssh).)?((?=.*\.)[^\n\/]*)
and don't forget to make it case insensitive
exemple :
import re
arr = ["https://www.exemple.com/?query=blablabla","https://www.exemple.com/aaa","hTTp://www.exemple.com","www.exemple.com/aaa","exemple.com"]
for i in arr:
m = re.search(r"(?:https?:\/\/)?(?:(?:www|ssh).)?((?=.*\.)[^\n\/]*)",i,re.IGNORECASE)
print(m.group(1))
Related
I am trying to write a function that transforms an integer into its string value only if this integer follows certains words.
So, I want all the numbers that follow only words such as "hours", "hour", "day", "days", "minutes" to be transformed into their string value, otherwise, kept the same.
So for example, i have this : "I am 45, I came here 4 times and I have been waiting for 6 hours."
The result should be : "I am 45, I came here four times and I have been waiting for six hours."
I tried to write a code for that but i am stuck at some point:
I am able to get the result in the previous case, but when i have something like:
"I am 45, I came here 4 times and I have been waiting for 45 hours.", my code returns
"I am forty-five , I came here 4 times and I have been waiting for forty-five hours." while i don't want the first "45" to be changed.
When i test my code with a single sentence it works, but when i use an entire dataframe column with the map function, it's not working.
Here is my code and the error i get.
import pandas as pd
from num2words import num2words
import re
text = [[1, "I am writing some very basic english sentences"],
[2, " i am 45 old and worked 3 times this week for 45 hours " ],
[3, " i am 75 old and worked 6 times this week for 45 hours "]]
Data = pd.DataFrame(raw_docs, columns=["index", "text"])
Data
def remove_numbers(text):
m = re.findall('\d+\s(?=hour|day|days|hours|hrs|hr|minutes|min|time|times)', text)
for i in range(len(m)):
if m[i]:
t = m[i]
t2 = num2words(t)
clean = re.sub(t, t2+' ', text)
text = clean
return clean
Data['text'] = pd.DataFrame(Data['text'].map(remove_numbers))
Data['text']
The error i get:
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
<ipython-input-165-b46ce833010e> in <module>
16 return clean
17
---> 18 Data['text'] = pd.DataFrame(Data['text'].map(remove_numbers))
19 Data['text']
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in map(self, arg, na_action)
3907 dtype: object
3908 """
-> 3909 new_values = super()._map_values(arg, na_action=na_action)
3910 return self._constructor(new_values, index=self.index).__finalize__(
3911 self, method="map"
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
935
936 # mapper is a function
--> 937 new_values = map_f(values, mapper)
938
939 return new_values
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-165-b46ce833010e> in remove_numbers(text)
14 clean = re.sub(t, t2+' ', text)
15 text = clean
---> 16 return clean
17
18 Data['text'] = pd.DataFrame(Data['text'].map(remove_numbers))
UnboundLocalError: local variable 'clean' referenced before assignment
Please, can someone help me solve those 2 issues ?
The last error is whats getting you. In your example text[0][1] has no matches for m so it returns clean before it has been set to anything.
try:
def remove_numbers(text):
m = re.findall('\d+\s(?=hour|day|days|hours|hrs|hr|minutes|min|time|times)', text)
clean = text
for i in range(len(m)):
if m[i]:
t = m[i]
t2 = num2words(t)
clean = re.sub(t, t2+' ', text)
text = clean
return clean
UPDATE
Forgot about the first part of the question, you'll need to apply the regex when substituting the new value. When you search for 45 in the case of text[1][1] it's replacing both instances.
try:
def remove_numbers(text):
clean = text
m = re.findall('\d+\s(?=hour|day|days|hours|hrs|hr|minutes|min|time|times)', text)
print(m)
for i in range(len(m)):
if m[i]:
t = m[i]
t2 = num2words(t)
pattern = '[' + t + ']+\s(?=hour|day|days|hours|hrs|hr|minutes|min|time|times)'
clean = re.sub(pattern, ' '+ t2 + ' ', text)
text = clean
return clean
I want to convert a str number into a float or int numerical type. However, it is throwing an error that it can't, so I am removing the comma. The comma will not be removed, so I need to find a way of finding a way of designating the location in the number space like say fourth.
power4 = power[power.get('Number of Customers Affected') != 'Unknown']
power5 = power4[pd.notnull(power4['Number of Customers Affected'])]
power6 = power5[power5.get('NERC Region') == 'RFC']
power7 = power6.get('Number of Customers Affected').loc[1]
power8 = power7.strip(",")
power9 = float(power8)
ValueError Traceback (most recent call last) <ipython-input-70- 32ca4deb9734> in <module>
6 power7 = power6.get('Number of Customers Affected').loc[1]
7 power8 = power7.strip(",")
----> 8 power9 = float(power8)
9
10
ValueError: could not convert string to float: '127,000'
Use replace()
float('127,000'.replace(',',''))
Have you tried pandas.to_numeric?
import pandas as pd
a = '1234'
type(a)
a = pd.to_numeric(a)
type(a)
In the
power8 = power7.strip(",")
line, do
power8 = power7.replace(',', '')
strip() will not work here. What is required is replace() method of string. You may also try
''.join(e for e in s if e.isdigit())
Or,
s = ''.join(s.split(','))
RegeEx can also be a way to solve this, or you can have a look at this answer : https://stackoverflow.com/a/266162/9851541
I want to print elements of list via regex this is my code:
myresult_tv = [ 'Extinct A Horizon Guide to Dinosaurs WEB h264-WEBTUBE', 'High Noon 2019 04 05 720p HDTV DD5 1 MPEG2-NTb', 'Wyatt Cenacs Problem Areas S02E01 1080p WEBRip x264-eSc', 'Bondi Vet S05E15 720p WEB x264-GIMINI', 'If Loving You Is Wrong S04E03 Randals Stage HDTV x264-CRiMSON', 'Wyatt Cenacs Problem Areas S02E01 WEBRip x264-eSc', 'Bondi Vet S05E15 1080p WEB x264-GIMINI']
li = []
for a in myresult_tv:
w = re.match(".*\d ", a)
c =w.group()
li.append(c)
print(li)
and the result is :
Traceback (most recent call last):
File "azazzazazaaz.py", line 31, in <module>
c =w.group()
AttributeError: 'NoneType' object has no attribute 'group'
***Repl Closed***
You're not checking if the regex matched the element of the list. You should be doing something like this:
match = re.search(pattern, string)
if match:
process(match)
Since I don't understand what your expected output, I use the same regex as yours. Try use this code:
li = []
for a in myresult_tv:
try: # I use try... except... in case the regex doesn't work at some list elements
w = re.search("(.*\d )", a) # I use search instead of match
c = w.group()
li.append(c)
except:
pass
print(li)
from mrjob.job import MRJob
import re
Creation_date=re.compile('CreationDate=\"[0-9]*\"[:17]')
class Part2(MRJob):
def mapper(self, _, line):
DateOnly=Creation_date.group(0).split("=")
if(DateOnly > 2013):
yield None, 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
Part1.run()
I have written python code for MapReduce Job where CreationDate="2010-07-28T19:04:21.300". I have to find all the dates where creation date is at or after 2014-01-01. But I have encountered an error.
Creation_date is just a regex.
You need to match your input string before you can call group(0)
Regular expression object (the result of re.compile) does not have group method:
>>> pattern = re.compile('CreationDate="([0-9]+)"')
>>> pattern.group
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: '_sre.SRE_Pattern' object has no attribute 'group'
To get a match object (which has a group method), you need to match the pattern against the string (line) using regex.search method (or regex.match method depending on your need):
>>> pattern.search('CreationDate="2013"')
<_sre.SRE_Match object at 0x7fac5c64e8a0>
>>> pattern.search('CreationDate="2013"').group(1) # returns a string
'2013'
Creation_date = re.compile('CreationDate="([0-9]+)"')
def mapper(self, _, line):
date_only = Creation_date.search(line), group(1)
if int(date_only) > 2013:
yield None, 1
NOTE: modifed the regular express to capture the numeric part as a group. and convert the matched string to int (comparing string with the number 2013 has no meaning, or raise exception depending on Python version)
I have a string that looks like a path from which I am trying to extract 020414_001 with a regular expression I got from here.
str1 = "Test 123 <C:\User\Test\xyz\022014-101\more\stuff\022014\1> Text"
Actually I am retrieving the string from a text file so I dont have to escape it, but for testing purpose I used this string instead:
str1 = <C:\\User\\Test\\xyz\\022014-101\\more\\stuff\\022014\\1>
Here is the code I tried to match the first occuring 022014-101:
import re
p = re.compile('(?<=\\)[\d]{6}[^\\]*')
m = p.match(str1)
print m.group(0) #Line 6
It gave me this error:
Traceback (most recent call last):
File "test12.py", line 6, in <module>
print m.group(0)
AttributeError: 'NoneType' object has no attribute 'group'
How can I get the desired output 020414_001 ?
EDIT:
That did it:
import re
m = re.search(r'(?<=\\)[\d]{6}[^\\]*', str1)
print m.group(0)