removing the \n when extracted the program - python

I made a regex for the number of followers on twitter and i have to extract it
# Create a regex for number of followers
(
(\s|-) # first separator
\d\d # first 2 digits
, # separator
\d\d\d # hundred thousands
, # separator
\d\d\d # hundreds
)
''', re.VERBOSE)
Extract username/followers from this text
extractedFollowers = followersRegex.findall(text)
allFollowers = []
for followerCount in extractedFollowers:
allFollowers.append(followerCount[0])
but whenever i run it, this appears:
['\n90,280,191', '\n84,239,451', '\n79,215,375', '\n75,925,596', '\n62,869,696']
How do i remove the \n?

>>> lst = ['\n90,280,191', '\n84,239,451', '\n79,215,375', '\n75,925,596', '\n62,869,696']
>>> [i.replace('\n', '') for i in lst]
# ['90,280,191', '84,239,451', '79,215,375', '75,925,596', '62,869,696']
If you provide more information about the original strings you are applying the regex to, maybe I could help with the regex part.

You can use replace or lstrip.
>>>lst = ['\n90,280,191', '\n84,239,451', '\n79,215,375', '\n75,925,596', '\n62,869,696']
>>>[i.lstrip('\n') for i in lst]
['90,280,191', '84,239,451', '79,215,375', '75,925,596', '62,869,696']

Related

what is the fast way to match words in text?

i have a list of regex like :
regex_list = [".+rive.+",".+ll","[0-9]+ blue car.+"......] ## list of length 3000
what is the best method to match all this regex to my text
for example :
text : Hello, Owning 2 blue cars for a single driver
so in the output , i want to have a list of matched words :
matched_words = ["Hello","4 blue cars","driver"] ##Hello <==>.+llo
Alright, first of all, you will probably want to adjust your regex_list, because of now, matching those strings will give you the entire text back as match. This is because of .+, which states that there may follow any character any amount of time. What I have done here is the following:
import re
regex_list = [".rive.",".+ll.","[0-9]+ blue car."]
text = "Hello, Owning 2 blue cars for a single driver"
# Returns all the spans of matched regex items in text
spans = [re.search(regex_item,text).span() for regex_item in regex_list]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
print(matching_texts)
I adjusted your regex_list slightly, so it does not match the entire text. Then, I retrieve all spans from the matches with the text. Additionally, I sort the spans on first occurence. Lastly, I retrieve the texts via the indexes of the spans and print those out. What you will get is the following
['Hello', '2 blue cars', 'driver']
NOTE: I am unsure why you would like to match '4 blue cars', because that is not in your text.
You could also try this which is multi threaded version of #Lexpj answer
from concurrent.futures import ThreadPoolExecutor, as_completed
import re
# list of length 3000
regex_list = [".rive.", ".+ll.", "[0-9]+ blue car."]
my_string = "Hello, Owning 2 blue cars for a single driver "
def test(text, regex):
# Returns all the spans of matched regex items in text
spans = [re.search(regex, text).span()]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
return matching_texts
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(test, my_string, regex)
for regex in regex_list}
# as_completed() gives you the threads once finished
matched = set()
for f in as_completed(futures):
# Get the results
rs = f.result()
matched = matched.union(set(rs))
print(matched)
Looking at the desired result, your regexes are not correct. You don't want to match .+, but \w+, and also with the second regex, you'll want to match some letters after ll too.
The main idea is then to make one regex for all, by concatenating them with the | symbol:
import re
regex_list = [r"\w+rive\w+", r"\w+ll\w+", r"\d+ blue car\w+"]
regex = re.compile('|'.join(regex_list))
text = "Hello, Owning 2 blue cars for a single driver "
print(regex.findall(text)) # ["Hello","2 blue cars","driver"]
This still could give undesired effects when there is a part of your string that would match with more than one regex in the list. In that case the first will "win". So make sure that when multiple regexes could match the same text, they are ordered along their desired priority.

Function to extract company register number from text string using Regex

I have a function which extracts the company register number (German: handelsregisternummer) from a given text. Although my regex for this particular problem matches the correct format (please see demo), I can not extract the correct company register number.
I want to extract HRB 142663 B but I get HRB 142663.
Most numbers are in the format HRB 123456 but sometimes there is the letter B attached to the end.
import re
def get_handelsregisternummer(string, keyword):
# https://regex101.com/r/k6AGmq/10
reg_1 = fr'\b{keyword}[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*)(?: B)?'
match = re.compile(reg_1)
handelsregisternummer = match.findall(string) # list of matched words
if handelsregisternummer: # not empty
return handelsregisternummer[0]
else: # no match found
handelsregisternummer = ""
return handelsregisternummer
Example text scraped from website. Linebreaks make words attached to each other:
text_impressum = """"Berlin, HRB 142663 BVAT-ID.: DE283580648Tax Reference Number:"""
Apply function:
for keyword in ['HRB', 'HRA', 'HR B', 'HR A']:
handelsregisternummer = get_handelsregisternummer(text_impressum, keyword=keyword)
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = keyword + " " + handelsregisternummer
break
if not handelsregisternummer: # if list is empty
handelsregisternummer = 'not specified'
handelsregisternummer_dict = {'handelsregisternummer':handelsregisternummer}
Afterwards I get:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663'}
But I want this:
handelsregisternummer_dict ={'handelsregisternummer': 'HRB 142663 B'}
You need to use two capturing groups in the regex to capture the keyword and the number, and just match the rest:
reg_1 = fr'\b({keyword})[,:]?(?:[- ](?:Nr|Nummer)[.:]*)?\s?(\d+(?: \d+)*(?: B)?)'
# |_________| |___________________|
Then, you need to concatenate, join all the capturing groups matched and returned with findall:
if handelsregisternummer: # if list is not empty anymore, then do...
handelsregisternummer = " ".join(handelsregisternummer)
break
See the Python demo.

Regex match but re.match() doesn't return anything

I try to parse a .md file using a specific pattern with regex in python. The file is written like this:
## title
## title 2
### first paragraph
[lines]
...
### second
[lines]
...
## third
[lines]
...
## last
[lines]
...
So i used this regular expression to match it:
##(.*)\n+##(.*)\n+###((\n|.)*)###((\n|.)*)##((\n|.)*)##((\n|.)*)
when I am trying it online, the regex match:
https://regex101.com/r/8iYBrp/1
But when I am using it in python, it doesn't work, I can't understand why.
Here is my code:
Here is my code:
import re
str = (
r'##(.*)\n+##(.*)\n+###((\n|.)*)###((\n|.)*)##((\n|.)*)##((\n|.)*)')
file_regexp = re.compile(str)
## Retrieve the content of the file (I am sure this part
## returns what I want)
m = file_regexp.match(fileContent)
# m is always None
I already tried to add flags, like re.DOTALL, re.I, re.M, re.S. But when I do this, the script becomes really slow and my computer starts making strange noise.
Does anyone know what I did wrong ? Any help appreciated
First of all you assign your regex pattern to a variable str (overrides built-in str), but you use featureStr afterwards. Your resulting match object is empty, because you told it to ignore, what it matched. You can assign names to the regex placeholder using ?P<name> and access them later. Here is a working example:
import re
featureStr = (
r'##(?P<title>.*)\n+##(?P<title_2>.*)\n+###(?P<first>(.*)###(?P<second>(.*)##(?P<third>(.*)##(.*)')
file_regexp = re.compile(featureStr, re.S)
fileContent = open("markdown.md").read()
m = file_regexp.match(fileContent)
print(m.groupdict())
Which prints:
{'title': ' title', 'title_2': ' title 2', 'first': ' first paragraph\n[lines]\n...\n\n', 'second': ' second\n[lines]\n...\n\n', 'third': ' third \n[lines]\n...\n\n'}
I hope this helps you. Let me know if there are any questions left. Have a nice day!
Correct me if I'm wrong, but if you're interested only in the lines you could just skip the lines starting with #. This could be solved with something like
with open("/path/to/your/file",'r') as in_file:
for line in in_file:
if line.startswith('#'):
continue
else:
do something here.
Why do you need a regex?
Use re.search instead of re.match.
str = (r'##(.*?)\n##(.*?)\n+###(.*?)\n+###(.*?)\n+##(.*?)\n+##(.*?)')
file_regexp = re.compile(str, re.S)
fileContent = '''
## title
## title 2
### first paragraph
[lines]
...
### second
[lines]
...
## third
[lines]
...
## last
[lines]
...
'''
m = file_regexp.search(fileContent)
print(m.groups())
Output:
(' title', ' title 2', ' first paragraph\n[lines]\n...', ' second\n[lines]\n...', ' third \n[lines]\n...', '')

Regex to catch only the certain part of the string

Is there universal regex to catch only the names of companies?
Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc
Q4_2018_Control4_Corp
The output should be:
American_Airlines_Group_Inc
Apple_Inc
Alcoa_Inc
Arconic_Inc
Orkla_ASA
AGCO_Corp
Autodesk_Inc
Note:
The name of the company may contain symbols or numbers
You can use this regex,
[a-zA-Z]+(?:_[a-zA-Z]+)*$
Your company names all start with alphabetical words and hyphen separated till end of string, for which above regex will work fine.
Here, [a-zA-Z]+ starts matching alphabetical company names, and (?:_[a-zA-Z]+)* further matches any alphabetical words having hyphen before them and $ ensures the matched string ends with the string.
Regex Demo
Python code,
import re
arr = ['Q4_2017_American_Airlines_Group_Inc','Q1_2016_Apple_Inc','Q4_2014_Alcoa_Inc','Q3_2015_Arconic_Inc','Q3_2017_Orkla_ASA','Q2_2018_AGCO_Corp','Quarter_3_2018_Autodesk_Inc']
for s in arr:
m = re.search(r'[a-zA-Z]+(?:_[a-zA-Z]+)*$', s)
print(s, '-->', m.group())
Prints,
Q4_2017_American_Airlines_Group_Inc --> American_Airlines_Group_Inc
Q1_2016_Apple_Inc --> Apple_Inc
Q4_2014_Alcoa_Inc --> Alcoa_Inc
Q3_2015_Arconic_Inc --> Arconic_Inc
Q3_2017_Orkla_ASA --> Orkla_ASA
Q2_2018_AGCO_Corp --> AGCO_Corp
Quarter_3_2018_Autodesk_Inc --> Autodesk_Inc
Also, if you have a single string of those company names, then you can use following code and use re.findall to list all company names,
import re
s = '''Q4_2017_American_Airlines_Group_Inc
Q1_2016_Apple_Inc
Q4_2014_Alcoa_Inc
Q3_2015_Arconic_Inc
Q3_2017_Orkla_ASA
Q2_2018_AGCO_Corp
Quarter_3_2018_Autodesk_Inc'''
print(re.findall(r'(?m)[a-zA-Z]+(?:_[a-zA-Z]+)*$', s))
Prints,
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']
Edit:
As Chyngyz Akmatov raised, if name can contain numbers and in general any symbol, then this regex will get the name properly, which assumes company name starts after year part and underscore.
(?<=\d{4}_).*$
Demo handling any character in company name
You can use re.sub:
import re
data = [re.sub('\w+\d{4}_', '', i) for i in filter(None, content.split('\n'))]
Output:
['American_Airlines_Group_Inc', 'Apple_Inc', 'Alcoa_Inc', 'Arconic_Inc', 'Orkla_ASA', 'AGCO_Corp', 'Autodesk_Inc']
You can also use this regex:
_\d+(?:_\d+)*_(.*)
Code:
import re
lst = ['Q4_2017_American_Airlines_Group_Inc', 'Q1_2016_Apple_Inc', 'Q4_2014_Alcoa_Inc', 'Q3_2015_Arconic_Inc', 'Q3_2017_Orkla_ASA', 'Q2_2018_AGCO_Corp', 'Quarter_3_2018_Autodesk_Inc']
for x in lst:
print(re.search(r'_\d+(?:_\d+)*_(.*)', x).group(1))
# American_Airlines_Group_Inc
# Apple_Inc
# Alcoa_Inc
# Arconic_Inc
# Orkla_ASA
# AGCO_Corp
# Autodesk_Inc
Assuming there are only normal letters and the names are the end of each line :
grep -o '[A-Za-z][A-Za-z_]*$' names

Removing digits from list elements

I have a list of job titles (12,000 in total) formatted in this way:
Career_List = ['1) ABLE SEAMAN', '2) ABRASIVE GRADER', '3) ABRASIVE GRINDER']
How do I remove the numbers, parentheses, and spaces from the list elements so that I end up with this output:
Career_List_Updated = ['ABLE SEAMAN', 'ABRASIVE GRADER', 'ABRASIVE GRINDER']
I know that I am unable to simply remove the first three characters because I have more than ten items in my list.
Take advantage of the fact that str.lstrip() and the rest of the strip functions accept multiple characters as an argument.
Career_List_Updated =[career.lstrip('0123456789) ') for career in Career_List]
Split each career at the first space; keep the rest of the line.
Career_List = ['1) ABLE SEAMAN', '2) ABRASIVE GRADER', '3) ABRASIVE GRINDER', '12000) ZEBRA CLEANER']
Career_List_Updated = []
for career in Career_List:
job = career.split(' ', 1)
Career_List_Updated.append(job[1])
print Career_List_Updated
Output:
['ABLE SEAMAN', 'ABRASIVE GRADER', 'ABRASIVE GRINDER', 'ZEBRA CLEANER']
One-line version:
Career_List_Updated = [career.split(' ', 1)[1] \
for career in Career_List]
We want to find the first index that STOPS being a bad character and return the rest of the string, as follows.
def strip_bad_starting_characters_from_string(string):
bad_chars = set(r"'0123456789 )") # set of characters we don't like
for i, char in enumerate(string):
if char not in bad_chars
# we are at first index past "noise" digits
return string[i:]
career_list_updated = [strip_bad_starting_characters_from_string(string) for string in career_list]

Categories

Resources