How to extract data from a dataset using regex in Python?

How to extract data from a dataset using regex in Python? - python

I have a dataset and I would like to extract the appositive feature from this dataset.
در
همین
حال
،
<coref coref_coref_class="set_0" coref_mentiontype="ne" markable_scheme="coref" coref_coreftype="ident">
نجیب
الله
خواجه
عمری
,
</coref>
<coref coref_coref_class="set_0" coref_mentiontype="np" markable_scheme="coref" coref_coreftype="atr">
سرپرست
وزارت
تحصیلات
عالی
افغانستان
</coref>
گفت
که
در
سه
ماه
گذشته
در
۳۳
ولایت
کشور
<coref coref_coreftype="ident" coref_coref_class="empty" coref_mentiontype="ne" markable_scheme="coref">
خدمات
ملکی
</coref>
از
حدود
۱۴۹
هزار
I want to store the data inside dataset in two list. In find_atr list I stored the data where the coref tag includes coref_coreftype="atr". For the find_ident list I want to store the data of coref_coreftype="ident" So we have on the last coref tag in this dataset another coref tag that has coref_coref_class="empty". I dont want to store that data that has the tag coref_coref_class="empty". Now on the regex I mentioned that it should only include those that the coref_coref_class="set_.*?" not coref_coref_class="empty" but it still store the data of coref_coref_class="empty", where it should only store the coref_coref_class="set_.*?".
How to avoid:
i_ident = []
j_atr = []
find_ident = re.findall(r'<coref.*?coref_coref_class="set_.*?coref_mentiontype="ne".*?coref_coreftype="ident".*?>(.*?)</coref>', read_dataset, re.S)
ident_list = list(map(lambda x: x.replace('\n', ' '), find_ident))
for i in range(len(ident_list)):
i_ident.append(str(ident_list[i]))
find_atr = re.findall(r'<coref.*?coref_coreftype="atr".*?>(.*?)</coref>', read_dataset, re.S)
atr_list = list(map(lambda x: x.replace('\n', ' '), find_atr))
#print(coref_list)
for i in range(len(atr_list)):
j_atr.append(str(atr_list[i]))
print(i_ident)
print()
print(j_atr)

I reduced your dataset file to:
A
<coref coref_coref_class="set_0" coref_mentiontype="ne" markable_scheme="coref" coref_coreftype="ident">
B
</coref>
<coref coref_coref_class="set_0" coref_mentiontype="np" markable_scheme="coref" coref_coreftype="atr">
C
</coref>
D
<coref coref_coreftype="ident" coref_coref_class="empty" coref_mentiontype="ne" markable_scheme="coref">
E
</coref>
F
And tried this code, which is almost the same you provided:
import re
with open ("test_dataset.log", "r") as myfile:
read_dataset = myfile.read()
i_ident = []
j_atr = []
find_ident = re.findall(r'<coref.*?coref_coref_class="set_.*?coref_mentiontype="ne".*?coref_coreftype="ident".*?>(.*?)</coref>', read_dataset, re.S)
ident_list = list(map(lambda x: x.replace('\n', ' '), find_ident))
for i in range(len(ident_list)):
i_ident.append(str(ident_list[i]))
find_atr = re.findall(r'<coref.*?coref_coreftype="atr".*?>(.*?)</coref>', read_dataset, re.S)
atr_list = list(map(lambda x: x.replace('\n', ' '), find_atr))
#print(coref_list)
for i in range(len(atr_list)):
j_atr.append(str(atr_list[i]))
print(i_ident)
print()
print(j_atr)
And got this output, which seems right to me:
[' B ']
[' C ']

Related

Align Multiple Matches in SpaCy nlp into a Pandas Dataframe

I have written a Code that will search a multiple terms in a Text file which are 'Capex' & Much more in my case
nlp = spacy.load('en_core_web_sm')
pm = PhraseMatcher(nlp.vocab)
tipe = PhraseMatcher(nlp.vocab)
doc = nlp(text)
sents = [sent for sent in doc.sents]
phrases = ['capex', 'capacity expansion', 'Capacity expansion', 'CAPEX', 'Capacity Expansion', 'Capex']
patterns = [nlp(text) for text in phrases]
pm.add('CAPEX ',None,*patterns)
matches = pm(doc)
Then after that when i get where these terms were is a text file , I try to get the sentence where these terms were used. After that i further search for Date , Value & Type of 'CAPEX' further in that Sentence
now the issue that i am facing is that when i do so their will have multiple instances where Type of 'CAPEX' which are "Greenfield etc etc" are used multiple times. Although my code only runs the no.of times the matches of the word 'CAPEX' are. Any solution to align all these into one Dataframe
def findmatch(doc,phrases,name):
p = phrases
pa = [nlp(text) for text in p]
name = PhraseMatcher(nlp.vocab)
name.add('Type',None,*pa)
results = name(doc)
return results
def getext(matches):
for match_id,start,end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
text = span.text
return text
allcapex = pd.DataFrame( columns = ['Type', 'Value', 'Date','business segment','Location', 'source'])
for ind,match in enumerate(matches):
for sent in sents:
if matches[ind][1] < sent.end:
typematches = findmatch(sent,['Greenfield','greenfield', 'brownfield','Brownfield', 'de-bottlenecking', 'De-bottlenecking'],'Type')
valuematches = findmatch(sent,['Crore', 'Cr','crore', 'cr'],'Value')
datematches = findmatch(sent,['2020', '2021','2022', '2023','2024', '2025', 'FY21', 'FY22', 'FY23', 'FY24', 'FY25','FY26'],'Date')
capextype = getext(typematches)
capexvalue = getext(valuematches)
capexdate = getext(datematches)
allcapex.loc[len(allcapex.index)] = [capextype,capexvalue,capexdate,'','',sent]
break
print(allcapex)

Split each line in a file based on delimitters

This is the sample data in a file. I want to split each line in the file and add to a dataframe. In some cases they have more than 1 child. So whenever they have more than one child new set of column have to be added child2 Name and DOB
(P322) Rashmika Chadda 15/05/1995 – Rashmi C 12/02/2024
(P324) Shiva Bhupati 01/01/1994 – Vinitha B 04/08/2024
(P356) Karthikeyan chandrashekar 22/02/1991 – Kanishka P 10/03/2014
(P366) Kalyani Manoj 23/01/1975 - Vandana M 15/05/1995 - Chandana M 18/11/1998
This is the code I have tried but this splits only by taking "-" into consideration
with open("text.txt") as read_file:
file_contents = read_file.readlines()
content_list = []
temp = []
for each_line in file_contents:
temp = each_line.replace("â€“", " ").split()
content_list.append(temp)
print(content_list)
Current output:
[['(P322)', 'Rashmika', 'Chadda', '15/05/1995', 'Rashmi', 'Chadda', 'Teega', '12/02/2024'], ['(P324)', 'Shiva', 'Bhupati', '01/01/1994', 'Vinitha', 'B', 'Sahu', '04/08/2024'], ['(P356)', 'Karthikeyan', 'chandrashekar', '22/02/1991', 'Kanishka', 'P', '10/03/2014'], ['(P366)', 'Kalyani', 'Manoj', '23/01/1975', '-', 'Vandana', 'M', '15/05/1995', '-', 'Chandana', 'M', '18/11/1998']]
Final output should be like below
Code
Parent_Name
DOB
Child1_Name
DOB
Child2_Name
DOB
P322
Rashmika Chadda
15/05/1995
Rashmi C
12/02/2024
P324
Shiva Bhupati
01/01/1994
Vinitha B
04/08/2024
P356
Karthikeyan chandrashekar
22/02/1991
Kanishka P
10/03/2014
P366
Kalyani Manoj
23/01/1975
Vandana M
15/05/1995
Chandana M
18/11/1998

I'm not sure if you want it as a list or something else.
To get lists:
result = []
for t in text[:]:
# remove the \n at the end of each line
t = t.strip()
# remove the parenthesis you don't wnt
t = t.replace("(", "")
t = t.replace(")", "")
# split on space
t = t.split(" – ")
# reconstruct
for i, person in enumerate(t):
person = person.split(" ")
# print(person)
# remove code
if i==0:
res = [person.pop(0)]
res.extend([" ".join(person[:2]), person[2]])
result.append(res)
print(result)
Which would give the below output:
[['P322', 'Rashmika Chadda', '15/05/1995', 'Rashmi C', '12/02/2024'], ['P324', 'Shiva Bhupati', '01/01/1994', 'Vinitha B', '04/08/2024'], ['P356', 'Karthikeyan chandrashekar', '22/02/1991', 'Kanishka P', '10/03/2014'], ['P366', 'Kalyani Manoj', '23/01/1975', 'Vandana M', '15/05/1995', 'Chandana M', '18/11/1998']]
You can organise a bit more the data using dictionnary:
result = {}
for t in text[:]:
# remove the \n at the end of each line
t = t.strip()
# remove the parenthesis you don't wnt
t = t.replace("(", "")
t = t.replace(")", "")
# split on space
t = t.split(" – ")
for i, person in enumerate(t):
# split name
person = person.split(" ")
# remove code
if i==0:
code = person.pop(0)
if i==0:
result[code] = {"parent_name": " ".join(person[:2]), "parent_DOB": person[2], "children": [] }
else:
result[code]['children'].append({f"child{i}_name": " ".join(person[:2]), f"child{i}_DOB": person[2]})
print(result)
Which would give this output:
{'P322': {'children': [{'child1_DOB': '12/02/2024',
'child1_name': 'Rashmi C'}],
'parent_DOB': '15/05/1995',
'parent_name': 'Rashmika Chadda'},
'P324': {'children': [{'child1_DOB': '04/08/2024',
'child1_name': 'Vinitha B'}],
'parent_DOB': '01/01/1994',
'parent_name': 'Shiva Bhupati'},
'P356': {'children': [{'child1_DOB': '10/03/2014',
'child1_name': 'Kanishka P'}],
'parent_DOB': '22/02/1991',
'parent_name': 'Karthikeyan chandrashekar'},
'P366': {'children': [{'child1_DOB': '15/05/1995',
'child1_name': 'Vandana M'},
{'child2_DOB': '18/11/1998', 'child2_name': 'Chandana M'}],
'parent_DOB': '23/01/1975',
'parent_name': 'Kalyani Manoj'}}
In the end, to have an actual table, you would need to use pandas but that will require for you to fix the number of children max so that you can pad the empty cells.

Extract() position list

Hi i try to extract info from a scrapy item.
I try
dre = [i.split(', ') for i in response.xpath('normalize-space(//*[contains(#class,"business-address")])').extract()]
ml_item['address'] = dre[0]
output:
'address': ['Calle V Centenario 24', '46900', 'Torrente', 'Valencia']
I need save the info from this output in a diferent variables comma delimited
ml_item['cp'] = '46900' , ml_item['city'] = 'Torrente'

If dre[0] gives you ['Calle V Centenario 24', '46900', 'Torrente', 'Valencia'] then
ml_item['cp'] = dre[0][1]
ml_item['city'] = dre[0][2]
or
ml_item['address'] = dre[0]
ml_item['cp'] = ml_item['address'][1]
ml_item['city'] = ml_item['address'][2]

How to arrange html sentences having different structures

I have few hundreds of html files look like the below.
<nonDerivativeTable>
<nonDerivativeHolding> #First Holding
<securityTitle>
<value>Stock</value>
</securityTitle>
</nonDerivativeHolding>
<nonDerivativeHolding> #Second Holding
<securityTitle>
<footnoteId id="F1"/>
</securityTitle>
</nonDerivativeHolding>
<nonDerivativeHolding> #Third Holding
<securityTitle>
<value>Option</value>
<footnoteId id="F2"/>
<footnoteId id="F3"/>
</securityTitle>
</nonDerivativeHolding>
</nonDerivativeTable>
Two variables that I would like to extract is security ('Stock' in #First holding, '' in #Second holding, and 'Option' in #Third holding) and security_footnote ('' in #First holding, 'F1; F2' in #Second holding, and 'F3' in #Third holding. But securityTitle and securityTitleFootnote do not always exist.
Also, sometimes there are multiple footnote IDs just like in the #third holding.
I want to write each rwo using data in each "Holding" tag allowing for empty values.
import csv
from bs4 import BeautifulSoup
with open('output.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile, )
soup = BeautifulSoup(doc, 'htmparser') #Let's say doc has the html.
try:
securityTitles = soup.select('securityTitle > value').text
except:
securitiyTitles = ''
try:
securityTitleFootnotes = '; 'join(soup.select('securityTitle > footnoteid').get('id')
except:
securityTitleFootnotes = ''
for securityTitle, securityTitleFootnote in zip(securitiyTitles, securityTitleFootnotes):
writer.writerow([securityTitle, securityTitleFootnote])
I want the result to be
Want Table
Note: one of url that I am trying to parse is "https://www.sec.gov/Archives/edgar/data/12927/0001225208-09-018738.txt". sentences that I uploaded are only part of the data.
Now I see that those are XML... rather than HTML.

You can find the contents for each nonDerivativeHolding, and then apply a custom list of handlers for each:
from bs4 import BeautifulSoup as soup
c = [i.securitytitle.contents for i in soup(s, 'html.parser').find_all('nonderivativeholding')]
h = [('value', lambda x:x.text), ('footnoteid', lambda x:x['id'])]
results = [[i for i in b if i != '\n'] for b in c]
r = [{a:(lambda x:'' if not x else x[0] if len(x) == 1 else x)([b(j) for j in i if j.name == a]) for a, b in h} for i in results]
Output:
[{'value': 'Stock', 'footnoteid': ''}, {'value': '', 'footnoteid': 'F1'}, {'value': 'Option', 'footnoteid': ['F2', 'F3']}]

How do I convert a str list that has phrases to a int list?

I have a script that allows me to extract the info obtained from excel to a list, this list contains str values that contain phrases such as: "I like cooking", "My dog´s name is Doug", etc.
So I've tried this code that I found on the Internet, knowing that the int function has a way to transform an actual phrase into numbers.
The code I used is:
lista=["I like cooking", "My dog´s name is Doug", "Hi, there"]
test_list = [int(i, 36) for i in lista]
Running the code I get the following error:
builtins.ValueError: invalid literal for int() with base 36: "I like
cooking"
But I´ve tried the code without the spaces or punctuation, and i get an actual value, but I do need to take those characters into consideration.

To expand on the bytearray approach you could use int.to_bytes and int.from_bytes to actually get an int back, although the integers will be much longer than you show in your example.
def to_int(s):
return int.from_bytes(bytearray(s, 'utf-8'), 'big', signed=False)
def to_str(s):
return s.to_bytes((s.bit_length() +7 ) // 8, 'big').decode()
lista = ["I like cooking",
"My dog´s name is Doug",
"Hi, there"]
encoded = [to_int(s) for s in lista]
decoded = [to_str(s) for s in encoded]
encoded:
[1483184754092458833204681315544679,
28986146900667755422058678317652141643897566145770855,
1335744041264385192549]
decoded:
['I like cooking',
'My dog´s name is Doug',
'Hi, there']

As noted in the comments, converting phrases to integers with int() won't work if the phrase contains whitespace or most non-alphanumeric characters with a few exceptions.
If your phrases all use a common encoding, then you might get something closer to what you want by converting your strings to bytearrays. For example:
s = 'My dog´s name is Doug'
b = bytearray(s, 'utf-8')
print(list(b))
# [77, 121, 32, 100, 111, 103, 194, 180, 115, 32, 110, 97, 109, 101, 32, 105, 115, 32, 68, 111, 117, 103]
From there you would have to figure out whether or not you want to preserve the list of integers representing each phrase or combine them in some way depending on what you intend to do with these numerical string representations.

Since you want to convert your text for an AI, you should do something like this:
import re
def clean_text(text, vocab):
'''
normalizes the string
'''
chars = {'\'':[u"\u0060", u"\u00B4", u"\u2018", u"\u2019"], 'a':[u"\u00C0", u"\u00C1", u"\u00C2", u"\u00C3", u"\u00C4", u"\u00C5", u"\u00E0", u"\u00E1", u"\u00E2", u"\u00E3", u"\u00E4", u"\u00E5"],
'e':[u"\u00C8", u"\u00C9", u"\u00CA", u"\u00CB", u"\u00E8", u"\u00E9", u"\u00EA", u"\u00EB"],
'i':[u"\u00CC", u"\u00CD", u"\u00CE", u"\u00CF", u"\u00EC", u"\u00ED", u"\u00EE", u"\u00EF"],
'o':[u"\u00D2", u"\u00D3", u"\u00D4", u"\u00D5", u"\u00D6", u"\u00F2", u"\u00F3", u"\u00F4", u"\u00F5", u"\u00F6"],
'u':[u"\u00DA", u"\u00DB", u"\u00DC", u"\u00DD", u"\u00FA", u"\u00FB", u"\u00FC", u"\u00FD"]}
for gud in chars:
for bad in chars[gud]:
text = text.replace(bad, gud)
if 'http' in text:
return ''
text = text.replace('&', ' and ')
text = re.sub(r'\.( +\.)+', '..', text)
#text = re.sub(r'\.\.+', ' ^ ', text)
text = re.sub(r',+', ',', text)
text = re.sub(r'\-+', '-', text)
text = re.sub(r'\?+', ' ? ', text)
text = re.sub(r'\!+', ' ! ', text)
text = re.sub(r'\'+', "'", text)
text = re.sub(r';+', ':', text)
text = re.sub(r'/+', ' / ', text)
text = re.sub(r'<+', ' < ', text)
text = re.sub(r'>+', ' > ', text)
text = text.replace('%', '% ')
text = text.replace(' - ', ' : ')
text = text.replace(' -', " - ")
text = text.replace('- ', " - ")
text = text.replace(" '", " ")
text = text.replace("' ", " ")
#for c in ".,:":
# text = text.replace(c + ' ', ' ' + c + ' ')
text = re.sub(r' +', ' ', text.strip(' '))
for i in text:
if i not in vocab:
text = text.replace(i, '')
return text
def arr_to_vocab(arr, vocabDict):
'''
returns a provided array converted with provided vocab dict, all array elements have to be in the vocab, but not all vocab elements have to be in the input array, works with strings too
'''
try:
return [vocabDict[i] for i in arr]
except Exception as e:
print (e)
return []
def str_to_vocab(vocab):
'''
generates vocab dicts
'''
to_vocab = {}
from_vocab = {}
for index, i in enumerate(vocab):
to_vocab[index] = i
from_vocab[i] = index
return to_vocab, from_vocab
vocab = sorted([chr(i) for i in range(32, 127)]) # a basic vocab for your model
vocab.insert(0, None)
toVocab, fromVocab = str_to_vocab(vocab) #converting vocab into usable form
your_data_str = ["I like cooking", "My dog´s name is Doug", "Hi, there"] #your data, a list of strings
X = []
for i in your_data_str:
X.append(arr_to_vocab(clean_text(i, vocab), fromVocab)) # normalizing and converting to "ints" each string
# your data is now almost ready for your model, just pad it to the size of your input with zeros and it's done
print (X)
If you want to know how convert an "int" string back to a string, tell me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract data from a dataset using regex in Python? - python

Related

Align Multiple Matches in SpaCy nlp into a Pandas Dataframe

Split each line in a file based on delimitters

Extract() position list

How to arrange html sentences having different structures

How do I convert a str list that has phrases to a int list?

Categories

Resources