I have a function that is able to create triples and relationships from text. However, when I create a list of a column that contains text and pass it through the function, it only processes the first row, or item of the list. Therefore, I am wondering how the whole list can be processed within this function. Maybe a for loop would work?
The following line contains the list
rez_dictionary = {'Decent Little Reader, Poor Tablet',
'Ok For What It Is',
'Too Heavy and Poor weld quality,',
'difficult mount',
'just got it installed'}
from transformers import pipeline
triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
# We need to use the tokenizer manually since we need special tokens.
extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(rez_dictionary, return_tensors=True, return_text=False)[0]["generated_token_ids"]])
print(extracted_text[0])
If anyone has a suggestion, I am looking forward for it.
Would it also be possible to get the output adjusted to the following format:
# Function to parse the generated text and extract the triplets
def extract_triplets(text):
triplets = []
relation, subject, relation, object_ = '', '', '', ''
text = text.strip()
current = 'x'
for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
if token == "<triplet>":
current = 't'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
relation = ''
subject = ''
elif token == "<subj>":
current = 's'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
object_ = ''
elif token == "<obj>":
current = 'o'
relation = ''
else:
if current == 't':
subject += ' ' + token
elif current == 's':
object_ += ' ' + token
elif current == 'o':
relation += ' ' + token
if subject != '' and relation != '' and object_ != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)
You are removing the other entries of rez_dictionary inside the batch_decode:
triplet_extractor(rez_dictionary, return_tensors=True, return_text=False)[0]["generated_token_ids"]
Use a list comprehension instead:
from transformers import pipeline
rez = ['Decent Little Reader, Poor Tablet',
'Ok For What It Is',
'Too Heavy and Poor weld quality,',
'difficult mount',
'just got it installed']
triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
model_output = triplet_extractor(rez, return_tensors=True, return_text=False)
extracted_text = triplet_extractor.tokenizer.batch_decode([x["generated_token_ids"] for x in model_output])
print("\n".join(extracted_text))
Output:
<s><triplet> Decent Little Reader <subj> Poor Tablet <obj> different from <triplet> Poor Tablet <subj> Decent Little Reader <obj> different from</s>
<s><triplet> Ok For What It Is <subj> film <obj> instance of</s>
<s><triplet> Too Heavy and Poor <subj> weld quality <obj> subclass of</s>
<s><triplet> difficult mount <subj> mount <obj> subclass of</s>
<s><triplet> 2008 Summer Olympics <subj> 2008 <obj> point in time</s>
Regarding the extension of the OP's question, OP wanted to know how to run the function extract_triplets. OP can simply do that via a for-loop:
for text in extracted_text:
print(extract_triplets(text))
Output:
[{'head': 'Decent Little Reader', 'type': 'different from', 'tail': 'Poor Tablet'}, {'head': 'Poor Tablet', 'type': 'different from', 'tail': 'Decent Little Reader'}]
[{'head': 'Ok For What It Is', 'type': 'instance of', 'tail': 'film'}]
[{'head': 'Too Heavy and Poor', 'type': 'subclass of', 'tail': 'weld quality'}]
[{'head': 'difficult mount', 'type': 'subclass of', 'tail': 'mount'}]
[{'head': '2008 Summer Olympics', 'type': 'point in time', 'tail': '2008'}]
This is the sample data in a file. I want to split each line in the file and add to a dataframe. In some cases they have more than 1 child. So whenever they have more than one child new set of column have to be added child2 Name and DOB
(P322) Rashmika Chadda 15/05/1995 – Rashmi C 12/02/2024
(P324) Shiva Bhupati 01/01/1994 – Vinitha B 04/08/2024
(P356) Karthikeyan chandrashekar 22/02/1991 – Kanishka P 10/03/2014
(P366) Kalyani Manoj 23/01/1975 - Vandana M 15/05/1995 - Chandana M 18/11/1998
This is the code I have tried but this splits only by taking "-" into consideration
with open("text.txt") as read_file:
file_contents = read_file.readlines()
content_list = []
temp = []
for each_line in file_contents:
temp = each_line.replace("–", " ").split()
content_list.append(temp)
print(content_list)
Current output:
[['(P322)', 'Rashmika', 'Chadda', '15/05/1995', 'Rashmi', 'Chadda', 'Teega', '12/02/2024'], ['(P324)', 'Shiva', 'Bhupati', '01/01/1994', 'Vinitha', 'B', 'Sahu', '04/08/2024'], ['(P356)', 'Karthikeyan', 'chandrashekar', '22/02/1991', 'Kanishka', 'P', '10/03/2014'], ['(P366)', 'Kalyani', 'Manoj', '23/01/1975', '-', 'Vandana', 'M', '15/05/1995', '-', 'Chandana', 'M', '18/11/1998']]
Final output should be like below
Code
Parent_Name
DOB
Child1_Name
DOB
Child2_Name
DOB
P322
Rashmika Chadda
15/05/1995
Rashmi C
12/02/2024
P324
Shiva Bhupati
01/01/1994
Vinitha B
04/08/2024
P356
Karthikeyan chandrashekar
22/02/1991
Kanishka P
10/03/2014
P366
Kalyani Manoj
23/01/1975
Vandana M
15/05/1995
Chandana M
18/11/1998
I'm not sure if you want it as a list or something else.
To get lists:
result = []
for t in text[:]:
# remove the \n at the end of each line
t = t.strip()
# remove the parenthesis you don't wnt
t = t.replace("(", "")
t = t.replace(")", "")
# split on space
t = t.split(" – ")
# reconstruct
for i, person in enumerate(t):
person = person.split(" ")
# print(person)
# remove code
if i==0:
res = [person.pop(0)]
res.extend([" ".join(person[:2]), person[2]])
result.append(res)
print(result)
Which would give the below output:
[['P322', 'Rashmika Chadda', '15/05/1995', 'Rashmi C', '12/02/2024'], ['P324', 'Shiva Bhupati', '01/01/1994', 'Vinitha B', '04/08/2024'], ['P356', 'Karthikeyan chandrashekar', '22/02/1991', 'Kanishka P', '10/03/2014'], ['P366', 'Kalyani Manoj', '23/01/1975', 'Vandana M', '15/05/1995', 'Chandana M', '18/11/1998']]
You can organise a bit more the data using dictionnary:
result = {}
for t in text[:]:
# remove the \n at the end of each line
t = t.strip()
# remove the parenthesis you don't wnt
t = t.replace("(", "")
t = t.replace(")", "")
# split on space
t = t.split(" – ")
for i, person in enumerate(t):
# split name
person = person.split(" ")
# remove code
if i==0:
code = person.pop(0)
if i==0:
result[code] = {"parent_name": " ".join(person[:2]), "parent_DOB": person[2], "children": [] }
else:
result[code]['children'].append({f"child{i}_name": " ".join(person[:2]), f"child{i}_DOB": person[2]})
print(result)
Which would give this output:
{'P322': {'children': [{'child1_DOB': '12/02/2024',
'child1_name': 'Rashmi C'}],
'parent_DOB': '15/05/1995',
'parent_name': 'Rashmika Chadda'},
'P324': {'children': [{'child1_DOB': '04/08/2024',
'child1_name': 'Vinitha B'}],
'parent_DOB': '01/01/1994',
'parent_name': 'Shiva Bhupati'},
'P356': {'children': [{'child1_DOB': '10/03/2014',
'child1_name': 'Kanishka P'}],
'parent_DOB': '22/02/1991',
'parent_name': 'Karthikeyan chandrashekar'},
'P366': {'children': [{'child1_DOB': '15/05/1995',
'child1_name': 'Vandana M'},
{'child2_DOB': '18/11/1998', 'child2_name': 'Chandana M'}],
'parent_DOB': '23/01/1975',
'parent_name': 'Kalyani Manoj'}}
In the end, to have an actual table, you would need to use pandas but that will require for you to fix the number of children max so that you can pad the empty cells.
I am currently lost in my quest to script something that parses a firewall configuration and outputs a html list.
For some reason that I just can't figure out myself why the <br />.join part does not work.
Another part of my script creates the below mentioned dictionaries.
The part of the script displayed here checks, which value is assigned to a key from the dict "ipsec_encr_int". Said value gets assigned to key_interface as a string.
The script then checks if this key_interface exists als a key in phase1_int_ike and whether it has one or two values assigned.
After there are only ikev1 and ikev2. Depending on which one (or both) the key/Value pairing should be bisplayed, but not as a LIST! It should be displayed as several strings with <br /> between them, since the end result should be a html-table.
I tried running the ".join()" part outside of the function, but got the same result.
Data set:
phase1_int_ike = {'OUTSIDE': ['ikev2', 'ikev1'], 'P2P-Duckburg': ['ikev2'], 'P2P-Darkwing': ['ikev2']}
ipsec_encr_int = {'VPN-TO-Gearloose': ['OUTSIDE'], 'VPN-TO-Ducktales': ['OUTSIDE'], 'VPN-TO-BeagleBoys': ['OUTSIDE'], 'VPN-TO-Duckburg': ['P2P-Duckburg'], 'VPN-TO-Darkwing': ['P2P-Darkwing']}
ipsec = {'VPN-TO-Gearloose': ['OUTSIDE_MAP', 'Map Number: 10', 'PFS: Default DH Group', 'Peer: 123.123.123.126', 'ikev1', 'Phase 2: ESP-AES256-SHA1'], 'VPN-TO-Ducktales': ['OUTSIDE_MAP', 'Map Number: 20', 'PFS: DH group19', 'Peer: 123.123.123.13', 'ikev2', 'Phase 2: IKEV2-AES256-SHA256'], 'VPN-TO-BeagleBoys': ['OUTSIDE_MAP', 'Map Number: 30', 'PFS: DH group5', 'Peer: 123.123.123.250', 'ikev1', 'Phase 2: ESP-AES256-SHA1', 'lifetime 3600 seconds', 'lifetime 4608000 kilobytes'], 'VPN-TO-Duckburg': ['P2P-Duckburg', 'Map Number: 10', 'PFS: DH group19', 'Peer: 123.123.123.27', 'ikev2', 'Phase 2: IKEV2-AES256-SHA256'], 'VPN-TO-Darkwing': ['P2P-Darkwing', 'Map Number: 10', 'PFS: DH group19', 'Peer: 123.123.123.17', 'ikev2', 'Phase 2: IKEV2-AES256-SHA256']}
ikev2_pols = {'ikev2 policy 10': ['encryption: aes-256', 'integrity sha256', 'group 19', 'prf sha256', 'lifetime 86400']}
ikev1_pols = {'ikev1 policy 10': ['authentication: pre-share', 'encryption: aes-256', 'hash: sha', 'group 2', 'lifetime: 86400'], 'ikev1 policy 20': ['authentication: pre-share', 'encryption: aes-256', 'hash: sha', 'group 5', 'lifetime: 28800']}
crypto_int = {'OUTSIDE_MAP': ['OUTSIDE'], 'P2P-Duckburg': ['P2P-Duckburg'], 'P2P-Darkwing': ['P2P-Darkwing']}
Function currently:
def html_ipsec_tbl(key):
if key in ipsec_encr_int:
key_interface = ' '.join(ipsec_encr_int[key])
if key_interface in phase1_int_ike and len(phase1_int_ike[key_interface]) == 2:
for ikev1_key in ikev1_pols:
return '<td>' + '<th>' + ikev1_key + '</th>' + '<br />'.join([str(x) for x in ikev1_pols.values()]) + '</td>'
#print('<th>' + ikev1_key + '</th>')
#print(*ikev1_pol, sep = "\n")
for ikev2_key in ikev2_pols:
return '<td>' + '<th>' + ikev2_key + '</th>' + '<br />'.join([str(x) for x in ikev2_pols.values()]) + '</td>'
#print('<th>' + ikev2_key + '</th>')
#print(*ikev2_pol, sep = "\n")
elif key_interface in phase1_int_ike and len(phase1_int_ike[key_interface]) == 1:
if 'ikev1' in phase1_int_ike[key_interface]:
for ikev1_key in ikev1_pols:
return '<td>' + '<th>' + ikev1_key + '</th>' + '<br />'.join([str(x) for x in ikev1_pols.values()]) + '</td>'
#print('<th>' + ikev1_key + '</th>')
#print(*ikev1_pol, sep = "\n")
elif 'ikev2' in phase1_int_ike[key_interface]:
for ikev2_key in ikev2_pols:
return '<td>' + '<th>' + ikev2_key + '</th>' + '<br />'.join([str(x) for x in ikev2_pols.values()]) + '</td>'
#print('<th>' + ikev2_key + '</th>')
#print(*ikev2_pol, sep = "\n")
for value in ipsec[key]:
if 'Peer:' in value.split(' '):
peer = value.split(' ')[1]
print(html_ipsec_tbl('VPN-TO-Duckburg'))
Output currently:
<td><th>ikev2 policy 10</th>['encryption: aes-256', 'integrity sha256', 'group 19', 'prf sha256', 'lifetime 86400']</td>
My Goal:
<td><th>ikev2 policy 10</th> <br />encryption: aes-256<br />integrity sha256<br />group 19<br />prf sha256<br />lifetime 86400</td>
Now there is no question that it's possible to code this WAY cleaner. But I'm a total beginner and as long as the .join part works too, I'm fine with my code as it is.
The values in your dictionary are lists with strings in them. So dict.values() returns a list of lists. When you use .join, it is converting that list into a str instead of the elements of that list:
In [38]: [str(x) for x in ikev2_pols.values()]
Out[38]: ["['encryption: aes-256', 'integrity sha256', 'group 19', 'prf sha256', 'lifetime 86400']"]
Instead, you need to flatten out that list:
In [39]: [str(elem) for sublist in ikev2_pols.values() for elem in sublist]
Out[39]:
['encryption: aes-256',
'integrity sha256',
'group 19',
'prf sha256',
'lifetime 86400']
Add that code inside your .join() instead.
[EDIT] - Issue not with dictionary itself. Unmodified copies of original file 'census2010.py' do not display the issue.
I'm trying to encode Excel data into a nested dictionary for further analysis.
I expect to be able to read out any key from the dictionary. For example, I expect the following to work:
>>> census2010.allData['AK']['Anchorage']
{'pop': 291826, 'tracts': 55}
What I get is:
census2010.allData['AK']['Anchorage']
Traceback (most recent call last):
File "<input>", line 1, in <module>
KeyError: 'AK'
The only key that works is:
census2010.allData['WY']['Weston']
{'pop': 3894, 'tracts': 1}
I've created the Census2010.py file with the data from the censuspopdata.xlsx folder (following process from Chapter 12 of "Automate the Boring Stuff").
Directly looking at Census2010.py shows all the nested keys, but importing 'census2010.py' and interrogating the dictionary only shows the "final" key.
Here's the script to generate census2010.py: (and it runs without error)
import openpyxl, pprint, os
print('Opening workbook...')
os.getcwd()
p = os.getcwd()
os.chdir(p + '\\automatestuffdirectorytest\\')
wb = openpyxl.load_workbook('censuspopdata.xlsx')
sheet = wb['Population by Census Tract']
countyData = {}
print('Reading rows...')
for row in range(2, sheet.max_row + 1):
# Each row in the spreadsheet has data for one census tract.
state = sheet['B' + str(row)].value
county = sheet['C' + str(row)].value
pop = sheet['D' + str(row)].value
# Make sure the key for this state exists.
countyData.setdefault(state, {})
# Make sure the key for this county in this state exists.
countyData[state].setdefault(county, {'tracts': 0, 'pop': 0})
# Each row represents one census tract, so increment by one.
countyData[state][county]['tracts'] += 1
# Increase the county pop by the pop in this census tract.
countyData[state][county]['pop'] += int(pop)
print('Writing results...')
resultFile = open('census2010.py', 'w')
resultFile.write('allData = ' + pprint.pformat(countyData))
resultFile.close()
print('Done.')
and here's a few snips of the resulting dictionary (3143 lines)
allData = {'AK': {'Aleutians East': {'pop': 3141, 'tracts': 1},
'Aleutians West': {'pop': 5561, 'tracts': 2},
'Anchorage': {'pop': 291826, 'tracts': 55}, # ...
--snip--
'Yukon-Koyukuk': {'pop': 5588, 'tracts': 4}}, # ...
--snip--
'WY': {'Albany': {'pop': 36299, 'tracts': 10}, # ...
--snip --
'Weston': {'pop': 7208, 'tracts': 2}}}
But the only key that seems to be found is [WY][Weston]
for i in allData.items():
... print(i)
...
('WY', {'Weston': {'pop': 3894, 'tracts': 1}})
calling the keys only works with ['WY']['Weston']
census2010.allData['WY']['Weston']
{'pop': 3894, 'tracts': 1}
This code might help you:
Note: I am not using os module
# ! python3
# read_census_excel.txt - Tabultaes population and number of census tracts for
# each county.
import openpyxl, pprint
print('Opening workbook')
wb = openpyxl.load_workbook('censuspopdata.xlsx')
sheet = wb.get_sheet_by_name('Population by Census Tract')
county_data = {}
print('Reading rows...')
for row in range(2, sheet.max_row + 1):
# Each row in the spreadsheet has data for one census tract.
state = sheet['B' + str(row)].value
county = sheet['C' + str(row)].value
pop = sheet['D' + str(row)].value
# Make sure the key for this state exists.
county_data.setdefault(state, {})
# Make sure the key for this county in this state exists.
county_data[state].setdefault(county, {'tracts': 0, 'pop': 0})
# Each row represents one census tract, so increment by one.
county_data[state][county]['tracts'] += 1
# Increase the county pop by the pop in this census tract.
county_data[state][county]['pop'] += int(pop)
# Open a new text file and write the contents of county_data to it.
print('Writing results...')
result_file = open('census2010.py', 'w')
result_file.write('all_data= ' + pprint.pformat(county_data))
result_file.close()
print('Done')
I have an Excel spreadsheet I'm preparing to migrate to Access and the date column has entries in multiple formats such as: 1963 to 1969, Aug. 1968 to Sept. 1968, 1972, Mar-73, 24-Jul, Oct. 2, 1980, Aug 29, 1980, July 1946, etc. and 'undated'. I'm pulling the column that will be the key (map number) and date column into a csv and writing back to a csv.
I can strip out years that are 4 digit, but not ranges. And I'm stumped how to extract days and 2 digit years short of re-formatting by hand. My code isn't very elegant and probably not best practice:
import csv, xlwt, re
# create new Excel document and add sheet
# from tempfile import TemporaryFile
from xlwt import Workbook
book = Workbook()
sheet1 = book.add_sheet('Sheet 1')
# populate first row with header
sheet1.write(0,0,"Year")
sheet1.write(0,1,"Map")
sheet1.write(0,2,"As Entered")
# count variable for populating sheet
rowCount=0
# open csv file and read
with open('C:\dateTestMSDOs.csv', 'rb') as f:
reader=csv.reader(f)
for row in reader:
map = row[0] # first row is map number
dateRaw = row[1] # second row is raw date as entered
# write undated and blank entries
if dateRaw == 'undated':
yearStr = '0000'
rowCount +=1
sheet1.write(rowCount, 0, yearStr)
sheet1.write(rowCount, 1, map)
sheet1.write(rowCount, 2, dateRaw)
#print rowCount, yearStr, map, dateRaw, '\n'
yearStr=''
if dateRaw == '':
yearStr = 'NoEntry'
rowCount +=1
sheet1.write(rowCount, 0, yearStr)
sheet1.write(rowCount, 1, map)
sheet1.write(rowCount, 2, dateRaw)
#print rowCount, yearStr, map, dateRaw, '\n'
yearStr=''
# search and write instances of four consecutive digits
try:
year = re.search(r'\d\d\d\d', dateRaw)
yearStr= year.group()
#print yearStr, map, dateRaw
rowCount +=1
sheet1.write(rowCount, 0, yearStr)
sheet1.write(rowCount, 1, map)
sheet1.write(rowCount, 2, dateRaw)
#print rowCount, yearStr, map, dateRaw, '\n'
yearStr=''
# if none exist flag for cleaning spreadsheet and print
except:
#print 'Nope', map, dateRaw
rowCount +=1
yearStr='Format'
sheet1.write(rowCount, 0, yearStr)
sheet1.write(rowCount, 1, map)
sheet1.write(rowCount, 2, dateRaw)
#print rowCount, yearStr, map, dateRaw, '\n'
yearStr=''
yearStr=''
dateRaw=''
book.save('D:\dateProperty.xls')
print "Done!"
I would like to write day and month to an additional column as well as pull the second 4 digit date of range entries.
You can try using dateutil for this. I think you'd still need to deal with some of the more difficult formats in a different way though. See a sample implementation below:
Code:
import dateutil.parser as dateparser
date_list = ['1963 to 1969',
'Aug. 1968 to Sept. 1968',
'Mar-73',
'24-Jul',
'Oct. 2 1980',
'Aug 29, 1980',
'July 1946',
'undated']
for d in date_list:
if 'to' in d:
a, b = d.split('to')
# Get the higher number. Use min to get lower of two.
print max(dateparser.parse(a.strip()).year, dateparser.parse(b.strip()).year)
elif d == 'undated':
print '0000'
else:
yr = dateparser.parse(d).year
print yr
Result:
1969
1968
1973
2014
1980
1980
1946
0000
[Finished in 0.4s]
Only glaring issue I can see is that 24-Jul returns a date of 2014 because the parser assumes the current day, month, or year in place of missing component, ie. Mar-73 will become 1973-03-20 if today is the 20th of the month, etc.
Not entirely sure if this is what you were going for or not but I just used a "simple" regex search and then traversed through the sets of groups that matched, applying the given function defined. If a match is found then the function that is called (found in the regex_groups variable) should return a dictionary with the following keys: start_day, start_month, start_year, end_day, end_month, end_year
Then you can do whatever you'd like with those values. Definitely not the cleanest solution but it works, as far as I can tell.
#!/usr/local/bin/python2.7
import re
# Crazy regex
regex_pattern = '(?:(\d{4}) to (\d{4}))|(?:(\w+)\. (\d{4}) to (\w+)\. (\d{4}))|(?:(\w+)-(\d{2}))|(?:(\d{2})-(\w+))|(?:(\w+)\. (\d+), (\d{4}))|(?:(\w+) (\d+), (\d{4}))|(?:(\w+) (\d{4}))|(?:(\d{4}))'
date_strings = [
'1963 to 1969',
'Aug. 1968 to Sept. 1968',
'1972',
'Mar-73',
'24-Jul',
'Oct. 2, 1980',
'Aug 29, 1980',
'July 1946',
]
# Here you set the group matching functions that will be called for a matching group
regex_groups = {
(1,2): lambda group_matches: {
'start_day': '', 'start_month': '', 'start_year': group_matches[0],
'end_day': '', 'end_month': '', 'end_year': group_matches[1]
},
(3,4,5,6): lambda group_matches: {
'start_day': '', 'start_month': group_matches[0], 'start_year': group_matches[1],
'end_day': '', 'end_month': group_matches[2], 'end_year': group_matches[3]
},
(7,8): lambda group_matches: {
'start_day': '', 'start_month': group_matches[0], 'start_year': group_matches[1],
'end_day': '', 'end_month': '', 'end_year': ''
},
(9,10): lambda group_matches: {
'start_day': group_matches[1], 'start_month': '', 'start_year': group_matches[0],
'end_day': '', 'end_month': '', 'end_year': ''
},
(11,12,13): lambda group_matches: {
'start_day': group_matches[1], 'start_month': group_matches[0], 'start_year': group_matches[2],
'end_day': '', 'end_month': '', 'end_year': ''
},
(14,15,16): lambda group_matches: {
'start_day': group_matches[1], 'start_month': group_matches[0], 'start_year': group_matches[2],
'end_day': '', 'end_month': '', 'end_year': ''
},
(17,18): lambda group_matches: {
'start_day': '', 'start_month': group_matches[0], 'start_year': group_matches[1],
'end_day': '', 'end_month': '', 'end_year': ''
},
(19,): lambda group_matches: {
'start_day': '', 'start_month': '', 'start_year': group_matches[0],
'end_day': '', 'end_month': '', 'end_year': ''
},
}
for ds in date_strings:
matches = re.search(regex_pattern, ds)
start_month = ''
start_year = ''
end_month = ''
end_year = ''
for regex_group, group_func in regex_groups.items():
group_matches = [matches.group(sub_group_num) for sub_group_num in regex_group]
if all(group_matches):
match_data = group_func(group_matches)
print
print 'Matched:', ds
print '%s to %s' % ('-'.join([match_data['start_day'], match_data['start_month'], match_data['start_year']]), '-'.join([match_data['end_day'], match_data['end_month'], match_data['end_year']]))
# match_data is a dictionary with keys:
# * start_day
# * start_month
# * start_year
# * end_day
# * end_month
# * end_year
# If a group doesn't contain one of those items, then it is set to a blank string
Outputs:
Matched: 1963 to 1969
--1963 to --1969
Matched: Aug. 1968 to Sept. 1968
-Aug-1968 to -Sept-1968
Matched: 1972
--1972 to --
Matched: Mar-73
-Mar-73 to --
Matched: 24-Jul
Jul--24 to --
Matched: Oct. 2, 1980
2-Oct-1980 to --
Matched: Aug 29, 1980
29-Aug-1980 to --
Matched: July 1946
-July-1946 to --
You could define all the possible cases of dates using regex, something like:
import re
s = ['1963 to 1969', 'Aug. 1968 to Sept. 1968',
'1972', 'Mar-73', '03-Jun', '24-Jul', 'Oct. 2, 1980', 'Oct. 26, 1980',
'Aug 29 1980', 'July 1946']
def get_year(date):
mm = re.findall("\d{4}", date)
if mm:
return mm
mm = re.search("\w+-(\d{2})", date)
if mm:
return [mm.group(1)]
def get_month(date):
mm = re.findall("[A-Z][a-z]+", date)
if mm:
return mm
def get_day(date):
d_expr = ["(\d|\d{2})\-[A-Z][a-z]+","[A-Z][a-z]+[\. ]+(\d|\d{2}),"]
for expr in d_expr:
mm = re.search(expr, date)
if mm:
return [mm.group(1)]
d = {}
m = {}
y = {}
for idx, date in enumerate(s):
d[idx] = get_day(date)
m[idx] = get_month(date)
y[idx] = get_year(date)
print "Year Dict: ", y
print "Month Dict: ", m
print "Day Dict: ", d
As result you get dictionaries of days, month, and years. They could be used to populate the rows.
Output:
Year Dict: {0: ['1963', '1969'], 1: ['1968', '1968'], 2: ['1972'], 3: ['73'], 4: None, 5: None, 6: ['1980'], 7: ['1980'], 8: ['1980'], 9: ['1946']}
Month Dict: {0: None, 1: ['Aug', 'Sept'], 2: None, 3: ['Mar'], 4: ['Jun'], 5: ['Jul'], 6: ['Oct'], 7: ['Oct'], 8: ['Aug'], 9: ['July']}
Day Dict: {0: None, 1: None, 2: None, 3: None, 4: ['03'], 5: ['24'], 6: ['2'], 7: ['26'], 8: None, 9: None}
Thank you for the innovative suggestions. After consideration we decided to remove day and month from what would be searchable in our database, since only a relatively small amount of our data had that level of detail. Here is the code I use to extract and generate the data I needed from a long and messy list.
import csv, xlwt, re
# create new Excel document and add sheet
from xlwt import Workbook
book = Workbook()
sheet1 = book.add_sheet('Sheet 1')
# populate first row with header
sheet1.write(0,0,"MapYear_(Parsed)")
sheet1.write(0,1,"Map_Number")
sheet1.write(0,2,"As_Entered")
# count variable for populating sheet
rowCount=0
# open csv file and read
yearStr = ''
with open('C:\mapsDateFix.csv', 'rb') as f:
reader=csv.reader(f)
for row in reader:
map = row[0] # first row is map number
dateRaw = row[1] # second row is raw date as entered
# write undated and blank entries
if dateRaw == 'undated':
yearStr = 'undated'
rowCount +=1
sheet1.write(rowCount, 0, yearStr)
sheet1.write(rowCount, 1, map)
sheet1.write(rowCount, 2, dateRaw)
#print rowCount, yearStr, map, dateRaw, '\n'
#yearStr=''
if yearStr != 'undated':
if dateRaw == '':
yearStr = 'NoEntry'
rowCount +=1
sheet1.write(rowCount, 0, yearStr)
sheet1.write(rowCount, 1, map)
sheet1.write(rowCount, 2, dateRaw)
#print rowCount, yearStr, map, dateRaw, '\n'
#yearStr=''
# search and write instances of four consecutive digits
if yearStr != dateRaw:
try:
year = re.search(r'\d\d\d\d', dateRaw)
yearStr= year.group()
#print yearStr, map, dateRaw
rowCount +=1
sheet1.write(rowCount, 0, yearStr)
sheet1.write(rowCount, 1, map)
sheet1.write(rowCount, 2, dateRaw)
#print rowCount, yearStr, map, dateRaw, '\n'
yearStr=''
# if none exist flag for cleaning spreadsheet and print
except:
#print 'Nope', map, dateRaw
rowCount +=1
yearStr='Format'
sheet1.write(rowCount, 0, yearStr)
sheet1.write(rowCount, 1, map)
sheet1.write(rowCount, 2, dateRaw)
#print rowCount, yearStr, map, dateRaw, '\n'
yearStr=''
yearStr=''
dateRaw=''
book.save('D:\dateProperty.xls')
print "Done!"