Split, edit, and replace values in list - python

having trouble doing some email text munging. I have participant sign ups in a list like this:
body=['Study: Study 1', 'Date: Friday, March 28, 2014 3:15 PM - 4:00 PM',
'Location: Some Place','Participant: John Doe','Study: Study 1',
'Date: Friday, March 28, 2014 4:00 PM - 4:40 PM',
'Location: Some Place','Participant: Mary Smith']
I'm new to using python, so I'm not sure if there a specific name for the operation I want. Practically, what I want is to take the list items with the 'Participant: tag, remove that tag, and split the names up into separate list items for first and last names. So, something like this:
body=['Study: Study 1', 'Date: Friday, March 28, 2014 3:15 PM - 4:00 PM',
'Location: Some Place','John' ,'Doe']
I've tried using a list comprehension similar to here:
[item.split(' ')[1:] for item in body if re.match('Participant:*', item)]
which gives me back a nested list like this:
[['John', 'Doe'],['Mary','Smith']]
But, I have no idea how to make those nested lists with the first and last names into single list items, and no idea how to insert them back into the original list.
Any help is much appreciated!

You can have you cake and eat it with:
[elem
for line in body
for elem in (line.split()[1:] if line.startswith('Participant:') else (line,))]
This produces output in a nested loop, where the inner loop either iterates over the split output or a tuple with one element, the unsplit list element:
>>> from pprint import pprint
>>> body=['Study: Study 1', 'Date: Friday, March 28, 2014 3:15 PM - 4:00 PM',
... 'Location: Some Place','Participant: John Doe','Study: Study 1',
... 'Date: Friday, March 28, 2014 4:00 PM - 4:40 PM',
... 'Location: Some Place','Participant: Mary Smith']
>>> [elem
... for line in body
... for elem in (line.split()[1:] if line.startswith('Participant:') else (line,))]
['Study: Study 1', 'Date: Friday, March 28, 2014 3:15 PM - 4:00 PM', 'Location: Some Place', 'John', 'Doe', 'Study: Study 1', 'Date: Friday, March 28, 2014 4:00 PM - 4:40 PM', 'Location: Some Place', 'Mary', 'Smith']
>>> pprint(_)
['Study: Study 1',
'Date: Friday, March 28, 2014 3:15 PM - 4:00 PM',
'Location: Some Place',
'John',
'Doe',
'Study: Study 1',
'Date: Friday, March 28, 2014 4:00 PM - 4:40 PM',
'Location: Some Place',
'Mary',
'Smith']

IMHO, this sort of thing is cleanest with a function:
def do_whatever(lst):
for item in lst:
if item.startswith('Participant:'):
head, tail = item.split(':', 1)
for name in tail.split():
yield name
else:
yield item
body = list(do_whatever(body))
e.g.:
>>> def do_whatever(lst):
... for item in lst:
... if item.startswith('Participant:'):
... head, tail = item.split(':', 1)
... for name in tail.split():
... yield name
... else:
... yield item
...
>>> body=['Study: Study 1', 'Date: Friday, March 28, 2014 3:15 PM - 4:00 PM',
... 'Location: Some Place','Participant: John Doe','Study: Study 1',
... 'Date: Friday, March 28, 2014 4:00 PM - 4:40 PM',
... 'Location: Some Place','Participant: Mary Smith']
>>> body = list(do_whatever(body))
>>> body
['Study: Study 1', 'Date: Friday, March 28, 2014 3:15 PM - 4:00 PM', 'Location: Some Place', 'John', 'Doe', 'Study: Study 1', 'Date: Friday, March 28, 2014 4:00 PM - 4:40 PM', 'Location: Some Place', 'Mary', 'Smith']
Sorry for the really bad function name -- I'm not feeling creative at the moment...

Related

Pandas date convesrion unconverted data remains

In Pandas (Juypter) I have a column with dates in string format:
koncerti.Date.values[:20]
array(['15 September 2010', '16 September 2010', '18 September 2010',
'20 September 2010', '21 September 2010', '23 September 2010',
'24 September 2010', '26 September 2010', '28 September 2010',
'30 September 2010', '1 October 2010', '3 October 2010',
'5 October 2010', '6 October 2010', '8 October 2010',
'10 October 2010', '12 October 2010', '13 October 2010',
'15 October 2010', '17 October 2010'], dtype=object)
I try to convert them to date format with the following statement:
koncerti.Date = pd.to_datetime(koncerti.Date, format='%d %B %Y')
Unfortunatelly, it produces the following error: ValueError: unconverted data remains: [31]
What does it mean this error?
Solution: koncerti.Date = pd.to_datetime(koncerti.Date, format='%d %B %Y', exact=False)
Addditional parameter was needed: exact=False

multiprocessing.Pool.map() 'MaybeEncodingError: Error Sending Results' while trying to scrape a long list of links

This is all of my code: (I am working in a jupyter notebook)
I just tried to do something new with the worker function and I am still getting a Reason: 'RecursionError('maximum recursion depth exceeded while pickling an object')' error. I tried to implement itertools like so:
def worker(links):
# row_data = [meeting_data_scraper(link) for link in links]
with multiprocessing.Pool(10) as p:
row_data = p.map(meeting_data_scraper, links)
all_data = set(it.chain.from_iterable(row_data))
return all_data
and this gave me the same error I have been getting. I just do not understand why this is not working. I know I am probably missing something simple but I can't see it and I am losing my mind.
import csv
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
from multiprocessing.pool import Pool, ThreadPool
import itertools as it
# this is where any files or directories/path variables go:
csv_filename = './meetings.csv'
def csv_parser(csv_reader, header: str):
_header = next(csv_reader)
headerIndex = _header.index(header)
# now create an empty list to append the addresses to
data_list = []
# loop through CSV file and append to address_list
for line in csv_reader:
all_data = line[headerIndex]
data_list.append(all_data)
return data_list
# CREATE A CSV WRITER TO WRITE ROW DATA TO A NEW CSV FILE WITH MEETING > > DETAILS:
def csv_writer(csv_writer, data):
# Open CSV file to append data to:
with open('new_meeting_data.csv', 'w') as f:
csv_writer = csv.DictWriter(f, field_names=list(data.keys()))
# create func to get list of links:
def get_links(csv_filename):
with open(csv_filename, 'r') as f:
csv_reader = csv.reader(f)
link_list = csv_parser(csv_reader, 'link')
return link_list
# CREATE A FUNCTION THAT WILL EXTRACT ADDRESS DATA FROM EACH LINK IN
# LINK LIST:
def get_address_data(soup):
try:
address_tag = soup.address
address = address_tag.contents[1]
meeting_name = soup.find(
'div', class_='fui-card-body').find(class_='weight-300')
name = meeting_name.contents[1]
city_tag = meeting_name.find_next('a')
city = city_tag.contents[0]
state_tag = city_tag.find_next('a')
state = state_tag.contents[0]
return {'name': name, 'address': address, 'city': city, 'state': state}
except IndexError as ie:
print(f"Index Error: {ie}")
try:
return {'name': name, 'address': address, 'city': city, 'state': 'state'}
except UnboundLocalError as ule:
print(f"UnboundLocalError: {ule}")
try:
return {'name': name, 'address': address, 'city': 'city', 'state': state}
except UnboundLocalError as ule:
print(f"UnboundLocalError: {ule}")
try:
return {'name': name, 'address': 'address', 'city': city, 'state': state}
except UnboundLocalError as ule:
print(f"UnboundLocalError: {ule}")
try:
return {'name': 'name', 'address': address, 'city': city, 'state': state}
except UnboundLocalError as ule:
print(f"UnboundLocalError: {ule}")
# CREATE A FUNCTION THAT WILL EXTRACT ALL THE TABLE DATA FROM EACH LINK
# IN THE LINK LIST. THE TABLE DATA WILL THEN NEED TO BE PARSED AND
# CLEANED IF THERE ARE MULTIPLE ITEMS:
def get_table_data(soup):
try:
info_table = soup.find('table', class_='table fui-table')
# obtain all the columns from <th>
headers = []
for i in info_table.find_all('th'):
title = i.text
headers.append(title.lower())
# now create a dataframe:
df = pd.DataFrame(columns=headers)
# Now create the foor loop to fill dataframe
# a row is under the <tr> tags and data under the <td> tags
for j in info_table.find_all('tr')[1:]:
# if info_table.find_all('tr')[1:] == AttributeError.NoneType:
# print("No info table found")
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(df)
df.loc[length] = row
# data['day'].append(df['day'].to_list())
# data['time'].append(df['time'].to_list())
# data['info'].append(df['info'].to_list())
day = df['day'].to_list()
time = df['time'].to_list()
info = df['info'].to_list()
# now return data
return {'day': day, 'time': time, 'info': info}
except AttributeError as ae:
print(f"info_table.find_all('tr')[1:] raised error: {ae}")
return {'day': 'day', 'time': 'time', 'info': 'info'}
# CREATE A FUNCTION THAT WILL PARSE THE ROW DATA AND STORE IT
# IN A DICTIONARY. THAT DICTIONARY CAN THEN BE INSERTED INTO
# A LIST OF DICTIONARIES CALLED ROW_LIST BUT TO DO THIS THE PARSER
# HAS TO JOIN LIST ITEMS INTO ONE LONG STRING SO EACH ROW HAS THE
# SAME NUMBER OF COLUMNS:
# THIS WAS INSERTED INTO meeting_data_scraper
def meeting_row_parser(item0, item1):
"""
:param item0: This is the address data in a dictionary. Use the following keys to access
the data -> Keys: 'name' - 'address' - 'city' - 'state'
:param item1: This is the meeting details data in a dictionary. Use the following keys to
access the data -> Keys: 'day' - 'time' - 'info'
create a final dictionary that will be used to store the information in the database as one row.
I will need to join the list items to create one string with a | seperating each item so I can
split the string when retrieving the data.
"""
try:
row = {}
try:
row['name'] = item0['name']
row['address'] = item0['address']
row['city'] = item0['city']
row['state'] = item0['state']
# now add item1 to the row data
row['day'] = ' | '.join(item1['day'])
row['time'] = ' | '.join(item1['time'])
row['info'] = ' | '.join(item1['info'])
# now return the row data dictionary
return row
except Exception as e:
print(f'{e}')
except Exception as e:
print(f'{e}')
for k, v in row.items():
if v is not None:
pass
else:
v = str(k)
return row
# THIS IS THE 'MAIN LOGICAL FUNCTION' THIS FUNCTION WILL COMBINE THE
# get_address_data, get_table_data, and meeting_row_parser FUNCTIONS.
# THAT WAY I CAN EXECUTE ALL OF THE FUNCTIONS IN ONE CALL.
def meeting_data_scraper(link):
def get_address_data(soup):
try:
address_tag = soup.address
address = address_tag.contents[1]
meeting_name = soup.find(
'div', class_='fui-card-body').find(class_='weight-300')
name = meeting_name.contents[1]
city_tag = meeting_name.find_next('a')
city = city_tag.contents[0]
state_tag = city_tag.find_next('a')
state = state_tag.contents[0]
return {'name': name, 'address': address, 'city': city, > 'state': stat> e}
except IndexError as ie:
print(f"Index Error: {ie}")
try:
return {'name': name, 'address': address, 'city': city, 'state': 'state'}
except UnboundLocalError as ule:
print(f"UnboundLocalError: {ule}")
try:
return {'name': name, 'address': address, 'city': 'city', 'state': state}
except UnboundLocalError as ule:
print(f"UnboundLocalError: {ule}")
try:
return {'name': name, 'address': 'address', 'city': city, 'state': state}
except UnboundLocalError as ule:
print(f"UnboundLocalError: {ule}")
try:
return {'name': 'name', 'address': address, 'city': city, 'state': state}
except UnboundLocalError as ule:
print(f"UnboundLocalError: {ule}")
def get_table_data(soup):
try:
info_table = soup.find('table', class_='table fui-table')
# obtain all the columns from <th>
headers = []
for i in info_table.find_all('th'):
title = i.text
headers.append(title.lower())
# now create a dataframe:
df = pd.DataFrame(columns=headers)
# Now create the foor loop to fill dataframe
# a row is under the <tr> tags and data under the <td> tags
for j in info_table.find_all('tr')[1:]:
# if info_table.find_all('tr')[1:] == AttributeError.NoneType:
# print("No info table found")
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(df)
df.loc[length] = row
# data['day'].append(df['day'].to_list())
# data['time'].append(df['time'].to_list())
# data['info'].append(df['info'].to_list())
day = df['day'].to_list()
time = df['time'].to_list()
info = df['info'].to_list()
# now return data
return {'day': day, 'time': time, 'info': info}
except AttributeError as ae:
print(f"info_table.find_all('tr')[1:] raised error: {ae}")
return {'day': 'day', 'time': 'time', 'info': 'info'}
def meeting_row_parser(item0, item1):
"""
:param item0: This is the address data in a dictionary. Use the following keys to access
the data -> Keys: 'name' - 'address' - 'city' - 'state'
:param item1: This is the meeting details data in a dictionary. Use the following keys to
access the data -> Keys: 'day' - 'time' - 'info'
create a final dictionary that will be used to store the information in the database as one row.
I will need to join the list items to create one string with a | seperating each item so I can
split the string when retrieving the data.
"""
row = {}
try:
row['name'] = item0['name']
row['address'] = item0['address']
row['city'] = item0['city']
row['state'] = item0['state']
except Exception as e:
print(e)
row['name'] = 'name'
row['address'] = 'address'
row['city'] = 'city'
row['state'] = 'state'
# now add item1 to the row data
row['day'] = ' | '.join(item1['day'])
row['time'] = ' | '.join(item1['time'])
row['info'] = ' | '.join(item1['info'])
# now return the row data dictionary
return row
# STEP 1: Get the webpage response obj from requests
page = requests.get(link)
# STEP 2: Get Soup object for html parsing
soup = bs(page.text, "lxml")
# Create two dicts with the following keys
address_dict = get_address_data(soup)
details_dict = get_table_data(soup)
d = [address_dict, details_dict]
row_data = meeting_row_parser(d[0], d[1])
return row_data
multiprocessing.Pool functions
# THE MAIN FUNCTION. THIS WILL BE USED WITH THE POOL.MAP METHOD
# TO ATTEMPT TO EXTRACT ALL THE DATA FROM THOSE LINKS QUICKER
# AND MORE EFFICIENTLY. THIS IS THE 'WORKER'
def worker(links):
# row_data = [meeting_data_scraper(link) for link in links]
row_data = meeting_data_scraper(links)
return row_data
# Create a Pool Factory function to determine thread or process
def pool_factory(key, n):
if key == 'proc':
print('using procs instead of threads')
return Pool(n)
else:
return ThreadPool(n)
def main():
"""
:description: This will execute the logic for scraping the link_list
from the meetings.csv file. It also brings together the Pool.map
method to make scraping easier and faster.
:csv_filename: in import cell top of file
:info: To change from using 'processes' just change 'proc'
in pool_factory to 'thread'
"""
link_list = get_links(csv_filename)
pool = pool_factory('proc', 10)
results = pool.map(worker, link_list[:1000])
pool.close()
pool.join()
data = [len(x) for x in results]
return data
data = main()
That is the complete codebase. I know it is a lot but I wanted to make sure everything was there. I have tried for three days different solutions and nothing is working. I tried without functions, I tried asyncio, I tried using concurrent.futures' ThreadPoolExecutor. Nothing seems to work.
THIS IS THE ERROR MESSAGE:
using procs instead of threads
Index Error: list index out of range
Index Error: list index out of range
Index Error: list index out of range
---------------------------------------------------------------------------
MaybeEncodingError Traceback (most recent call last)
/home/blackbox/projects/github/web-scraper/scraped_meeting_data/meeting_data_nb.ipynb Cell 13' in <cell line: 1>()
----> 1 main()
/home/blackbox/projects/github/web-scraper/scraped_meeting_data/meeting_data_nb.ipynb Cell 12' in main()
12 link_list = get_links(csv_filename)
13 pool = pool_factory('proc', 10)
---> 14 results = pool.map(worker, link_list[:1000])
15 pool.close()
16 pool.join()
File /usr/lib/python3.8/multiprocessing/pool.py:364, in Pool.map(self, func, iterable, chunksize)
359 def map(self, func, iterable, chunksize=None):
360 '''
361 Apply `func` to each element in `iterable`, collecting the results
362 in a list that is returned.
363 '''
--> 364 return self._map_async(func, iterable, mapstar, chunksize).get()
File /usr/lib/python3.8/multiprocessing/pool.py:771, in ApplyResult.get(self, timeout)
769 return self._value
770 else:
--> 771 raise self._value
MaybeEncodingError: Error sending result: '[{'name': ' Contoocook Keep It Simple Group', 'address': ' 24 Maple St ', 'city': 'Hopkinton', 'state': 'New Hampshire', 'day': 'Sunday', 'time': '7:00 pm - 8:00 pm', 'info': 'Wheelchair Access, Speaker, Discussion, Concurrent with Al-Anon'}, {'name': ' Triangle Group Online', 'address': ' 401 Southwest Plaza', 'city': 'Arlington', 'state': 'Texas', 'day': 'Monday | Saturday', 'time': '7:00 pm - 8:00 pm | 9:00 am - 10:00 am', 'info': 'Big Book, Open Meeting | Big Book, Open Meeting'}, {'name': ' AA Happy Thoughts Group', 'address': ' 1330 Hooksett Road ', 'city': 'Hooksett', 'state': 'New Hampshire', 'day': 'Thursday', 'time': '7:00 pm - 8:00 pm', 'info': 'Discussion, Virtual'}, {'name': ' Esperanza De Vivir', 'address': ' 2323 West Lincoln Avenue', 'city': 'Anaheim', 'state': 'California', 'day': 'Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday', 'time': '8:00 pm - 9:00 pm | 8:00 pm - 9:00 pm | 8:00 pm - 9:00 pm | 8:00 pm - 9:00 pm | 8:00 pm - 9:00 pm | 8:00 pm - 9:00 pm | 8:00 pm - 9:00 pm', 'info': 'Spanish | Spanish | Spanish | Spanish | Spanish | Spanish | Spanish'}, {'name': ' Discussion Anaheim', 'address': ' 202 West Broadway', 'city': 'Anaheim', 'state': 'California', 'day': 'Wednesday | Wednesday | Thursday | Friday | Sunday', 'time': '12:00 pm - 1:00 pm | 5:30 pm - 6:30 pm | 5:30 pm - 6:30 pm | 12:00 pm - 1:00 pm | 5:30 pm - 6:30 pm', 'info': 'Wheelchair Access, Open Meeting | Wheelchair Access, Open Meeting | Wheelchair Access, Open Meeting | Wheelchair Access, Open Meeting | Wheelchair Access, Open Meeting'}, {'name': ' In the Hall Group', 'address': ' 50 Westminster Street ', 'city': 'Walpole', 'state': 'New Hampshire', 'day': 'Monday | Wednesday | Friday', 'time': '7:00 pm - 8:00 pm | 7:00 pm - 8:00 pm | 7:00 pm - 8:00 pm', 'info': 'Discussion | Discussion | Discussion'}, {'name': ' Shining Light Group', 'address': ' 162 Village Road ', 'city': 'Newbury', 'state': 'New Hampshire', 'day': 'Monday | Tuesday | Wednesday | Thursday | Friday', 'time': '8:30 am - 9:30 am | 8:30 am - 9:30 am | 8:30 am - 9:30 am | 8:30 am - 9:30 am | 8:30 am - 9:30 am', 'info': 'Wheelchair Access, Discussion | Wheelchair Access, Discussion | Wheelchair Access, Discussion | Wheelchair Access, Discussion | Wheelchair Access, Discussion'}, {'name': ' Carry The Message', 'address': ' 727 South Harbor Boulevard', 'city': 'Anaheim', 'state': 'California', 'day': 'Saturday', 'time': '6:00 am - 7:00 am', 'info': 'Open Meeting'}, {'name': ' Chester Big Book Group', 'address': ' 1 Chester Street ', 'city': 'Chester', 'state': 'New Hampshire', 'day': 'Wednesday', 'time': '8:00 pm - 9:30 pm', 'info': 'Big Book'}, {'name': ' Bills Babes Womens Book Study', 'address': ' 311 West South Street', 'city': 'Anaheim', 'state': 'California', 'day': 'Thursday', 'time': '6:30 pm - 7:30 pm', 'info': 'Closed Meeting, Women'}, {'name': ' Big Book Study', 'address': ' 202 West Broadway', 'city': 'Anaheim', 'state': 'California', 'day': 'Monday', 'time': '8:30 pm - 9:30 pm', 'info': 'Wheelchair Access, Open Meeting'}, {'name': ' Saturday Night Live', 'address': ' 2759 South King Street ', 'city': 'Honolulu', 'state': 'Hawaii', 'day': 'Saturday', 'time': '8:00 pm - 9:00 pm', 'info': 'Discussion, English, Open Meeting, Speaker, Virtual'}, {'name': ' Pali Womens Group', 'address': ' 2747 Pali Highway ', 'city': 'Honolulu', 'state': 'Hawaii', 'day': 'Thursday', 'time': '5:30 pm - 6:30 pm', 'info': 'Discussion, English, Open Meeting, Women, Virtual'}, {'name': ' Ohua Group', 'address': ' 310 Paoakalani Avenue ', 'city': 'Honolulu', 'state': 'Hawaii', 'day': 'Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday', 'time': '7:30 pm - 8:30 pm | 7:30 pm - 8:30 pm | 7:30 pm - 8:30 pm | 7:30 pm - 8:30 pm | 7:30 pm - 8:30 pm | 7:30 pm - 8:30 pm', 'info': 'English, LGBTQ, Newcomer, Open Meeting, Virtual | Discussion, English, LGBTQ, Open Meeting, Speaker, Virtual | English, LGBTQ, Open Meeting, Virtual | English, LGBTQ, Open Meeting, Step Meeting, Virtual | Discussion, English, LGBTQ, Literature, Open Meeting, Virtual | Big Book, English, LGBTQ, Open Meeting, Virtual'}, {'name': ' New Beginnings Womens Meeting', 'address': ' 5919 Kalanianaʻole Highway ', 'city': 'Honolulu', 'state': 'Hawaii', 'day': 'Tuesday', 'time': '7:00 pm - 8:00 pm', 'info': 'Big Book, Child-Friendly, English, Newcomer, Women, Virtual'}, {'name': ' Malia Discussion', 'address': ' 766 North King Street ', 'city': 'Honolulu', 'state': 'Hawaii', 'day': 'Thursday', 'time': '7:30 pm - 9:00 pm', 'info': 'Discussion, English, Open Meeting, Speaker, Virtual'}, {'name': ' Huntsville Group', 'address': ' 309 Church Avenue ', 'city': 'Huntsville', 'state': 'Arkansas', 'day': 'Wednesday | Saturday', 'time': '8:00 pm - 9:00 pm | 8:00 pm - 9:00 pm', 'info': 'Closed Meeting | Open Meeting'}, {'name': ' Squad 26 Group', 'address': ' 1415 6th Avenue', 'city': 'Anoka', 'state': 'Minnesota', 'day': 'Thursday', 'time': '7:30 pm - 8:30 pm', 'info': 'English'}, {'name': ' Squad 23 Anoka', 'address': ' 2700 North Ferry Street', 'city': 'Anoka', 'state': 'Minnesota', 'day': 'Sunday', 'time': '7:30 pm - 8:30 pm', 'info': 'Open Meeting, Wheelchair Access'}, {'name': ' Squad 20 Anoka', 'address': ' 2700 North Ferry Street', 'city': 'Anoka', 'state': 'Minnesota', 'day': 'Saturday', 'time': '8:00 am - 9:00 am', 'info': 'Men, Closed Meeting, Wheelchair Access'}, {'name': ' Saturday Night Fever', 'address': ' 1631 Esmeralda Place ', 'city': 'Minden', 'state': 'Nevada', 'day': 'Saturday', 'time': '7:00 pm - 8:00 pm', 'info': 'Open Meeting, Discussion, Wheelchair Access'}, {'name': ' Squad 2 Anoka', 'address': ' 2700 North Ferry Street', 'city': 'Anoka', 'state': 'Minnesota', 'day': 'Tuesday', 'time': '8:00 pm - 9:00 pm', 'info': 'Child-Friendly, Open Meeting, Wheelchair Access'}, {'name': ' Squad 18 Anoka', 'address': ' 2700 North Ferry Street', 'city': 'Anoka', 'state': 'Minnesota', 'day': 'Wednesday', 'time': '6:30 pm - 7:30 pm', 'info': 'Open Meeting, Wheelchair Access'}, {'name': ' Language of the Heart', 'address': ' 850 West 4th Street ', 'city': 'Fallon', 'state': 'Nevada', 'day': 'Wednesday', 'time': '5:15 pm - 6:15 pm', 'info': 'Open Meeting, Discussion, Wheelchair Access'}, {'name': ' Speaker and Birthday Meeting', 'address': ' 457 Esmeralda Street ', 'city': 'Fallon', 'state': 'Nevada', 'day': 'Saturday', 'time': '7:00 pm - 8:00 pm', 'info': 'Birthday, Open Meeting, Speaker, Wheelchair Access'}]'. Reason: 'RecursionError('maximum recursion depth exceeded while pickling an object')'
I just need to be able to go through the entire list of links (which is 33, 488) get the page contents, and extract the name, address, and info table data. I then want to put the extracted data into a dictionary and append that dictionary to a list so that each dictionary will be like a row. so I can then add the data (each dict) to a csv row.
I tried to do a for loop and use the meeting_data_scraper function, scraping one link at a time and it ran for 2 1/2 hours and then had a connection error. But this is not the first time I have had issues like this. I can never get my code to scrape big lists.

How to find time format in a text?

I want to extract only time from a text with so many different formats of dates and time such as 'thursday, august 6, 2020 4:32:54 pm', '25 september 2020 04:05 pm' and '29 april 2020 07:42'. So I want to extract, for example, 4:32:54, 07:42, 04:05. Can you help me with that?
I would try something like this:
times = [
'thursday, august 6, 2020 4:32:54 pm',
'25 september 2020 04:05 pm',
'29 april 2020 07:42',
]
print("\n".join("".join(i for i in t.split() if ":" in i) for t in times))
Output:
4:32:54
04:05
07:42

Splitting date range of dates into two to lists

I want to spit the following list of dates:
Month=['1 October 2020 to 31 October 2020',
'1 October 2020 to 31 October 2020',
'1 October 2020 to 31 October 2020',
'1 October 2020 to 31 October 2020',
'1 October 2020 to 31 October 2020']
The desired output is as follows:
Month = [['1 October 2020', '31 October 2020'],
['1 October 2020', '31 October 2020'],
['1 October 2020','31 October 2020'],
['1 October 2020', '31 October 2020'],
['1 October 2020','31 October 2020']]
How can I do this using regex.
I used Month.str.split('to') but this is not working properly because October contains to as a result splitting October into three strings. Therefore, I guess I have to use regex for this. What is the best way to achieve this?
Use ' to ' as the partition instead of just to -- this matches the input format better anyway since if you split on to you'd need to also strip the whitespace.
>>> [list(i.split(' to ') for i in Month)]
[[['1 October 2020', '31 October 2020'], ['1 October 2020', '31 October 2020'], ['1 October 2020', '31 October 2020'], ['1 October 2020', '31 October 2020'], ['1 October 2020', '31 October 2020']]]

Selenium vs. the NY Metropolitan Opera

First, obligatory advance apologies - almost newbie here, and this is my first question; please be kind...
I'm struggling to scrape javascript generated pages; in particular those of the Metropolitan Opera schedule. For any given month, I would like to create a calendar with just the name of the production, and the date and time of performance. I threw beautifulsoup and selenium at it, and I can get tons of info about the composer's love life, etc. - but not these 3 elements. Any help would be greatly appreciated.
Link to a random month in their schedule
One thing that you should look for (in the future) on websites are calls to an API. I opened up Chrome Dev Tools (F12) and reloaded the page while in the Network tab.
I found two api calls, one for "productions" and one for "events". The "events" response has much more information. This code below makes a call to the "events" endpoint and then returns a subset of that data (specifically, title, date and time according to your description).
I wasn't sure what you wanted to do with that data so I just printed it out. Let me know if the code needs to be updated/modified and I will do my best to help!
I wrote this code using Python 3.6.4
from datetime import datetime
import requests
BASE_URL = 'http://www.metopera.org/api/v1/calendar'
EVENT = """\
Title: {title}
Date: {date}
Time: {time}
---------------\
"""
def get_events(*, month, year):
params = {
'month': month,
'year': year
}
r = requests.get('{}/events'.format(BASE_URL), params=params)
r.raise_for_status()
return r.json()
def get_name_date_time(*, events):
result = []
for event in events:
d = datetime.strptime(event['eventDateTime'], '%Y-%m-%dT%H:%M:%S')
result.append({
'title': event['title'],
'date': d.strftime('%A, %B %d, %Y'),
'time': d.strftime('%I:%M %p')
})
return result
if __name__ == '__main__':
events = get_events(month=11, year=2018)
names_dates_times = get_name_date_time(events=events)
for event in names_dates_times:
print(EVENT.format(**event))
Console:
Title: Tosca
Date: Friday, November 02, 2018
Time: 08:00 PM
---------------
Title: Carmen
Date: Saturday, November 03, 2018
Time: 01:00 PM
---------------
Title: Marnie
Date: Saturday, November 03, 2018
Time: 08:00 PM
---------------
Title: Tosca
Date: Monday, November 05, 2018
Time: 08:00 PM
---------------
Title: Carmen
Date: Tuesday, November 06, 2018
Time: 07:30 PM
---------------
Title: Marnie
Date: Wednesday, November 07, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Thursday, November 08, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Friday, November 09, 2018
Time: 08:00 PM
---------------
Title: Marnie
Date: Saturday, November 10, 2018
Time: 01:00 PM
---------------
Title: Carmen
Date: Saturday, November 10, 2018
Time: 08:00 PM
---------------
Title: Mefistofele
Date: Monday, November 12, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Tuesday, November 13, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Wednesday, November 14, 2018
Time: 07:30 PM
---------------
Title: Carmen
Date: Thursday, November 15, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Friday, November 16, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Saturday, November 17, 2018
Time: 01:00 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Saturday, November 17, 2018
Time: 08:00 PM
---------------
Title: Mefistofele
Date: Monday, November 19, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Tuesday, November 20, 2018
Time: 08:00 PM
---------------
Title: Il Trittico
Date: Friday, November 23, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Saturday, November 24, 2018
Time: 01:00 PM
---------------
Title: Mefistofele
Date: Saturday, November 24, 2018
Time: 08:00 PM
---------------
Title: Il Trittico
Date: Monday, November 26, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Tuesday, November 27, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Wednesday, November 28, 2018
Time: 07:30 PM
---------------
Title: La Bohème
Date: Thursday, November 29, 2018
Time: 07:30 PM
---------------
Title: Il Trittico
Date: Friday, November 30, 2018
Time: 07:30 PM
---------------
For reference, here is a link to the full JSON response from the events endpoint. There is a bunch more potentially interesting information you may want but I just grabbed the subset of what you asked for in the description.

Categories

Resources