I currently have a dictionary where the values are:
disney_data = {
'title': ['Gus (1976)',
'Johnny Kapahala: Back on Board (2007)',
'The Adventures of Huck Finn (1993)',
'The Simpsons (1989)',
'Atlantis: Milo’s Return (2003)']
}
I would like to split up the title from the year value and have a dictionary like:
new_disney_data = {
'title' : ['Gus',
'Johnny Kapahala: Back on Board',
'The Adventures of Huck Finn',
'The Simpsons',
'Atlantis: Milo’s Return'],
'year' : ['1976',
'2007',
'1993',
'1989',
'2003']
}
I tried using the following, but I know something is off - I'm still relatively fresh to python so any help would be greatly apprecated!
for value in disney_data.values():
new_disney_data['title'].append(title[0,-7])
new_disney_data['year'].append(title[-7,-1])
There are two concepts you can use here:
The first would be .split(). This usually works better than indexing in a string (in case someone placed a space after the brackets in the string, for example). Read more.
The second would be comprehension. Read more.
Using these two, here is one possible solution.
titles = [item.split('(')[0].strip() for item in disney_data['title']]
years = [item.split('(')[1].split(')')[0].strip() for item in disney_data['title']]
new_disney_data = {
'title': titles,
'year': years
}
print(new_disney_data)
Edit: I also used .strip(). This removes any trailing whitespace like spaces, tabs, or newlines from the ends of a string. Read more
You're not that far off. In your for-loop you iterate over values of the dict, but you want to iterate over the titles. Also the string slicing syntax is [id1:id2]. So this would probably do what you are looking for:
new_disney_data = {"title":[], "year":[]}
for value in disney_data["title"]:
new_disney_data['title'].append(value[0:-7])
new_disney_data['year'].append(value[-5:-1])
new_disney_data = {
'title': [i[:-6].rstrip() for i in disney_data['title']],
'year': [i[-5:-1] for i in disney_data['title']]
}
this code can do it
import re
disney_data = {
'title': ['Gus (1976)',
'Johnny Kapahala: Back on Board (2007)',
'The Adventures of Huck Finn (1993)',
'The Simpsons (1989)',
'Atlantis: Milo’s Return (2003)']
}
disney_data['year'] = []
for index,line in enumerate(disney_data.get('title')):
match = re.search(r'\d{4}', line)
if match is not None:
disney_data['title'][index] = line.split('(')[0].strip()
disney_data['year'].append(match.group())
print(disney_data)
it searches for every line in the title if there are 4 digits, if exists then add to year, and remove digits and parenthesis from the title.
Something like this
disney_data = {
'title': ['Gus (1976)',
'Johnny Kapahala: Back on Board (2007)',
'The Adventures of Huck Finn (1993)',
'The Simpsons (1989)',
'Atlantis: Milo’s Return (2003)']
}
new_disney_data = {'title': [], 'year': []}
#split title into two columns title and year in new dict
for title in disney_data['title']:
new_disney_data['title'].append(title.split('(')[0]) #split title by '('
new_disney_data['year'].append(title.split('(')[1].split(')')[0]) #split year by ')'
print(disney_data)
print(new_disney_data)
Using split and replace.
def split(data):
o = {'title' : [], 'year' : []}
for (t, y) in [d.replace(')','').split(' (') for d in data['title']]:
o['title'].append(t)
o['year'].append(y)
return o
Using Regular Expession
import re
def regex(data):
r = re.compile("(.*?) \((\d{4})\)")
o = {'title' : [], 'year' : []}
for (t, y) in [r.findall(d)[0] for d in data['title']]:
o['title'].append(t)
o['year'].append(y)
return o
Related
I have a collection of subtitle files that contain dialogues, like this:
1
00:00:02,460 --> 00:00:07,020
JOHN: Great.
2
00:00:07,020 --> 00:00:11,850
How are you today?
JANE: Quite alright.
JOHN: Perfect.
3
00:00:11,850 --> 00:00:17,230
Had a busy day?
4
00:00:17,230 --> 00:00:28,070
JANE: Not so much. And you?
5
00:00:28,070 --> 00:00:32,300
JOHN: Mine was okay too. Gimme a few extra minutes.
I would like to extract only, for example, JANE, and then both, and to have a resulting string or file, like this:
Quite alright
Not so much
And you
And then both speakers combined, like this:
Great
How are you today
Quite alright
Perfect
Had a busy day
Not so much
And you
Mine was okay too
Gimme a few extra minutes
So, the result is sentence per line and interpunctions removed (all but ', which are kept for contractions; e.g., don't).
Effectively, I have managed to clean from interpunctions and numbers/timestamps. I've been using RegEx (infile is input file; first re.sub() is to tidy up instances where there is no space after the interpunction):
for line in infile:
if not line[0].isnumeric():
line = re.sub('(?<=[,;:.!?])(?=[a-zA-Z])', r' ', line)
lines += re.sub(r'[^a-zA-Z\'\ \n]+', r'', line)
Sadly, I haven't found any elegant way to condition and extract lines that belong to one specific speaker. In principle, I would like to be able to choose whether all will be saved to the same string/file, each speaker to a separate string/file (or one speaker only).
You basically just have to keep sniffing for change of speaker and build up a nice array of structured data:
current_speaker = None
dialogue = []
while(True):
the_line = fetchLine(fromWhever)
if the_line is None:
break
if the_line == '':
continue
if the_line.isnumeric():
fetchLine(fromWherever) # Get the timeline that follows a block count
continue # ignore it all for now
# Actual speaker line.
m = re.search("^(\S+):", the_line)
if m is not None:
spk = m.groups()[0]
current_speaker = spk
the_line = the_line[len(spk)+2:] # remove name, colon, and 1 space
dialogue.append({"spk":current_speaker,"text":the_line})
print(dialogue)
[{'spk': 'JOHN', 'text': 'Great.'}, {'spk': 'JOHN', 'text': 'How are you today? '}, {'spk': 'JANE', 'text': 'Quite alright. '}, {'spk': 'JOHN', 'text': 'Perfect.'}, {'spk': 'JOHN', 'text': 'Had a busy day?'}, {'spk': 'JANE', 'text': 'Not so much. And you?'}, {'spk': 'JOHN', 'text': 'Mine was okay too. Gimme a few extra minutes.'}]
After this, it is a simple matter of post-processing the array to turn sentences into more entries or write to a file, etc.
I have a list of strings as follows :
list_of_words = ['all saints church','churchill college', "great saint mary's church", 'holy trinity church', "little saint mary's church", 'emmanuel college']
And I have a list of dictionaries that contains 'text' as key and a sentence as a value. It is as follows :
"dict_sentences": [
{
"text": "Can you help me book a taxi going from emmanuel college to churchill college?"
},
{
"text": "Yes, I could! What time would you like to depart from Emmanuel College?"
},
{
"text": "I want a taxi to holy trinity church"
},
{
"text": "Alright! I have a yellow Lexus booked to pick you up. The Contact number is 07543493643. Anything else I can help with?"
},
{
"text": "No, that is everything I needed. Thank you!"
},
{
"text": "Thank you! Have a great day!"
}
]
For each sentence in dict_sentences, I want to check if any of the words from list_of_words exists in that sentence and if yes, I want to store it in another dictionary(as I have to further work on it).
For example, in the first sentence in dict_sentences, "Can you help me book a taxi going from emmanuel college to churchill college?", the substring "churchill college" and 'emmanuel college' exists in our list_of_words, so I want to store the word 'churchill college' and 'emmanuel college' in another dictionary like { sent1 : ['churchill college', 'emmanuel college'] }
So the expected output would be :
{ sent1 : ['churchill college', 'emmanuel college'] ,
sent2 : [ 'emmanuel college' ],
sent3 : [ 'holy trinity church' ]
} # ignore the rest of sentences as no word from list_of_words exist in them
The main problem here is checking if given sentence consists of word/group of words (like 'holy trinity church' - 3 words) in the given sentence and if yes, extracting the same. I went through other answers and following code was suggested for checking if a word from a list occurs in a sentence :
if any(word in sentence for word in list_of_words()):
pass
However, this way we can only check if the word from sentence exists in list_of_words(), to extract the word, I will have to run for loops. But, I refrain from using for loops as I need a very time efficient solution because I have around 300 documents where every document consist of such 10-15(or more) sentences and the list_of_words too is large i.e. around 300 strings. So, I need a time efficient way to check and extract the word from a given sentence that exists in list_of_words.
You could use re.findall so there's no nested loop.
output = {}
find_words = re.compile('|'.join(list_of_words)).findall
for i, (s,) in enumerate(map(dict.values, data['dict_sentences']), 1):
words = find_words(s.lower())
if words:
output[f"sent{i}"] = words
{'sent1': ['emmanuel college', 'churchill college'],
'sent2': ['emmanuel college'],
'sent3': ['holy trinity church']}
This can be done in a dict_comprehension as well using the walrus operator in python 3.8+ although may be a little overboard:
find_sent = re.compile('|'.join(list_of_words)).findall
iter_sent = enumerate(map(dict.values, data['dict_sentences']), 1)
output = {f"sent{i}": words for i, (s,) in iter_sent if (words := find_sent(s.lower()))}
There might be a more efficient way to do this with something like itertools, but I am not very familiar with it.
test = {"dict_sentences":...} # I'm assuming it's a section of a json or a larger dictionary.
output = {}
j = 1
for sent in test["dict_sentences"]:
addition = []
for i in list_of_words:
if i.upper() in sent["text"].upper():
addition.append(i)
if addition:
output[f"sent{j}"] = addition
j += 1
You can do a nested dict comprehension and compare the content by transforming both to lower case, for example:
output = {
f"sent{i+1}": [
phrase for phrase in list_of_words if phrase.lower() in sentence['text'].lower()
] for i,sentence in enumerate(dict_sentences)
}
output_without_empty_matches = { k:v for k,v in output.items() if v }
print(output_without_empty_matches)
>>> {'sent1': ['churchill college', 'emmanuel college'], 'sent2': ['emmanuel college'], 'sent3': ['holy trinity church']}
new_list=[]
new_dict={}
for index, subdict in enumerate(dict_sentences):
for word in list_of_words:
if word in subdict['text'].lower():
key="sent"+str(index+1)
new_list.append(word)
new_dict[key]=new_list
new_list=[]
print(new_dict)
I am trying to loop through two querysets with keys based on dates in the set. Each date has two types of items: Life events and work. The dict should look like this:
Timeline['1980']['event'] = "He was born"
Timeline['1992']['work'] = "Symphony No. 1"
Timeline['1993']['event'] = "He was married"
Timeline['1993']['work'] = "Symphony No. 2"
How do I create this dictionary?
I tried the following:
timeline = defaultdict(list)
for o in opus:
if o.date_comp_f is not None:
timeline[o.date]['work'].append(o)
timeline = dict(timeline)
for e in event:
if e.date_end_y is not None:
timeline[e.date]['event'].append(e)
timeline = dict(timeline)
I keep getting bad Key errors.
t = {}
t['1980'] = {
'event':'He was born',
'work':'None'
}
Or
t = {}
t['1980'] = {}
t['1980']['event'] = 'He was born'
t['1980']['work'] = 'None'
I am not sure what you want, but I guess you want to initialize a dictionary where you can make such assignments. You may need something like this:
from collections import defaultdict
# Create empty dict for assignment
Timeline = defaultdict(defaultdict)
# store values
Timeline['1980']['event'] = "He was born"
Timeline['1992']['work'] = "Symphony No. 1"
# Create a regular dict for checking if both are equal
TimelineRegular = {'1980':{'event':"He was born"},'1992':{'work':"Symphony No. 1"}}
# check
print(Timeline==TimelineRegular)
Output:
>>> True
timeline = {'1980':{'event':'He was born', 'work':'None'}, '1992':{'event':'None', 'work':'Symphony No. 1'}, '1993':{'event':'He was married', 'work':'Symphony No. 2'}}
With results:
>>> timeline['1980']['event']
'He was born'
>>> timeline['1992']['work']
'Symphony No. 1'
>>> timeline['1993']['event']
'He was married'
>>> timeline['1993']['work']
'Symphony No. 2'
This is a nested dictionary, the external dictionary are keys of dates with values of another dictionary. Internal dictionary are keys of work or event with values of the final value.
And to add more:
>>> timeline['2019'] = {'event':'Asked stackoverflow question', 'work':'unknown'}
>>> timeline
{'1980': {'event': 'He was born', 'work': 'None'}, '1992': {'event': 'None', 'work': 'Symphony No. 1'}, '1993': {'event': 'He was married', 'work': 'Symphony No. 2'}, '2019': {'event': 'Asked stackoverflow question', 'work': 'unknown'}}
When you add a new key, you need to make the value your empty dictionary with placeholders for each future key.
timeline['year'] = {'work':'', 'event':''}
or just an empty dictionary, though you may end up with missing keys later
timeline['year'] = {}
I am pretty new to python and I am trying to set up a webscraper that gathers data on characters who have died in the show Game of Thrones. I have gotten the data that I want but I can't seem to get some of the extra fluff out of the data.
I have tried the .strip() method and the .replace() method using .replace(" ", "") but each time nothing changes. Here is a block of my code:
url = "http://time.com/3924852/every-game-of-thrones-death/"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
# Find the characters who have died by searching for the text embedded within the <div> tag with class = "headline"
find_deaths = soup.find_all('div', class_="headline")
# Strip out all the extra fluff at the beginning and end of the text and add it to list
for hit in find_deaths:
deaths.append(hit.contents)
This code yields items in the list that look like this:
deaths = [['\n Will\n '], ['\n Jon Arryn\n '], ['\n Jory Cassel\n ']
I have tried the following methods in order to try to stip out the extra fluff surrounding the data but it doesn't change anything in the list at all.
for item in deaths:
str(item).strip()
for item in deaths:
str(item).replace("\n ", "")
Using either one of the two methods above I thought that it would strip all the extra fluff out from the items in the list but it doesn't seem to change anything at all.
Is there another method I could use besides strip and replace that will get rid of the extra fluff in this data.
You should use a list comprehension:
deaths = [s.strip() for s in deaths]
However, you have a lot of unnecessary intermediate steps here - you can simply use a list comprehension directly out of find_all:
deaths = [hit.contents[0].strip() for hit in soup.find_all('div', class_="headline")]
With the given website and query, deaths will be
['Will', 'Jon Arryn', 'Jory Cassel', 'Benjen Stark', 'Robert Baratheon', 'Syrio Forel', 'Eddard Stark', 'Viserys Targaryen', 'Drogo', 'Rhaego', 'Mirri Maz Duur', 'Rakharo', 'Yoren', 'Renly Baratheon', 'Rodrik Cassel', 'Irri', 'Maester Luwin', 'Qhorin', 'Pyat Pree', 'Doreah', 'Xaro Xhoan Daxos', 'Hoster Tully', 'Jeor Mormont', 'Craster', 'Kraznys', 'Beric Dondarrion', 'Ros', 'Talisa Stark', 'Robb Stark', 'Catelyn Stark', 'Polliver', 'Tansy', 'Joffrey Baratheon', 'Karl Tanner', 'Locke', 'Rast', 'Lysa Arryn', 'Oberyn Martell', 'The Mountain', 'Grenn', 'Mag the Mighty', 'Pyp', 'Styr', 'Ygritte', 'Jojen Reed', 'Shae', 'Tywin Lannister', 'Mance Rayder', 'Janos Slynt', 'Barristan Selmy', 'Maester Aemon', 'Karsi', 'Shireen Baratheon', 'Hizdahr zo Loraq', 'Selyse Baratheon', 'Stannis Baratheon', 'Myranda', 'Meryn Trant', 'Myrcella Baratheon', 'Jon Snow', 'Areo Hotah', 'Doran Martell', 'Trystane Martell', 'The Flasher', 'Roose Bolton', 'Walda Bolton', 'Unnamed Bolton Child', 'Balon Greyjoy', 'Alliser Thorne', 'Olly', 'Ser Arthur Dayne', 'Osha', 'Khal Moro', 'Three-Eyed Raven', 'Leaf', 'Hodor', 'Aerys II Targaryen, "The Mad King"', 'Brother Ray', 'Lem', 'Brynden Tully (The Blackfish)', 'Lady Crane', 'The Waif', 'Razdal mo Eraz', 'Belicho Paenymion', 'Rickon Stark', 'Jon Umber', 'Wun Weg Wun Dar Wun', 'Ramsay Bolton', 'Grand Maester Pycelle', 'Lancel', 'The High Sparrow', 'Loras Tyrell', 'Mace Tyrell', 'Kevan Lannister', 'Margaery Tyrell', 'Tommen Baratheon', 'Walder Rivers', 'Lothar Frey', 'Walder Frey', 'Lyanna Stark', 'Nymeria Sand', 'Obara Sand', 'Tyene Sand', 'Olenna Tyrell', 'Randyll Tarly', 'Dickon Tarly', 'Thoros of Myr', 'Petyr "Littlefinger" Baelish', 'Ned Umber']
Strings are immutable. strip() and replace() return new strings, they don't change the original.
Use a list comprehension like the one that #Tomothy32 suggested:
deaths = [hit.contents.strip() for hit in soup.find_all('div', class_="headline")]
I can't test due to my location but you should be able to avoid this but using the already clean string in the name attribute of the elements with class anchor-only
deaths = [item['name'] for item in soup.select('.anchor-only')]
So I have some lines of text that are stored in a list as follows:
lines = ['1.9 #comment 1.11* 1.5 # another comment',
'1.23',
'3.10.3* #commennnnnt 1.2 ']
I want to create:
[{'1.9': 'comment'},
{'1.11*': ''},
{'1.5': 'another comment'},
{'1.23': ''},
{'3.10.3*': 'commennnnnt'},
{'1.2': ''} ]
In other words, I want to take the list apart and pair each decimal number with either the comment (starting with '#'; we can assume that no numbers occur in it) that appears right after it on the same line, or with an empty string if there is no comment (e.g., the next thing after it is another number).
Specifically, a 'decimal number' can be a single digit, followed by a dot and then either one or two digits, optionally followed by a dot and one or two more digits. A '*' may appear at the very end. So like this(?): r'\d\.\d{1,2}(\.\d{1,2})?\*?')
I've tried a few things with re.split() to get started. For example, splitting the first list item on either the crazy decimal regex or #, before worrying about the dict pairings:
>>> crazy=r'\d\.\d{1,2}(\.\d{1,2})?\*?'
>>> re.split(r'({0})|#'.format(crazy), results[0])
Result:
[u'',
u'1.9',
None,
u' ',
None,
None,
u'comment ',
u'1.11',
None,
u' ',
u'1.5',
None,
u' ',
None,
None,
u' test comment']
This looks like something I can filter and work with, but is there a better way? (also, wow...it seems the parentheses in my crazy regex allow me to keep the decimal number delimiters as desired!)
The following seems to work:
lines = ['1.9 #comment 1.11* 1.5 # another comment',
'1.23',
'3.10.3* #commennnnnt 1.2 ']
entries = re.findall(r'([0-9.]+\*?)\s+((?:[\# ])?[a-zA-Z ]*)', " ".join(lines))
ldict = [{k: v.strip(" #")} for k,v in entries]
print ldict
This displays:
[{'1.9': 'comment'}, {'1.11*': ''}, {'1.5': 'another comment'}, {'1.23': ''}, {'3.10.3*': 'commennnnnt'}, {'1.2': ''}]