Any comments or solutions are welcomed; I could not create a dictionary variable with one line.
import requests as re
from bs4 import BeautifulSoup
url = re.get('https://toiguru.jp/toeic-vocabulary-list')
soup = BeautifulSoup(url.content, "html.parser")
words = [str(el).replace("<td>", "") for el in soup.find_all("td")]
words = [str(el).replace("</td>", "") for el in words]
**words = [str(el).split("<br/>")for el in words]**
# With this code below, it got an error saying "IndexError: list index out of range"
words = {str(el[0]):str(el[1])for el in words}
# From here, I could not have any idea to create a dictionary variable like below
#{ENword: translation for ENword}
# e.g.) {'survey':'調査'}, {'interview':'面接'}
words = [str(el).split("<br/>")for el in words]
*The code above outputs values as below:
[['survey', '調査'], ['interview', '面接'], ['exhibition', '展示'], ['conference', '会議'], ['available', '利用できる'], ['annual', '年
1回の'], ['equipment', '備品/器具'], ['department', '部署'], ['refund', '払い戻す'], ['receipt', '領収書'], ['schedule', '予定, 計画'], ・・・and more・・・]
I want to change the above-mentioned values like this:
{ENword: translation for ENword}
e.g.) {'survey':'調査'}, {'interview':'面接'}
With bs4, I want to create a dictionary variable.
Try the code below.
There seems to be atleast 1 element in words that has no 2 items
words = {el[0]:el[1] for el in words if len(el)==2}
to find the non valid elements with different formatation u can use:
not_good=[[f"index={counter}", f"value={el2}"] for counter,el2 in enumerate(words) if len(el2)!=2]
print(not_good)
#output [['index=474', "value=['neither', 'どちらも…でない', '']"], ['index=475', "value=['']"], ['index=481', "value=['enclose', '同封する', '']"], ['index=701', "value=['']"]]
Ignore ['']:
words = {el[0]: el[1] for el in words if el != ['']}
# {'survey': '調査', 'interview': '面接', ..., 'neither': 'どちらも…でない', ..., 'enclose': '同封する', ...}
or list of dict:
words = [{el[0]: el[1]} for el in words if el != ['']]
# [{'survey': '調査'}, {'interview': '面接'}, ..., {'neither': 'どちらも…でない'}, ..., {'enclose': '同封する'}, ...]
Related
I have to compare two lists for the matching and non matching elements and print them out. I have tried the below code:
list1 = ["prefencia","santro ne prefence"]
I'm fetching all the text from a webpage using selenium getText() method and all the text is getting stored in a string variable which is then stored to list2:
str = "Centro de prefencia de lilly ac"
list2 = []
list2 = str
for item in list1:
if item in list2:
print("match:", item)
else:
print("no_match:", item)
Result of above code-
match:prefencia
Seems like in keyword is working like contains. I would want to search for the exact match for the element present in list1 with the element present in list2.
At least here you have a problem:
list1 = ["prefencia","santro ne prefence"]
str = "Centro de prefencia de lilly ac"
list2 = []
list2 = str
Do you want list2 variable to be list. Now you set it as an empty list and on next row it to variable str.
How about something like this (quess what you want to do)
list1 = ["prefencia","santro ne prefence"]
mystr = "Centro de prefencia de lilly ac"
list2 = mystr.split(' ') // splits your string to list of words
for item in list1:
if item in list2:
print("match:", item)
else:
print("no_match:", item)
But if you split your string to list of words you'll never get exact match for multiple words such as "santro ne prefence".
I got a list of strings. Those strings have all the two markers in. I would love to extract the string between those two markers for each string in that list.
example:
markers 'XXX' and 'YYY' --> therefore i want to extract 78665786 and 6866
['XXX78665786YYYjajk', 'XXX6866YYYz6767'....]
You can just loop over your list and grab the substring. You can do something like:
import re
my_list = ['XXX78665786YYYjajk', 'XXX6866YYYz6767']
output = []
for item in my_list:
output.append(re.search('XXX(.*)YYY', item).group(1))
print(output)
Output:
['78665786', '6866']
import re
l = ['XXX78665786YYYjajk', 'XXX6866YYYz6767'....]
l = [re.search(r'XXX(.*)YYY', i).group(1) for i in l]
This should work
Another solution would be:
import re
test_string=['XXX78665786YYYjajk','XXX78665783336YYYjajk']
int_val=[int(re.search(r'\d+', x).group()) for x in test_string]
the command split() splits a String into different parts.
list1 = ['XXX78665786YYYjajk', 'XXX6866YYYz6767']
list2 = []
for i in list1:
d = i.split("XXX")
for g in d:
d = g.split("YYY")
list2.append(d)
print(list2)
it's saved into a list
I have a function which returns a list of elements and the len of each element of the list. I used that function in order to extract in the element of my past list, those which are present in a lexicon.
The problem I am facing is that the script below return a list of all the words present in my list of an element, but I want to return a list of words which are present in the lexicon for each elt of my past list. So that I will have a list of lists and those lists will contain only the word which appear in my lexicon for each particular element not a big ensemble of all the element.
My script is below and I tried two things : list-comprehension and loop but the two solutions always print me a list of all the words and not a list of lists of the word :
def polarity_word(texte, listpos, listneg):
lemme_sent, len_sent = lemmatisation(texte) # list of element(sentences lemmatized)
list_pos = []
list_neg = []
intersection = [w for w in listpos for elt in lemme_sent if w in elt ]
#other way
for elt in lemme_sent:
for w in elt.split():
if w in listpos:
list_pos.append([w])
# test data:
lemme_sent =[ 'je vie manger et boire', 'je être bel et lui très beau']
len_sent = [5, 7]
list_pos = ['luire','manger','vie','soleil','boire', 'demain', 'soir', 'bel', 'temps', 'beau']
print(intersection)
expected answer
[['vie', 'manger','boire'],['bel', 'beau']]
instead I have
[vie, manger','boire','bel','beau']
def polarity_word(texte, listpos, listneg):
lemme_sent, len_sent = lemmatisation(texte)
intersection=[]
for elt in lemme_sent:
intersection.append([word for word in listpos if word in elt])
return intersection
I have a list of lists, in which I store sentences as strings. What I want to do is to get only the words starting with #. In order to do that, I split the sentences into words and now trying to pick only the words that start with # and exclude all the other words.
# to create the empty list:
lst = []
# to iterate through the columns:
for i in range(0,len(df)):
lst.append(df['col1'][i].split())
If I am mistaken you just need flat list containing all words starting with particular character. For doing that I would employ list flattening (via itertools):
import itertools
first = 'f' #look for words starting with f letter
nested_list = [['This is first sentence'],['This is following sentence']]
flat_list = list(itertools.chain.from_iterable(nested_list))
nested_words = [i.split(' ') for i in flat_list]
words = list(itertools.chain.from_iterable(nested_words))
lst = [i for i in words if i[0]==first]
print(lst) #output: ['first', 'following']
I am trying to write a code that takes the text from a novel and converts it to a dictionary where the keys are each unique word and the values are the number of occurrences of the word in the text.
For example it could look like: {'the': 25, 'girl': 59...etc}
I have been trying to make the text first into a list and then use the Counter function to make a dictionary of all the words:
source = open('novel.html', 'r', encoding = "UTF-8")
soup = BeautifulSoup(source, 'html.parser')
#make a list of all the words in file, get rid of words that aren't content
mylist = []
mylist.append(soup.find_all('p'))
newlist = filter(None, mylist)
cnt = collections.Counter()
for line in newlist:
try:
if line is not None:
words = line.split(" ")
for word in line:
cnt[word] += 1
except:
pass
print(cnt)
This code doesn't work because of an error with "NoneType" or it just prints an empty list. I'm wondering if there is an easier way to do what I'm trying to do or how I can fix this code so it won't have this error.
import collections
from bs4 import BeautifulSoup
with open('novel.html', 'r', encoding='UTF-8') as source:
soup = BeautifulSoup(source, 'html.parser')
cnt = collections.Counter()
for tag in soup.find_all('p'):
for word in tag.string.split():
word = ''.join(ch for ch in word.lower() if ch.isalnum())
if word != '':
cnt[word] += 1
print(cnt)
with statement is simply a safer way to open the file
soup.find_all returns a list of Tag's
tag.string.split() gets all the words (separated by spaces) from the Tag
word = ''.join(ch for ch in word.lower() if ch.isalnum()) removes punctuation and convertes to lowercase so that 'Hello' and 'hello!' count as the same word
For the counter just do a
from collections import Counter
cnt = Counter(mylist)
Are you sure your list is getting items to begin with? After what step are you getting an empty list?
Once you've converted your page to a list, try something like this out:
#create dictionary and fake list
d = {}
x = ["hi", "hi", "hello", "hey", "hi", "hello", "hey", "hi"]
#count the times a unique word occurs and add that pair to your dictionary
for word in set(x):
count = x.count(word)
d[word] = count
Output:
{'hello': 2, 'hey': 2, 'hi': 4}