I made a dictionary using .groupdict() function, however, I am having a problem regarding elimination of certain output dictionaries.
For example my code looks like this (tweet is a string that contains 5 elements separated by || :
def somefuntion(pattern,tweet):
pattern = "^(?P<username>.*?)(?:\|{2}[^|]+){2}\|{2}(?P<botprob>.*?)(?:\|{2}|$)"
for paper in tweet:
for item in re.finditer(pattern,paper):
item.groupdict()
This produces an output in the form:
{'username': 'yashrgupta ', 'botprob': ' 0.30794588629999997 '}
{'username': 'sterector ', 'botprob': ' 0.39391528649999996 '}
{'username': 'MalcolmXon ', 'botprob': ' 0.05630123819 '}
{'username': 'ryechuuuuu ', 'botprob': ' 0.08492567222000001 '}
{'username': 'dpsisi ', 'botprob': ' 0.8300337045 '}
But I would like it to only return dictionaries whose botprob is above 0.7. How do I do this?
Specifically, as #WiktorStribizew notes, just skip iterations you don't want:
pattern = "^(?P<username>.*?)(?:\|{2}[^|]+){2}\|{2}(?P<botprob>.*?)(?:\|{2}|$)"
for paper in tweet:
for item in re.finditer(pattern,paper):
item = item.groupdict()
if item["botprob"] < 0.7:
continue
print(item)
This could be wrapped in a generator expression to save the explicit continue, but there's enough going on as it is without making it harder to read (in this case).
UPDATE since you are apparently in a function:
pattern = "^(?P<username>.*?)(?:\|{2}[^|]+){2}\|{2}(?P<botprob>.*?)(?:\|{2}|$)"
items = []
for paper in tweet:
for item in re.finditer(pattern,paper):
item = item.groupdict()
if float(item["botprob"]) > 0.7:
items.append(item)
return items
Or using comprehensions:
groupdicts = (item.groupdict() for paper in tweet for item in re.finditer(pattern, paper))
return [item for item in groupdicts if float(item["botprob"]) > 0.7]
I would like it to only return dictionaries whose botprob is above 0.7.
entries = [{'username': 'yashrgupta ', 'botprob': ' 0.30794588629999997 '},
{'username': 'sterector ', 'botprob': ' 0.39391528649999996 '},
{'username': 'MalcolmXon ', 'botprob': ' 0.05630123819 '},
{'username': 'ryechuuuuu ', 'botprob': ' 0.08492567222000001 '},
{'username': 'dpsisi ', 'botprob': ' 0.8300337045 '}]
filtered_entries = [e for e in entries if float(e['botprob'].strip()) > 0.7]
print(filtered_entries)
output
[{'username': 'dpsisi ', 'botprob': ' 0.8300337045 '}]
Related
This question already has answers here:
How can I remove a trailing newline?
(27 answers)
Handling extra newlines (carriage returns) in csv files parsed with Python?
(6 answers)
Closed last month.
The output:
{'name': 'Peter', 'surname': ' Abdilla', 'DOB': ' 22/02/1986', 'mobileNo': '79811526', 'locality': ' Zabbar\n'}
{'name': 'John', 'surname': ' Borg', 'DOB': ' 12/04/1982', 'mobileNo': '99887654', 'locality': ' Paola\n'}
The expected output is supposed to be:
{'name': 'Peter', 'surname': ' Abdilla', 'DOB': ' 22/02/1986', 'mobileNo': '79811526', 'locality': ' Zabbar'}
{'name': 'John', 'surname': ' Borg', 'DOB': ' 12/04/1982', 'mobileNo': '99887654', 'locality': ' Paola'}
Each line in CSV has an endLine character which is '\n', so when you try to read the contents of the csv file line by line, '\n' will also be in the string.
So, we just need to replace the '\n' with an empty string.
To fix this, use the python's string replace() function.
while True:
x = file.readline()
x = x.replace('\n', '')
if x == '':
break
else:
value = x.split(',')
contact = dict(tuple(zip(keys,value)))
filename.append(contact)
I'm a web-scraping beginner and am trying to scrape this webpage: https://profiles.doe.mass.edu/statereport/ap.aspx
I'd like to be able to put in some settings at the top (like District, 2020-2021, Computer Science A, Female) and then download the resulting data for those settings.
Here's the code I'm currently using:
import requests
from bs4 import BeautifulSoup
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
r = s.get('https://profiles.doe.mass.edu/statereport/ap.aspx')
soup = BeautifulSoup(r.text,"lxml")
data = {i['name']:i.get('value','') for i in soup.select('input[name]')}
data["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
data["ctl00$ContentPlaceHolder1$ddYear"]="2021",
data["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
data["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",
p = s.post(url,data=data)
When I print out p.text, I get a page with title '\t404 - Page Not Found\r\n' and message
<h2>We are unable to locate information at: <br /><br '
'/>http://profiles.doe.mass.edu:80/statereport/ap.aspxp?ASP.NET_SessionId=bxfgao54wru50zl5tkmfml00</h2>\r\n'
Here's what data looks like before I modify it:
{'__EVENTVALIDATION': '/wEdAFXz4796FFICjJ1Xc5ZOd9SwSHUlrrW+2y3gXxnnQf/b23Vhtt4oQyaVxTPpLLu5SKjKYgCipfSrKpW6jkHllWSEpW6/zTHqyc3IGH3Y0p/oA6xdsl0Dt4O8D2I0RxEvXEWFWVOnvCipZArmSoAj/6Nog6zUh+Jhjqd1LNep6GtJczTu236xw2xaJFSzyG+xo1ygDunu7BCYVmh+LuKcW56TG5L0jGOqySgRaEMolHMgR0Wo68k/uWImXPWE+YrUtgDXkgqzsktuw0QHVZv7mSDJ31NaBb64Fs9ARJ5Argo+FxJW/LIaGGeAYoDphL88oao07IP77wrmH6t1R4d88C8ImDHG9DY3sCDemvzhV+wJcnU4a5qVvRziPyzqDWnj3tqRclGoSw0VvVK9w+C3/577Gx5gqF21UsZuYzfP4emcqvJ7ckTiBk7CpZkjUjM6Z9XchlxNjWi1LkzyZ8QMP0MaNCP4CVYJfndopwFzJC7kI3W106YIA/xglzXrSdmq6/MDUCczeqIsmRQGyTOkQFH724RllsbZyHoPHYvoSAJilrMQf6BUERVN4ojysx3fz5qZhZE7DWaJAC882mXz4mEtcevFrLwuVPD7iB2v2mlWoK0S5Chw4WavlmHC+9BRhT36jtBzSPRROlXuc6P9YehFJOmpQXqlVil7C9OylT4Kz5tYzrX9JVWEpeWULgo9Evm+ipJZOKY2YnC41xTK/MbZFxsIxqwHA3IuS10Q5laFojoB+e+FDCqazV9MvcHllsPv2TK3N1oNHA8ODKnEABoLdRgumrTLDF8Lh+k+Y4EROoHhBaO3aMppAI52v3ajRcCFET22jbEm/5+P2TG2dhPhYgtZ8M/e/AoXht29ixVQ1ReO/6bhLIM+i48RTmcl76n1mNjfimB8r3irXQGYIEqCkXlUHZ/SNlRYyx3obJ6E/eljlPveWNidFHOaj+FznOh264qDkMm7fF78WBO2v0x+or1WGijWDdQtRy9WRKXchYxUchmBlYm15YbBfMrIB7+77NJV+M6uIVVnCyiDRGj+oPXcTYxqSUCLrOMQyzYKJeu8/hWD0gOdKeoYUdUUJq4idIk+bLYy76sI/N2aK+aXZo/JPQ+23gTHzIlyi4Io7O6kXaULPs8rfo8hpkH1qXyKb/rP2VJBNWgyp8jOMx9px+m4/e2Iecd86E4eN4Rk6OIiwqGp+dMdgntXu5ruRHb1awPlVmDw92dL1P0b0XxJW7EGfMzyssMDhs1VT6K6iMUTHbuXkNGaEG1dP1h4ktnCwGqDLVutU6UuzT6i4nfqnvFjGK9+7Ze8qWIl8SYyhmvzmgpLjdMuF9CYMQ2Aa79HXLKFACsSSm0dyiU1/ZGyII2Fvga9o+nVV1jZam3LkcAPaXEKwEyJXfN/DA7P4nFAaQ+QP+2bSgrcw+/dw+86OhPyG88qyJwqZODEXE1WB5zSOUywGb1/Xed7wq9WoRs6v8rAK5c/2iH7YLiJ4mUVDo+7WCKrzO5+Hsyah3frMKbheY1acRmSVUzRgCnTx7jvcLGR9Jbt6TredqZaWZBrDFcntdg7EHd7imK5PqjUld3iCVjdyO+yLKUkMKiFD85G3vEferg/Q/TtfVBqeTU0ohP9d+CsKOmV/dxVYWEtBcfa9KiN6j4N8pP7+3iUOhajojZ8jV98kxT0zPZlzkpqI4SwR6Ys8d2RjIi5K+oQul4pL5u+zZvX0lsLP9Jl7FeVTfBvST67T6ohz8dl9gBfmmbwnT23SyuFSUGd6ZGaKE+9kKYmuImW7w3ePs7C70yDWHpIpxP/IJ4GHb36LWto2g3Ld3goCQ4fXPu7C4iTiN6b5WUSlJJsWGF4eQkJue8=',
'__VIEWSTATE': '/wEPDwUKLTM0NzY4OTQ4NmRkDwwPzTpuna+yxVhQxpRF4n2+zYKQtotwRPqzuCkRvyU=',
'__VIEWSTATEGENERATOR': '2B6F8D71',
'ctl00$ContentPlaceHolder1$btnViewReport': 'View Report',
'ctl00$ContentPlaceHolder1$hfExport': 'ViewReport',
'leftNavId': '11241',
'quickSearchValue': '',
'runQuickSearch': 'Y',
'searchType': 'QUICK',
'searchtext': ''}
Following suggestions from similar questions, I've tried playing around with the parameters, editing data in various ways (to emulate the POST request that I see in my browser when I navigate the site myself), and specifying an ASP.NET_SessionId, but to no avail.
How can I access the information from this website?
This should be what you are looking for what I did was use bs4 to parse HTML data and then found the table. Then I get the rows and to make it easier to work with the data I put it into a dictionary.
import requests
from bs4 import BeautifulSoup
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find_all('table')
rows = table[0].find_all('tr')
data = {}
for row in rows:
if row.find_all('th'):
keys = row.find_all('th')
for key in keys:
data[key.text] = []
else:
values = row.find_all('td')
for value in values:
data[keys[values.index(value)].text].append(value.text)
for key in data:
print(key, data[key][:10])
print('\n')
The output:
District Name ['Abington', 'Academy Of the Pacific Rim Charter Public (District)', 'Acton-Boxborough', 'Advanced Math and Science Academy Charter (District)', 'Agawam', 'Amesbury', 'Amherst-Pelham', 'Andover', 'Arlington', 'Ashburnham-Westminster']
District Code ['00010000', '04120000', '06000000', '04300000', '00050000', '00070000', '06050000', '00090000', '00100000', '06100000']
Tests Taken [' 100', ' 109', ' 1,070', ' 504', ' 209', ' 126', ' 178', ' 986', ' 893', ' 97']
Score=1 [' 16', ' 81', ' 12', ' 29', ' 27', ' 18', ' 5', ' 70', ' 72', ' 4']
Score=2 [' 31', ' 20', ' 55', ' 74', ' 65', ' 34', ' 22', ' 182', ' 149', ' 23']
Score=3 [' 37', ' 4', ' 158', ' 142', ' 55', ' 46', ' 37', ' 272', ' 242', ' 32']
Score=4 [' 15', ' 3', ' 344', ' 127', ' 39', ' 19', ' 65', ' 289', ' 270', ' 22']
Score=5 [' 1', ' 1', ' 501', ' 132', ' 23', ' 9', ' 49', ' 173', ' 160', ' 16']
% Score 1-2 [' 47.0', ' 92.7', ' 6.3', ' 20.4', ' 44.0', ' 41.3', ' 15.2', ' 25.6', ' 24.7', ' 27.8']
% Score 3-5 [' 53.0', ' 7.3', ' 93.7', ' 79.6', ' 56.0', ' 58.7', ' 84.8', ' 74.4', ' 75.3', ' 72.2']
Process finished with exit code 0
I was able to get this working by adapting the code from here. I'm not sure why editing the payload in this way made the difference, so I'd be grateful for any insights!
Here's my working code, using Pandas to parse out the tables:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://profiles.doe.mass.edu/statereport/ap.aspx'
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0"
response = s.get(url)
soup = BeautifulSoup(response.content, 'html5lib')
data = { tag['name']: tag['value']
for tag in soup.select('input[name^=ctl00]') if tag.get('value')
}
state = { tag['name']: tag['value']
for tag in soup.select('input[name^=__]')
}
payload = data.copy()
payload.update(state)
payload["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT",
payload["ctl00$ContentPlaceHolder1$ddYear"]="2021",
payload["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA",
payload["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F",
p = s.post(url,data=payload)
df = pd.read_html(p.text)[0]
df["District Code"] = df["District Code"].astype(str).str.zfill(8)
display(df)
I'm looking to categorize some sentences. To do this, I've created a couple dictionary categories for "Price" and "Product Quality". So far I have the code loop through the words within the category and it displays the word it found.
I'd also like to add the actual category name like "Price" or "Product Quality" depending on the values within those keys.
Is there a way to display the keys for each category. Currently it's just displaying both "Price" and "Product Quality" for everything.
Here is the code:
data = ["Great price on the dewalt saw", "cool deal and quality", "love it! and the price percent off", "definitely going to buy"]
words = {'price': ['price', 'compare', '$', 'percent', 'money', '% off'],
'product_quality': ['quality', 'condition', 'aspect']}
for d in data:
for word in words.values():
for s in word:
if s in d:
print(id(d), ", ", d, ", ", s, ", ", words.keys())
Here is the output as well:
4398300496 , Great price on the dewalt saw , price , dict_keys(['price', 'product_quality'])
4399544552 , cool deal and quality , quality , dict_keys(['price', 'product_quality'])
4398556680 , love it! and the price percent off , price , dict_keys(['price', 'product_quality'])
4398556680 , love it! and the price percent off , percent , dict_keys(['price', 'product_quality'])
You can use items(), which unpacks into (key, value):
data = ["Great price on the dewalt saw", "cool deal and quality", "love it! and the price percent off", "definitely going to buy"]
words = {'price': ['price', 'compare', '$', 'percent', 'money', '% off'],
'product_quality': ['quality', 'condition', 'aspect']}
for d in data:
for category, word in words.items():
for s in word:
if s in d:
print(id(d), ", ", d, ", ", s, ", ", category)
Out:
(4338487344, ', ', 'Great price on the dewalt saw', ', ', 'price', ', ', 'price')
(4338299376, ', ', 'cool deal and quality', ', ', 'quality', ', ', 'product_quality')
(4338487416, ', ', 'love it! and the price percent off', ', ', 'price', ', ', 'price')
(4338487416, ', ', 'love it! and the price percent off', ', ', 'percent', ', ', 'price')
def exportOrders(self):
file = open("orders.txt", 'w')
file.write("\"Date\" \"Pair\" \"Amount bought/sold\" \"Pair Price\" \"Profit/Loss\" \"Order Type\"" + '\n')
for x in self.tradeHistory:
date = x['date']
pair = self.currentPair
amount = x[self.currentPair]
price = x['price']
order = x['Order Type']
if order == "buy":
spent = x['spent']
file.write(date + ' ' + pair + ' ' + amount + ' '
+ price + ' ' + float(-spent) + ' ' + order + ' \n')
if order == "sell":
obtained = x['obtained']
file.write(date + ' ' + pair + ' ' + amount + ' '
+ price + ' ' + obtained + ' ' + order + ' \n')
file.close()
self.tradeHistory is a list of dictionaries that store a date, a pair, the amount bought, the price of the pair, the money spent or obtained, and the order type.
My problem is that when the program runs for the first time into:
if order == "buy":
spent = x['spent']
file.write(date + ' ' + pair + ' ' + amount + ' '
+ price + ' ' + str(float(-spent)) + ' ' + order + ' \n')
The for loop breaks out and the orders.txt only shows the first line which is:
file.write("\"Date\" \"Pair\" \"Amount bought/sold\" \"Pair Price\" \"Profit/Loss\" \"Order Type\"" + '\n')
Thank you in advance!
edit:
Basically, my self.tradeHistory has the following content
{'date': 1505161800, 'BTC_ETH': 0.7091196761422075, 'price': 0.07050996, 'spent': 0.05, 'Order Type': 'buy'}
{'date': 1505167200, 'BTC_ETH': 0.7091196761422075, 'price': 0.07079909, 'obtained': 0.050205027771963, 'Order Type': 'sell'}
{'date': 1505236500, 'BTC_ETH': 0.7032346826344071, 'price': 0.07110002, 'spent': 0.05, 'Order Type': 'buy'}
{'date': 1505251800, 'BTC_ETH': 0.7032346826344071, 'price': 0.0707705, 'obtained': 0.04976827010737831, 'Order Type': 'sell'}
{'date': 1505680200, 'BTC_ETH': 0.715374411944349, 'price': 0.06989347, 'spent': 0.05, 'Order Type': 'buy'}
{'date': 1505699100, 'BTC_ETH': 0.715374411944349, 'price': 0.071989, 'obtained': 0.05149908854146174, 'Order Type': 'sell'}
{'date': 1505733300, 'BTC_ETH': 0.6879187705515734, 'price': 0.072683, 'spent': 0.05, 'Order Type': 'buy'}
{'date': 1505745000, 'BTC_ETH': 0.6889021311187427, 'price': 0.07257925, 'spent': 0.05, 'Order Type': 'buy'}
{'date': 1505756700, 'BTC_ETH': 1.3768209016703161, 'price': 0.0732, 'obtained': 0.10078329000226714, 'Order Type': 'sell'}
...
There are 63 items inside the list of dictionaries. My aim is to create a .txt file that looks like
"Date" "Pair" "Amount bought/sold" "Pair Price" "Profit/Loss" "Order Type"
1505161800 BTC_ETH 0.7091196761422075 0.07050996 0.05 buy
1505167200 BTC_ETH 0.7091196761422075 0.07079909 0.05 sell
...
You should not concatenate numbers with strings in Python. Use str.format instead:
file.write(
'{} {} {} {} {} {}\n'
.format(date, pair, amount, price, float(-spent), order)
)
You can also use csv module for a better implementation.
import csv
def exportOrders(self):
with open("orders.txt", 'w') as file:
writer = csv.writer(file, delimiter=' ', quotechar='"')
writer.writerow([
'Date', 'Pair', 'Amount bought/sold', 'Pair Price',
'Profit/Loss', 'Order Type'])
for x in self.tradeHistory:
date = x['date']
pair = self.currentPair
amount = x[self.currentPair]
price = x['price']
order = x['Order Type']
if order == "buy":
spent = x['spent']
writer.writerow([
date, pair, amount, price,
float(-spent), order])
if order == "sell":
obtained = x['obtained']
writer.writerow([
date, pair, amount, price,
obtained, order])
This question already has answers here:
Check if a given key already exists in a dictionary
(16 answers)
How to check if a key exists in an inner dictionary inside a dictionary in python?
(3 answers)
Closed 5 years ago.
Below is the "data" dict
{' node2': {'Status': ' online', 'TU': ' 900', 'Link': ' up', 'Port': ' a0a-180', 'MTU': ' 9000'}, ' node1': {'Status': ' online', 'TU': ' 900', 'Link': ' up', 'Port': ' a0a-180', 'MTU': ' 9000'}}
I am trying key node2 is present or not in data dict in below code but it is not working. Please help
if 'node2' in data:
print "node2 Present"
else:
print "node2Not present"
if 'node2' in data:
print "node2 Present"
else:
print "node2Not present"
This is a perfectly appropriate way of determining if a key is inside a dictionary, unfortunately 'node2' is not in your dictionary, ' node2' is (note the space):
if ' node2' in data:
print "node2 Present"
else:
print "node2Not present"
Check key is present in dictionary :
data = {'node2': {'Status': ' online', 'TU': ' 900', 'Link': ' up', 'Port': ' a0a-180', 'MTU': ' 9000'}, ' node1': {'Status': ' online', 'TU': ' 900', 'Link': ' up', 'Port': ' a0a-180', 'MTU': ' 9000'}}
In python 2.7 version:
has_key()-Test for the presence of key in the dictionary.
if data.has_key('node2'):
print("found key")
else:
print("invalid key")
In python 3.x version:
key in d - Return True if d has a key, else False.
if 'node2' in data:
print("found key")
else:
print("invalid key")