How to search a partial text in dict? - python

I have a dictionary and would like to be able to search with a partial variable and return a text.
df = pd.read_csv('MY_PATH')
d = defaultdict(lambda: 'Error, input not listed')
d.update(df.set_index('msg')['reply'].to_dict())
d[last_msg()]
last_msg() should be my partial variable, and d is my dictionary.
The index on my dictionary is column msg from df.
In column msg i have a sample like Jeff Bezos.
In column reply i have a matching reply Jeff Bezos is the CEO of Amazon
How can I search a partial value in column msg and return matching value from column reply?
I want to search just Jeff or just Bezos and get the matching reply Jeff Bezos is the CEO of Amazon
PS. Alternatives to defaultdict may also help improve the code.
EDIT: last_msg() code extracts a text from selenium element.
def last_msg():
try:
post = driver.find_elements_by_class_name("_12pGw")
ultimo = len(post) - 1
texto = post[ultimo].find_element_by_css_selector(
"span.selectable-text").text
return texto
except Exception as e:
print("Error, input not valid")
When I print(d):
defaultdict(<function <lambda> at 0x0000021F959D37B8>, {'Jeff Bezos': 'Jeff Bezos is the CEO of Amazon', 'Serguey Brin': 'Serguey Brin co-founded Google', nan: nan})
When I print(df)
Unnamed: 0 msg reply
0 0 Jeff Bezos Jeff Bezos is the CEO of Amazon
1 1 Serguey Brin Serguey Brin co-founded Google
2 2 NaN NaN
3 3 NaN NaN

I found a way arround my own question using:
d = df.set_index('msg')['reply'].to_dict()
...
try:
x = next(v for k, v in d.items() if last_msg() in k)
except StopIteration:
x = 'Error, input not listed'

Related

Python - issue with removing fields from the output

I have an issue with my code. I need the script to remove fields which fill all three conditions:
the CreatedBy is koala,
Book is PI or SI or II or OT or FG,
and the Category **is ** Cert or CertPlus or Cap or Downside.
Currently my code removes all koala and all books and only takes the last argument. So for example my current output leaves fields only if the category is different. I would like it to show fields ONLY if all 3 arguments are met and not if koala or book = PI or SI or II or OT or FG and to show everything else which is in range.
If field is created by koala and category is Cert I wish to see this field but now it is removed.
Or if none of the arguments are met I also want to see those fields ( e.g. createdby is Extra, Book is NG and Category is Multiple. Now those are also removed from the output.
Example dataset:
In the link below - I wish to remove only those marked red:
current_path = os.path.dirname(os.path.realpath(sys.argv[0]))
a_path, q_path = 0, 0
def assign_path(current_path, a_path = 0, q_path = 0):
files = os.listdir(current_path)
for i in files:
if re.search('(?i)activity',i):
a_path = '\\'.join([current_path,i])
elif re.search('(?i)query',i):
q_path = '\\'.join([current_path,i])
return a_path, q_path
a_path, q_path = assign_path(current_path)
if a_path == 0 or q_path == 0:
files = os.listdir(current_path)
directories = []
for i in files:
if os.path.isdir(i): directories.append(i)
for i in directories:
if re.search('(?i)input',i):
a_path, q_path = assign_path('\\'.join([current_path,i]), a_path, q_path)
L = list(range(len(qr)))
L1 = list(range(len(qr2)))
L2 = list(range(len(ac)))
-------------------------------------------------------
qr = pd.read_excel(q_path)
qr2 = pd.read_excel(q_path)
qr_rec = qr2.iloc[[0,1]]
d = qr2.iloc[0].to_dict()
for i in list(d.keys()): d[i] = np.nan
for i in range(len(qr2)):
if qr2.iloc[i]['LinkageLinkType'] != 'B2B_COUNTER_TRADE'\
and qr2.iloc[i]['CreatedBy'] == 'koala_'\
and qr2.iloc[i]['Book'] in {'PI','SI','II','OT','FG'}\
and qr2.iloc[i]['Category'] not in {'Cert','CertPlus','Cap','Downside'}:
while i in L: L.remove(i)
if qr2.iloc[i]['PrimaryRiskId'] not in list(aID):
qr_rec = qr_rec.append(qr2.iloc[i],ignore_index=True)
I have added the beggining of the code which allows me to use the Excel file. I have two files, one of them being a_path ( please disregard this one). The issue I have is on the q_path.
Check this out:
pd.read_csv('stackoverflow.csv')
category book createdby
0 Multiple NG panda
1 Cert DG koala
2 Cap PI monkey
3 CertPlus ZZ panda
4 Cap ll joey
5 Cert OT koala
6 Cap FG koala
7 Cert PI koala
8 Block SI koala
9 Cap II koala
df.query("~(category in ['Cert', 'Cap'] and book in ['OT', 'FG', 'PI', 'II'] and createdby=='koala')")
category book createdby
0 Multiple NG panda
1 Cert DG koala
2 Cap PI monkey
3 CertPlus ZZ panda
4 Cap ll joey
8 Block SI koala
pd.DataFrame.query can be used to filter data, the ~ at the beginning is a not operator.
BR
E

How to extract only a specific part of url in Python and add its value as another column in df for every row?

I have a df containing user and url looking like this.
df
User Url
1 http://www.mycompany.com/Overview/Get
2 http://www.mycompany.com/News
3 http://www.mycompany.com/Accountinfo
4 http://www.mycompany.com/Personalinformation/Index
...
I want to add another column page that only takes the second part of the url, so I'd be having it like this.
user url page
1 http://www.mycompany.com/Overview/Get Overview
2 http://www.mycompany.com/News News
3 http://www.mycompany.com/Accountinfo Accountinfo
4 http://www.mycompany.com/Personalinformation/Index Personalinformation
...
My code below is not working.
slashparts = df['url'].split('/')
df['page'] = slashparts[4]
The error I'm getting
AttributeError Traceback (most recent call last)
<ipython-input-23-0350a98a788c> in <module>()
----> 1 slashparts = df['request_url'].split('/')
2 df['page'] = slashparts[1]
~\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
4370 if
self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'split'
Use pandas text functions with str and for select 4. lists use str[3], because python counts from 0:
df['page'] = df['Url'].str.split('/').str[3]
Or if performance is important use list comprehension:
df['page'] = [x.split('/')[3] for x in df['Url']]
print (df)
User Url \
0 1 http://www.mycompany.com/Overview/Get
1 2 http://www.mycompany.com/News
2 3 http://www.mycompany.com/Accountinfo
3 4 http://www.mycompany.com/Personalinformation/I...
page
0 Overview
1 News
2 Accountinfo
3 Personalinformation
I'm attempting to be a little more explicit to handle where http might be missing and other variations
pat = '(?:https?://)?(?:www\.)?(?:\w+\.\w+\/)([^/]*)'
df.assign(page=df.Url.str.extract(pat, expand=False))
User Url page
0 1 http://www.mycompany.com/Overview/Get Overview
1 2 http://www.mycompany.com/News News
2 3 www.mycompany.com/Accountinfo Accountinfo
3 1 http://www.mycompany.com/Overview/Get Overview
4 2 mycompany.com/News News
5 3 https://www.mycompany.com/Accountinfo Accountinfo
6 4 http://www.mycompany.com/Personalinformation/I... Personalinformation

Join whole word by its Tag Python

let say i have this sentences:
His/O name/O is/O Petter/Name Jack/Name and/O his/O brother/O name/O is/O
Jonas/Name Van/Name Dame/Name
How can i get result like this:
Petter Jack, Jonas Van Dame.
So far i've already tried this, but still its just join 2 word :
import re
pattern = re.compile(r"\w+\/Name)
sent = sentence.split()
for i , w in sent:
if pattern.match(sent[i]) != None:
if pattern.match(sent[i+1]) != None:
#....
#join sent[i] and sent[i+1] element
#....
Try something like this
pattern = re.compile(r"((\w+\/Name\s*)+)")
names = pattern.findall(your_string)
for name in names:
print(''.join(name[0].split('/Name')))
I'm thinking about a two-phase solution
r = re.compile(r'\w+\/Name(?:\ \w+\/Name)*')
result = r.findall(s)
# -> ['Petter/Name Jack/Name', 'Jonas/Name Van/Name Dame/Name']
for r in result:
print(r.replace('/Name', ''))
# -> Petter Jack
# -> Jonas Van Dame

Data Analysis using Python

I have 2 CSV files. One with city name, population and humidity. In second cities are mapped to states. I want to get state-wise total population and average humidity. Can someone help? Here is the example:
CSV 1:
CityName,population,humidity
Austin,1000,20
Sanjose,2200,10
Sacramento,500,5
CSV 2:
State,city name
Ca,Sanjose
Ca,Sacramento
Texas,Austin
Would like to get output(sum population and average humidity for state):
Ca,2700,7.5
Texas,1000,20
The above solution doesn't work because dictionary will contain one one key value. i gave up and finally used a loop. below code is working, mentioned input too
csv1
state_name,city_name
CA,sacramento
utah,saltlake
CA,san jose
Utah,provo
CA,sanfrancisco
TX,austin
TX,dallas
OR,portland
CSV2
city_name population humidity
sacramento 1000 1
saltlake 300 5
san jose 500 2
provo 100 7
sanfrancisco 700 3
austin 2000 4
dallas 2500 5
portland 300 6
def mapping_within_dataframe(self, file1,file2,file3):
self.csv1 = file1
self.csv2 = file2
self.outcsv = file3
one_state_data = 0
outfile = csv.writer(open('self.outcsv', 'w'), delimiter=',')
state_city = read_csv(self.csv1)
city_data = read_csv(self.csv2)
all_state = list(set(state_city.state_name))
for one_state in all_state:
one_state_cities = list(state_city.loc[state_city.state_name == one_state, "city_name"])
one_state_data = 0
for one_city in one_state_cities:
one_city_data = city_data.loc[city_data.city_name == one_city, "population"].sum()
one_state_data = one_state_data + one_city_data
print one_state, one_state_data
outfile.writerows(whatever)
def output(file1, file2):
f = lambda x: x.strip() #strips newline and white space characters
with open(file1) as cities:
with open(file2) as states:
states_dict = {}
cities_dict = {}
for line in states:
line = line.split(',')
states_dict[f(line[0])] = f(line[1])
for line in cities:
line = line.split(',')
cities_dict[f(line[0])] = (int(f(line[1])) , int(f(line[2])))
for state , city in states_dict.iteritems():
try:
print state, cities_dict[city]
except KeyError:
pass
output(CSV1,CSV2) #these are the names of the files
This gives the output you wanted. Just make sure the names of cities in both files are the same in terms of capitalization.

html scraping using python topboxoffice list from imdb website

URL: http://www.imdb.com/chart/?ref_=nv_ch_cht_2
I want you to print top box office list from above site (all the movies' rank, title, weekend, gross and weeks movies in the order)
Example output:
Rank:1
title: godzilla
weekend:$93.2M
Gross:$93.2M
Weeks: 1
Rank: 2
title: Neighbours
This is just a simple way to extract those entities by BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
url = "http://www.imdb.com/chart/?ref_=nv_ch_cht_2"
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data, 'html.parser')
rows = page.findAll("tr", {'class': ['odd', 'even']})
for tr in rows:
for data in tr.findAll("td", {'class': ['titleColumn', 'weeksColumn','ratingColumn']}):
print data.get_text()
P.S.-Arrange according to your will.
There is no need to scrape anything. See the answer I gave here.
How to scrape data from imdb business page?
The below Python script will give you, 1) List of Top Box Office movies from IMDb 2) And also the List of Cast for each of them.
from lxml.html import parse
def imdb_bo(no_of_movies=5):
bo_url = 'http://www.imdb.com/chart/'
bo_page = parse(bo_url).getroot()
bo_table = bo_page.cssselect('table.chart')
bo_total = len(bo_table[0][2])
if no_of_movies <= bo_total:
count = no_of_movies
else:
count = bo_total
movies = {}
for i in range(0, count):
mo = {}
mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()
m_page = parse(mo['url']).getroot()
m_casttable = m_page.cssselect('table.cast_list')
flag = 0
mo['cast'] = []
for cast in m_casttable[0]:
if flag == 0:
flag = 1
else:
m_starname = cast[1][0][0].text_content().strip()
mo['cast'].append(m_starname)
movies[i] = mo
return movies
if __name__ == '__main__':
no_of_movies = raw_input("Enter no. of Box office movies to display:")
bo_movies = imdb_bo(int(no_of_movies))
for k,v in bo_movies.iteritems():
print '#'+str(k+1)+' '+v['title']+' ('+v['year']+')'
print 'URL: '+v['url']
print 'Weekend: '+v['weekend']
print 'Gross: '+v['gross']
print 'Weeks: '+v['weeks']
print 'Cast: '+', '.join(v['cast'])
print '\n'
Output (run in terminal):
parag#parag-innovate:~/python$ python imdb_bo_scraper.py
Enter no. of Box office movies to display:3
#1 Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan SkarsgÄrd, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden
#2 Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski
#3 Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long

Categories

Resources