Put all the Row value to CELLS Python - python

How can i put all the Rows value to cell
rows = db().select(i.INV_ITEMCODE, n.INV_NAME, orderby=i.INV_ITEMCODE, join=n.on(i.POS_TASKCODE == n.POS_TASKCODE))
for r in rows:
code = str(r.db_i_item.INV_ITEMCODE)
desc = str(r.db_i_name.INV_NAME)
row = [dict(rows=rows)]
cell = [code, desc]
row = [dict(cell=cell, id=str(+1))]
records = []
total = []
result = None
result = dict(records=str(total), total='1', row=row , page='1') #records should get the total cell
return result
the RESULT return only ONE cell value
dict: {'records': '[]', 'total': '1', 'page': '1', 'row': [{'cell': ['LUBS001', 'Hav. Fully Synthetic 1L'], 'id': '1'}]}
but the ROWS have the query:
Rows: db_i_item.INV_ITEMCODE,db_i_name.INV_NAME
LUBS001,Hav. Fully Synthetic 1L
LUBS002,Hav. Formula 1L
LUBS003,Hav. SF 1L
LUBS004,Hav. Plus 2T 200ML
LUBS005,Havoline Plus 2T 1L
LUBS006,Havoline Super 4T 1L
LUBS007,Havoline EZY 4T 1L
LUBS008,Delo Sports 1L
LUBS009,Delo Gold Multigrade 1L
LUBS010,Delo Gold Monograde 1L
LUBS011,Delo Silver 1L
LUBS012,Super Diesel 1L
LUBS013,Brake Fluid 250ML
LUBS014,Brake Fluid 500ML
LUBS015,Brake Fluid 1L
LUBS016,Texamatic ATF 1L
LUBS020,Coolant
LUBS21,Delo
PET001,DIESEL
PET002,SILVER
PET003,GOLD
PET004,REGULAR
PET005,KEROSENE
    

got it :D
'items = db(q1 & q2 & search).select(i.INV_ITEMCODE, n.INV_NAME, m.INV_KIND, p.INV_PRICE, m.INV_DRTABLE, p.INV_PRICECODE, p.INV_ITEMCODEX, orderby=o)
ri = 0
rows = []
for ri, r in enumerate(items):
if r.db_i_matrix.INV_KIND == 'W':
kind = 'Wet'
else:
kind = 'Dry'
cell = [ str(ri + 1), str(r.db_i_item.INV_ITEMCODE), str(r.db_i_name.INV_NAME), str(kind), str(r.db_i_price.INV_PRICE),
str(r.db_i_matrix.INV_DRTABLE), str(r.db_i_price.INV_PRICECODE), str(r.db_i_price.INV_ITEMCODEX)]
records = ri + 1
rows += [dict(id=str(ri + 1), cell=cell)]
ikind = dict(records=records, totals='1', rows=rows)'

Related

pyspark- connect two columns array elements

I am very new to pays-ark.
I have a Dataframe including two columns and each column has strings in an array format:
How can I connected the element of array from first column to the value in the same position in an array of other column.
if I convert the Dataframe to Pandas Dataframe in data brick the below code works but it will not keep the arrays in a correct format.
for item in list_x:
df_head[item] = "x"
value = df_head['materialText'].values
headName = df_head['materialTextPart'].values
value_list = []
for k in range(len(df_head)):
# print(k)
if type(value[k]) == np.float:
continue;
else:
value_array =value[k][0:].split(',')
# print(value_array)
headName_array = headName[k][1:-2].split(',')
for m in range(len(headName_array)):
if (headName_array[m] == item) or (headName_array[m] ==' '+item) or (headName_array[m] ==' '+item.replace('s','')):
columnName = item
columnValue = df_head.loc[k,columnName]
if columnValue == 'x':
df_head.loc[k,columnName] = value_array[m]
else:
df_head.loc[k,columnName]= df_head.loc[k,columnName]+ ',' + value_array[m]
df_head[item] = df_head[item].replace('x', np.nan)
Example of columns:
["Fabric:", "Wall bracket:", "Top rail/ Bottom rail:"]
["100 % polyester (100% recycled), PET plastic", "Steel, Polycarbonate/ABS plastic, Powder coating", "Aluminium, Powder coating"]
materialTextPart
materialText
["Fabric:", "Wall bracket:", "Top rail/ Bottom rail:"]
["100 % polyester (100% recycled), PET plastic", "Steel, Polycarbonate/ABS plastic, Powder coating", "Aluminium, Powder coating"]
["Ticking:", "Filling:", "Ticking, underside:", "Comfort filling:", "Ticking:"]
["100 % polyester (100% recycled)", "100 % polyester", "100% polypropylene", "Polyurethane foam 28 kg/cu.m.", "100% polyester"]
As I mentioned in my comment -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = spark.createDataFrame( data = [
(["Fabric:", "Wall bracket:", "Top rail/ Bottom rail:"],
["100 % polyester (100% recycled), PET plastic", "Steel, Polycarbonate/ABS plastic, Powder coating", "Aluminium, Powder coating"]
)
],
schema = StructType([StructField("xs", ArrayType(StringType())), StructField("ys", ArrayType(StringType()))])
)
df.select(zip_with("xs", "ys", lambda x, y: concat(x,y)).alias("Array_Elements_Concat")).show(truncate=False)
Output
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Array_Elements_Concat |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Fabric:100 % polyester (100% recycled), PET plastic, Wall bracket:Steel, Polycarbonate/ABS plastic, Powder coating, Top rail/ Bottom rail:Aluminium, Powder coating]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Buy any X product for X amount?

I was trying to solve the problem But I could not figure it out.
I have product dictionary:
product = {
"shirt" : {
"price" :300 ,
"no_reqired_for_dis" : {"3": ["shirt","pents","tshirt","shorts"],"discount_price" : 250}},
"pents" : {
"price" :200 ,
"no_reqired_for_dis" : {"3": ["shirt","pents","tshirt","shorts"],"discount_price" : 250}}
"tshirt" : {
"price" :150 ,
"no_reqired_for_dis" : {"3": ["shirt","pents","tshirt","shorts"],"discount_price" : 250}}
"shorts" : {
"price" :100 ,
"no_reqired_for_dis" : {"3": ["shirt","pents","tshirt","shorts"],"discount_price" : 250}}
}
What should be best approach to to find the total
discount criteria if anyone who buys a minimum of three products or a multiple of 3 they will get three item for 250?
for example if someone buy total 11 (shirt = 5,pants = 4, tshirt = 1, short = 1) products,then their total should be 250 * 3 + remaining item * lowest price product. Here remaining item should be lowest price of the product(here it should be shorts and tshirt).
I have done this:
total_payment = 0
total_product = {"shirt" : 5,"pents":4,"tshirt":1,"shorts" 1}
total_item = sum(total_product.values())
for key, value in total_product.items():
min_no_required_for_discount = product[key]["no_required_for_dis"].keys()
if total_item < int(min_no_required_for_discount[0]:
total_payment += value * product[key]["price"]
else:
remaining_unit = total_item % 3
total_pair = (total_item - remaining_unit) // 3
total_payment += total_pair * 250
Now i am confuse in remaining_unit. how to calculate price for remaining_unit because remaining_unit must multiply with who has minimum price . in above example remaining_unit will be 2 and it will calculate price of shorts and tshirt
Here is quick template that might help you to start working on this problem:
[Notes: use set() to get the items difference quickly, and use print() statement to confirm each step is expected] Again, this is NOT a complete solution - but just offers a good template for you to start quickly.
from pprint import pprint
lowest_price_items = ['tshirt', 'shorts']
discount_price_items = ['shirt', 'pants']
discount_Set = set(discount_price_items)
cart = ['shirt', 'shirt', 'shirt', 'shirt', 'shirt', 'pants', 'pants', 'pants', 'pants', 'tshirt', 'shorts']
cart_Set = set(cart)
low_price_goods = cart_Set - discount_Set
pprint(product)
print(f' products: {product.keys()} ') # first level of prod. dict.
print(product['shirt'].keys()) # 'price' and 'no_requied_for_dis'
# products key1 key2
print(product['shirt']['no_reqired_for_dis']['discount_price']) # 250
tshirt_price = product['tshirt']['price']
print(tshirt_price)
"""
total should be 250 * 3 + remaining item * lowest_price_products (tshirts, shorts) only
"""
total_items = len(cart)
print(total_items)
# modify this to calculate the final price.
if total_items > 3:
final_price = 250 * total_items %3 + "remaining item * lowest_price_products" # select the lowest price items

I am trying to hide data on the basis of certain condition

I am trying to hide the data for a single column if the column contain value that is in exceptionList then it should escape and move to next but somehow i am not able to hide that and throws error
if(x in exceptionList):
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
here is mine code
data = [['NISAMANEE ROWELL', '9198762345','98 Oxford Ave.Elk Grove Village, IL 60007'], ['ALICE BAISDEN', '8756342865', '94 Valley Rd.Miami Gardens, FL 33056'], ['MARC COGNETTI', '9198762345', '221 Summer CircleGreer, SC 29650'], ['JOHNS HOPKINS HEALTHCARE', '9654987642', '8522 Pendergast AvenueVilla Park, IL 60181']]
df = pd.DataFrame(data, columns = ['Name', 'Number', 'Address'])
df
def title_format(inp):
return inp.str.title()
def new(x):
#x = input('Enter your column name')
#x = x.title()
x = title_format(x)
print(x)
exc_list=['Mackesson Inc','Care','Healthcare','Henery Schien','Besse','LLC','CandP','INC','LTD','PHARMACY','PHARMACEUTICAL','HOSPITAL','COMPANY','ELECTRONICS','APP','VOLUNTEERS','SPECIALITIES','APPLIANCE','EXPRESS','MAGAZINE','SUPPLY','ENDOSCOPY','NETWandK','SCHOOL','AT&T','SOLUTIONS','SANITATION','SYSTEMS','COMPOUNDING','CLINIC','UTILITIES','DEPARTMENT','CREATIVE','PIN','employment','consultant','units','label','machine','anesthesia','services','medical','community','plaza','tech','bipolar','brand','commerce','testing','inspection','killer','plus','electric','division','diagnostic','materials','imaging','international','district','chamber','city','products','essentials','life','scissand','leasing','units','health','healthcare','surgical','enterprises','print','radiology','water','screens','telecom']
exceptionList = [z.title() for z in exc_list]
if(x in exceptionList):
return x
else:
return x.str.replace(x, 'X' * random.randrange(3, 8))
#new(df.Name.astype(str))
new(df['Name'].astype(str))
As far as i understand you want, i change several lines in your code:
import pandas as pd
import random
data = [['NISAMANEE ROWELL', '9198762345','98 Oxford Ave.Elk Grove Village, IL 60007'], ['ALICE BAISDEN', '8756342865', '94 Valley Rd.Miami Gardens, FL 33056'], ['MARC COGNETTI', '9198762345', '221 Summer CircleGreer, SC 29650'], ['Healthcare', '9654987642', '8522 Pendergast AvenueVilla Park, IL 60181']]
df = pd.DataFrame(data, columns = ['Name', 'Number', 'Address'])
def title_format(inp):
return inp.str.title()
def new(x):
#x = input('Enter your column name')
#x = x.title()
x = title_format(x)
print(x)
exc_list=['Mackesson Inc','Care','Healthcare','Henery Schien','Besse','LLC','CandP','INC','LTD','PHARMACY','PHARMACEUTICAL','HOSPITAL','COMPANY','ELECTRONICS','APP','VOLUNTEERS','SPECIALITIES','APPLIANCE','EXPRESS','MAGAZINE','SUPPLY','ENDOSCOPY','NETWandK','SCHOOL','AT&T','SOLUTIONS','SANITATION','SYSTEMS','COMPOUNDING','CLINIC','UTILITIES','DEPARTMENT','CREATIVE','PIN','employment','consultant','units','label','machine','anesthesia','services','medical','community','plaza','tech','bipolar','brand','commerce','testing','inspection','killer','plus','electric','division','diagnostic','materials','imaging','international','district','chamber','city','products','essentials','life','scissand','leasing','units','health','healthcare','surgical','enterprises','print','radiology','water','screens','telecom']
exceptionList = [z.title() for z in exc_list]
match = [x1 in exceptionList for x1 in x]
df.loc[match,'Name'] = ['X' * random.randrange(3, 8) for a in range(sum(match))]
# return x
# else:
# return x.str.replace(x, 'X' * random.randrange(3, 8))
#new(df.Name.astype(str))
new(df['Name'].astype(str))
df
Out[1]:
Name Number Address
0 NISAMANEE ROWELL 9198762345 98 Oxford Ave.Elk Grove Village, IL 60007
1 ALICE BAISDEN 8756342865 94 Valley Rd.Miami Gardens, FL 33056
2 MARC COGNETTI 9198762345 221 Summer CircleGreer, SC 29650
3 XXXXXXX 9654987642 8522 Pendergast AvenueVilla Park, IL 60181
More optimal way to do the same
exc_list = [x.title() for x in exc_list]
df['Name'] = df['Name'].map(str.title)
df['match'] = [nn in exc_list for nn in df['Name']]
df.loc[df['match'] == True,'Name'] = ['X' * random.randrange(3, 8) for a in range(sum(df['match']))]
To hide first 3 symbols
exc_list = [x.title() for x in exc_list]
df['Name'] = df['Name'].map(str.title)
df['match'] = [nn in exc_list for nn in df['Name']]
df['NameIf'] = list(zip(df['Name'], [(lambda x: 'XXX' + s[3:] if len(x)>3 else 'XXX')(s) for s in df['Name']]))
df['Name'] = [n[0][n[1]] for n in list(zip(df['NameIf'],df['match'].astype(int)))]
df = df.drop(['NameIf', 'match'], axis = 1)
df
To hide whole row
exc_list = [x.title() for x in exc_list]
df['Name'] = df['Name'].map(str.title)
df['match'] = [nn in exc_list for nn in df['Name']]
hide_row = {c:'XXX' for c in df.columns}
df[df['match'] != True].merge(pd.DataFrame(hide_row, index = df[df['match'] == True].index), how = 'outer')
short explanation
# Step 1. this gives you DataFrame without matching
df[df['match'] != True]
Out[3]:
Name Number Address match
0 Nisamanee Rowell 9198762345 98 Oxford Ave.Elk Grove Village, IL 60007 False
1 Alice Baisden 8756342865 94 Valley Rd.Miami Gardens, FL 33056 False
2 Marc Cognetti 9198762345 221 Summer CircleGreer, SC 29650 False
# Step 2. this opposite gives you DataFrame with matching
df[df['match'] == True]
Out[4]:
Name Number Address match
3 Healthcare 9654987642 8522 Pendergast AvenueVilla Park, IL 60181 True
# Step 3. but you take only index from Step 2. And create new dataframe with indexes and 'XXX' columns
hide_row = {c:'XXX' for c in df.columns}
pd.DataFrame(hide_row, index = df[df['match'] == True].index)
Out[5]:
Name Number Address match
3 XXX XXX XXX XXX
# Step 4. And then you just merge two dataframes from step 1 and step 3 by indexes
df[df['match'] != True].merge(pd.DataFrame(hide_row, index = df[df['match'] == True].index), how = 'outer')
Just small change in your code will work, mind you that's not optimal but works just fine.
data = [['NISAMANEE ROWELL', '9198762345','98 Oxford Ave.Elk Grove Village, IL 60007'], ['ALICE BAISDEN', '8756342865', '94 Valley Rd.Miami Gardens, FL 33056'], ['MARC COGNETTI', '9198762345', '221 Summer CircleGreer, SC 29650'], ['Healthcare', '9654987642', '8522 Pendergast AvenueVilla Park, IL 60181']]
df = pd.DataFrame(data, columns = ['Name', 'Number', 'Address'])
df
def title_format(inp):
return inp.title()
def new(x):
#x = input('Enter your column name')
#x = x.title()
x = title_format(x)
print(x)
exc_list=['Mackesson Inc','Care','Healthcare','Henery Schien','Besse','LLC','CandP','INC','LTD','PHARMACY','PHARMACEUTICAL','HOSPITAL','COMPANY','ELECTRONICS','APP','VOLUNTEERS','SPECIALITIES','APPLIANCE','EXPRESS','MAGAZINE','SUPPLY','ENDOSCOPY','NETWandK','SCHOOL','AT&T','SOLUTIONS','SANITATION','SYSTEMS','COMPOUNDING','CLINIC','UTILITIES','DEPARTMENT','CREATIVE','PIN','employment','consultant','units','label','machine','anesthesia','services','medical','community','plaza','tech','bipolar','brand','commerce','testing','inspection','killer','plus','electric','division','diagnostic','materials','imaging','international','district','chamber','city','products','essentials','life','scissand','leasing','units','health','healthcare','surgical','enterprises','print','radiology','water','screens','telecom']
exceptionList = [z.title() for z in exc_list]
if(x in exceptionList):
return x
else:
return x.replace(x, 'X' * random.randrange(3, 8))
#new(df.Name.astype(str))
df['Name'] = df['Name'].apply(new)

Beautifulsoup + Python HTML UL targeting, creating a list and appending to variables

I'm trying to scrape Autotrader's website to get an excel of the stats and names.
I'm stuck at trying to loop through an html 'ul' element without any classes or IDs and organize that info in python list to then append the individual li elements to different fields in my table.
As you can see I'm able to target the title and price elements, but the 'ul' is really tricky... Well... for someone at my skill level.
The specific code I'm struggling with:
for i in range(1, 2):
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
html_soup = BeautifulSoup(response.text, 'html.parser')
ad_containers = html_soup.find_all('h2', class_ = 'listing-title title-wrap')
price_containers = html_soup.find_all('section', class_ = 'price-column')
for container in ad_containers:
name = container.find('a', class_ ="js-click-handler listing-fpa-link").text
names.append(name)
# Trying to loop through the key specs list and assigned each 'li' to a different field in the table
lis = []
list_container = container.find('ul', class_='listing-key-specs')
for li in list_container.find('li'):
lis.append(li)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
lis = [] # Clearing dictionary to get ready for next set of data
And the error message I get is the following:
Full code here:
from requests import get
from bs4 import BeautifulSoup
import pandas
# from time import sleep, time
# import random
# Create table fields
names = []
prices = []
year = []
body_type = []
milage = []
engine = []
hp = []
transmission = []
petrol_type = []
for i in range(1, 2):
# Make a get request
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
# Pause the loop
# sleep(random.randint(4, 7))
# Create containers
html_soup = BeautifulSoup(response.text, 'html.parser')
ad_containers = html_soup.find_all('h2', class_ = 'listing-title title-wrap')
price_containers = html_soup.find_all('section', class_ = 'price-column')
for container in ad_containers:
name = container.find('a', class_ ="js-click-handler listing-fpa-link").text
names.append(name)
# Trying to loop through the key specs list and assigned each 'li' to a different field in the table
lis = []
list_container = container.find('ul', class_='listing-key-specs')
for li in list_container.find('li'):
lis.append(li)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
lis = [] # Clearing dictionary to get ready for next set of data
for pricteainers in price_containers:
price = pricteainers.find('div', class_ ='vehicle-price').text
prices.append(price)
test_df = pandas.DataFrame({'Title': names, 'Price': prices, 'Year': year, 'Body Type': body_type, 'Mileage': milage, 'Engine Size': engine, 'HP': hp, 'Transmission': transmission, 'Petrol Type': petrol_type})
print(test_df.info())
# test_df.to_csv('Autotrader_test.csv')
I followed the advice from David in the other answer's comment area.
Code:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
pd.set_option('display.width', 1000)
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
names = []
prices = []
year = []
body_type = []
milage = []
engine = []
hp = []
transmission = []
petrol_type = []
for i in range(1, 2):
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
html_soup = BeautifulSoup(response.text, 'html.parser')
outer = html_soup.find_all('article', class_='search-listing')
for inner in outer:
lis = []
names.append(inner.find_all('a', class_ ="js-click-handler listing-fpa-link")[1].text)
prices.append(inner.find('div', class_='vehicle-price').text)
for li in inner.find_all('ul', class_='listing-key-specs'):
for i in li.find_all('li')[-7:]:
lis.append(i.text)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
test_df = pd.DataFrame.from_dict({'Title': names, 'Price': prices, 'Year': year, 'Body Type': body_type, 'Mileage': milage, 'Engine Size': engine, 'HP': hp, 'Transmission': transmission, 'Petrol Type': petrol_type}, orient='index')
print(test_df.transpose())
Output:
Title Price Year Body Type Mileage Engine Size HP Transmission Petrol Type
0 Citroen C3 1.4 HDi Exclusive 5dr £500 2002 (52 reg) Hatchback 123,065 miles 1.4L 70bhp Manual Diesel
1 Volvo V40 1.6 XS 5dr £585 1999 (V reg) Estate 125,000 miles 1.6L 109bhp Manual Petrol
2 Toyota Yaris 1.3 VVT-i 16v GLS 3dr £700 2000 (W reg) Hatchback 94,000 miles 1.3L 85bhp Automatic Petrol
3 MG Zt-T 2.5 190 + 5dr £750 2002 (52 reg) Estate 95,000 miles 2.5L 188bhp Manual Petrol
4 Volkswagen Golf 1.9 SDI E 5dr £795 2001 (51 reg) Hatchback 153,000 miles 1.9L 68bhp Manual Diesel
5 Volkswagen Polo 1.9 SDI Twist 5dr £820 2005 (05 reg) Hatchback 106,116 miles 1.9L 64bhp Manual Diesel
6 Volkswagen Polo 1.4 S 3dr (a/c) £850 2002 (02 reg) Hatchback 125,640 miles 1.4L 75bhp Manual Petrol
7 KIA Picanto 1.1 LX 5dr £990 2005 (05 reg) Hatchback 109,000 miles 1.1L 64bhp Manual Petrol
8 Vauxhall Corsa 1.2 i 16v SXi 3dr £995 2004 (54 reg) Hatchback 81,114 miles 1.2L 74bhp Manual Petrol
9 Volkswagen Beetle 1.6 3dr £995 2003 (53 reg) Hatchback 128,000 miles 1.6L 102bhp Manual Petrol
The ul is not a child of the h2 . It's a sibling.
So you will need to make a separate selection because it's not part of the ad_containers.

Match word irrespective of the case

Dataset:
> df
Id Clean_Data
1918916 Luxury Apartments consisting 11 towers Well equipped gymnasium Swimming Pool Toddler Pool Health Club Steam Room Sauna Jacuzzi Pool Table Chess Billiards room Carom Table Tennis indoor games
1495638 near medavakkam junction calm area near global hospital
1050651 No Pre Emi No Booking Amount No Floor Rise Charges No Processing Fee HLPROJECT HIGHLIGHTS
Below is the code which is successfully returning the matching words in ngrams from the list of values in Category.py
df['one_word_tokenized_text'] =df["Clean_Data"].str.split()
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df['trigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 3)))
df['four_words'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 4)))
token=pd.Series(df["one_word_tokenized_text"])
Lid=pd.Series(df["Id"])
matches= token.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.HealthCare])))
match_list= [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches]
match_df = pd.DataFrame({"ID":Lid,"jc1":match_list})
def match_word(feature, row):
categories = []
for bigram in row.bigram:
joined = ' '.join(bigram)
if joined in feature:
categories.append(joined)
for trigram in row.trigram:
joined = ' '.join(trigram)
if joined in feature:
categories.append(joined)
for fourwords in row.four_words:
joined = ' '.join(fourwords)
if joined in feature:
categories.append(joined)
return categories
match_df['Health1'] = df.apply(partial(match_word, HealthCare), axis=1)
match_df['HealthCare'] = match_df[match_df.columns[[1,2]]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
Category.py
category = [('steam room','IN','HealthCare'),
('sauna','IN','HealthCare'),
('Jacuzzi','IN','HealthCare'),
('Aerobics','IN','HealthCare'),
('yoga room','IN','HealthCare'),]
HealthCare= [e1 for (e1, rel, e2) in category if e2=='HealthCare']
Output:
ID HealthCare
1918916 Jacuzzi
1495638
1050651 Aerobics, Jacuzzi, yoga room
Here if I mention the features in "Category list" in the exact letter case as mentioned in the dataset, then the code identifies it and returns the value, else it won't.
So I want my code to be case insensitive and even track "Steam Room","Sauna" under health category. I tried with ".lower()" function, but am not sure how to implement it.
edit 2: only category.py is updated
Category.py
category = [('steam room','IN','HealthCare'),
('sauna','IN','HealthCare'),
('jacuzzi','IN','HealthCare'),
('aerobics','IN','HealthCare'),
('Yoga room','IN','HealthCare'),
('booking','IN','HealthCare'),
]
category1 = [value[0].capitalize() for index, value in enumerate(category)]
category2 = [value[0].lower() for index, value in enumerate(category)]
test = []
test2 =[]
for index, value in enumerate(category1):
test.append((value, category[index][1],category[index][2]))
for index, value in enumerate(category2):
test2.append((value, category[index][1],category[index][2]))
category = category + test + test2
HealthCare = [e1 for (e1, rel, e2) in category if e2=='HealthCare']
Your unaltered dataset
import pandas as pd
from nltk import ngrams, word_tokenize
import Categories
from Categories import *
from functools import partial
data = {'Clean_Data':['Luxury Apartments consisting 11 towers Well equipped gymnasium Swimming Pool Toddler Pool Health Club Steam Room Sauna Jacuzzi Pool Table Chess Billiards room Carom Table Tennis indoor games',
'near medavakkam junction calm area near global hospital',
'No Pre Emi No Booking Amount No Floor Rise Charges No Processing Fee HLPROJECT HIGHLIGHTS '],
'Id' : [1918916, 1495638,1050651]}
df = pd.DataFrame(data)
df['one_word_tokenized_text'] =df["Clean_Data"].str.split()
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df['trigram'] = df['Clean_Data']).apply(lambda row: list(ngrams(word_tokenize(row), 3)))
df['four_words'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 4)))
token=pd.Series(df["one_word_tokenized_text"])
Lid=pd.Series(df["Id"])
matches= token.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.HealthCare])))
match_list= [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches]
match_df = pd.DataFrame({"ID":Lid,"jc1":match_list})
def match_word(feature, row):
categories = []
for bigram in row.bigram:
joined = ' '.join(bigram)
if joined in feature:
categories.append(joined)
for trigram in row.trigram:
joined = ' '.join(trigram)
if joined in feature:
categories.append(joined)
for fourwords in row.four_words:
joined = ' '.join(fourwords)
if joined in feature:
categories.append(joined)
return categories
match_df['Health1'] = df.apply(partial(match_word, HealthCare), axis=1)
match_df['HealthCare'] = match_df[match_df.columns[[1,2]]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)enize(row), 4)))
Output
print match_df
+--------+----------------+-------------+------------------------------------+
|ID |jc1 |Health1 |HealthCare |
+--------+----------------+-------------+------------------------------------+
|1918916 |[sauna, jacuzzi]| |['sauna', 'jacuzzi'],['steam room'] |
+--------+----------------+-------------+------------------------------------+
|1495638 | | | |
+--------+----------------+-------------+------------------------------------+
|1050651 | [Booking] | | ['Booking'],[] | |
+--------+----------------+-------------+------------------------------------+

Categories

Resources