Extracting string with the help of function - python

Actually I have data frames of clickstream with about 4 million rows. I have many columns and two of them are based on URL and Domain. I have a dictionary and want to use it as a condition. For example: If the domain is equal to amazon.de and Url contains a Keyword pillow then the column will have a value pillow. And so on.
dictionary_keywords = {"amazon.de": "pillow", "rewe.com": "apple"}
ID Domain URL
1 amazon.de www.amazon.de/ssssssss/exapmle/pillow
2 rewe.de www.rewe.de/apple
The expected output should be the new column:
ID Domain URL New_Col
1 amazon.de www.amazon.de/ssssssss/exapmle/pillow pillow
2 rewe.de www.rewe.de/apple apple
I can manually use .str.contain method but need to define a function which takes the dictionary key and value as a condition.
Something like this df[df['domain] == 'amazon.de'] & df[df['url'].str.contains('pillow')
But I am not sure. I am new in this.

The way I prefer to solve this kind of problem is by using df.apply() by row (axis=1) with a custom function to deal with the logic.
import pandas as pd
dictionary_keywords = {"amazon.de": "Pillow", "rewe.de": "Apple"}
df = pd.DataFrame({
'Domain':['amazon.de','rewe.de'],
'URL':['www.amazon.de/ssssssss/exapmle/pillow', 'www.rewe.de/apple']
})
def f(row):
global dictionary_keywords
try:
url = row['URL'].lower()
domain = url.split('/')[0].strip('www.')
if dictionary_keywords[domain].lower() in url:
return dictionary_keywords[domain]
except Exception as e:
print(row.name, e)
return None #or False, or np.nan
df['New_Col'] = df.apply(f, axis=1)
Output:
print(df)
Domain URL New_Col
0 amazon.de www.amazon.de/ssssssss/exapmle/pillow Pillow
1 rewe.de www.rewe.de/apple Apple

Related

Trying to add prefixes to url if not present in pandas df column

I am trying to add prefixes to urls in my 'Websites' Column. I can't figure out how to keep each new iteration of the helper column from overwriting everything from the previous column.
for example say I have the following urls in my column:
http://www.bakkersfinedrycleaning.com/
www.cbgi.org
barstoolsand.com
This would be the desired end state:
http://www.bakkersfinedrycleaning.com/
http://www.cbgi.org
http://www.barstoolsand.com
this is as close as I have been able to get:
def nan_to_zeros(df, col):
new_col = f"nanreplace{col}"
df[new_col] = df[col].fillna('~')
return df
df1 = nan_to_zeros(df1, 'Website')
df1['url_helper'] = df1.loc[~df1['nanreplaceWebsite'].str.startswith('http')| ~df1['nanreplaceWebsite'].str.startswith('www'), 'url_helper'] = 'https://www.'
df1['url_helper'] = df1.loc[df1['nanreplaceWebsite'].str.startswith('http'), 'url_helper'] = ""
df1['url_helper'] = df1.loc[df1['nanreplaceWebsite'].str.startswith('www'),'url_helper'] = 'www'
print(df1[['nanreplaceWebsite',"url_helper"]])
which just gives me a helper column of all www because the last iteration overwrites all fields.
Any direction appreciated.
Data:
{'Website': ['http://www.bakkersfinedrycleaning.com/',
'www.cbgi.org', 'barstoolsand.com']}
IIUC, there are 3 things to fix here:
df1['url_helper'] = shouldn't be there
| should be & in the first condition because 'https://www.' should be added to URLs that start with neither of the strings in the condition. The error will become apparent if we check the first condition after the other two conditions.
The last condition should add "http://" instead of "www".
Alternatively, your problem could be solved using np.select. Pass in the multiple conditions in the conditions list and their corresponding choice list and assign values accordingly:
import numpy as np
s = df1['Website'].fillna('~')
df1['fixed Website'] = np.select([~(s.str.startswith('http') | ~s.str.contains('www')),
~(s.str.startswith('http') | s.str.contains('www'))
],
['http://' + s, 'http://www.' + s], s)
Output:
Website fixed Website
0 http://www.bakkersfinedrycleaning.com/ http://www.bakkersfinedrycleaning.com/
1 www.cbgi.org http://www.cbgi.org
2 barstoolsand.com http://www.barstoolsand.com

Append dataframe column values to an array by comparing the column values

I have a data-frame with mail id and result.
Mail id result
0 xyz#gmail.com fail
1 xyz#yahoo.com pass
2 pqr#gmail.com not attempted
3 tuv#gmail.com not attempted
4 123#gmail.com fail
5 ABC#gmail.com not attempted
From the above data, need to get mail into an array by the result
Ex: If result is equal to 'Fail' then failed one's mail id should get in an array/list called failed.. , Similarly for 'not attempted'.
failed = ['xyz#gmail.com', '123#gmail.com']
not attempted = ['pqr#gmail.com', 'tuv#gmail.com', 'ABC#gmail.com']
You can filter values separately:
failed = df.loc[df['result'].eq('fail'), 'Mail id'].tolist()
notattempted = df.loc[df['result'].eq('not attempted'), 'Mail id'].tolist()
Or create Series with aggregate lists and then select by index:
s = df.groupby('result')['Mail id'].agg(list)
failed = s.loc['failed']
notattempted = s.loc['not attempted']
failed = s['failed']
notattempted = s['not attempted']
Please try this.
import pandas as pd
// let df is the basic dataframe
result_df = pd.DataFrame()
// get all different result values
result_df['idx_str'] = list(set(df['result']))
def separate_by_result(idx_str):
str_df = df[df['result'] == idx_str]
return list(str_df['Mail id'])
// show result as one new column
result_df['result_list'] = result_df['idx_str'].apply(separate_by_result)
result_df will be what you want.

Create Proper Dataframe from SDMX Response, Python 3.6

I want to prepare dataset from the data available in http://stat.data.abs.gov.au/Index.aspx?DataSetCode=ATSI_BIRTHS_SUMM
Data API:
http://stat.data.abs.gov.au/restsdmx/sdmx.ashx/GetData/ATSI_BIRTHS_SUMM/1+4+5+7+8+9+10+13+14+15+18+19+20.IM+IB.0+1+2+3+4+5+6+7.A/all
from pandasdmx import Request
Agency_Code = 'ABS'
Dataset_Id = 'ATSI_BIRTHS_SUMM'
ABS = Request(Agency_Code)
data_response = ABS.data(resource_id='ATSI_BIRTHS_SUMM')
print(data_response.url)
DF = data_response.write(data_response.data.obs(with_values=True, with_attributes=True), parse_time=False)
Above gives error: ValueError: Type names and field names cannot be a keyword: 'None'
DF = data_response.write(data_response.data.series, parse_time=False), This works but Dimension items coming in column wise.
Support Links:
http://stat.data.abs.gov.au/restsdmx/sdmx.ashx/GetDataStructure/all
http://stat.data.abs.gov.au/restsdmx/sdmx.ashx/GetDataStructure/ATSI_BIRTHS_SUMM
http://stat.data.abs.gov.au/Index.aspx?DataSetCode=ATSI_BIRTHS_SUMM
Please suggest better way to retrieve data.
Your example
DF = data_response.write(data_response.data.series, parse_time=False)
Produces a stacked DataFrame, by unstack().reset_index() you will get a "flat" DataFrame.
data_response.write().unstack().reset_index()
MEASURE INDIGENOUS_STATUS ASGS_2011 FREQUENCY TIME_PERIOD 0
0 1 IM 0 A 2001 8334.0
Is this what you are looking for?

Coloring Single Column of Pandas Dataframe.to_html()

Before this is marked as duplicate, I have tried the code from the following topics and none has worked for me thus far:
[Colouring one column of pandas dataframe ]
[Format the color of a cell in a panda dataframe according to multiple conditions ]
[how to color selected columns in python dataframe? ]
I have code that produces three pandas dataframe that looks like this:
RowName Orders Market StartTime StopTime
Status
good A 9 gold 10:00:00 10:09:45
.
.
.
bad B 60 silver 07:54:43 08:02:12
RowName Orders Market StartTime StopTime
Status
good E 19 plat. 10:00:00 10:09:45
.
.
bad F 54 mercury 07:54:43 08:02:12
RowName Orders Market StartTime StopTime
Status
great D 3 alum. 10:00:00 10:09:45
.
.
ok C 70 bronze 07:54:43 08:02:12
where the Status column is set as the index of each frame
For each frame, I want to highlight the StartTime column with the value #D42A2A (aka red) regardless of what value is in a given cell.
How can this be done?
MOST RECENT FAILED ATTEMPTS:
def column_style(col):
if col.Name == 'StartTime':
return pd.Series('bgcolor: #d42a2a', col.index)
def col_color(data):
color = 'red' if data != '' else 'black'
return 'color: %s' %color
frame.style.applymap(col_color, subset=['StartTime'])
but this also fails.
NOTE:
I am using VI within a linux shell
The entire script is being called by IE (internet explorer) so the output of the script is html
I am using BS (beautifulsoup) to scrape data from a few sites and the aggregating the results onto one page
{*after scraping the initial website and creating the required website (call it Page1), I tried to scrape Page1 in the same script and add in the html lines via the .attrs method, but this "fails", i.e. the webserver times out during the run}
Let's try this:
import pandas as pd
import numpy as np
np.random.seed(24)
df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list('BCDE'))],
axis=1)
df.iloc[0, 2] = np.nan
def highlight_column(s, col):
return ['background-color: #d42a2a' if s.name == col else '' for v in s.index]
df.style.apply(highlight_column, col = 'B')
Output:
If anyone is using BeautifulSoup to parse a website and then using pandas to create a DataFrame that you may want to add styles to, you can do something like:
(before using this you have already; imported beautifulsoup, scraped your site and created your dataframe)
variable_name = beautifulsoup(dataframe_name.to_html())
list = []
`for table in variable_name.findAll('table'):`
`for tbody in variable_name.findAll('table'):`
`for td in variable_name.findAll('tbody'):`
`list.append(td)`
list[td_index]['attribute_name'] = 'attribute_value'
This will add your all your table data to a list and you can select any element from that list and add/update a tag attribute
(if more efficient way please comment to help improve)

Pandas Google Distance Matrix API - Pass coordinates into URL

I am working with the Google Distance Matrix API, where I want to feed coordinates from a dataframe into the API and return the duration and distance between the two points.
Here is my dataframe:
import pandas as pd
import simplejson
import urllib
import numpy as np
Record orig_lat orig_lng dest_lat dest_lng
1 40.7484405 -74.0073127 40.7115242 -74.0145492
2 40.7421218 -73.9878531 40.7727216 -73.9863531
First, I need to combine the orig_lat & orig_lng and dest_lat & dest_lng into strings, which then pass into the url. So I've tried creating the variables orig_coord & dest_coord then passing them into the URL and returning values:
orig_coord = df[['orig_lat','orig_lng']].apply(lambda x: '{},{}'.format(x[0],x[1]), axis=1)
dest_coord = df[['dest_lat','dest_lng']].apply(lambda x: '{},{}'.format(x[0],x[1]), axis=1)
for row in df.itertuples():
url = "http://maps.googleapis.com/maps/api/distancematrix/json?origins={0}&destinations={1}&units=imperial&MYGOOGLEAPIKEY".format(orig_coord,end_coord)
result = simplejson.load(urllib.urlopen(url))
df['driving_time_text'] = result['rows'][0]['elements'][0]['duration']['text']
But I get the following error: "TypeError: () got an unexpected keyword argument 'axis'"
So my question is: how do I concatenate values from two columns into a string, then pass that string into a URL and output the result?
Thank you in advance!
Hmm, I am not sure how you constructed your data frame. Maybe post those details? But if you can live with referencing tuple elements positionally, this worked for me:
import pandas as pd
data = [{'orig_lat': 40.748441, 'orig_lng': -74.007313, 'dest_lat': 40.711524, 'dest_lng': -74.014549},
{'orig_lat': 40.742122, 'orig_lng': -73.987853, 'dest_lat': 40.772722, 'dest_lng': -73.986353}]
df = pd.DataFrame(data)
for row in df.itertuples():
orig_coord='{},{}'.format(row[1],row[2])
dest_coord='{},{}'.format(row[3],row[4])
url = "http://maps.googleapis.com/maps/api/distancematrix/json?origins={0}&destinations={1}&units=imperial&MYGOOGLEAPIKEY".format(orig_coord,dest_coord)
print url
produces
http://maps.googleapis.com/maps/api/distancematrix/json?origins=40.748441,-74.007313&destinations=40.711524,-74.014549&units=imperial&MYGOOGLEAPIKEY
http://maps.googleapis.com/maps/api/distancematrix/json?origins=40.742122,-73.987853&destinations=40.772722,-73.986353&units=imperial&MYGOOGLEAPIKEY
To update the data frame with the result, since row is a tuple and not writeable, you might want to keep track of the current index as you iterate. Maybe something like this:
data = [{'orig_lat': 40.748441, 'orig_lng': -74.007313, 'dest_lat': 40.711524, 'dest_lng': -74.014549, 'result': -1},
{'orig_lat': 40.742122, 'orig_lng': -73.987853, 'dest_lat': 40.772722, 'dest_lng': -73.986353, 'result': -1}]
df = pd.DataFrame(data)
i_row = 0
for row in df.itertuples():
orig_coord='{},{}'.format(row[1],row[2])
dest_coord='{},{}'.format(row[3],row[4])
url = "http://maps.googleapis.com/maps/api/distancematrix/json?origins={0}&destinations={1}&units=imperial&MYGOOGLEAPIKEY".format(orig_coord,dest_coord)
# Do stuff to get your result
df['result'][i_row] = result
i_row++

Categories

Resources