Handling exceptions with df.apply - python

I am using the tld python library to grab the first level domain from the proxy request logs using a apply function.
When I run into a strange request that tld doesnt know how to handle like 'http:1 CON' or 'http:/login.cgi%00' I run into an error message like this:
TldBadUrl: Is not a valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in engine
----> 1 new_fld_column = request_2['request'].apply(get_fld)
/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)()
/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in get_fld(url,
fail_silently, fix_protocol, search_public, search_private, **kwargs)
385 fix_protocol=fix_protocol,
386 search_public=search_public,
--> 387 search_private=search_private
388 )
389
/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in process_url(url, fail_silently, fix_protocol, search_public, search_private)
289 return None, None, parsed_url
290 else:
--> 291 raise TldBadUrl(url=url)
292
293 domain_parts = domain_name.split('.')
In the mean time I have been weeding these out by using many lines like following code but there are hundreds or thousands of them in this dataset:
request_2 = request_1[request_1['request'] != 'http:1 CON']
request_2 = request_1[request_1['request'] != 'http:/login.cgi%00']
Dataframe:
request
request_url count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com 65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com 12
The code:
from tld import get_tld
from tld import get_fld
from impala.dbapi import connect
from impala.util import as_pandas
import pandas as pd
import numpy as np
request = pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove rows where there were null values in the request column
request = request[pd.notnull(request['request'])]
#Reset index
request.reset_index(drop=True)
#Find the urls that contain IP addresses and exclude them from the new dataframe
request_1 = request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset index
request_1 = request_1.reset_index(drop=True)
#Appply the get_fld lib on the request column
new_fld_column = request_2['request'].apply(get_fld)
Is there anyway to keep this error from firing and instead add those that would error to a separate dataframe?

If you can wrap your function around a try-except clause, you can determine what rows error out by querying those rows with NaN:
import tld
from tld import get_fld
def try_get_fld(x):
try:
return get_fld(x)
except tld.exceptions.TldBadUrl:
return np.nan
print(df)
request_url count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com 65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com 12
6 http:1 CON 10
7 http:/login.cgi%00 200
df['flds'] = df['request_url'].apply(try_get_fld)
print(df['flds'])
0 microsoftonline.com
1 adsafeprotected.com
2 doubleclick.net
3 amazon.com
4 microsoft.com
5 adnxs.com
6 NaN
7 NaN
Name: flds, dtype: object
faulty_url_df = df[df['flds'].isna()]
print(faulty_url_df)
request_url count flds
6 http:1 CON 10 NaN
7 http:/login.cgi%00 200 NaN

Related

How to use tabulate library?

I am trying to use tabulate with the zip_longest function. So I have it like this:
from __future__ import print_function
from tabulate import tabulate
from itertools import zip_longest
import itertools
import locale
import operator
import re
50 ="['INGBNL2A, VAT number: NL851703884B01 i\nTel, +31 (0}1 80 61 88 \n\nrut ard wegetables\n\x0c']"
fruit_words = ['Appels', 'Ananas', 'Peen Waspeen',
'Tomaten Cherry', 'Sinaasappels',
'Watermeloenen', 'Rettich', 'Peren', 'Peen', 'Mandarijnen', 'Meloenen', 'Grapefruit']
def total_amount_fruit_regex(format_=re.escape):
return r"(\d*(?:\.\d+)*)\s*~?=?\s*(" + '|'.join(
format_(word) for word in fruit_words) + ')'
def total_fruit_per_sort():
number_found = re.findall(total_amount_fruit_regex(), verdi50)
fruit_dict = {}
for n, f in number_found:
fruit_dict[f] = fruit_dict.get(f, 0) + int(n)
result = '\n'.join(f'{key}: {val}' for key, val in fruit_dict.items())
return result
def fruit_list(format_=re.escape):
return "|".join(format_(word) for word in fruit_words)
def findallfruit(regex):
return re.findall(regex, verdi50)
def verdi_total_number_fruit_regex():
return rf"(\d*(?:\.\d+)*)\s*\W+(?:{fruit_list()})"
def show_extracted_data_from_file():
regexes = [
verdi_total_number_fruit_regex(),
]
matches = [findallfruit(regex) for regex in regexes]
fruit_list = total_fruit_per_sort().split("\n")
return "\n".join(" \t ".join(items) for items in zip_longest(tabulate(*matches, fruit_list, headers=['header','header2'], fillvalue='', )))
print(show_extracted_data_from_file())
But then I get this error:
TypeError at /controlepunt140
tabulate() got multiple values for argument 'headers'
So how to improve this?
So if you remove the tabulate function. Then the format looks like this:
16 Watermeloenen: 466
360 Appels: 688
6 Sinaasappels: 803
75
9
688
22
80
160
320
160
61
So expected output is with headers:
header1 header2
------- -------
16 Watermeloenen: 466
360 Appels: 688
6 Sinaasappels: 803
75
9
688
22
80
160
320
160
61
Like how it works in tabulate.
You should be passing a single table to the tabulate() function, passing multiple lists results in the TypeError: tabulate() got multiple values for argument 'headers' you are seeing.
Updating your return statement -
def show_extracted_data_from_file():
regexes = [
verdi_total_number_fruit_regex(),
]
matches = [findallfruit(regex) for regex in regexes]
fruit_list = total_fruit_per_sort().split("\n")
return tabulate(zip_longest(*matches, fruit_list), headers=['header1','header2'])
Output:
header1 header2
--------- ------------------
16 Watermeloenen: 466
360 Appels: 688
6 Sinaasappels: 803
75
9
688
22
80
160
320
160
61

Converting XML file to Data Frame in Python

I have to convert an XML file from a URL link, to a dataframe
I have written the code which gives a dictionary from the XML file. I am not able to convert it into a dataframe. Please suggest if there any other way is suitable for this XML file.
import xml.etree.ElementTree as ET
import pandas as pd
xml_data=requests.get('http://wbes.nrldc.in/xml/FullSchedule-(130)-19-01-2021.xml')
root = ET.fromstring(xml_data.text)
root = ET.tostring(root, encoding='utf8', method='xml')
data_dict = dict(xmltodict.parse(root))
Consider parsing data with DOM tools like etree (or feature-rich, third-party lxml) and then build a list of dictionaries at the repeating <FullSchedule> element to be passed into DataFrame constructor:
import urllib.request as rq
import xml.etree.ElementTree as et
import pandas as pd
url = "https://wbes.nrldc.in/xml/FullSchedule-(184)-30-01-2021.xml"
doc = rq.urlopen(url)
tree = et.fromstring(doc.read()) # NOTE: TAKES SEVERAL MINUTES DUE TO SIZE
data = [{t.tag:t.text.strip() if t.text is not None else None
for t in fs.findall("*")
} for fs in tree.findall(".//FullSchedule")]
df = pd.DataFrame(data)
df.shape
# (1152, 21)
df.columns
# Index(['Buyer', 'Seller', 'ScheduleName', 'ScheduleSubTypeName', 'ScheduleDate',
# 'ScheduleAmount', 'BuyerAmount', 'SellerAmount', 'PocInjectionLoss',
# 'PocDrawalLoss', 'StateInjectionLoss', 'StateDrawalLoss',
# 'DiscomInjectionLoss', 'DiscomDrawalLoss', 'Trader', 'LinkName',
# 'OffBarTotal', 'OffBarAllocatedFromPool', 'Open', 'Combined',
# 'ApprovalNo'], dtype='object')
Because <Buyer> and <Seller> contain nested elements, they are blank above. Hence consider additional parsing and compilation. Only difference above is the findall XPath.
data = [{t.tag:t.text.strip() if t.text is not None else None
for t in fs.findall("*")
} for fs in tree.findall(".//FullSchedule/Buyer")]
df = pd.DataFrame(data)
print(df)
# Acronym ParentState WBESParentStateAcronym
# 0 HARYANA HARYANA HARYANA_STATE
# 1 JK&LADAKH JAMMU AND KASHMIR JK&LADAKH_UT
# 2 UPPCL UTTAR PRADESH UTTARPRADESH_STATE
# 3 JK&LADAKH JAMMU AND KASHMIR JK&LADAKH_UT
# 4 UPPCL UTTAR PRADESH UTTARPRADESH_STATE
# ... ... ...
# 1147 CHANDIGARH CHANDIGARH CHANDIGARH_UT
# 1148 PUNJAB PUNJAB PUNJAB_STATE
# 1149 DELHI DELHI DELHI_UT
# 1150 HARYANA HARYANA HARYANA_STATE
# 1151 CHANDIGARH CHANDIGARH CHANDIGARH_UT
data = [{t.tag:t.text.strip() if t.text is not None else None
for t in fs.findall("*")
} for fs in tree.findall(".//FullSchedule/Seller")]
df = pd.DataFrame(data)
print(df)
# Acronym ParentState WBESParentStateAcronym
# 0 KAMENG None None
# 1 KAPS None None
# 2 VSTPS V None None
# 3 SOLAPUR None None
# 4 LARA-I None None
# ... ... ...
# 1147 NAPP None None
# 1148 BHAKRA None None
# 1149 CHAMERA3 None None
# 1150 RAPPC None None
# 1151 BHAKRA None None
By the way, pandas.read_xml() is in the works by me and uses above algorithm where above may soon be handled with below. See Git issues post.
url = "https://wbes.nrldc.in/xml/FullSchedule-(184)-30-01-2021.xml"
fs_df = pd.read_xml(url, xpath=".//FullScheule", parser="lxml")
fs_df = pd.read_xml(url, xpath=".//FullScheule", parser="etree")
buyer_df = pd.read_xml(url, xpath=".//FullScheule/Buyer")
seller_df = pd.read_xml(url, xpath=".//FullScheule/Seller")

building panda dataframe from cloudant data, error: If using all scalar values, you must pass an index

I'm just starting with pandas. All the answers I found for the error message do not resolve my error. I'm trying to build a dataframe from a dictionary constructed from an IBM cloudant query. I'm using a jupyter notebook. The specific error message is: If using all scalar values, you must pass an index
the section of code where I think my error is, is here:
def read_high_low_temp(location):
USERNAME = "*************"
PASSWORD = "*************"
client = Cloudant(USERNAME,PASSWORD, url = "https://**********" )
client.connect()
my_database = client["temps"]
query = Query(my_database,selector= {'_id': {'$gt': 0}, 'l':location, 'd':dt.datetime.now().strftime("%m-%d-%Y")}, fields=['temp','t','d'],sort=[{'temp': 'desc'}])
temp_dict={}
temp_dict=query(limit=1000, skip=5)['docs']
df = pd.DataFrame(columns = ['Temperature','Time','Date'])
df.set_index('Time', inplace= True)
for row in temp_dict:
value_list.append(row['temp'])
temp_df=pd.DataFrame({'Temperature':row['temp'],'Time':row['t'], 'Date':row['d']}, index=['Time'])
df=df.append(temp_df)
message="the highest temp in the " + location + " is: " + str(max(value_list)) + " the lowest " + str(min(value_list))
return message, df
my data (Output from Jupyter) looks like this:
Temperature Time Date
Time 51.6 05:07:18 12-31-2020
Time 51.6 04:59:00 12-31-2020
Time 51.5 04:50:31 12-31-2020
Time 51.5 05:15:38 12-31-2020
Time 51.5 05:03:09 12-31-2020
... ... ... ...
Time 45.3 11:56:34 12-31-2020
Time 45.3 11:52:22 12-31-2020
Time 45.3 11:14:15 12-31-2020
Time 45.2 10:32:05 12-31-2020
Time 45.2 10:36:22 12-31-2020
[164 rows x 3 columns]
my full code looks like:
import numpy as np
import pandas as pd
import seaborn as sns
import os, shutil, glob, time, subprocess, re, sys, sqlite3, logging
#import RPi.GPIO as GPIO
from datetime import datetime
import datetime as dt
import cloudant
from cloudant.client import Cloudant
from cloudant.query import Query
from cloudant.result import QueryResult
from cloudant.error import ResultException
import seaborn as sns
def read_high_low_temp(location):
USERNAME = "******"
PASSWORD = "******"
client = Cloudant(USERNAME,PASSWORD, url = "********" )
client.connect()
# location='Backyard'
my_database = client["temps"]
query = Query(my_database,selector= {'_id': {'$gt': 0}, 'l':location, 'd':dt.datetime.now().strftime("%m-%d-%Y")}, fields=['temp','t','d'],sort=[{'temp': 'desc'}])
temp_dict={}
temp_dict=query(limit=1000, skip=5)['docs']
df = pd.DataFrame(columns = ['Temperature','Time','Date'])
df.set_index('Time')
for row in temp_dict:
temp_df=pd.DataFrame({'Temperature':row['temp'],'Time':row['t'], 'Date':row['d']}, index=['Time'])
df=df.append(temp_df)
message="the highest temp in the " + location + " is: " + str(max(value_list)) + " the lowest " + str(min(value_list))
return message, df
print ("Cloudant Jupyter Query test\nThe hour = ",dt.datetime.now().hour)
msg1, values=read_high_low_temp("Backyard")
print (msg1)
print(values)
sns.lineplot(values)
The full error message from Jupyter is:
C:\Users\ustl02870\AppData\Local\Programs\Python\Python37\lib\site-packages\seaborn\_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-34956d8dafb0> in <module>
53
54 #df = sns.load_dataset(values)
---> 55 sns.lineplot(values)
56 #print (values)
~\AppData\Local\Programs\Python\Python37\lib\site-packages\seaborn\_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48
~\AppData\Local\Programs\Python\Python37\lib\site-packages\seaborn\relational.py in lineplot(x, y, hue, size, style, data, palette, hue_order, hue_norm, sizes, size_order, size_norm, dashes, markers, style_order, units, estimator, ci, n_boot, seed, sort, err_style, err_kws, legend, ax, **kwargs)
686 data=data, variables=variables,
687 estimator=estimator, ci=ci, n_boot=n_boot, seed=seed,
--> 688 sort=sort, err_style=err_style, err_kws=err_kws, legend=legend,
689 )
690
~\AppData\Local\Programs\Python\Python37\lib\site-packages\seaborn\relational.py in __init__(self, data, variables, estimator, ci, n_boot, seed, sort, err_style, err_kws, legend)
365 )
366
--> 367 super().__init__(data=data, variables=variables)
368
369 self.estimator = estimator
~\AppData\Local\Programs\Python\Python37\lib\site-packages\seaborn\_core.py in __init__(self, data, variables)
602 def __init__(self, data=None, variables={}):
603
--> 604 self.assign_variables(data, variables)
605
606 for var, cls in self._semantic_mappings.items():
~\AppData\Local\Programs\Python\Python37\lib\site-packages\seaborn\_core.py in assign_variables(self, data, variables)
666 self.input_format = "long"
667 plot_data, variables = self._assign_variables_longform(
--> 668 data, **variables,
669 )
670
~\AppData\Local\Programs\Python\Python37\lib\site-packages\seaborn\_core.py in _assign_variables_longform(self, data, **kwargs)
924 # Construct a tidy plot DataFrame. This will convert a number of
925 # types automatically, aligning on index in case of pandas objects
--> 926 plot_data = pd.DataFrame(plot_data)
927
928 # Reduce the variables dictionary to fields with valid data
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
527
528 elif isinstance(data, dict):
--> 529 mgr = init_dict(data, index, columns, dtype=dtype)
530 elif isinstance(data, ma.MaskedArray):
531 import numpy.ma.mrecords as mrecords
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
285 arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
286 ]
--> 287 return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
288
289
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity)
78 # figure out the index, if necessary
79 if index is None:
---> 80 index = extract_index(arrays)
81 else:
82 index = ensure_index(index)
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\construction.py in extract_index(data)
389
390 if not indexes and not raw_lengths:
--> 391 raise ValueError("If using all scalar values, you must pass an index")
392
393 if have_series:
ValueError: If using all scalar values, you must pass an index
I resolved my problem with help/direction from #Ena, as it turned out I made several mistake. In layman's terms 1) I was trying to plot a tuple when it should have been a dataframe, 2) My data was in a dictionary, I was iterating through it trying to build a tuple when I should used built in panda tools to build a dataframe right from the dictionary 3) my code should have been written so as to NOT have scalar values so as NOT to need an index, and finally 4) I was trying to use a tuple as data for my seaborn plot when it should have been a dataframe. Here is the code that now works.
#!/usr/bin/env python
# coding: utf-8
import numpy as np
import pandas as pd
import seaborn as sns
import os, shutil, glob, time, subprocess, sys
from datetime import datetime
import datetime as dt
from matplotlib import pyplot as plt
import cloudant
from cloudant.client import Cloudant
from cloudant.query import Query
from cloudant.result import QueryResult
from cloudant.error import ResultException
import seaborn as sns
def read_high_low_temp(location):
USERNAME = "****************"
PASSWORD = "*****************"
client = Cloudant(USERNAME,PASSWORD, url = "**************************" )
client.connect()
my_database = client["temps"]
query = Query(my_database,selector= {'_id': {'$gt': 0}, 'l':location, 'd':dt.datetime.now().strftime("%m-%d-%Y")}, fields=['temp','t','d'],sort=[{'t': 'asc'}])
temp_dict={}
temp_dict=query(limit=1000, skip=5)['docs']
df = pd.DataFrame(temp_dict)
value_list=[]
for row in temp_dict:
value_list.append(row['temp'])
message="the highest temp in the " + location + " is: " + str(max(value_list)) + " the lowest " + str(min(value_list))
return message, df
msg1, values=read_high_low_temp("Backyard")
g=sns.catplot(x='t', y='temp', data=values, kind='bar',color="darkblue",height=8.27, aspect=11.7/8.27)
print("the minimum temp is:", values['temp'].min(), " the maximum temp is:", values['temp'].max())
plt.xticks(rotation=45)
g.set(xlabel='Time', ylabel='Temperature')
plt.ylim(values['temp'].min()-1, values['temp'].max()+1)
plt.savefig("2021-01-01-temperature graph.png")
g.set_xticklabels(step=10)
The problem is that you assigned "Time" as an index everywhere. Look how the data frame looks in seaborn.lineplot documentation: https://seaborn.pydata.org/generated/seaborn.lineplot.html
Can you try without this df.set_index('Time') part?

OverflowError when trying to convert generators to lists

I'm trying to extract dates from txt files using datefinder.find_dates which returns a generator object. Everything works fine until I try to convert the generator to list, when i get the following error.
I have been looking around for a solution but I can't figure out a solution to this, not sure I really understand the problem neither.
import datefinder
import glob
path = "some_path/*.txt"
files = glob.glob(path)
dates_dict = {}
for name in files:
with open(name, encoding='utf8') as f:
dates_dict[name] = list(datefinder.find_dates(f.read()))
Returns :
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-53-a4b508b01fe8> in <module>()
1 for name in files:
2 with open(name, encoding='utf8') as f:
----> 3 dates_dict[name] = list(datefinder.find_dates(f.read()))
C:\ProgramData\Anaconda3\lib\site-packages\datefinder\__init__.py in
find_dates(self, text, source, index, strict)
29 ):
30
---> 31 as_dt = self.parse_date_string(date_string, captures)
32 if as_dt is None:
33 ## Dateutil couldn't make heads or tails of it
C:\ProgramData\Anaconda3\lib\site-packages\datefinder\__init__.py in
parse_date_string(self, date_string, captures)
99 # otherwise self._find_and_replace method might corrupt
them
100 try:
--> 101 as_dt = parser.parse(date_string, default=self.base_date)
102 except ValueError:
103 # replace tokens that are problematic for dateutil
C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in
parse(timestr, parserinfo, **kwargs)
1354 return parser(parserinfo).parse(timestr, **kwargs)
1355 else:
-> 1356 return DEFAULTPARSER.parse(timestr, **kwargs)
1357
1358
C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in
parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
651 raise ValueError("String does not contain a date:",
timestr)
652
--> 653 ret = self._build_naive(res, default)
654
655 if not ignoretz:
C:\ProgramData\Anaconda3\lib\site-packages\dateutil\parser\_parser.py in
_build_naive(self, res, default)
1222 cday = default.day if res.day is None else res.day
1223
-> 1224 if cday > monthrange(cyear, cmonth)[1]:
1225 repl['day'] = monthrange(cyear, cmonth)[1]
1226
C:\ProgramData\Anaconda3\lib\calendar.py in monthrange(year, month)
122 if not 1 <= month <= 12:
123 raise IllegalMonthError(month)
--> 124 day1 = weekday(year, month, 1)
125 ndays = mdays[month] + (month == February and isleap(year))
126 return day1, ndays
C:\ProgramData\Anaconda3\lib\calendar.py in weekday(year, month, day)
114 """Return weekday (0-6 ~ Mon-Sun) for year (1970-...), month(1- 12),
115 day (1-31)."""
--> 116 return datetime.date(year, month, day).weekday()
117
118
OverflowError: Python int too large to convert to C long
Can someone explain this clearly?
Thanks in advance
REEDIT : After taking into consideration the remarks that were made, I found a minimal, readable and verifiable example. The error occurs on :
import datefinder
generator = datefinder.find_dates("466990103060049")
for s in generator:
pass
This looks to be a bug in the library you are using. It is trying to parse the string as a year, but that this year is too big to be handled by Python. The library that datefinder is using says that it raises an OverflowError in this instance, but that datefinder is ignoring this possibility.
One quick and dirty hack just to get it working would be to do:
>>> datefinder.ValueError = ValueError, OverflowError
>>> list(datefinder.find_dates("2019/02/01 is a date and 466990103060049 is not"))
[datetime.datetime(2019, 2, 1, 0, 0)]

How to extract only a specific part of url in Python and add its value as another column in df for every row?

I have a df containing user and url looking like this.
df
User Url
1 http://www.mycompany.com/Overview/Get
2 http://www.mycompany.com/News
3 http://www.mycompany.com/Accountinfo
4 http://www.mycompany.com/Personalinformation/Index
...
I want to add another column page that only takes the second part of the url, so I'd be having it like this.
user url page
1 http://www.mycompany.com/Overview/Get Overview
2 http://www.mycompany.com/News News
3 http://www.mycompany.com/Accountinfo Accountinfo
4 http://www.mycompany.com/Personalinformation/Index Personalinformation
...
My code below is not working.
slashparts = df['url'].split('/')
df['page'] = slashparts[4]
The error I'm getting
AttributeError Traceback (most recent call last)
<ipython-input-23-0350a98a788c> in <module>()
----> 1 slashparts = df['request_url'].split('/')
2 df['page'] = slashparts[1]
~\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
4370 if
self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'split'
Use pandas text functions with str and for select 4. lists use str[3], because python counts from 0:
df['page'] = df['Url'].str.split('/').str[3]
Or if performance is important use list comprehension:
df['page'] = [x.split('/')[3] for x in df['Url']]
print (df)
User Url \
0 1 http://www.mycompany.com/Overview/Get
1 2 http://www.mycompany.com/News
2 3 http://www.mycompany.com/Accountinfo
3 4 http://www.mycompany.com/Personalinformation/I...
page
0 Overview
1 News
2 Accountinfo
3 Personalinformation
I'm attempting to be a little more explicit to handle where http might be missing and other variations
pat = '(?:https?://)?(?:www\.)?(?:\w+\.\w+\/)([^/]*)'
df.assign(page=df.Url.str.extract(pat, expand=False))
User Url page
0 1 http://www.mycompany.com/Overview/Get Overview
1 2 http://www.mycompany.com/News News
2 3 www.mycompany.com/Accountinfo Accountinfo
3 1 http://www.mycompany.com/Overview/Get Overview
4 2 mycompany.com/News News
5 3 https://www.mycompany.com/Accountinfo Accountinfo
6 4 http://www.mycompany.com/Personalinformation/I... Personalinformation

Categories

Resources