Classify column with URLs into categories base on terms - python

I'm still a newbie in Python and having a hard time trying to code something.
I have a list with more than 80k URLs and this is the only thing I have in my .xls, the URLs looks like this:
https://domainexample.com/user-query/credit-card-debit-balance/
https://domainexample.com/user-query/second-invoice-current-debt/
https://domainexample.com/user-query/query-balances/
https://domainexample.com/user-query/where-is-client-portal/
https://domainexample.com/user-query/i-want-to-change-my-password/
https://domainexample.com/user-query/second-invoice-internet/
https://domainexample.com/user-query/print-payment-invoice/
I want to create a code that will read this excel and based on certain categories I already wrote, will put them in other columns.
So, whenever the code finds "paswword" it will put that URL in the column password, when it finds "user" will put the URL in the column "user".
It would look like this:
debt
https://domainexample.com/user-query/second-invoice-current-debt/
password
https://domainexample.com/user-query/i-want-to-change-my-password/
payment
https://domainexample.com/user-query/print-payment-invoice/
The code doesn't necessarily needs to change the column of the URLs, if it can create a 2nd column and write of what categories that URL belongs, it would be also great.
There is no need for the code to read the URL, just the excel file, like those URLs are simple text.
If anyone can help me, thanks a lot!

Try this where df is your dataframe, and 'url_column' is the column with all your urls
df.loc[df['url_column'] =='url.com/what-is-a-car', 'car'] = 'url.com/'+'car'
df.loc[df['url_column'] =='url.com/what-is-a-bike', 'bike'] = 'url.com/'+'bike'
df.loc[df['url_column'] =='url.com/what-is-a-van', 'van'] = 'url.com/'+'van'

Related

python parsing soup protocol response header and value

Trying to write the question one more time because the first time the problem was not clear.
I'm trying to extract output from a SOAP protocol using Python.
The output is structured with a lot of subcategories and this is where I can't extract the information properly.
The output I have is very long, so I will bring here a short example from soapenv:Body
the original code (soap body)
<mc:ocb-rule id="boic"><mc:id>boic</mc:id><mc:ocb-conditions><mc:rule-deactivated>true</mc:rule-deactivated><mc:international>true</mc:international></mc:ocb-conditions><mc:cb-actions><mc:allow>false</mc:allow></mc:cb-actions></mc:ocb-rule>
As you can see I also used the command
xmltodict.parse(soap_response)
for turning the output into a dictionary
OrderedDict([('#id',
'boic'),
('mc:id',
'boic'),
('mc:ocb-conditions',
OrderedDict([('mc:rule-deactivated',
'true'),
('mc:international',
'true')])),
('mc:cb-actions',
OrderedDict([('mc:allow',
'false')]))])
If there is an easier way, I would appreciate guidance
As I mentioned, my goal is ultimately to get each category its own value, when if there is a subcategory, a separate category will be added.
In the end I will have to put all the values into the Data frame and display all the categories and their values
for example :
table example
I really hope that this time I was able to explain myself properly
Thanks in advance
i am trying to parsing soup response and insert all value and header to data frame

Delete rows based on the multiple values as fast as possible

I have a dataframe which consist a column named Url. There are around 500K Urls. I want to delete all the rows where the Urls consist of some domain name like amazon.com, ebay.com, bestbuy.com and so on.
Those Urls can be like the below:
https://www.homedepot.com/b/Plumbing-Water-Pumps-Sump-Pumps-Submersible-Sump-Pumps/N-5yc1vZbqj6
https://images.homedepot-static.com/catalog/pdfImages/ba/ba1bd2c2-82ea-4510-85c8-333392e70a23.pdf
https://us.amazon.com/Simer-A5000-Hole-Battery-System/dp/B000DZKXC2
https://www.amazon.com/Hydromatic-DPS41-Diaphragm-Switch-Range/dp/B009S0NS0C
So the domain can be present as subdomain too. It may or may not consist with http, https, www, top level country domain name like co.uk, .co.nz and so on.
So I need an universal solution to delete any URL when the domain name is present in the exclude-sites list.
I already created a function for it which is working fine for smaller data set. But it couldn't clean the data for the 500K rows even after running it for straight 5 hours.
Here is the function I am using:
exclude_sites=['amazon.com','amzn.to','ebay.com','walmart.com','sears.com','costco.com','youtube.com', 'lowes.com', 'homedepot.com', 'bestbuy.com']
def exclude_urls(exclude_sites, df):
i=[]
for row in df['Url']:
if any(url in row for url in exclude_sites):
i.extend(df.index[df['Url']==row])
#reset index
return i
df = df.drop(list( exclude_urls(exclude_sites, df)))
What might be the fastest solution to delete the domain saved on the exclude_sites list? Thank you in advance.
exclude_sites_escaped = [x.replace('.', '\.') for x in exclude_sites]
df[~df['Url'].str.contains('|'.join(exclude_sites_escaped), regex = True)]

Is it possible to create a unique Django path for each row in a pandas dataframe?

I have a Django project that takes information in each row of a pandas dataframe and displays the contents on the page. I'd like to separate each row of the dataframe into its own page with its own unique path. There are too many rows in the dataframe to hard code each row in the context of the views.py file. How would I go about creating a unique url path for each row? Is this even possible?
You probably want to do something like embed the row as a parameter in the url
like
http://example.com/pandasdata/<row>
In your view your would extract the row from the url and extract only that data from the pandas dataframe and display that.
In your application urls.py File Add Following code
path('pandasdata/< int:rowid >', views.row)
In your application views.py file, add following code:
def row(request, rowid):
# Add code to extract row and display here
Make sure both place name should be same (rowid)
The answer above is exactly correct and just what I was looking for. I'd just like to elaborate on this for anyone stumbling across this question in the future.
As stated in the answer, in the urls.py file add the path that allows you extract row information. Something like:
path("data/<int:row_id>/", views.app)
Then in your views.py file, you'll be able to access the row information like this:
def func(request, row_id):
row_id = row_id
x = row_id
df1 = df.iloc[x]
return render(request, "data/app.html", {
"col1": df1['col1'],
"col2": df1['col2'],}
Now when you visit a path like http://example.com/data/100/ for example, it will load the row with that index from the dataframe, along with whatever information from the columns that you have set in the context. Or if the number is outside the number of rows in your database it will throw you an error.
Thanks to WombatPM for the original answer!

webbrowser.open on DataFrame entry

I have a pandas' DataFrame like this:
listings_df = pd.DataFrame({'prices': prices,
'listing_links': listing_links,
'photo_links': photo_links,
'listing_names': listing_names})
photo_links list contains URLs to photos. Say I want to get a link straight from the dataframe and open it in webbrowser like this:
link_to_open = listings_df.loc[1:1,'photo_links']
webbrowser.open(link_to_open)
However the link does not open and I get a 404 error, because the link is stored in the dataframe (or at least printed) in a shortened version:
https://a0.muscache.com/im/pictures/70976075/b
versus the original link as it is stored in the photo_links list:
https://a0.muscache.com/im/pictures/70976075/b20d9efc_original.jpg?aki_policy=large
The question is, how do I access full link from within dataframe?
link_to_open = listings_df.loc[1:1,'photo_links'] returns Series object.
try this
link_to_open = listings_df.loc[1,'photo_links']

Technique to add csv data for varying fields

I am trying to scrape results from a site (has no captcha , simple roll-no authentication, and the pattern of roll-no is known to me). The problem is that they have the results in a table format and many students have different subjects. The code I wrote so far in Python is
for row in rows:
col=row.findAll('td') #BeautifulSoup object
sub=col[1].text.encode('utf-8') #Header.(Subject Names)
subjectname.append((sub))
marks=col[4].text.encode('utf-8')
markall.append((marks))
csvwriter.writerows([subjectname,])
csvwriter.writerows([markall,])
I want to generate a .csv file so that I can do some data analysis on it. Now the problem is I want a table which has a specific subject column and marks of it. But the scraper won't know if it's a different subject and will append marks of whatever subject it finds in that row/column pair.
How do I approach this?
Here's a visual representation of the problem.
So if I have Subject A at column 1 , I want to get marks only of subject A and not any other subject. Do I need to create a list for all marks ?
Edit : Here's the HTML table markup https://jsfiddle.net/rpmgku7m/

Categories

Resources