Delete rows based on the multiple values as fast as possible - python

I have a dataframe which consist a column named Url. There are around 500K Urls. I want to delete all the rows where the Urls consist of some domain name like amazon.com, ebay.com, bestbuy.com and so on.
Those Urls can be like the below:
https://www.homedepot.com/b/Plumbing-Water-Pumps-Sump-Pumps-Submersible-Sump-Pumps/N-5yc1vZbqj6
https://images.homedepot-static.com/catalog/pdfImages/ba/ba1bd2c2-82ea-4510-85c8-333392e70a23.pdf
https://us.amazon.com/Simer-A5000-Hole-Battery-System/dp/B000DZKXC2
https://www.amazon.com/Hydromatic-DPS41-Diaphragm-Switch-Range/dp/B009S0NS0C
So the domain can be present as subdomain too. It may or may not consist with http, https, www, top level country domain name like co.uk, .co.nz and so on.
So I need an universal solution to delete any URL when the domain name is present in the exclude-sites list.
I already created a function for it which is working fine for smaller data set. But it couldn't clean the data for the 500K rows even after running it for straight 5 hours.
Here is the function I am using:
exclude_sites=['amazon.com','amzn.to','ebay.com','walmart.com','sears.com','costco.com','youtube.com', 'lowes.com', 'homedepot.com', 'bestbuy.com']
def exclude_urls(exclude_sites, df):
i=[]
for row in df['Url']:
if any(url in row for url in exclude_sites):
i.extend(df.index[df['Url']==row])
#reset index
return i
df = df.drop(list( exclude_urls(exclude_sites, df)))
What might be the fastest solution to delete the domain saved on the exclude_sites list? Thank you in advance.

exclude_sites_escaped = [x.replace('.', '\.') for x in exclude_sites]
df[~df['Url'].str.contains('|'.join(exclude_sites_escaped), regex = True)]

Related

Is it possible to create a unique Django path for each row in a pandas dataframe?

I have a Django project that takes information in each row of a pandas dataframe and displays the contents on the page. I'd like to separate each row of the dataframe into its own page with its own unique path. There are too many rows in the dataframe to hard code each row in the context of the views.py file. How would I go about creating a unique url path for each row? Is this even possible?
You probably want to do something like embed the row as a parameter in the url
like
http://example.com/pandasdata/<row>
In your view your would extract the row from the url and extract only that data from the pandas dataframe and display that.
In your application urls.py File Add Following code
path('pandasdata/< int:rowid >', views.row)
In your application views.py file, add following code:
def row(request, rowid):
# Add code to extract row and display here
Make sure both place name should be same (rowid)
The answer above is exactly correct and just what I was looking for. I'd just like to elaborate on this for anyone stumbling across this question in the future.
As stated in the answer, in the urls.py file add the path that allows you extract row information. Something like:
path("data/<int:row_id>/", views.app)
Then in your views.py file, you'll be able to access the row information like this:
def func(request, row_id):
row_id = row_id
x = row_id
df1 = df.iloc[x]
return render(request, "data/app.html", {
"col1": df1['col1'],
"col2": df1['col2'],}
Now when you visit a path like http://example.com/data/100/ for example, it will load the row with that index from the dataframe, along with whatever information from the columns that you have set in the context. Or if the number is outside the number of rows in your database it will throw you an error.
Thanks to WombatPM for the original answer!

Classify column with URLs into categories base on terms

I'm still a newbie in Python and having a hard time trying to code something.
I have a list with more than 80k URLs and this is the only thing I have in my .xls, the URLs looks like this:
https://domainexample.com/user-query/credit-card-debit-balance/
https://domainexample.com/user-query/second-invoice-current-debt/
https://domainexample.com/user-query/query-balances/
https://domainexample.com/user-query/where-is-client-portal/
https://domainexample.com/user-query/i-want-to-change-my-password/
https://domainexample.com/user-query/second-invoice-internet/
https://domainexample.com/user-query/print-payment-invoice/
I want to create a code that will read this excel and based on certain categories I already wrote, will put them in other columns.
So, whenever the code finds "paswword" it will put that URL in the column password, when it finds "user" will put the URL in the column "user".
It would look like this:
debt
https://domainexample.com/user-query/second-invoice-current-debt/
password
https://domainexample.com/user-query/i-want-to-change-my-password/
payment
https://domainexample.com/user-query/print-payment-invoice/
The code doesn't necessarily needs to change the column of the URLs, if it can create a 2nd column and write of what categories that URL belongs, it would be also great.
There is no need for the code to read the URL, just the excel file, like those URLs are simple text.
If anyone can help me, thanks a lot!
Try this where df is your dataframe, and 'url_column' is the column with all your urls
df.loc[df['url_column'] =='url.com/what-is-a-car', 'car'] = 'url.com/'+'car'
df.loc[df['url_column'] =='url.com/what-is-a-bike', 'bike'] = 'url.com/'+'bike'
df.loc[df['url_column'] =='url.com/what-is-a-van', 'van'] = 'url.com/'+'van'

Python - How to read a csv files separated by commas which have commas within the values?

The file has an URL which contain commas within it. For example:
~oref=https://tuclothing.tests.co.uk/c/Girls/Girls_Underwear_Socks&Tights?INITD=GNav-CW-GrlsUnderwear&title=Underwear,+Socks+&+Tights
Between Underwear and +Socks there is a comma which is making my life not easy.
Is there a way to indicate to the reader(Pandas, CSV reader..etc) that the whole URL is just one value?
This is a bigger sample with columns and values:
Event Time,User ID,Advertiser ID,TRAN Value,Other Data,ORD Value,Interaction Time,Conversion ID,Segment Value 1,Floodlight Configuration,Event Type,Event Sub-Type,DBM Auction ID,DBM Request Time,DBM Billable Cost (Partner Currency),DBM Billable Cost (Advertiser Currency),
1.47E+15,CAESEKoMzQamRFTrkbdTDT5F-gM,2934701,,~oref=https://tuclothing.tests.co.uk/c/NewIn/NewIn_Womens?q=%3AnewArrivals&page=2&size=24,4.60E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEKQhGXdLq0FitBKF5EPPfgs,2934701,,~oref=https://tuclothing.tests.co.uk/c/Women/Women_Accessories?INITD=GNav-WW-Accesrs&q=%3AnewArrivals&title=Accessories&mkwid=sv5biFf2y_dm&pcrid=90361315613&pkw=leather%20bag&pmt=e&med=Search&src=Google&adg=Womens_Accessories&kw=leather+bag&cmp=TU_Women_Accessories&adb_src=4,4.73E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEEpNRaLne21k6juip9qfAos,2934701,,num=16512910;~oref=https://tuclothing.tests.co.uk/,1,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEJ3a2YRrPSSeeRUFHDSoXNQ,2934701,,~oref=https://tuclothing.tests.co.uk/c/Girls/Girls_Underwear_Socks&Tights?INITD=GNav-CW-GrlsUnderwear&title=Underwear,+Socks+&+Tights,8.12E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,0,0,0
1.47E+15,CAESEGmwaNjTvIrQ3MoIvqiRC8U,2934701,,~oref=https://tuclothing.tests.co.uk/login/checkout,1.75E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESEM3G-Nh6Q0OhboLyOhtmtiI,2934701,,~oref=https://3984747.fls.doubleclick.net/activityi;~oref=http%3A%2F%2Fwww.tests.co.uk%2Fshop%2Fgb%2Fgroceries%2Ffrozen-%2Fbeef--pork---lamb,3.74E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
1.47E+15,CAESENlK7oc-ygl637Y2is3a90c,2934701,,~oref=https://tuclothing.tests.co.uk/,5.10E+12,1.47E+15,1,0,940892,CONVERSION,POSTCLICK,,,0,0,
It looks like, in this case, the only comma which you are having issues with is located in a URL. You could run your csv file through a preprocessor method which strips out commas in your URLs or URL encode them.
Personally, I would opt for the URL encoding method which will convert the comma to %2E, this way you don't have a comma in your URL when you start reading your csv row values, yet the URL still retains its working link to the reference/destination page.
If you had this issue with other fields (not a URL), or in other unknown/random locations in the csv row, then the solution would not be easy at all. But since you know exactly where the issue is occurring each time, you could perform a static lookup for that character and replace if found in that particular field.

Some simple examples of Smartsheet API using the Python SDK

I am newbie to the Smartsheet Python SDK. Using the sample code from the Smartsheets API doc as a starting point:
action = smartsheet.Sheets.list_sheets(include_all=True)
sheets = action.data
This code returns a response just fine.
I am now looking for some simple examples to iterate over the sheets ie:
for sheet in sheets:
then select a sheet by name
then iterate over the rows in the selected sheet and select a row.
for row in rows:
then retrieve a cell value from the selected row in the selected sheet.
I just need some simple samples to get started. I have searched far and wide and unable to find any simple examples of how to do this
Thanks!
As Scott said, a sheet could return a lot of data, so make sure that you use filters judiciously. Here is an example of some code I wrote to pull two rows but only one column in each row:
action = smartsheet.Sheets.get_sheet(SHEET_ID, column_ids=COL_ID, row_numbers="2,4")
Details on the available filters can be found here.
UPDATE: more code added in order to follow site etiquette and provide a complete answer.
The first thing I did while learning the API is display a list of all my sheets and their corresponding sheetId.
action = MySS.Sheets.list_sheets(include_all=True)
for single_sheet in action.data:
print single_sheet.id, single_sheet.name
From that list I determined the sheetId for the sheet I want to pull data from. In my example, I actually needed to pull the primary column, so I used this code to determine the Id of the primary column (and also saved the non-primary column Ids in a list because at the time I thought I might need them):
PrimaryCol = 0
NonPrimaryCol = []
MyColumns = MySS.Sheets.get_columns(SHEET_ID)
for MyCol in MyColumns.data:
if MyCol.primary:
print "Found primary column", MyCol.id
PrimaryCol = MyCol.id
else:
NonPrimaryCol.append(MyCol.id)
Lastly, keeping in mind that retrieving an entire sheet could return a lot of data, I used a filter to return only the data in the primary column:
MySheet = MySS.Sheets.get_sheet(SHEET_ID, column_ids=PrimaryCol)
for MyRow in MySheet.rows:
for MyCell in MyRow.cells:
print MyRow.id, MyCell.value
Below is a very simple example. Most of this is standard python, but one somewhat non-intuitive thing about this may be the fact that the sheet objects in the list returned from smartsheet.Sheets.list_sheets doesn't include the rows & cells. As this could be a lot of data, it returns information about the sheet, that you can use to retrieve the sheet's complete data by calling smartsheet.Sheets.get_sheet.
To better understand things such as this, be sure to keep the Smartsheet REST API reference handy. Since the SDK is really just calling this API under the covers, you can often find more information by look at that documentation as well.
action = smartsheet.Sheets.list_sheets(include_all=True)
sheets = action.data
for sheetInfo in sheets:
if sheetInfo.name=='WIP':
sheet = smartsheet.Sheets.get_sheet(sheetInfo.id)
for row in sheet.rows:
if row.row_number==2:
for c in range(0, len(sheet.columns)):
print row.cells[c].value
I started working with Python APIs with SmartSheets. Due to our usage of smartsheets to back some of our RIO2016 Olympic Games operations, every now and then we had to delete the oldest Smartsheets for the sake of licence compliance limits. And that was a blunder: login, select each smarts among 300 hundred, check every field and so on. So thanks smartsheet API 2.0, we could learn easily how many sheets we have been used so far, get all the 'modified' date, sort by that column from the latest to the most recent date and then write to a CSV disk. I am not sure if this is the best approach for that but it worked as I expected.I use Idle-python2.7, Debian 8.5. Here you are:
# -*- coding: utf-8 -*-
#!/usr/bin/python
'''
create instance of Sheet Object.
Then populate List of Sheet Object with name and modified
A token is necessary to access Smartsheets
We create and return a list of all objects with fields aforesaid.
'''
# The Library
import smartsheet, csv
'''
Token long var. This token can be obtained in
Account->Settings->Apps...->API
from a valid SmartSheet Account.
'''
xMytoken=xxxxxxxxxxxxxxxxxxxxxx
# Smartsheet Token
xSheet = smartsheet.Smartsheet(xMyToken)
# Class object
xResult = xSheet.Sheets.list_sheets(include_all=True)
# The list
xList = []
'''
for each sheet element, we choose two, namely name and date of modification. As most of our vocabulary has special characters, we use utf-8 after the name of each spreadsheet.So for each sheet read from Object sheets
'''
for sheet1 in xResult.data.
xList.append((sheet1._name.encode('utf-8'),sheet1._modified_at))
# sort the list created by 'Modifiedat' attribute
xNlist = sorted(xList,key=lambda x: x[1])
# print list
for key, value in xNlist:
print key,value
# Finally write to disk
with open("listofsmartsh.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(xNList)
Hope you enjoy.
regards

Quandl data, API call

Recently I am reading some stock prices database in Quandl using API call to extract the data. But I am really confused by the example I have.
import requests
api_url = 'https://www.quandl.com/api/v1/datasets/WIKI/%s.json' % stock
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=3))
raw_data = session.get(api_url)
Can anyone explain that to me?
1) for api_url, if I copy that webepage, it says 404 not found. So if I want to use other database, how do I prepare this api_usl? What does '% stock' mean?
2) here request looks like to be used to extract the data, what is the format of the raw_data? How do I know the column names? How do I extract the columns?
To expand on my comment above:
% stock is a string formatting operation, replacing %s in the preceding string with the value referenced by stock. Further details can be found here
raw_data actually references a Response object (part of the requests module - details found here
To expand on your code.
import requests
#Set the stock we are interested in, AAPL is Apple stock code
stock = 'AAPL'
#Your code
api_url = 'https://www.quandl.com/api/v1/datasets/WIKI/%s.json' % stock
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=3))
raw_data = session.get(api_url)
# Probably want to check that requests.Response is 200 - OK here
# to make sure we got the content successfully.
# requests.Response has a function to return json file as python dict
aapl_stock = raw_data.json()
# We can then look at the keys to see what we have access to
aapl_stock.keys()
# column_names Seems to be describing the individual data points
aapl_stock['column_names']
# A big list of data, lets just look at the first ten points...
aapl_stock['data'][0:10]
Edit to answer question in comment
So the aapl_stock[column_names] shows Date and Open as the first and second values respectively. This means they correspond to positions 0 and 1 in each element of the data.
Therefore to access date use aapl_stock['data'][0:10][0] (date value for first ten items) and to access the value for open use aapl_stock['data'][0:78][1] (open value for first 78 items).
To get a list of every value in the dataset, where each element is a list with values for Date and Open you could add something like aapl_date_open = aapl_stock['data'][:][0:1].
If you are new to python I seriously recommend looking at the list slice notation, a quick intro can be found here

Categories

Resources