Extracting many URLs in a python dataframe - python

I have a dataframe which contains text including one or more URL(s) :
user_id text
1 blabla... http://amazon.com ...blabla
1 blabla... http://nasa.com ...blabla
2 blabla... https://google.com ...blabla ...https://yahoo.com ...blabla
2 blabla... https://fnac.com ...blabla ...
3 blabla....
I want to transform this dataframe with the count of URL(s) per user-id :
user_id count_URL
1 2
2 3
3 0
Is there a simple way to perform this task in Python ?
My code start :
URL = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
for i in range(data.shape[0]) :
for j in range(0,8):
URL.iloc[i,j] = re.findall("(?P<url>https?://[^\s]+)", str(data.iloc[i]))
Thanks you
Lionel

In general, the definition of a URL is much more complex than what you have in your example. Unless you are sure you have very simple URLs, you should look up a good pattern.
import re
URLPATTERN = r'(https?://\S+)' # Lousy, but...
First, extract the URLs from each string and count them:
df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()
Next, group the counts by user id:
df.groupby('user_id').sum()['urlcount']
#user_id
#1 2
#2 3
#3 0

Below there is another way to do that:
#read data
import pandas as pd
data = pd.read_csv("data.csv")
#Divide data into URL and user_id and cast it to pandas DataFrame
URL = pd.DataFrame(data.loc[:,"text"].values)
user_id = pd.DataFrame(data.loc[:,"user_id"].values)
#count the number of appearance of the "http" in each row of data
sub = "http"
count_URL = []
for val in URL.iterrows():
counter = val[1][0].count(sub)
count_URL.append(counter)
#list to DataFrame
count_URL = pd.DataFrame(count_URL)
#Concatenate the two data frames and apply the code of #DyZ to group by and count the number of url
finalDF = pd.concat([user_id,count_URL],axis=1)
finalDF.columns=["user_id","urlcount"]
data = finalDF.groupby('user_id').sum()['urlcount']
print(data.head())

Related

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.
Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360
You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

remove unwanted strings from Pandas column

I have a dataframe :
ID Website
1 www.yah.com/?trk
2 www.gle.com
I want to clean unwanted part from the website Url by deleting '?trk' or replacing it by ''
My final Dataframe will be :
ID Website
1 www.yah.com
2 www.gle.com
how can i do it known that i might have other options not only '?trk'
If you want to replace '?trk' only and not the '/' you can:
df['Website'] = df['Website'].replace(['?trk'],'')
Check split
df['Website'] = df['Website'].str.split('/').str[0]
df
Out[169]:
ID Website
0 1 www.yah.com
1 2 www.gle.com

Filter only links in a dataframe based on certain domain name

I have a pandas dataframe having 5 columns. I need to filter the dataframe on the column link based on a domain name in a list and count the number of rows for each process repetitively.
Suppose I have the following dataframe:
url_id | link
------------------------------------------------------------------------
1 | http://www.example.com/somepath
2 | http://www.somelink.net/example
3 | http://other.someotherurls.ac.uk/thisissomelink.net&part/sample
4 | http://part.example.com/directory/files
I want to filter the dataframe based on a domain name from the list below and count the number of each result:
domains = ['example.com', 'other.com', 'somelink.net' , 'sample.com']
Following is the expected output:
domain | no_of_links
--------------------------
example.com | 2
other.com | 0
somelink.net | 1
sample.com | 0
This is my code:
from tld import get_tld
import pandas as pd
def urlparsing(row):
url = row['link']
res = get_tld(url,as_object=True)
return (res.fld)
link = ({"url_id":[1,2,3,4],"link":["http://www.example.com/somepath",
"http://www.somelink.net/example",
"http://other.someotherurls.ac.uk/thisissomelink.net&part/sample",
"http://part.example.com/directory/files"]})
domains = ['example.com', 'other.com', 'somelink.net' , 'sample.com']
df_link = pd.DataFrame(link)
ref_dom = []
for dom in domains:
ddd = df_link[(df_link.apply(lambda row: urlparsing(row), axis=1)).str.contains(dom, regex=False)]
ref_dom.append([dom, len(ddd)])
pd.DataFrame(ref_dom, columns=['domain','no_of_links'])
Basically, my code is working. However, when the size of the dataframe is very big (more than 5 millions rows), and the list of the domain name is more than a hundred thousand, the process takes my day.
If you have an alternative way to make it faster, please let me know. Any help would be appreciated. Thank you.
you can do it using regex and findall function of df.str function
domains = ['example.com', 'other.com', 'somelink.net' , 'sample.com']
pat = "|".join([f"http[s]?://(?:\w*\.)?({domain})"
for domain in map(lambda x: x.replace(".","\."), domains)])
match = df["link"].str.findall(pat).explode().explode()
match = match[match.str.len()>0]
match.groupby(match).count()
result
link
example.com 2
somelink.net 1
Name: link, dtype: int64
For Pandas Before 0.25
domains = ['example.com', 'other.com', 'somelink.net' , 'sample.com']
pat = "|".join([f"http[s]?://(?:\w*\.)?({domain})"
for domain in map(lambda x: x.replace(".","\."), domains)])
match = df["link"].str.findall(pat) \
.apply(lambda x: "".join([domain for match in x for domain in match]).strip())
match = match[match.str.len()>0]
match.groupby(match).count()
to get domains with 0 links also you can join the result with df having all domain

Overwriting values in column created with Python for loop

I'm building an automated MLB schedule from a base URL and a loop through a list of team names as they appear in the URL. Using pd.read_html I get each team's schedule. The only thing I'm missing is, for each team's page, the team name itself, which I'd like as a new column 'team_name'. I have a small sample of my goal at the end of this post.
Below, is what I have so far, and if you run this, the print out does exactly what I need for just one team.
import pandas as pd
url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners']
df = pd.DataFrame()
for team in (team_list):
new_url = url_base + team
df = df.append(pd.read_html(new_url)[1])
df['team_name'] = team
print(df[['team_name', 'Opponent']])
The trouble is, when I have all 30 teams in team_list, the value of team_name keeps getting overwritten, so that all 4000+ records list the same team name (the last one in team_list). I've tried dynamically assigning only certain rows the team value by using
df['team_name'][a:b] = team
where a, b are the starting and ending rows on the dataframe for the index team; but this gives KeyError: 'team_name'. I've also tried using placeholder series and dataframes for team_name, then merging with df later, but get duplication errors. On a larger scale, what I'm looking for is this:
team_name opponent
0 seattle-mariners new-york-yankees
1 seattle-mariners new-york-yankees
2 seattle-mariners boston-red-sox
3 seattle-mariners boston-red-sox
4 seattle-mariners san-diego-padres
5 seattle-mariners san-diego-padres
6 cincinatti-reds new-york-yankees
7 cincinatti-reds new-york-yankees
8 cincinatti-reds boston-red-sox
9 cincinatti-reds boston-red-sox
10 cincinatti-reds san-diego-padres
11 cincinatti-reds san-diego-padres
The original code df['team_name'] = team rewrites team_name for the entire df. The code below creates a placeholder, df_team, where team_name is updated and then df.append(df_team).
url_base = "https://www.teamrankings.com/mlb/team/"
team_list = ['seattle-mariners', 'houston-astros']
Option A: for loop
df_list = list()
for team in (team_list):
new_url = url_base + team
df_team = pd.read_html(new_url)[1]
df_team['team_name'] = team
df_list.append(df_team)
df = pd.concat(df_list)
Option B: list comprehension:
df_list = [pd.read_html(url_base + team)[1].assign(team=team) for team in team_list]
df = pd.concat(df_list)
df.head()
df.tail()

Adding headers to a table I have scraped

I have been following an online tutorial but rather than using the tutorial data which comes with headers I want to use the following code:
The problem I have is that my table has no headers so it is using the first row as the header. How can I set defined headers of "Ride" and "Queue Time"?
Thanks
import requests
import lxml.html as lh
import pandas as pd
url='http://www.ridetimes.co.uk/'
page = requests.get(url)
doc = lh.fromstring(page.content)
tr_elements = doc.xpath('//tr')
r_elements = doc.xpath('//tr')
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print '%d:"%s"'%(i,name)
col.append((name,[]))
print(col)
How about trying this:
>>> pd.DataFrame(col,columns=["Ride","Queue Time"])
Ride Queue Time
0 Spinball Whizzer []
1 0 mins []
If I am correct then this is the answer.
Use pandas to get the table, then just assign the column names:
import pandas as pd
url='http://www.ridetimes.co.uk/'
df = pd.read_html(url)[0]
df.columns = ['Ride', 'Queue Time']
Output:
print (df)
Ride Queue Time
0 Spinball Whizzer 0 mins
1 Nemesis 5 mins
2 Oblivion 5 mins
3 Wicker Man 5 mins
4 The Smiler 10 mins
5 Rita 20 mins
6 TH13TEEN 25 mins
7 Galactica Currently Unavailable
8 Enterprise Currently Unavailable
Consider using the same source the page does to update values which returns json. You add a random number to the url to prevent cached results being served. This does all group types not just thrill.
import requests
import random
import pandas as pd
i = random.randint(1,1000000000000000000)
r = requests.get('http://ridetimes.co.uk/queue-times-new.php?r=' + str(i)).json() #to prevent cached results being served
df = pd.DataFrame([(item['ride'], item['time']) for item in r], columns = ['Ride', ' Queue Time'])
print(df)
If you want only thrill group then amend to this line:
df = pd.DataFrame([(item['ride'], item['time']) for item in r if item['group'] == 'Thrill'], columns = ['Ride', ' Queue Time'])

Categories

Resources