Extracting multiple URLs - Python - python

I want to extract several links from a text (comments), which are stored in a panda dataframe. My goal is to add the extracted URLs in a new column of the original dataset. By using the following method applied to my text, I am able to extract the comments and store it in the variable URL and transform it into another pandas dataframe. In this case, I am not sure if this is the efficient way to extract the necessary information.
URL = (ALL.textOriginal.str.extractall("(?P<URL>(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300})))").reset_index('match', drop=True))
URL_df = pd.DataFrame(data=URL)
URL_df.drop([1],axis=1)
gives me:
596 https://www.tag24.de/nachrichten
596 http://www.tt.com/panorama
596 http://www.wz.de/lokales
666 https://www.svz.de/regionales
666 https://www.watson.ch/Leben
... ...
The dataframe contains only the indices and the hyperlinks. The problem with this method is, that some of the indices are duplicated because one comment can exist several URLs, which will be extracted. I tried different ways to solve this problem such as:
pd.concat([ALL, URL_df.drop], axis=1).sort_index()
I also tried to store it the URLs directly to the original dataframe by applying:
ALL['URL'] = ALL.textOriginal.str.extractall("(?P<URL>(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300})))").reset_index('match', drop=True))
but I only received this error message:
"incompatible index of the inserted column with frame index"
As I said before my goal is to store the extracted URLs in a new column like:
text URL
"blablabla link1, link2, link3" [https://www.tag24.de/nachrichten, http://www.tt.com/panorama, http://www.wz.de/lokales]
"blablabla link1, link2" [https://www.svz.de/regionales, https://www.watson.ch/Leben]
... ...

I think need findall:
pat = "(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300}))"
ALL['URL'] = ALL.textOriginal.str.findall(pat)
print (ALL)
textOriginal \
0 https://www.tag24.de/nachrichten http://www.tt...
1 https://www.svz.de/regionales https://www.wats...
URL
0 [https://www.tag24.de/nachrichten, http://www....
1 [https://www.svz.de/regionales, https://www.wa... ]
Another solution with extractall, which return MultiIndex, so necessary groupby by first level with creating lists:
pat = "(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300}))"
ALL['URL'] = ALL.textOriginal.str.extractall(pat).groupby(level=0)[0].apply(list)
print (ALL)
textOriginal \
0 https://www.tag24.de/nachrichten http://www.tt...
1 https://www.svz.de/regionales https://www.wats...
URL
0 [https://www.tag24.de/nachrichten, http://www....
1 [https://www.svz.de/regionales, https://www.wa...
Setup:
ALL = pd.DataFrame({'textOriginal': ['https://www.tag24.de/nachrichten http://www.tt.com/panorama http://www.wz.de/lokales', 'https://www.svz.de/regionales https://www.watson.ch/Leben']})

Let's say you have a dataframe with two columns, 'Indice' and 'Link', where Indice is not unique. You can aggregate all the links with the same Indice in the following way:
myAggregateDF = myDF.groupby('Indice')['Link'].apply(list).to_frame()
In this way, you will obtain a new dataframe with two columns, 'Indice' and 'Link', where 'Link' is a list of the previous links.
Pay attention though, this method is not efficient. Groupby is memory hungry and this can be a problem with large dataframes.

Related

Split and create data from a column to many columns

I have a pandas data frame in which the values of one of its columns looks like that
print(VCF['INFO'].iloc[0])
Results (Sorry I can copy and paste this data as I am working from a cluster without an internet connection)
I need to create new columns with the name END, SVTYPE and SVLEN and their info as values of that columns. Following the example, this would be
END SVTYPE SVLEN-
224015456 DEL 223224913
The rest of the info contained in the column INFOI do not need it so far.
The information contained in this column is huge but as far I can read there is not more something=value as you can see in the picture.
Simply use .str.extract:
extracted = df['INFO'].str.extract('END=(?P<END>.+?);SVTYPE=(?P<SVTYPE>.+?);SVLEN=(?P<SVLEN>.+?);')
Output:
>>> extracted
END SVTYPE SVLEN
0 224015456 DEL -223224913

Ignoring multiple commas while reading csv in pandas

I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a.
Each file has data like :
Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php
I need two columns in my pandas df: Login_id and Web.
I am facing error when I try to read records like 2.
df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)
I am facing the following error :
ValueError: Columns must be same length as key.
Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks
Solution 1: use split with argument n=1 and expand=True.
result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']
That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).
EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:
result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)
This splits the field and uses the names of the matching groups to create columns with their content. The output is:
Login_id URL
0 1 http://www.x1.com
1 2 http://www.x1.com,as.php
Solution 3: convetional version with regex:
You could do something customized, e.g with a regex:
import re
sp_re= re.compile('([^,]*),(.*)')
aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]
The result on your example data is:
Login_id, Web Login_id URL
0 1,http://www.x1.com 1 http://www.x1.com
1 2,http://www.x1.com,as.php 2 http://www.x1.com,as.php
Now you could drop the column 'Login_id, Web'.

Match pattern of urls in a pandas column

I am currently working on a drop which contains a big amount of links.
So far I want to filter the links to a list of websites.
So I wrote an array which contains the xxx-value of every website:
www.xxx.de/com/whatever
What I want to do is to check every column entry with the values which are in the array.
list = ['forbes','bloomberg',...]
map = df['URL'].match(list)
df['URL'] = df.apply(map)
Somehow in this manner. I am just not so sure how to work with the link which in the column since I never worked with strings before.
Links are in the following format:
www.forbes.com/.../...
Is there any easy way without using urlparse or similar to do this job?
Thanks a lot for your help!
I believe you need extract for new column:
df = pd.DataFrame({'URL':['www.forbes.com/.../...',
'www.bloomberg.com/something',
'www.webpage.com/something']})
L = ['forbes','bloomberg']
df['new'] = df['URL'].str.extract("(" + "|".join(L) +")", expand=False)
print (df)
URL new
0 www.forbes.com/.../... forbes
1 www.bloomberg.com/something bloomberg
2 www.webpage.com/something NaN
But if want filter rows only use contains:
df = df[df['URL'].str.contains("|".join(L))]
print (df)
URL
0 www.forbes.com/.../...
1 www.bloomberg.com/something

Taking dictionary/list, and mapping to dataframe with x number of matching columns

I have a table that contains 2 columns.
Column 1 | Column 2
----------------------------
unique_number | '123 Main St. Suite 100 Chicago, IL'
I've been exploring address parsing using https://parserator.datamade.us/api-docs and ideally would like to parse the address, and put the results into new columns.
import usaddress
addr='123 Main St. Suite 100 Chicago, IL'
Two options for returning parsed results, and I plan on using whichever is easier to add to a dataframe:
usaddress.parse(addr) The parse method will split your address string into components, and
label each component. (returns list)
usaddress.tag(addr) The tag method will try to be a little smarter it will merge consecutive components, strip commas, & return an address type (returns ordered list)
There are 26 different tags available for an address using this parser.
However, Not all addresses will contain all of these tags.
I need to grab the full address for each row, parse it, map the parsed results to each matching column in that same row.
What the tag data looks like using from_records (index isn't exactly ideal)
What the parse data looks like using from_records
I can't quite figure out how to logic of row by row calculations and mapping results.
First, create a column of json responses from the parsing service
df['json_response'] = df['address'].apply(usaddress.pars)
Next, combine all the jsons into a single json string
json_combined = json.dumps(list(df['json_response']))
Finally parse the combined json to a dataframe (after parsing the json string)
df_parsed = pd.io.json.json_normalize(json.loads(json_combined))
Now you should have a structured dataframe with all required columns which you can df.join with your original dataframe to produce a single unified dataset.
Just a note, depending on the structure of the returned json, you may need to pass further arguments to the `pandas.io.json.json_normalize function. The example on the linked page is a good starting point.
Super late in posting this solution, but wanted to in case anyone else ran into the same problem
Address csv file headings:
name, address
Imports:
import pandas as pd
import numpy as np
import json
import itertools
import usaddress
def address_func(address):
try:
return usaddress.tag(address)
except:
return [{'AddressConverstion':'Error'}]
# import file
file = pd.read_excel('addresses.xlsx')
# apply function
file['tag_response'] = file['Full Address'].apply(address_func)
# copy values to new column
file['tags'] = file.apply(lambda row: row['tag_response'][0], axis=1)
# dump json
tags_combined = json.dumps(list(file['tags']))
# create dataframe of parsed info
df_parsed = pd.io.json.json_normalize(json.loads(tags_combined))
# merge dataframes on index
merged = file.join(df_parsed)

how to locate row in dataframe without headers

I noticed that when using .loc in pandas dataframe, it not only finds the row of data I am looking for but also includes the header column names of the dataframe I am searching within.
So when I try to append the .loc row of data, it includes the data + column headers - I don't want any column headers!
##1st dataframe
df_futures.head(1)
date max min
19990101 2000 1900
##2nd dataframe
df_cash.head(1)
date$ max$ min$
1999101 50 40
##if date is found in dataframe 2, I will collect the row of data
data_to_track = []
for ii in range(len(df_futures['date'])):
##date I will try to find in df2
date_to_find = df_futures['date'][ii]
##append the row of data to my list
data_to_track.append(df_cash.loc[df_cash['Date$'] == date_to_find])
I want the for loop to return just 19990101 50 40
It currently returns 0 19990101 50 40, date$, max$, min$
I agree with other comments regarding the clarity of the question. However, if what you want to get is just a string that contains a particular row's data, then you could use to_string() method of Pandas.
In your case,
Instead of this:
df_cash.loc[df_cash['Date$'] == date_to_find]
You could get a string that includes only the row data:
df_cash[df_cash['Date$'] == date_to_find].to_string(header=None)
Also notice that I dropped the .loc part, which outputs the same result.
If your dataframe has multiple columns and you dont want them to be joined in a string (may bring data type issues and is potentially problematic if you want to separate them later on), you could use list() method such as:
list(df_cash[df_cash['Date$'] == date_to_find].iloc[0])

Categories

Resources