Match pattern of urls in a pandas column

Match pattern of urls in a pandas column - python

I am currently working on a drop which contains a big amount of links.
So far I want to filter the links to a list of websites.
So I wrote an array which contains the xxx-value of every website:
www.xxx.de/com/whatever
What I want to do is to check every column entry with the values which are in the array.
list = ['forbes','bloomberg',...]
map = df['URL'].match(list)
df['URL'] = df.apply(map)
Somehow in this manner. I am just not so sure how to work with the link which in the column since I never worked with strings before.
Links are in the following format:
www.forbes.com/.../...
Is there any easy way without using urlparse or similar to do this job?
Thanks a lot for your help!

I believe you need extract for new column:
df = pd.DataFrame({'URL':['www.forbes.com/.../...',
'www.bloomberg.com/something',
'www.webpage.com/something']})
L = ['forbes','bloomberg']
df['new'] = df['URL'].str.extract("(" + "|".join(L) +")", expand=False)
print (df)
URL new
0 www.forbes.com/.../... forbes
1 www.bloomberg.com/something bloomberg
2 www.webpage.com/something NaN
But if want filter rows only use contains:
df = df[df['URL'].str.contains("|".join(L))]
print (df)
URL
0 www.forbes.com/.../...
1 www.bloomberg.com/something

Related

Find which columns contain a certain value for each row in a dataframe

I have a dataframe, df, shown below. Each row is a story and each column is a word that appears in the corpus of stories. A 0 means the word is absent in the story while a 1 means the word is present.
I want to find which words are present in each story (i.e. col val == 1). How can I go about finding this (preferably without for-loops)?
Thanks!

Assuming you are just trying to look at one story, you can filter for the story (let's say story 34972) and transpose the dataframe with:
df_34972 = df[df.index=34972].T
and then you can send the values equal to 1 to a list:
[*df_34972[df_34972['df_34972'] == 1]]
If you are trying to do this for all stories, then you can do this, but it will be a slightly different technique. From the link that SammyWemmy provided, you can melt() the dataframe and filter for 1 values for each story. From there you could .groupby('story_column') which is 'index' (after using reset_index()) in the example below:
df = df.reset_index().melt(id_vars='index')
df = df[df['values'] == 1]
df.groupby('index')['variable'].apply(list)

Ignoring multiple commas while reading csv in pandas

I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a.
Each file has data like :
Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php
I need two columns in my pandas df: Login_id and Web.
I am facing error when I try to read records like 2.
df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)
I am facing the following error :
ValueError: Columns must be same length as key.
Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks

Solution 1: use split with argument n=1 and expand=True.
result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']
That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).
EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:
result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)
This splits the field and uses the names of the matching groups to create columns with their content. The output is:
Login_id URL
0 1 http://www.x1.com
1 2 http://www.x1.com,as.php
Solution 3: convetional version with regex:
You could do something customized, e.g with a regex:
import re
sp_re= re.compile('([^,]*),(.*)')
aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]
The result on your example data is:
Login_id, Web Login_id URL
0 1,http://www.x1.com 1 http://www.x1.com
1 2,http://www.x1.com,as.php 2 http://www.x1.com,as.php
Now you could drop the column 'Login_id, Web'.

looping to pull the first 2 substrings of a column in python

I am attempting to pull a substring form a column, in the following way:
target_column:
PE123
DD123-HP123
HP123
373627HP23
I would like to pull the first two strings/alphabets of every record, except in cases where there is no alphabet in the first two strings. In this case, pull any alphabet that you find in the rest of the string. So in the case of 373627HP23, it will pull HP.
But the problem is with something like DD123-HP123. My loop is pulling the HP instead of the DD.
for index,row in df.iterrows():
target_value = row['target_column']
predefined_code = [HP]
for code in re.findall("[a-zA-Z]+", target_value):
if (len(code)!=1) and not (code in predefined_code):
possible_code = code
What is wrong with my code here?
What is the best code to write a loop so that in the case of something like DD123-HP123, it will pull the DD and not the HP?

I believe you can use extract for return first matched pattern:
df['new'] = df['target_column'].str.extract("([a-zA-Z]+)")
print (df)
target_column new
0 PE123 PE
1 DD123-HP123 DD
2 HP123 HP
3 373627HP23 HP

Extracting multiple URLs - Python

I want to extract several links from a text (comments), which are stored in a panda dataframe. My goal is to add the extracted URLs in a new column of the original dataset. By using the following method applied to my text, I am able to extract the comments and store it in the variable URL and transform it into another pandas dataframe. In this case, I am not sure if this is the efficient way to extract the necessary information.
URL = (ALL.textOriginal.str.extractall("(?P<URL>(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300})))").reset_index('match', drop=True))
URL_df = pd.DataFrame(data=URL)
URL_df.drop([1],axis=1)
gives me:
596 https://www.tag24.de/nachrichten
596 http://www.tt.com/panorama
596 http://www.wz.de/lokales
666 https://www.svz.de/regionales
666 https://www.watson.ch/Leben
... ...
The dataframe contains only the indices and the hyperlinks. The problem with this method is, that some of the indices are duplicated because one comment can exist several URLs, which will be extracted. I tried different ways to solve this problem such as:
pd.concat([ALL, URL_df.drop], axis=1).sort_index()
I also tried to store it the URLs directly to the original dataframe by applying:
ALL['URL'] = ALL.textOriginal.str.extractall("(?P<URL>(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300})))").reset_index('match', drop=True))
but I only received this error message:
"incompatible index of the inserted column with frame index"
As I said before my goal is to store the extracted URLs in a new column like:
text URL
"blablabla link1, link2, link3" [https://www.tag24.de/nachrichten, http://www.tt.com/panorama, http://www.wz.de/lokales]
"blablabla link1, link2" [https://www.svz.de/regionales, https://www.watson.ch/Leben]
... ...

I think need findall:
pat = "(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300}))"
ALL['URL'] = ALL.textOriginal.str.findall(pat)
print (ALL)
textOriginal \
0 https://www.tag24.de/nachrichten http://www.tt...
1 https://www.svz.de/regionales https://www.wats...
URL
0 [https://www.tag24.de/nachrichten, http://www....
1 [https://www.svz.de/regionales, https://www.wa... ]
Another solution with extractall, which return MultiIndex, so necessary groupby by first level with creating lists:
pat = "(https?://(?:www)?(?:[\w-]{2,255}(?:\.\w{2,6}){1,2})(?:/[\w&%?#-]{1,300}))"
ALL['URL'] = ALL.textOriginal.str.extractall(pat).groupby(level=0)[0].apply(list)
print (ALL)
textOriginal \
0 https://www.tag24.de/nachrichten http://www.tt...
1 https://www.svz.de/regionales https://www.wats...
URL
0 [https://www.tag24.de/nachrichten, http://www....
1 [https://www.svz.de/regionales, https://www.wa...
Setup:
ALL = pd.DataFrame({'textOriginal': ['https://www.tag24.de/nachrichten http://www.tt.com/panorama http://www.wz.de/lokales', 'https://www.svz.de/regionales https://www.watson.ch/Leben']})

Let's say you have a dataframe with two columns, 'Indice' and 'Link', where Indice is not unique. You can aggregate all the links with the same Indice in the following way:
myAggregateDF = myDF.groupby('Indice')['Link'].apply(list).to_frame()
In this way, you will obtain a new dataframe with two columns, 'Indice' and 'Link', where 'Link' is a list of the previous links.
Pay attention though, this method is not efficient. Groupby is memory hungry and this can be a problem with large dataframes.

How to feed array of user_ids to flickr.people.getInfo()?

I have been working on extracting the flickr users location (not lat. and long. but person's country) by using their user_ids. I have made a dataframe (Here's the dataframe) consisted with photo id, owner and few other columns. My attempt was to feed each of the owner to flickr.people.getInfo() query by iterating owner column in dataframe. Here is my attempt
for index, row in df.iterrows():
A=np.array(df["owner"])
for i in range(len(A)):
B=flickr.people.getInfo(user_id=A[i])
unfortunately, it results only 1 result. After careful examination I've found that it belongs to the last user in the dataframe. My dataframe has 250 observations. I don't know how could I extract others.
Any help is appreciated.

It seems like you forgot to store the results while iterating over the dataframe. I haven't use the API but I think that this snippet should do it.
result_dict = {}
for idx, owner in df['owner'].iteritems():
result_dict[owner] = flickr.people.getInfo(user_id=owner)
The results are stored in a dictonary where the user id is the key.
EDIT:
Since it is a JSON you can use the read_json function to parse the result.
Example:
result_list = []
for idx, owner in df['owner'].iteritems():
result_list.appen(pd.read_json(json.dumps(flickr.people.get‌Info(user_id=owner))‌,orient=list))
# you may have to set the orient parameter.
# Option are: 'split','records','index', Default is 'index'
Note: I switched the dictonary to a list, since it is more convenient
Afterwards you can concatenate the resulting pandas serieses together like this:
df = pd.concat(result_list, axis=1).transpose()
I added the transpose() since you probably want the ID as the index.
Afterwards you should be able to sort by the column 'location'.
Hope that helps.

The canonical way to achieve that is to use an apply. It will be much more efficient.
import pandas as pd
import numpy as np
np.random.seed(0)
# A function to simulate the call to the API
def get_user_info(id):
return np.random.randint(id, id + 10)
# Some test data
df = pd.DataFrame({'id': [0,1,2], 'name': ['Pierre', 'Paul', 'Jacques']})
# Here the call is made for each ID
df['info'] = df['id'].apply(get_user_info)
# id name info
# 0 0 Pierre 5
# 1 1 Paul 1
# 2 2 Jacques 5
Note, another way to write the same thing is
df['info'] = df['id'].map(lambda x: get_user_info(x))

Before calling the method, have the following lines first.
from flickrapi import FlickrAPI
flickr = FlickrAPI(FLICKR_KEY, FLICKR_SECRET, format='parsed-json')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match pattern of urls in a pandas column - python

Related

Find which columns contain a certain value for each row in a dataframe

Ignoring multiple commas while reading csv in pandas

looping to pull the first 2 substrings of a column in python

Extracting multiple URLs - Python

How to feed array of user_ids to flickr.people.getInfo()?

Categories

Resources