how to do a conditional assignment to a column in python? - python

I have a Regex function in Google Data Studio Dashboard that creates a "Channel" column in the dataframe.
CASE
WHEN REGEXP_MATCH(business_partner, ".*Accounting.*|.*Ecosystem.*|.*Platform.*|.*Agency.*") THEN "Partner"
WHEN REGEXP_MATCH(utm_source, '.*Facebook.*') THEN "Facebook"
WHEN REGEXP_MATCH(utm_source, '.*Google*') AND NOT REGEXP_MATCH(utm_campaign,".*branding.*") THEN "Google"
WHEN REGEXP_MATCH(utm_campaign,".*branding.*") THEN "Branding"
ELSE "Others"
END
How can I replicate this code in python? something like df['channel'] = ...
channel
Facebook
Google
Partner
I did a lot of research on the internet, but didn't find anything very conclusive.
Here a sample of data:
utm_source
utm_campaign
business_partner
facebook
conversion
Google
Search
Google
Branding
Direct
Agency
facebook
traffic
Google
Display

This is an easy, straightforward solution using np.select (documentation):
import pandas as pd
import numpy as np
import io
data_string = io.StringIO("""utm_source utm_campaign business_partner
facebook conversion
Google Search
Google Branding
Direct Agency
facebook traffic
Google Display """)
df = pd.read_table(data_string, sep='\t')
conditions = [(df['utm_source'].str.lower() == 'facebook'),
(df['utm_source'].str.lower() == 'google'),
(df['utm_source'].str.lower() == 'partner')]
channels = ['Facebook', 'Google', 'Partner']
df['channel'] = np.select(conditions, channels, default=np.nan)
df['channel'] = np.where(df['utm_campaign'] == 'Branding', 'Branding', df['channel'])

Related

Why searching with subjArea and subjCode fetches different results with Scopus Serial Search API?

I am trying to retrieve all journals that exist within the a subject area of Scopus, say 'Medicine', using the python package pybliometrics.
According to the Scopus search (online), there are 13,477 Journals in this category.
Accessing the SerialTitle API of Scopus via pybliometrics.scopus.SerialSearch() for category Medicine, the subjArea='MEDI' and subjCode='2700'. The list of all codes associated with the Scopus subject categories are listed here
I am not able to get more than 5000 journals. But with parameter subjArea='MEDI' I am able to retrieve 5000+ documents but not more than 10,000.
I do not understand why searching with subjArea and subjCode fetches different results for me. Can anyone help me understand why this could be happening?
I am adding my code for both these search queries for better understanding:
import pandas as pd
from pybliometrics.scopus import SerialSearch
def search_by_subject_area(subject_area):
print("Searching journals by subject area....")
df = pd.DataFrame()
i = 0
# limitation of i<10000 is added otherwise raises error of scopus500
while (i > -1 and i < 10000):
s = SerialSearch(query={"subj": f"{str(subject_area)}"}, start=f'{i}', refresh=True)
if s.get_results_size() == 0:
break
else:
i += s.get_results_size()
df_new = pd.DataFrame(s.results)
df = pd.concat([df, df_new], axis=0, ignore_index=True)
print(i, " journals obtained!")
def search_by_subject_code(code):
print("------------------------------------------------\n Searching journals by subject codes....")
df = pd.DataFrame()
i = 0
while (i > -1):
s = SerialSearch(query={"subjCode": f"{code}"}, start=f'{i}', refresh=True)
if s.get_results_size() == 0:
break
else:
i += s.get_results_size()
df_new = pd.DataFrame(s.results)
df = pd.concat([df, df_new], axis=0, ignore_index=True)
print(i, " journals obtained!")
if __name__ == '__main__':
search_by_subject_area(subject_area = 'MEDI')
search_by_subject_code('2700')
Certain Scopus APIs, including the Serial Search API, are restricted: They do not allow more than 5,000 results.
There are some Search APIs that have pagination active, where they allow you to cycle through a potentially unlimited number of results.

Scraper To Copy Articles In Bulk

I'm working on an AI project, and one of the steps is to get ~5,000 articles from an online outlet.
I'm a beginner programmer, so please be kind. I've found a site that is very easy to scrape from, in terms of URL structure - I just need a scraper that can take an entire article from a site (we will be analyzing the articles in bulk, with AI).
The div containing the article text for each piece, is the same across the entire site - "col-md-12 description-content-wrap".
Does anyone know a simple Python script that would simply go thru a .CSV of URLs, pull the text from the above listed ^ div of each article, and output it as plain text? I've found a few solutions, but none are 100% what I need.
Ideally all of the 5,000 articles would be outputted in one file, but if they need to each be separate, that's fine too. Thanks in advance!
I did something a little bit similar to this about a week ago. Here is the code that I came up with.
from bs4 import BeautifulSoup
import urllib.request
from pandas import DataFrame
resp = urllib.request.urlopen("https://www.cnbc.com/finance/")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
substring = 'https://www.cnbc.com/'
df = ['review']
for link in soup.find_all('a', href=True):
#print(link['href'])
if (link['href'].find(substring) == 0):
# append
df.append(link['href'])
#print(link['href'])
#list(df)
# convert list to data frame
df = DataFrame(df)
#type(df)
#list(df)
# add column name
df.columns = ['review']
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
df['sentiment'] = df['review'].apply(lambda x: sid.polarity_scores(x))
def convert(x):
if x < 0:
return "negative"
elif x > .2:
return "positive"
else:
return "neutral"
df['result'] = df['sentiment'].apply(lambda x:convert(x['compound']))
df['result']
df_final = pd.merge(df['review'], df['result'], left_index=True, right_index=True)
df_final
df_final.to_csv('C:\\Users\\ryans\\OneDrive\\Desktop\\out.csv')
Result:

Specify amount of Tweets Tweepy returns

I am building a word cloud from public Tweets. I have connected to the API via Tweepy and have successfully gotten it to return Tweets related to my search term, but for some reason can only get it to return 15 Tweets.
import pandas as pd
# subject of word cloud
search_term = 'ENTER SEARCH TERM HERE'
# creating dataframe containing the username and corresponding tweet content relating to our search term
df = pd.DataFrame(
[tweet.user.id, tweet.user.name, tweet.text] for tweet in api.search(q=search_term, lang="en")
)
# renaming columns of data frame
df.rename(columns={0 : 'user id'}, inplace=True)
df.rename(columns={1 : 'screen name'}, inplace=True)
df.rename(columns={2 : 'text'}, inplace=True)
df
By default, the standard search API that API.search uses returns up to 15 Tweets per page.
You need to specify the count parameter, up to a maximum of 100, if you want to retrieve more per request.
If you want more than 100 or a guaranteed amount, you'll need to look into paginating using tweepy.Cursor.

Google Trends Category Search

I'm trying to extract/download Google Trends Series Data by category and/or subcategory with Python based on this list in the following link: https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories
This list of categories contains codes that are used in the (unofficial) API of Google Trends, named pytrends.
However, I'm not able to search only by category because it is required to give a keyword/search term. In the case below, we have category 47 (Autos & Vehicles) and keywords ['BMW', 'Peugeot'].
import pytrends
from pytrends.request import TrendReq
pytrend = TrendReq()
pytrend = TrendReq(hl='en-US', tz=360)
keywords = ['BMW', 'Peugeot']
pytrend.build_payload(
kw_list=keywords,
cat=47,
timeframe='today 3-m',
geo='FR',
gprop='')
data = pytrend.interest_over_time()
data= data.drop(labels=['isPartial'],axis='columns')
image = data.plot(title = 'BMW V.S. Peugeot in last 3 months on Google Trends ')
fig = image.get_figure()
I found this as a possible solution, but I haven't tried because it's in R:
https://github.com/PMassicotte/gtrendsR/issues/89
I don't know if there is an API that would give this possibility to extract series by category and ignoring keyword/search term. Let me know if it exists. I believe an option would be to download directly from Google Trends website and filling up just the category field, like this example where we can see the series for category "Autos & Vehicles":
https://trends.google.com/trends/explore?cat=47&date=all&geo=SG
You may search via category with empty string in the kw_list array:
keywords = ['']
pytrend.build_payload(kw_list=[''], cat=47,
timeframe='today 3-m', geo='FR', gprop='')
data = pytrend.interest_over_time()

Reworking Python for loop to ETL Workflow - Using Luigi, Airflow, etc

I'm currently experimenting with different python workflow techniques and I have a nested for loop that I want to convert into an automated workflow. I've been trying to use luigi, but I am unable to figure out a successful workflow that takes in two datasets that depend on each other and outputs both dataset chunks CSV. Every luigi example that I've seen so far takes in data in one step, aggregates, then writes the output.
In my example, I want to bring in a daily NBA scoreboard from the API, store that in a CSV; then using that scoreboard data to bring in the stats (or boxscore) of each of the games for that day and store those in separate CSVs.
The basic nested for loop that achieves what I want is the following:
import re
import requests
import pandas as pd
from pandas.io.json import json_normalize
import os
if not os.path.exists('data/'):
os.makedirs('data/')
if not os.path.exists('data/games/'):
os.makedirs('data/games/')
if not os.path.exists('data/boxscores/'):
os.makedirs('data/boxscores/')
dates = ['2020-03-01', '2020-03-02'] # and so on...
for date in dates:
print('*** DATE: {}'.format(date))
date = re.sub( '-', '', date)
print('*** MODIFIED DATE: {}'.format(date))
response = requests.get('http://data.nba.net/json/cms/noseason/scoreboard/{}/games.json'.format(date))
df = pd.read_json(response.text)
games_df = json_normalize(df[df.index == 'games']['sports_content'][0]['game'])
games_df.to_csv('data/games/games_{}.csv'.format(date), index=False)
for id in games_df.id.tolist():
print('*** GAME ID: {}'.format(id))
response = requests.get("http://data.nba.net/10s/prod/v1/{0}/{1}_boxscore.json".format(date, id))
df = pd.read_json(response.text)
df = json_normalize(df[df.index == 'activePlayers']['stats'][0])
boxscore = df[['personId', 'firstName', 'lastName', 'teamId',
'min', 'points', 'fgm', 'fga',
'ftm', 'fta', 'tpm', 'tpa', 'offReb', 'defReb',
'totReb', 'assists', 'pFouls', 'steals', 'turnovers', 'blocks', 'plusMinus']]
boxscore['min'] = boxscore['min'].str.split(":", expand = True)[0]
boxscore.to_csv('data/boxscores/boxscore_{}.csv'.format(id), index=False)
I want to avoid using for loops and I like how luigi's framework will avoid re-building days that you already have built. Does anyone have any suggestions or links on how to build a pipeline like this? I'm open to switching from luigi to Airflow if it is more intuitive.

Categories

Resources