Scraping data PRAW - How can I improve my code?

Scraping data PRAW - How can I improve my code? - python

I have this code:
posts = []
subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft', 'AskTechnology', 'realtech',
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess',
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')
for sub_name in subs:
for submission in reddit.subreddit(sub_name).hot(limit = 1):
date = submission.created
date = datetime.datetime.fromtimestamp(date)
if date >= targeted_date and reddit.subreddit(sub_name).subscribers >= 35000:
posts.append([date, submission.subreddit, reddit.subreddit(sub_name).subscribers,
submission.title, submission.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df
Runtime with limit = 16 (~500 rows): 905.9099962711334 s
Which gives me this results:
date subreddit subscribers title text
0 2021-11-08 09:18:22 Bitcoin 3546142 Please upgrade your node to enable Taproot.
1 2021-09-19 17:01:03 homeautomation 1333753 Looking for developers interested in helping t... A while back I opened sourced all of my source...
2 2021-11-11 11:00:17 Entrepreneur 1036934 Thank you Thursday! - November 11, 2021 **Your opportunity to thank the** /r/Entrepren...
3 2021-11-08 01:36:05 oculus 396752 [Weekly] What VR games have you been enjoying ... Welcome to the weekly recommendation thread! :...
4 2021-06-17 19:25:01 microsoft 141810 Microsoft: Official Support Thread Microsoft: Official Support Thread\n\nMicrosof...
5 2021-11-12 11:02:14 investing 1946917 Daily General Discussion and spitballin thread... Have a general question? Want to offer some c...
6 2021-11-12 04:16:13 tech 413040 Mars rover scrapes at rock to 'look at somethi...
7 2021-11-12 12:00:15 wallstreetbets 11143628 Daily Discussion Thread for November 12, 2021 Your daily trading discussion thread. Please k...
8 2021-04-17 14:50:02 singularity 134940 Re: The Discord Link Expired, so here's a new ...
9 2021-11-12 11:40:04 programming 3682438 It's probably time to stop recommending Clean ...
10 2021-09-10 10:26:07 software 149655 What I do/install on every Windows PC - Softwa... Hello, I have to spend a lot of time finding s...
11 2021-11-12 13:00:18 Android 2315799 Daily Superthread (Nov 12 2021) - Your daily t... Note 1. Check [MoronicMondayAndroid](https://o...
12 2021-11-11 23:32:33 CryptoCurrency 3871810 Live Recording: Kevin O’Leary Talks About Cryp...
13 2021-11-02 20:53:21 productivity 874076 Self-promotion/shout out thread This is the place to share your personal blogs...
14 2021-11-12 14:57:19 RenewableEnergy 97364 Northvolt produces first fully recycled batter...
15 2021-11-12 08:00:16 gaming 30936297 Free Talk Friday! Use this post to discuss life, post memes, or ...
16 2021-11-01 05:01:23 startups 884574 Share Your Startup - November 2021 - Upvote Th... [r/startups](https://www.reddit.com/r/startups...
17 2021-11-01 09:00:11 HomeKit 107076 Monthly Buying Megathread - Ask which accessor... Looking for lights, a thermostat, a plug, or a...
18 2021-11-01 13:00:13 dataisbeautiful 16467198 [Topic][Open] Open Discussion Thread — Anybody... Anybody can post a question related to data vi...
19 2021-11-12 12:29:47 technews 339611 Peter Jackson sells visual effects firm for $1...
20 2021-10-07 19:15:14 NFT 221897 Join our official —and the #1 NFT— Discord Ser...
21 2020-12-01 12:11:36 google 1622449 Monthly Discussion and Support Thread - Decemb... Have a question you need answered? A new Googl...
The issue is that it's taking way too much time. As you can see I set up a limit = 1 and it takes approx 1 min in to run. Yesterday, I set up the limit to 300, in order to analyze the data and it run for about 2 hours.
My question: Is there a way to change the code organization in order to limit the run time?
The bellow code used to work way faster, but I wanted to had a column subscriber number, and had to add a second for loop:
posts = []
subs = reddit.subreddit('Futurology+wallstreetbets+DataIsBeautiful+RenewableEnergy+Bitcoin+Android+programming+gaming+tech+google+hardware+oculus+software+startups+linus+microsoft+AskTechnology+realtech+homeautomation+HomeKit+singularity+technews+Entrepreneur+investing+BusinessHub+CareerSuccess+growmybusiness+venturecapital+ladybusiness+productivity+NFT+CryptoCurrency')
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')
for subreddit in subs.new(limit = 500):
date = subreddit.created
date = datetime.datetime.fromtimestamp(date)
posts.append([date, subreddit.subreddit, subreddit.title, subreddit.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit', 'title', 'text'])
df
Runtime with limit = 500 (500 rows): 7.630232095718384 s
I know they aren't doing exactly the same thing but, the only reason why I tried to implement this new code is to add the new columns 'subscribers' which seems to work differently for the other calls.
Any suggestions/improvement to suggest?
Last one, anyone knows a way to retrieve all subreddit list based on a specific subject? (Such as technology) I found this page that list subreddits: https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits/#wiki_technology
Thanks :)

Improving your existing code by reducing converting and server calls (with explanations at the end):
posts = []
subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft', 'AskTechnology', 'realtech',
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess',
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
# convert target date into epoch format
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S').timestamp()
for sub_name in subs:
subscriber_number = reddit.subreddit(sub_name).subscribers
if subscriber_number < 35000: # if the subscribers are less than this skip gathering the posts as this would have resulted in false originally
continue
for submission in reddit.subreddit(sub_name).hot(limit = 1):
date = submission.created # reddit uses epoch time timestamps
if date >= targeted_date:
posts.append([date, submission.subreddit, subscriber_number,
submission.title, submission.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df
By separating your logic AND gate you are able to skip over those loops that would evaluate to false.
Instead of converting the date to a human-readable date inside of the for loop converting the target date once into the format that Reddit uses increases speed by removing the conversion operations and instead is just a look-up operation to compare numbers.
By storing the result of the number of subscribers you remove the number of calls to retrieve that information and instead are looking up the number in memory.

Related

How to find the maximum date value with conditions in python?

I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.

what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).

Python - Count number of records occurring between two dates by matching/comparing variables from multiple data frames

I have two data frames: one containing social media (SM) posts the other containing time stamped responses for survey completion and time for 60 days prior to survey completion (inaccurate for the example data frame's sake, don't think it would affect the methods - calculated in by -timedelta(days=60) using pd.datetime).
social_media:
post
postedTime
FIPS
'aaa'
2019-02-09 20:28:26
01010
'bbb'
2019-01-11 14:11:30
01010
'ccc'
2017-11-23 09:10:11
99999
survey:
competionTime
60daysprior
FIPS
2019-01-08 11:28:26
2018-12-09 20:28:26
01010
2019-01-04 07:30:21
2018-11-11 14:11:30
01010
2017-07-14 07:30:21
2017-09-23 09:10:11
88888
2019-06-21 11:43:17
2019-04-23 09:10:11
77777
Each row contains a respondent's county FIPS code, and the time stamp for their survey response. Each FIPS may be associated with one or multiple survey respondents. Data was deidentified, so there is no respondent ID, only FIPS and the time stamp. So, I assume there will be double-counting if a county has more than one respondent completing the survey in the same time-frame - this is not a big concern.
My goal is to count the number of SM posts created within the past 60 days of each survey completion date per each respondent's survey completion time. I have the following code which seems to work, but takes a very long time because the datasets are very large.
import pandas as pd
import datetime
import itertools
full_count = []
for i, surveyrow in survey.iterrows():
count = 0
for j, SMrow in social_media.iterrows():
if surveyrow['FIPS'] == SMrow['FIPS']:
if surveyrow['completionTime']>=SMrow['postedTime']:
if surveyrow['60daysprior']<=SMrow['postedTime']:
count+=1
else:
count+=0
else:
count+=0
else:
count+=0
full_count.append(str(count))
survey['count'] = full_count
What can I do to optimize this code? I am using Python 3 and pandas. Thank you!!
Update: This code improves the speed, but still takes a great deal of time:
survey_fips = survey['FIPS']
social_media_fips = social_media['FIPS']
survey_tm_finish = survey['completionTime']
social_media_postedTime = social_media['postedTime']
survey_tm_finish_minus60 = survey['60daysprior']
full_count = []
for i, j, k in zip(survey_fips, survey_tm_finish, survey_tm_finish_minus60):
count = 0
for a, b in zip(social_media_fips, social_media_postedTime):
if i == a:
if j>=b:
if k<=b:
count+=1
else:
count
else:
count
else:
count
full_count.append(str(count))

How to find last 24 hours data from pandas data frame

I have a data in which we have two columns, one is description and another is publishedAt. I applied sort function on publishedAt column and get the output of descending order of date. Here is the sample of my data frame:
description publishedAt
13 Bitcoin price has failed to secure momentum in... 2018-05-06T15:22:22Z
16 Brian Kelly, a long-time contributor to CNBC’s... 2018-05-05T15:56:48Z
2 The bitcoin price is less than $100 away from ... 2018-05-05T13:14:45Z
12 Mati Greenspan, a senior analyst at eToro and ... 2018-05-04T16:05:37Z
52 A Singaporean startup developing ‘smart bankno... 2018-05-04T14:02:30Z
75 Cryptocurrencies are set to make a comeback on... 2018-05-03T08:10:19Z
76 The bitcoin price is hovering near its best le... 2018-04-30T16:26:57Z
74 In today’s climate of ICOs with 100 billion to... 2018-04-30T12:03:31Z
27 Investment guru Warren Buffet remains unsold o... 2018-04-29T17:22:19Z
22 The bitcoin price has increased by around $400... 2018-04-28T12:28:35Z
68 Bitcoin futures volume reached an all-time hig... 2018-04-27T16:32:44Z
14 Biotech-company-turned-cryptocurrency-investme... 2018-04-27T14:25:15Z
67 The bitcoin price has rebounded to $9,200 afte... 2018-04-27T06:24:42Z
Now i want to description of last 3 hours, 6 hours, 12 hours and 24 hours.
How can i find it?
Thanks

As a simple solution within Pandas, you can use the DataFrame.last(offset) function. Be sure to set the PublishedAt column as the dataframe DateTimeIndex. A similar function to get rows a the start of a dataframe is the DataFrame.first(offset) function.
Here is an example using the provided offsets:
df.last('24h')
df.last('12h')
df.last('6h')
df.last('3h')

Assuming that the dataframe is called df
import datetime as dt
df[df['publishedAt']>=(dt.datetime.now()-dt.timedelta(hours=3))]['description'] #hours = 6,12, 24
if you need the intervals exclusive, thus the description withing the last 6 hours but not the ones within 3 hours, you'll need to use array-like logical operators from numpy like numpy.logicaland(arr1, arr2) in the first breaket.

I need help making pandas perform better with dataframe interactions

I'm a newbie and have been studying pandas for a few days, and started my first project with it. I wanted to use it to create a product stock prediction timeline for the current month.
Basically I get the stock and predicted daily reduction and trace a line from today to the end of the month with the predicted stock. Also, if there is a purchase order to be delivered on day XYZ, I add the delivery amount on that day.
I have a dataframe that contain's the stock for today and the predicted daily redutcion for this month
ITEM STOCK DAILY_DEDUCTION
A 1000 20
B 2000 15
C 800 8
D 10000 100
And another dataframe that contains pending purchase orders and amount that will be delivered.
ITEM DATE RECEIVING_AMOUNT
A 2018-05-16 20
B 2018-05-23 15
A 2018-05-17 8
D 2018-05-29 100
I created this loop to iterate through the dataframe and do the following:
subtract the DAILY_DEDUCTION for the item
if the date is the same as a purchase order date, then add the RECEIVING_AMOUNT
df_dates = pd.date_range(start=today, end=endofmonth, freq='D')
temptable = []
for row in df_stock.itertuples(index=True):
predicted_stock= getattr(row, "STOCK")
item = getattr(row, "ITEM")
for date in df_dates:
date_format = date.strftime('%Y-%m-%d')
predicted_stock = predicted_stock - getattr(linha, "DAILY_DEDUCTION")
order_qty = df_purchase_orders.loc[(df_purchase_orders['DATE'] == date_format)
& (df_purchase_orders['ITEM'] == item), 'RECEIVING_AMOUNT']
if len(df_purchase_orders.index) > 0:
predicted_stock = predicted_stock + order_qty.item()
lista = [date_format, item, int(predicted_stock)]
temptable.append(lista)
And... well, it did the job, but it's quite slow. I run this on 100k rows give or take, and was hoping to find some insight on how I can solve this problem in a way that performs better?

Calculating customer lifetime using Pandas

I'm performing a Cohort analysis using python, and I am having trouble creating a new column that sums up the total months a user has stayed with us.
I know the math behind the answer, all I have to do is:
subtract the year when they canceled our service from when they started it
Multiply that by 12.
Subtract the month when they canceled our service from when they started it.
Add those two numbers together.
So in Excel, it looks like this:
=(YEAR(C2)-YEAR(B2))*12+(MONTH(C2)-MONTH(B2))
C is when the customer canceled the date, and B is when they started
The problem is that I am very new to Python and Pandas, and I am having trouble translating that function in Python
What I have tried so far:
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
It returns with an error 'Series' is not callable, and I have a general understanding of what that means.
I then tried:
def LTVCalc (Plan_Start_Date, Plan_Cancel_Date):
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
But that didn't add the Column 'Lifetime' to the DataFrame.
Anyone able to help a rookie?

I think need first convert to_datetime and then use dt.year and
dt.month:
df = pd.DataFrame({
'Plan_Cancel_Date': ['2018-07-07','2019-03-05','2020-10-08'],
'Plan_Start_Date': ['2016-02-07','2017-01-05','2017-08-08']
})
#print (df)
#if necessary convert to datetimes
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
df['Lifetime'] = ((df.Plan_Cancel_Date.dt.year - df.Plan_Start_Date.dt.year)*12 +
df.Plan_Cancel_Date.dt.month - df.Plan_Start_Date.dt.month)
print (df)
Plan_Cancel_Date Plan_Start_Date Lifetime
0 2018-07-07 2016-02-07 29
1 2019-03-05 2017-01-05 26
2 2020-10-08 2017-08-08 38

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping data PRAW - How can I improve my code? - python

Related

How to find the maximum date value with conditions in python?

Python - Count number of records occurring between two dates by matching/comparing variables from multiple data frames

How to find last 24 hours data from pandas data frame

I need help making pandas perform better with dataframe interactions

Calculating customer lifetime using Pandas

Categories

Resources