Related
I have this code:
posts = []
subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft', 'AskTechnology', 'realtech',
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess',
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')
for sub_name in subs:
for submission in reddit.subreddit(sub_name).hot(limit = 1):
date = submission.created
date = datetime.datetime.fromtimestamp(date)
if date >= targeted_date and reddit.subreddit(sub_name).subscribers >= 35000:
posts.append([date, submission.subreddit, reddit.subreddit(sub_name).subscribers,
submission.title, submission.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df
Runtime with limit = 16 (~500 rows): 905.9099962711334 s
Which gives me this results:
date subreddit subscribers title text
0 2021-11-08 09:18:22 Bitcoin 3546142 Please upgrade your node to enable Taproot.
1 2021-09-19 17:01:03 homeautomation 1333753 Looking for developers interested in helping t... A while back I opened sourced all of my source...
2 2021-11-11 11:00:17 Entrepreneur 1036934 Thank you Thursday! - November 11, 2021 **Your opportunity to thank the** /r/Entrepren...
3 2021-11-08 01:36:05 oculus 396752 [Weekly] What VR games have you been enjoying ... Welcome to the weekly recommendation thread! :...
4 2021-06-17 19:25:01 microsoft 141810 Microsoft: Official Support Thread Microsoft: Official Support Thread\n\nMicrosof...
5 2021-11-12 11:02:14 investing 1946917 Daily General Discussion and spitballin thread... Have a general question? Want to offer some c...
6 2021-11-12 04:16:13 tech 413040 Mars rover scrapes at rock to 'look at somethi...
7 2021-11-12 12:00:15 wallstreetbets 11143628 Daily Discussion Thread for November 12, 2021 Your daily trading discussion thread. Please k...
8 2021-04-17 14:50:02 singularity 134940 Re: The Discord Link Expired, so here's a new ...
9 2021-11-12 11:40:04 programming 3682438 It's probably time to stop recommending Clean ...
10 2021-09-10 10:26:07 software 149655 What I do/install on every Windows PC - Softwa... Hello, I have to spend a lot of time finding s...
11 2021-11-12 13:00:18 Android 2315799 Daily Superthread (Nov 12 2021) - Your daily t... Note 1. Check [MoronicMondayAndroid](https://o...
12 2021-11-11 23:32:33 CryptoCurrency 3871810 Live Recording: Kevin O’Leary Talks About Cryp...
13 2021-11-02 20:53:21 productivity 874076 Self-promotion/shout out thread This is the place to share your personal blogs...
14 2021-11-12 14:57:19 RenewableEnergy 97364 Northvolt produces first fully recycled batter...
15 2021-11-12 08:00:16 gaming 30936297 Free Talk Friday! Use this post to discuss life, post memes, or ...
16 2021-11-01 05:01:23 startups 884574 Share Your Startup - November 2021 - Upvote Th... [r/startups](https://www.reddit.com/r/startups...
17 2021-11-01 09:00:11 HomeKit 107076 Monthly Buying Megathread - Ask which accessor... Looking for lights, a thermostat, a plug, or a...
18 2021-11-01 13:00:13 dataisbeautiful 16467198 [Topic][Open] Open Discussion Thread — Anybody... Anybody can post a question related to data vi...
19 2021-11-12 12:29:47 technews 339611 Peter Jackson sells visual effects firm for $1...
20 2021-10-07 19:15:14 NFT 221897 Join our official —and the #1 NFT— Discord Ser...
21 2020-12-01 12:11:36 google 1622449 Monthly Discussion and Support Thread - Decemb... Have a question you need answered? A new Googl...
The issue is that it's taking way too much time. As you can see I set up a limit = 1 and it takes approx 1 min in to run. Yesterday, I set up the limit to 300, in order to analyze the data and it run for about 2 hours.
My question: Is there a way to change the code organization in order to limit the run time?
The bellow code used to work way faster, but I wanted to had a column subscriber number, and had to add a second for loop:
posts = []
subs = reddit.subreddit('Futurology+wallstreetbets+DataIsBeautiful+RenewableEnergy+Bitcoin+Android+programming+gaming+tech+google+hardware+oculus+software+startups+linus+microsoft+AskTechnology+realtech+homeautomation+HomeKit+singularity+technews+Entrepreneur+investing+BusinessHub+CareerSuccess+growmybusiness+venturecapital+ladybusiness+productivity+NFT+CryptoCurrency')
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')
for subreddit in subs.new(limit = 500):
date = subreddit.created
date = datetime.datetime.fromtimestamp(date)
posts.append([date, subreddit.subreddit, subreddit.title, subreddit.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit', 'title', 'text'])
df
Runtime with limit = 500 (500 rows): 7.630232095718384 s
I know they aren't doing exactly the same thing but, the only reason why I tried to implement this new code is to add the new columns 'subscribers' which seems to work differently for the other calls.
Any suggestions/improvement to suggest?
Last one, anyone knows a way to retrieve all subreddit list based on a specific subject? (Such as technology) I found this page that list subreddits: https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits/#wiki_technology
Thanks :)
Improving your existing code by reducing converting and server calls (with explanations at the end):
posts = []
subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft', 'AskTechnology', 'realtech',
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess',
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
# convert target date into epoch format
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S').timestamp()
for sub_name in subs:
subscriber_number = reddit.subreddit(sub_name).subscribers
if subscriber_number < 35000: # if the subscribers are less than this skip gathering the posts as this would have resulted in false originally
continue
for submission in reddit.subreddit(sub_name).hot(limit = 1):
date = submission.created # reddit uses epoch time timestamps
if date >= targeted_date:
posts.append([date, submission.subreddit, subscriber_number,
submission.title, submission.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df
By separating your logic AND gate you are able to skip over those loops that would evaluate to false.
Instead of converting the date to a human-readable date inside of the for loop converting the target date once into the format that Reddit uses increases speed by removing the conversion operations and instead is just a look-up operation to compare numbers.
By storing the result of the number of subscribers you remove the number of calls to retrieve that information and instead are looking up the number in memory.
I am trying to import an excel file with headers on the second row
selected columns=['Student name','Age','Faculty']
data = pd.read_excel(path_in + 'Results\\' + 'Survey_data.xlsx', header = 1,usecols = selected_columns).rename(columns={'Student Name':'First name'}).drop_duplicates()
Currently, the excel looks something like this:
Student name Surname Faculty Major Scholarship Age L1 TFM Date Failed subjects
Ana Ruiz Economics Finance N 20 N 0
Linda Peterson Mathematics Mathematics Y 22 N 2021-12-04 0
Gregory Olsen Engineering Industrial Engineering N 21 N 0
Ana Watson Business Marketing N 22 N 0
I have tried including the last column in the selected_columns list but it returns the same error. Would greatly appreciate if someone can let me know why python is not reading all the lines.
Thanks in advance.
I have multiple .log files that look similar to the example down below. I can't seem to import them correctly in python. If I use space as delimiter then most of the columns break into multiple ones. The first row shows the names of the columns I want to add. I tried turning the log files into .csv but it did not help.
pd.read_csv("A.log",delimiter=r"\s+")
Date Time Employee Floor Department Field
12/1/2020 08:03:10.429 Engineer(LSA_800 90) 0 Service Mechanichal engineering, electrical engineering, telecomunication
12/1/2020 08:03:10.642 Engineer(LSA_800 50) 2 Service Civil engineering
12/1/2020 08:03:10.674 Assistant(Junior Postion) 0 Administration Administration %
12/1/2020 08:03:10.856 Assistant(Senior Position) 2 Administration Administration %
12/1/2020 08:03:10.901 Project Manager(Senior Position) 3 Project Management USA Project management PR Communication
This will preprocess the file into an acceptable CSV format:
columns = [1,11,25,74,78,131]
colpairs = [(a-1,b-1) for a,b in zip(columns,columns[1:]+[999])]
for ln in open('log.txt'):
parts = [ ln[a:b].rstrip() for a,b in colpairs ]
print( '"' + '","'.join(parts) + '"' )
I trying to read the message from database, but under the class label can't really read same as CSV dataset.
messages = pandas.read_csv('bitcoin_reddit.csv', delimiter='\t',
names=["title","class"])
print (messages)
Under the class label the pandas only can read as NaN
The version of my CSV file
title,url,timestamp,class
"It's official! 1 Bitcoin = $10,000 USD",https://v.redd.it/e7io27rdgt001,29/11/2017 17:25,0
The last 3 months in 47 seconds.,https://v.redd.it/typ8fdslz3e01,4/2/2018 18:42,0
It's over 9000!!!,https://i.imgur.com/jyoZGyW.gifv,26/11/2017 20:55,1
Everyone who's trading BTC right now,http://cdn.mutually.com/wp-content/uploads/2017/06/08-19.jpg,7/1/2018 12:38,1
I hope James is doing well,https://i.redd.it/h4ngqma643101.jpg,1/12/2017 1:50,1
Weeeeeeee!,https://i.redd.it/iwl7vz69cea01.gif,17/1/2018 1:13,0
Bitcoin.. The King,https://i.redd.it/4tl0oustqed01.jpg,1/2/2018 5:46,1
Nothing can increase by that much and still be a good investment.,https://i.imgur.com/oWePY7q.jpg,14/12/2017 0:02,1
"This is why I want bitcoin to hit $10,000",https://i.redd.it/fhzsxgcv9nyz.jpg,18/11/2017 18:25,1
Bitcoin Doesn't Give a Fuck.,https://v.redd.it/ty2y74gawug01,18/2/2018 15:19,-1
Working Hard or Hardly Working?,https://i.redd.it/c2o6204tvc301.jpg,12/12/2017 12:49,1
The separator in your csv file is a comma, not a tab. And since , is the default, there is no need to define it.
However, names= defines custom names for the columns. Your header already provides these names, so passing the column names you are interested in to usecols is all you need then:
>>> pd.read_csv(file, usecols=['title', 'class'])
title class
0 It's official! 1 Bitcoin = $10,000 USD 0
1 The last 3 months in 47 seconds. 0
2 It's over 9000!!! 1
3 Everyone who's trading BTC right now 1
4 I hope James is doing well 1
5 Weeeeeeee! 0
I have a data structure, we'll call it an inventory, in CSV that looks similar to:
ResID,Building,Floor,Room,Resource
1.1.1.1,Central Park,Ground,Admin Office,Router
1.1.2.1,Central Park,Ground,Machine Closet,Router
1.3.1.1,Central Park,Mezzanine,Dungeon,Whip
2.1.3.1,Chicago,Roof,Pidgeon Nest,Weathervane
1.13.4.1,Central Park,Secret/Hidden Floor,c:\room,Site-to-site VPN for 1.1.1.1
1.2.1.1,Central Park,Balcony,Restroom,TP
And I am trying to get it to output in a sorted CSV, and in the format of a text file following the format:
1 Central Park
1.1 Ground
1.1.1 Admin Office
1.1.1.1 Router
1.1.2 Machine Closet
1.1.2.1 Router
1.2 Balcony
1.2.1 Restroom
1.2.1.1 TP
1.3 Mezzanine
1.3.1 Dungeon
1.3.1.1 Whip
1.13 Secret/Hidden Floor
1.13.4 c:\room
1.13.4.1 Site-to-site VPN for 1.1.1.1
2 Chicago
2.1 Roof
2.1.3 Pidgeon Nest
2.1.3.1 Weathervane
I envision a data structure similar to:
Building = {
1 : 'Central Park',
2 : 'Chicago'
}
Floor = {
1 : {
1 : 'Ground',
2 : 'Balcony',
3 : 'Mezzanine',
13: 'Secret/Hidden Floor'
},
2 : {
1 : 'Roof'
}
}
Room = {
1 : {
1 : {
1 : 'Admin Office',
2 : 'Machine Closet'
}
2 : {
1 : 'Restroom'
}
3 : {
1 : 'Dungeon'
}
... Hopefully by now you get the idea.
My complication is that I do not know if this is the best way to represent the data and then iterate over it as:
for buildingID in buildings:
for floorID in floors[buildingID]:
for roomID in rooms[buildingID][floorID]:
for resource in resources[buildingID][floorID][roomID]:
do stuff...
Or if there is a MUCH more sane way to represent the data in script, but I need the full document heading numbers AND names intact, and this is the only way I could visualize to do it at my skill level.
I am also at a loss for an effective way to generate this information and build it into the data structure from a CSV in this format.
This may seem trivial to some, but I am not a programmer by trade, and really only dabble on a infrequent basis.
My ultimate goal is to be able to ingest the CSV into a sane data structure, sort it appropriately in ascending numerical order, generate line entries in the text structure shown above that lists each building, floor, room, and resource only once and listed in context with each other, and then ostensibly it would be trivially for me to handle the output to text or back to sorted CSV.
Any recommendations would be GREATLY appreciated.
EDIT: SOLUTION
Leveraging my accepted answer below I was able to generate the following code. Thank you to the guy that deleted his answer and comments that simplified my sorting process too!
import csv
def getPaddedKey(line):
keyparts = line[0].split(".")
keyparts = map(lambda x: x.rjust(5, '0'), keyparts)
return '.'.join(keyparts)
def outSortedCSV(reader):
with open(fOutName, 'w') as fOut:
writer = csv.writer(fOut, delimiter=',')
head = next(reader)
writer.writerow(head)
writer.writerows(sorted(reader, key=getPaddedKey))
s = set()
fInName = 'fIn.csv'
fOutName = 'fOut.csv'
with open(fInName, 'r') as fIn:
reader = csv.reader(fIn, delimiter=',')
outSortedCSV(reader)
fIn.seek(0)
next(fIn)
for row in reader:
ids = row[0].split('.') # split the id
for i in range(1, 5):
s.add(('.'.join(ids[:i]), row[i])) # add a tuple with initial part of id and name
for e in sorted(list(s), key=getPaddedKey):
print e[0] + ' ' + e[1]
If you have no reason to build your proposed structure, you could simply add for each line the building, floor, room and resource along with its id to a set (to automatically eliminate duplicates). Then you convert the set to a list, sort it and you are done.
Possible Python code, assuming rd is a csv.reader on the inventory (*):
next(rd) # skip the headers line
s = set()
for row in rd:
ids = row[0].split('.') # split the id
for i in range(1, 5):
s.add(('.'.join(ids[:i]), row[i])) # add a tuple with initial part of id and name
l = list(s) # convert to a list
l.sort() # sort it
You have now a list of 2-tuples [('1', 'Central Park'), ('1.1', 'Ground'), ('1.1.1', 'Admin Office'), ...], you can use it to build a new csv or just print it as text:
for i in l:
print(" ".join(i))
(*) In Python 3, you would use:
with open(inventory_path, newline = '') as fd:
rd = csv.reader(fd)
...
while in Python 2, it would be:
with open(inventory_path, "rb") as fd:
rd = csv.reader(fd)
...
extract the id's
ids = ['Building_id', 'Floor_id', 'Room_id', 'Resource_id']
labels = ['ResID', 'Building', 'Floor', 'Room', 'Resource']
df2 = df.join(pd.DataFrame(list(df['ResID'].str.split('.')), columns=ids))
df2
ResID Building Floor Room Resource Building_id Floor_id Room_id Resource_id
0 1.1.1.1 Central Park Ground Admin Office Router 1 1 1 1
1 1.1.2.1 Central Park Ground Machine Closet Router 1 1 2 1
2 1.3.1.1 Central Park Mezzanine Dungeon Whip 1 3 1 1
3 2.1.3.1 Chicago Roof Pidgeon Nest Weathervane 2 1 3 1
4 1.13.4.1 Central Park Secret/Hidden Floor c:\room Site-to-site VPN for 1.1.1.1 1 13 4 1
5 1.2.1.1 Central Park Balcony Restroom TP 1 2 1 1
iterate over this
little helper method
def pop_list(list_):
while list_:
yield list_[-1], list_.copy()
list_.pop()
for (id_, remaining_ids), (label, remaining_labels) in zip(pop_list(ids), pop_list(labels)):
print(label, ': ', df2.groupby(remaining_ids)[label].first())
returns
Resource : Building_id Floor_id Room_id Resource_id
1 1 1 1 Router
2 1 Router
13 4 1 Site-to-site VPN for 1.1.1.1
2 1 1 TP
3 1 1 Whip
2 1 3 1 Weathervane
Name: Resource, dtype: object
Room : Building_id Floor_id Room_id
1 1 1 Admin Office
2 Machine Closet
13 4 c:\room
2 1 Restroom
3 1 Dungeon
2 1 3 Pidgeon Nest
Name: Room, dtype: object
Floor : Building_id Floor_id
1 1 Ground
13 Secret/Hidden Floor
2 Balcony
3 Mezzanine
2 1 Roof
Name: Floor, dtype: object
Building : Building_id
1 Central Park
2 Chicago
Name: Building, dtype: object
Explanation
for (id_, remaining_ids), (label, remaining_labels) in zip(pop_list(ids), pop_list(labels)):
print((id_, remaining_ids), (label, remaining_labels))
returns
('Resource_id', ['Building_id', 'Floor_id', 'Room_id', 'Resource_id']) ('Resource', ['ResID', 'Building', 'Floor', 'Room', 'Resource'])
('Room_id', ['Building_id', 'Floor_id', 'Room_id']) ('Room', ['ResID', 'Building', 'Floor', 'Room'])
('Floor_id', ['Building_id', 'Floor_id']) ('Floor', ['ResID', 'Building', 'Floor'])
('Building_id', ['Building_id']) ('Building', ['ResID', 'Building'])
So this just iterates over the different levels in your building structure
res = df2.groupby(remaining_ids)[label].first()
builds per level in your structure a DataFrame representing the items at this level with as (Multi)index the nested ID's to this level. This is the info you want for your eventual datatructure, it just needs to be transformed to a nested dict
Building_id Floor_id
1 1 Ground
13 Secret/Hidden Floor
2 Balcony
3 Mezzanine
2 1 Roof
to text (no nesting)
res.index = res.index.to_series().apply('.'.join)
print(res)
1.1 Ground
1.13 Secret/Hidden Floor
1.2 Balcony
1.3 Mezzanine
2.1 Roof
Name: Floor, dtype: object