Python snscrape: How to scrape tweet URL/link using snscrape? - python

What parameter should we use to include the URL/link of tweets? I have here the date, username, and content. Another question also is how can we transform the date in the dataframe into GMT+8? The timezone is in UTC. Please see code below for reference:
import snscrape.modules.twitter as sntwitter
import pandas as pd
query = "(from:elonmusk) until:2023-01-28 since:2023-01-27"
tweets = []
limit = 100000
for tweet in sntwitter.TwitterSearchScraper(query).get_items():
if len(tweets) == limit:
break
else:
tweets.append([tweet.date, tweet.username, tweet.content])
df = pd.DataFrame(tweets, columns=['Date', 'Username', 'Tweet'])
#Save to csv
df.to_csv('tweets.csv')
df

The get_items() return every single a search result with class type.
So the count of tweets needs to count by for loop.
This code will works
100K tweets is possible but it take too much time, I reduced 1K tweets.
import snscrape.modules.twitter as sntwitter
import pandas as pd
query = 'from:elonmusk since:2022-08-01 until:2023-01-28'
limit = 1000
tweets = sntwitter.TwitterSearchScraper(query).get_items()
index = 0
df = pd.DataFrame(columns=['Date','URL' ,'Tweet'])
for tweet in tweets:
if index == limit:
break
URL = "https://twitter.com/{0}/status/{1}".format(tweet.user.username,tweet.id)
df2 = {'Date': tweet.date, 'URL': URL, 'Tweet': tweet.rawContent}
df = pd.concat([df, pd.DataFrame.from_records([df2])])
index = index + 1
# # Converting time zone from UTC to GMT+8
df['Date'] = df['Date'].dt.tz_convert('Etc/GMT+8')
print(df)
df.to_csv('tweets.csv')
This single data of get_items()
it needs to extract only required key's value
tweet.date -> Date
https://twitter.com/tweet.user.username/status/tweet.id-> URL
tweet.rawContent-> Tweet
{
"_type": "snscrape.modules.twitter.Tweet",
"url": "https://twitter.com/elonmusk/status/1619164489710178307",
"date": "2023-01-28T02:44:31+00:00",
"rawContent": "#tn_daki #ShitpostGate Yup",
"renderedContent": "#tn_daki #ShitpostGate Yup",
"id": 1619164489710178307,
"user": {
"_type": "snscrape.modules.twitter.User",
"username": "elonmusk",
"id": 44196397,
"displayname": "Mr. Tweet",
"rawDescription": "",
"renderedDescription": "",
"descriptionLinks": null,
"verified": true,
"created": "2009-06-02T20:12:29+00:00",
"followersCount": 127536699,
"friendsCount": 176,
"statusesCount": 22411,
"favouritesCount": 17500,
"listedCount": 113687,
"mediaCount": 1367,
"location": "",
"protected": false,
"link": null,
"profileImageUrl": "https://pbs.twimg.com/profile_images/1590968738358079488/IY9Gx6Ok_normal.jpg",
"profileBannerUrl": "https://pbs.twimg.com/profile_banners/44196397/1576183471",
"label": null,
"url": "https://twitter.com/elonmusk"
}
... cut off
Result
>python get-data.py
Date URL Tweet
0 2023-01-27 15:29:36-08:00 https://twitter.com/elonmusk/status/1619115435... #farzyness No way
0 2023-01-27 15:14:05-08:00 https://twitter.com/elonmusk/status/1619111533... #mtaibbi Please correct your bs #PolitiFact &a...
0 2023-01-27 14:52:55-08:00 https://twitter.com/elonmusk/status/1619106207... #WallStreetSilv A quarter of all taxes just to...
0 2023-01-27 13:28:26-08:00 https://twitter.com/elonmusk/status/1619084945... #nudubabba #mikeduncan Yeah, whole thing
0 2023-01-27 13:12:16-08:00 https://twitter.com/elonmusk/status/1619080876... #TaraBull808 That’s way more monkeys than the ...
.. ... ... ...
0 2022-12-14 11:14:53-08:00 https://twitter.com/elonmusk/status/1603106271... #Jason Advertising revenue next year will be l...
0 2022-12-14 04:08:43-08:00 https://twitter.com/elonmusk/status/1602999020... #Balyx_ He would be welcome
0 2022-12-14 03:42:47-08:00 https://twitter.com/elonmusk/status/1602992493... #NorwayMFA #TwitterSupport #jonasgahrstore #AH...
0 2022-12-14 03:35:14-08:00 https://twitter.com/elonmusk/status/1602990594... #AvidHalaby Wow
0 2022-12-14 03:35:03-08:00 https://twitter.com/elonmusk/status/1602990549... #AvidHalaby Live & learn …
[1000 rows x 3 columns]
Reference
Converting time zone pandas dataframe
Tweet URL format
Detain information in here
Example:
URL = "https://twitter.com/elonmusk/status/1619111533216403456"
It saved into csv file.
0,2023-01-27 15:14:05-08:00,https://twitter.com/elonmusk/status/1619111533216403456,#mtaibbi Please correct your bs #PolitiFact & #snopes
It matched the tweet content and pandas Tweet column.
Also, you can add column, followers Count, friends Count, statuses Count, favourites Count, listed Count, media Count, reply Count, retweet Count, like Count and view Count too.

Related

Creating multiple dataframe using loop or function

I'm trying to extract the hash rate for 3 cryptocurrencies and I have attached the code for the same below. Now, I want to pass three urls and in return I need three different different dictionaries which should have the values. I'm stuck and I don't understand how should I go about it. I have tried using loops but it is not working out for me.
url = {'Bitcoin' : 'https://bitinfocharts.com/comparison/bitcoin-hashrate.html#3y',
'Ethereum': 'https://bitinfocharts.com/comparison/ethereum-hashrate.html#3y',
'Litecoin': 'https://bitinfocharts.com/comparison/litecoin-hashrate.html'}
for ele in url:
#### requesting the page and extracting the script which has date and values
session = requests.Session()
page = session.get(ele[i])
soup = BeautifulSoup(page.content, 'html.parser')
values = str(soup.find_all('script')[4])
values = values.split('d = new Dygraph(document.getElementById("container"),')[1]
#create an empty dict to append date and hashrates
dict([("crypto_1 %s" % i,[]) for i in range(len(url))])
#run a loop over all the dates and adding to dictionary
for i in range(values.count('new Date')):
date = values.split('new Date("')[i+1].split('"')[0]
value = values.split('"),')[i+1].split(']')[0]
dict([("crypto_1 %s" % i)[date] = value
You can use next example how to get data from all 3 URLs and create a dataframe/dictionary from it:
import re
import requests
import pandas as pd
url = {
"Bitcoin": "https://bitinfocharts.com/comparison/bitcoin-hashrate.html#3y",
"Ethereum": "https://bitinfocharts.com/comparison/ethereum-hashrate.html#3y",
"Litecoin": "https://bitinfocharts.com/comparison/litecoin-hashrate.html",
}
data = []
for name, u in url.items():
html_doc = requests.get(u).text
for date, hash_rate in re.findall(
r'\[new Date\("(.*?)"\),(.*?)\]', html_doc
):
data.append(
{
"Name": name,
"Date": date,
"Hash Rate": float("nan")
if hash_rate == "null"
else float(hash_rate),
}
)
df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
# here save df to CSV
# this will create a dictionary, where the keys are crypto names and values
# are dicts with keys Date/HashRate:
out = {}
for name, g in df.groupby("Name"):
out[name] = g[["Date", "Hash Rate"]].to_dict(orient="list")
print(out)
Prints:
{
"Bitcoin": {
"Date": [
Timestamp("2009-01-03 00:00:00"),
Timestamp("2009-01-04 00:00:00"),
Timestamp("2009-01-05 00:00:00"),
...

Databricks - Pyspark - Handling nested json with a dynamic key

I have an example json data file which has the following structure:
{
"Header": {
"Code1": "abc",
"Code2": "def",
"Code3": "ghi",
"Code4": "jkl",
},
"TimeSeries": {
"2020-11-25T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
},
"2020-11-26T03:00:00+00:00": {
"UnitPrice": 1000,
"Amount": 10000,
}
}
}
When I parse this into databricks with command:
df = spark.read.json("/FileStore/test.txt")
I get as output 2 objects: Header and TimeSeries. With the TimeSeries I want to be able to flatten the structure so it has the following schema:
Date
UnitPrice
Amount
As the date field is a key, I am currently only able to access it via iterating through the column names and then using this in the dot-notation dynamically:
def flatten_json(data):
columnlist = data.select("TimeSeries.*")
count = 0
for name in data.select("TimeSeries.*"):
df1 = data.select("Header.*").withColumn(("Timeseries"), lit(columnlist.columns[count])).withColumn("join", lit("a"))
df2 = data.select("TimeSeries." + columnlist.columns[count] + ".*").withColumn("join", lit("a"))
if count == 0:
df3 = df1.join(df2, on=['join'], how="inner")
else:
df3 = df3.union(df1.join(df2, on=['join'], how="inner"))
count = count + 1
return(df3)
This is far from ideal. Does anyone know a better method to create the described dataframe?
The idea:
Step 1: Extract Header and TimeSeries separately.
Step 2: For each field in the TimeSeries object, extract the Amount and UnitPrice, together with the name of the field, stuff them into a struct.
Step 3: Merge all these structs into an array column, and explode it.
Step 4: Extract Timeseries, Amount and UnitPrice from the exploded column.
Step 5: Cross join with the Header row.
import pyspark.sql.functions as F
header_df = df.select("Header.*")
timeseries_df = df.select("TimeSeries.*")
fieldNames = enumerate(timeseries_df.schema.fieldNames())
cols = [F.struct(F.lit(name).alias("Timeseries"), col(name).getItem("Amount").alias("Amount"), col(name).getItem("UnitPrice").alias("UnitPrice")).alias("ts_" + str(idx)) for idx, name in fieldNames]
combined = explode(array(cols)).alias("comb")
timeseries = timeseries_df.select(combined).select('comb.Timeseries', 'comb.Amount', 'comb.UnitPrice')
result = header_df.crossJoin(timeseries)
result.show(truncate = False)
Output:
+-----+-----+-----+-----+-------------------------+------+---------+
|Code1|Code2|Code3|Code4|Timeseries |Amount|UnitPrice|
+-----+-----+-----+-----+-------------------------+------+---------+
|abc |def |ghi |jkl |2020-11-25T03:00:00+00:00|10000 |1000 |
|abc |def |ghi |jkl |2020-11-26T03:00:00+00:00|10000 |1000 |
+-----+-----+-----+-----+-------------------------+------+---------+

Writing data from an API to a CSV

Basically I have this code that is working for me and its purpose is to download an entire series from an API about how many times a stock ticker is mentioned on the wallstreetbets sub.
This is the code:
import requests
tickers = open("ticker_list.txt","r")
for ticker in tickers:
ticker = ticker.strip()
url = "https://XXX SENSIBLE INFO/historical/wallstreetbets/"+ticker
headers = {'XXX (SENSIBLE INFO'}
r = requests.get(url, headers=headers)
print(r.content)
Where the file .txt is a simple list with about 8000 stock simbols.
I show you what are the first lines of the output, just for an example:
b'[{"Date": "2018-08-10", "Ticker": "AA", "Mentions": 1}, {"Date": "2018-08-28", "Ticker": "AA", "Mentions": 1}, {"Date": "2018-09-07", "Ticker": "AA", "Mentions": 1}, etc...
b'[{"Date": "2020-12-07", "Ticker": "AACQ", "Mentions": 1}, {"Date": "2020-12-08", "Ticker": "AACQ", "Mentions": 1}, {"Date": "2020-12-22", "Ticker": "AACQ", "Mentions": 1},... etc...
b'[{"Date": "2018-08-08", "Ticker": "AAL", "Mentions": 1}, {"Date": "2018-08-20", "Ticker": "AAL", "Mentions": 1}, {"Date": "2018-09-11", "Ticker": "AAL", "Mentions": 1}, .... etc
What I want to do now is to store all the data in a csv file so that the resulting table would be interpreted like this:
AA
AACQ
AAL
......
1/1/2018
3
3
7
...
2/1/2018
45
89
3
....
3/1/2018
21
4
2
......
....
(where the numbers in the middle represents the mentions per date per ticker, in this case to simplify I just put random numbers but they need to be the same numbers i got on the output as "mentions")
Alternatively, if it's easier, I need to create a single csv file for every ticker with the date in the first column and the numbers of mentions in the second column
The data that is being returned from the site is in JSON format, so this could be converted into a Python data structure using r.json(). Next, two things will help you here. Firstly a Counter can be used to keep track of all of the Mentions in your json data, and a defaultdict can be used to build a per date entry for each ticker. The set all_tickers can be used to keep track of all the tickers seen in the data and then be used to form the header for your output CSV file.
For example:
from collections import defaultdict, Counter
from datetime import datetime
import requests
import csv
dates = defaultdict(Counter)
all_tickers = set()
tickers = open("ticker_list.txt")
for ticker in tickers:
ticker = ticker.strip()
url = f"https://XXX SENSIBLE INFO/historical/wallstreetbets/{ticker}"
headers = {'XXX (SENSIBLE INFO'}
r = requests.get(url, headers=headers)
for row in r.json():
all_tickers.add(row['Ticker'])
date = datetime.strptime(row['Date'], '%Y-%m-%d') # convert to datetime format
dates[date][row['Ticker']] += row['Mentions']
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=['Date', *sorted(all_tickers)])
csv_output.writeheader()
for date, values in sorted(dates.items(), key=lambda x: x[0]):
row = {'Date' : date.strftime('%d/%m/%Y')} # Create an output date format of day/month/year
row.update(values)
csv_output.writerow(row)
This should produce the output you need.

How to access certain values of a json site via python

This is the code i have so far:
import json
import requests
import time
endpoint = "https://www.deadstock.ca/collections/new-arrivals/products/nike-
air-max-1-cool-grey.json"
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['product']:
name = (id['title'])
print (name)
Feel free to visit the link, I'm trying to grab all the "id" value and print them out. They will be used later to send to my discord.
I tried with my above code but i have no idea how to actually get those values. I don't know which variable to use in the for in reqjson statement
If anyone could help me out and guide me to get all of the ids to print that would be awesome.
for product in reqJson['product']['title']:
ProductTitle = product['title']
print (title)
I see from the link you provided that the only ids that are in a list are actually part of the variants list under product. All the other ids are not part of a list and have therefore no need to iterate over. Here's an excerpt of the data for clarity:
{
"product":{
"id":232418213909,
"title":"Nike Air Max 1 \/ Cool Grey",
...
"variants":[
{
"id":3136193822741,
"product_id":232418213909,
"title":"8",
...
},
{
"id":3136193855509,
"product_id":232418213909,
"title":"8.5",
...
},
{
"id":3136193789973,
"product_id":232418213909,
"title":"9",
...
},
...
],
"image":{
"id":3773678190677,
"product_id":232418213909,
"position":1,
...
}
}
}
So what you need to do should be to iterate over the list of variants under product instead:
import json
import requests
endpoint = "https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json"
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for product in reqJson['product']['variants']:
print(product['id'], product['title'])
This outputs:
3136193822741 8
3136193855509 8.5
3136193789973 9
3136193757205 9.5
3136193724437 10
3136193691669 10.5
3136193658901 11
3136193626133 12
3136193593365 13
And if you simply want the product id and product name, they would be reqJson['product']['id'] and reqJson['product']['title'], respectively.

What is the data format returned by the AdWords API TargetingIdeaPage service?

When I query the AdWords API to get search volume data and trends through their TargetingIdeaSelector using the Python client library the returned data looks like this:
(TargetingIdeaPage){
totalNumEntries = 1
entries[] =
(TargetingIdea){
data[] =
(Type_AttributeMapEntry){
key = "KEYWORD_TEXT"
value =
(StringAttribute){
Attribute.Type = "StringAttribute"
value = "keyword phrase"
}
},
(Type_AttributeMapEntry){
key = "TARGETED_MONTHLY_SEARCHES"
value =
(MonthlySearchVolumeAttribute){
Attribute.Type = "MonthlySearchVolumeAttribute"
value[] =
(MonthlySearchVolume){
year = 2016
month = 2
count = 2900
},
...
(MonthlySearchVolume){
year = 2015
month = 3
count = 2900
},
}
},
},
}
This isn't JSON and appears to just be a messy Python list. What's the easiest way to flatten the monthly data into a Pandas dataframe with a structure like this?
Keyword | Year | Month | Count
keyword phrase 2016 2 10
The output is a sudsobject. I found that this code does the trick:
import suds.sudsobject as sudsobject
import pandas as pd
a = [sudsobject.asdict(x) for x in output]
df = pd.DataFrame(a)
Addendum: This was once correct but new versions of the API (I tested
201802) now return a zeep.objects. However, zeep.helpers.serialize_object should do the same trick.
link
Here's the complete code that I used to query the TargetingIdeaSelector, with requestType STATS, and the method I used to parse the data to a useable dataframe; note the section starting "Parse results to pandas dataframe" as this takes the output given in the question above and converts it to a dataframe. Probably not the fastest or best, but it works! Tested with Python 2.7.
"""This code pulls trends for a set of keywords, and parses into a dataframe.
The LoadFromStorage method is pulling credentials and properties from a
"googleads.yaml" file. By default, it looks for this file in your home
directory. For more information, see the "Caching authentication information"
section of our README.
"""
from googleads import adwords
import pandas as pd
adwords_client = adwords.AdWordsClient.LoadFromStorage()
PAGE_SIZE = 10
# Initialize appropriate service.
targeting_idea_service = adwords_client.GetService(
'TargetingIdeaService', version='v201601')
# Construct selector object and retrieve related keywords.
offset = 0
stats_selector = {
'searchParameters': [
{
'xsi_type': 'RelatedToQuerySearchParameter',
'queries': ['donald trump', 'bernie sanders']
},
{
# Language setting (optional).
# The ID can be found in the documentation:
# https://developers.google.com/adwords/api/docs/appendix/languagecodes
'xsi_type': 'LanguageSearchParameter',
'languages': [{'id': '1000'}],
},
{
# Location setting
'xsi_type': 'LocationSearchParameter',
'locations': [{'id': '1027363'}] # Burlington,Vermont
}
],
'ideaType': 'KEYWORD',
'requestType': 'STATS',
'requestedAttributeTypes': ['KEYWORD_TEXT', 'TARGETED_MONTHLY_SEARCHES'],
'paging': {
'startIndex': str(offset),
'numberResults': str(PAGE_SIZE)
}
}
stats_page = targeting_idea_service.get(stats_selector)
##########################################################################
# Parse results to pandas dataframe
stats_pd = pd.DataFrame()
if 'entries' in stats_page:
for stats_result in stats_page['entries']:
stats_attributes = {}
for stats_attribute in stats_result['data']:
#print (stats_attribute)
if stats_attribute['key'] == 'KEYWORD_TEXT':
kt = stats_attribute['value']['value']
else:
for i, val in enumerate(stats_attribute['value'][1]):
data = {'keyword': kt,
'year': val['year'],
'month': val['month'],
'count': val['count']}
data = pd.DataFrame(data, index = [i])
stats_pd = stats_pd.append(data, ignore_index=True)
print(stats_pd)

Categories

Resources