How to use premium full archive search - python

I want to get 1000 tweets (without retweets) for the period 07/08/2006 00:00 to 07/08/2006 23:59 using the premium full archive. The api returns maximum 500 tweets per request. How can I get 1000 tweets without running my code two times? Also, how can I export the tweets in a csv format by including all the keys?
I am new in python. I tried to get the tweets but as I said in the summary description I'm getting 500 tweets (including rtweets). Also, when i save the tweets in the csv, every even row is empty.
for example:
|---------- |------|------|----|
|created_at |id_str|source|user|
|---------- |------|------ |----|
|2008|949483|www.none.com|John|
|----------|------|------|----|
|empty |empty |empty|empty|
|----------|------|------|----|
|2009|74332|www.non2.com|Marc|
|----------|------|------|----|
|empty |empty |empty|empty|
My questions are:
How can I get 1000 tweets (excluding rtweets) without getting duplicated tweets and running the code 1 time? and How can I save all the keys of the outputs in a csv without having empty even rows?
from TwitterAPI import TwitterAPI
import csv
SEARCH_TERM = '#nOne'
PRODUCT = 'fullarchive'
LABEL = 'dev-environment'
api = TwitterAPI("consumer_key",
"consumer_secret",
"access_token_key",
"access_token_secret")
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200608070000',
'toDate':'200608072359',
"maxResults":500
})
csvFile = open('data.csv', 'w',encoding='UTF-8')
csvWriter = csv.writer(csvFile)
for item in r:
csvWriter.writerow([item['created_at'],
item["id_str"],
item["source"],
item['user']['screen_name'],
item["user"]["location"],
item["geo"],
item["coordinates"],
item['text'] if 'text' in item else item])
I expect to get a dataframe with 1000 unique tweets (excluding retweets) by running the code once in a csv format?
Thanks

If you are using the TwitterAPI package, you should take advantage of the TwitterPager class which uses the next element in the returned JSON to get the next page of tweets. Look at this simple example to understand the usage.
In your case, you would simply replace this:
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200608070000',
'toDate':'200608072359',
"maxResults":500
})
...with this:
from TwitterAPI import TwitterPager
r = TwitterPager(api, 'tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200608070000',
'toDate':'200608072359',
"maxResults":500
}).get_iterator()
By default, TwitterPager waits 5 seconds between requests. In the Sandbox environment you should be able to reduce this to 2 seconds without exceeding the rate limit. To change the wait to 2 seconds you would call get_iterator with a parameter, like this:
get_iterator(wait=2)

Related

The DataFrame saved just the last query in scrapping tweets by emojis search, the for loop did not work well?

I wrote a python code using snscrape to collect Saudi tweets based on:
Location: GeneralListEmojis
Time Interval: TimeInterval
Emoji: EmojisList
I have a list for each of these searching criteria. For example, I want to collect tweets from Riyadh in the four time interval which include the emojis in the EmojiList.
I want to save all the results from all the cities into one dataframe called: EmojisTweets
I made another dataframe that showed the non-appear state when a specific emoji did not appear at the tweets of some cities, called: NoEmojiTweet.
I started the code by empty EmojisTweets dataframe, and it supposed to be filled iteratively in the inner-for-loop as I defined a dataframe EmojisTweets_i that hold the current searching tweet and concatenate it with the main dataframe EmojisTweets.
The problem is that, at the end of the scraping process, the EmojisTweets DataFrame saved only the last searching loop for the last emoji in the EmojisList!
the same problem appeared when using pd.append at NoEmojiTweet.
What is the problem in my code that caused this failure!
******************[ Start : By Emojis ]************************
now = datetime.now()
print("EMOJIS: Starting Time =", now.strftime("%H:%M:%S"))
GeneralListEmojis=[('Riyadh','250','2000'), ('Jeddah','300','2000'), ('Dammam','200','1000'), ('Abha','150','500'), ('Jizan','150','500'), ('24.470901, 39.612236','100','500'), ('Buraidah','100','500'), ('Tabuk','100','500'), ('Hail','100','500'), ('Najran','50','500'), ('29.5,39.52236','50','500'), ('20.0125, 41.465278','50','500'), ('Arar','50','500')]
TimeInterval=['since:2019-01-01 until:2019-12-31', 'since:2020-01-01 until:2020-12-31', 'since:2021-01-01 until:2021-12-31', 'since:2022-01-01 until:2022-12-14']
EmojisList= ['๐Ÿ’ฆ', '๐Ÿ–', '๐Ÿท', '๐Ÿฝ', '๐Ÿ‘ž', '๐Ÿ•', '๐Ÿถ', '๐Ÿ’ฉ', '๐Ÿ„', '๐Ÿฎ', '๐Ÿ‘', '๐Ÿ', '๐Ÿ‘Ž', '๐Ÿ˜ก', '๐Ÿคฌ', '๐Ÿ‘บ', '๐Ÿ‘ฟ', '๐Ÿ˜ ']
EmojisTweets=pd.DataFrame()
NoEmojiTweet=pd.DataFrame(columns=['City', 'Emoji', 'TimeInterval'])
for c in GeneralListEmojis:
City=c[0]
km=c[1]
tw=c[2]
for Ti in TimeInterval:
for E in EmojisList:
EmojisTweets_i= pd.DataFrame(itertools.islice(sntwitter.TwitterSearchScraper('"{}" near:"{}" within:{}km {}'.format(E,City,km,Ti)).get_items(), int('{}'.format(tw))))
if not EmojisTweets_i.empty:
EmojisTweets_i['user_location'] = EmojisTweets_i['user'].apply(lambda x: x['location'])
EmojisTweets_i['City'] = '{}'.format(City) #to chunck the df based on the city column
EmojisTweets_i['Search'] = 'ByEmojis' #To know the type of searching for this tweet, when concatenate all the files
EmojisTweets=pd.concat([EmojisTweets,EmojisTweets_i])
else:
NewNoEmoji_Row={'City':'{}'.format(City), 'Emoji':'{}'.format(E), 'TimeInterval':'{}'.format(Ti)}
NoEmojiTweet=NoEmojiTweet.append(NewNoEmoji_Row, ignore_index=True)
#All the Saudi Reigonal Tweets | Searching by Emojis
EmojisTweets.to_csv('EmojisTweets.csv')
#File for each City with the missing Emoji during Time Interval:
NoEmojiTweet.to_csv('NoEmojiTweet.csv')
now = datetime.now()
print("EMOJIS: Ending Time =", now.strftime("%H:%M:%S"))
I run the code and it worked, but when I investigate the collected tweets at EmojisTweets, I found that It had been saving only the last round of the inner-for loop. So, the dataframe contains all the tweets from all the cities in GeneralListEmojis list, with all the time interval in TimeInterval list for only the angry emoji '๐Ÿ˜ ', which is the last emoji at the EmojiList
EmojisList= ['๐Ÿ’ฆ', '๐Ÿ–', '๐Ÿท', '๐Ÿฝ', '๐Ÿ‘ž', '๐Ÿ•', '๐Ÿถ', '๐Ÿ’ฉ', '๐Ÿ„', '๐Ÿฎ', '๐Ÿ‘', '๐Ÿ', '๐Ÿ‘Ž', '๐Ÿ˜ก', '๐Ÿคฌ', '๐Ÿ‘บ', '๐Ÿ‘ฟ', '๐Ÿ˜ ']

How to get lots of tweets and filter by multiple strings in tweepy

I have an Academic account on twitter and would like to get all the tweets in a month from a particular country that have one or more of a set of strings in them.
My attempt so far using v2 of the API is:
import tweepy
client = tweepy.Client(bearer_token=secret_token)
query = 'covid -is:retweet place_country:GB'
# Replace with time period of your choice
start_time = '2020-03-01T00:00:00Z'
# Replace with time period of your choice
end_time = '2020-04-01T00:00:00Z'
tweets = client.search_all_tweets(query=query, tweet_fields=['context_annotations', 'created_at', 'geo'],
start_time=start_time, end_time=end_time,
place_fields=['place_type', 'geo'], expansions='geo.place_id', max_results=500)
I believe (but could be wrong) that this returns 500 tweets that have GB as their geolocation , are not retweets, were created in March 2020 and contain the string "covid".
How can I replace "covid" with a set of strings that it should be searching for? For example ["covid", "coronavirus"].
How can I get it to return many more than 500 tweets? In theory I can extract more than million a month with the academic account but search_all_tweets has a limit of 500.

How to retrieve tweets from a week ago using Tweepy in API 3.9

import tweepy,re,json,matplotlib.pyplot as plt, seaborn as sns,pandas as pd
from textblob import TextBlob
consumer_key = "key"
consumer_secret = "key"
access_key = "key"
secret_key = "key"
autenticacion = tweepy.OAuthHandler(consumer_key, consumer_secret)
autenticacion.set_access_token(access_key, secret_key)
"""Variable donde voy a llamar a la API"""
api = tweepy.API(autenticacion, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
results = tweepy.Cursor(api.search, q='Panamรก', tweet_mode="extended",lang="en",since='2020-11-12',until='2020-11-18').items(2000)
term = 'Panamรก'
json_data = [r._json for r in results]
df = pd.io.json.json_normalize(json_data)
I'm trying to extract a thousand tweets from today's date, a week back, but all the tweets I extract are from today's date. How do I go about extracting tweets with dates from a week ago to today, evenly?
The Twitter API itself does not support an even distribution of results on a per-day basis across a week, so you would have to implement this yourself. The count parameter only supports a maximum of 100 results per page. At the moment, your code is basically asking for 100 Tweets looking back from now.
You could try the following:
break down your 1000 results by 7 days (to make this easier, let's make it 100 per day, so 700 Tweets total)
create a 7 pass loop around your second block of code, and for each iteration, search for 100 results, each time making the since and until values for the same day, so '2020-11-12' to '2020-11-12', '2020-11-13' to '2020-11-13', etc
in each loop iteration, append your data into the dataframe
Also note that the line where you have term = 'Panamรก' is apparently unused in the code above.

Can I speed up my reading and processing of many .csv files in python?

I am currently occupied with a dataset consisting of 90 .csv files. There are three types of .csv files (30 of each type).
Each csv has from 20k to 30k rows average and 3 columns(timestamp in linux format, Integer,Integer).
Here's an example of the header and a row:
Timestamp id1 id2
151341342 324 112
I am currently using 'os' to list all files in the directory.
The process for each CSV file is as follows:
Read it through pandas into a dataframe
iterate the rows of the file and for each row convert the timestamp to readable format.
Use the converted timestamp and Integers to create a relationship-type of object and add it on a list of relationships
The list will later be looped to create the relationships in my neo4j database.
The problem I am having is that the process takes too much time. I have asked and searched for ways to do it faster (I got answers like PySpark, Threads) but I did not find something that really fits my needs. I am really stuck as with my resources it takes around 1 hour and 20 minutes to do all that process for one of the big .csv file(meaning one with around 30k rows)
Converting to readable format:
ts = int(row['Timestamp'])
formatted_ts = datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
And I pass the parameters to the Relationship func of py2neo to create my relationships. Later that list will be looped .
node1 = graph.evaluate('MATCH (n:User) WHERE n.id={id} RETURN n', id=int(row["id1"]))
node2 = graph.evaluate('MATCH (n:User) WHERE n.id={id} RETURN n', id=int(row['id2']))
rels.append(Relationship(node1, rel_type, node2, date=date, time=time))
time to compute row: 0:00:00.001000
time to create relationship: 0:00:00.169622
time to compute row: 0:00:00.001002
time to create relationship: 0:00:00.166384
time to compute row: 0:00:00
time to create relationship: 0:00:00.173672
time to compute row: 0:00:00
time to create relationship: 0:00:00.171142
I calculated the time for the two parts of the process as shown above. It is fast and there really seems to not be a problem except the size of the files. This is why the only things that comes to mind is that Parallelism would help to compute those files faster(by computing lets say 4 files in the same time instead of one)
sorry for not posting everything
I am really looking forward for replies
Thank you in advance
That sounds fishy to me. Processing csv files of that size should not be that slow.
I just generated a 30k line csv file of the type you described (3 columns filled with random numbers of the size you specified.
import random
with open("file.csv", "w") as fid:
fid.write("Timestamp;id1;id2\n")
for i in range(30000):
ts = int(random.random()*1000000000)
id1 = int(random.random()*1000)
id2 = int(random.random()*1000)
fid.write("{};{};{}\n".format(ts, id1, id2))
Just reading the csv file into a list using plain Python takes well under a second. Printing all the data takes about 3 seconds.
from datetime import datetime
def convert_date(string):
ts = int(string)
formatted_ts = datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
split_ts = formatted_ts.split()
date = split_ts[0]
time = split_ts[1]
return date
with open("file.csv", "r") as fid:
header = fid.readline()
lines = []
for line in fid.readlines():
line_split = line.strip().split(";")
line_split[0] = convert_date(line_split[0])
lines.append(line_split)
for line in lines:
print(line)
Could you elaborate what you do after reading the data? Especially "create a relationship-type of object and add it on a list of relationships"
That could help pinpoint your timing issue. Maybe there is a bug somewhere?
You could try timing different parts of your code to see which one takes the longest.
Generally, what you describe should be possible within seconds, not hours.

How to import data from a .txt file into arrays in python

I am trying to import data from a .txt file that contains four columns that are separated by tab and is several thousands lines long. This is how the start of the document look like:
Data info
File name: D:\(path to file)
Start time: 6/26/2019 15:39:54.222
Number of channels: 3
Sample rate: 1E6
Store type: fast on trigger
Post time: 20
Global header information: from DEWESoft
Comments:
Events
Event Type Event Time Comment
1 storing started at 7.237599
2 storing stopped at 7.257599
Data1
Time Incidente Transmitida DI 6
s um/m um/m -
0 2.1690152 140.98599 1
1E-6 2.1690152 140.98599 1
2E-6 4.3380303 145.32402 1
3E-6 4.3380303 145.32402 1
4E-6 -2.1690152 145.32402 1
I have several of these files that I want to loop trough and store in a cell/list that each cell/list item contains the four columns. After that I just use that cell/list to plot the data with a loop.
I saw that pandas library was suitable, but I don't understand how to use it.
fileNames = (["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"])
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Loop trough each source document
for i in range(0,len(fileNames)):
print('File location: '+folderName+fileNames[i])
# Get data from source as arrays, cut out the first 20 lines
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=[19], error_bad_lines=False)
# Store data in list/cell
# data[i] = temp # sort it
This is something I tried that didn't work, don't really know how to proceed. I know there are some documentation on this problem but I am new to this and need some help.
An error I get when trying the above:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 4
So it was an easy fix, just had to remove the braces from skiprows=[19].
The cods now looks like this and works.
fileNames = ["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"]
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Preallocation
data = []
for i in range(0,len(fileNames)):
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=19)
data.append(temp)

Categories

Resources