I'm messing around learning to work with APIs, I figured I'd make a Reddit bot. I'm trying to apply some code I used for a different script. That script used requests turned the request to json then added it a pandas dataframe and then wrote a csv.
I'm trying to do so about the same but don't know how to run the Reddit data into the dataframe. What I've tried below throws errors.
#!/usr/bin/python
import praw
import pandas as pd
reddit = praw.Reddit('my_bot')
subreddit = reddit.subreddit("askreddit")
for submission in subreddit.hot(limit=5):
print("Title: ", submission.title)
print("Score: ", submission.score)
print("Link: ", submission.url)
print("---------------------------------\n")
csv_file = f"/home/robothead/scripts/python/reddit/reddit-data.csv"
# start with empty dataframe
df = pd.DataFrame()
#j_data = subreddit.json()
#parse_data = j_data['data']
# append to the dataframe
#df = df.append(pd.DataFrame.from_dict(pd.json_normalize(parse_data), orient='columns'))
# append to the dataframe
df = df.append(pd.DataFrame.from_dict(pd(submission), orient='columns'))
# write the whole CSV at once
df.to_csv(csv_file, index=False, encoding='utf-8')
error:
Traceback (most recent call last):
File "bot.py", line 21, in <module>
df = df.append(pd.DataFrame.from_dict(pd(submission), orient='columns'))
TypeError: 'module' object is not callable
This is how I've done it in the past:
df = pd.DataFrame([ vars(post) for post in subreddit.hot(limit=5) ])
vars converts praw.Submission to a dict and pandas DataFrame constructor can take a list of dictionaries. Works well if you have dicts with the same keys, which is the case here. Of course you get a giant dataframe with ALL the columns. Some even have praw objects in them (that you can work with!). You'll probably want to parse that down by just keeping the columns you want before writing to a file.
Edit:
Just so there's no confusion, here is the full script example:
#!/usr/bin/python
import praw
import pandas as pd
reddit = praw.Reddit('my_bot')
subreddit = reddit.subreddit("askreddit")
df = pd.DataFrame([ vars(post) for post in subreddit.hot(limit=5) ])
df = df[["title","score","url"]]
df.to_csv(csv_file, index=False, encoding='utf-8')
Related
I'm currently struggling with figuring out how to use pandas to scrape data off of the OpenCorporate API and insert it into a CSV file. I'm not quite sure where I'm messing up.
import pandas as pd
df = pd.read_json('https://api.opencorporates.com/companies/search?q=pwc')
data = df['companies']['company'][0]
result = {'name':data['timestamp'],
'company_number':data[0]['company_number'],
'jurisdiction_code':data[0]['jurisdiction_code'],
'incorporation_date':data[0]['incorporation_date'],
'dissolution_date':data[0]['dissolution_date'],
'company_type':data[0]['company_type'],
'registry_url':data[0]['registry_url'],
'branch':data[0]['branch'],
'opencorporates_url':data[0]['opencorporates_url'],
'previous_names':data[0]['previous_names'],
'source':data[0]['source'],
'url':data[0]['url'],
'registered_address':data[0]['registered_address'],
}
df1 = pd.DataFrame(result, columns=['name', 'company_number', 'jurisdiction_code', 'incorporation_date', 'dissolution_date', 'company_type', 'registry_url', 'branch', 'opencorporates_url', 'previous_names', 'source', 'url', 'registered_address'])
df1.to_csv('company.csv', index=False, encoding='utf-8')
Get the json data with requests and then use pd.io.json.json_normalize to flatten the response.
import requests
json_data = requests.get('https://api.opencorporates.com/companies/search?q=pwc').json()
from pandas.io.json import json_normalize
df = None
for row in json_data["results"]["companies"]:
if df is None:
df = json_normalize(row["company"])
else:
df = pd.concat([df, json_normalize(row["company"])])
You then write the DataFrame to a csv using the df.to_csv() method as described in the question.
It might be easier for you to access to the OpenCorporates database in bulk.
OpenCorporates provides access for commercial users under a closed licence, and as open data for journalists, academics and NGOs who are able to share the results under a share-alike, open data licence. The licence is available here: https://opencorporates.com/info/licence
I'm trying to get an API call and save it as a dataframe.
problem is that I need the data from the 'result' column.
Didn't succeed to do that.
I'm basically just trying to save the API call as a csv file in order to work with it.
P.S when I do this with a "JSON to CSV converter" from the web it does it as I wish. (example: https://konklone.io/json/)
import requests
import pandas as pd
import json
res = requests.get("http://api.etherscan.io/api?module=account&action=txlist&
address=0xddbd2b932c763ba5b1b7ae3b362eac3e8d40121a&startblock=0&
endblock=99999999&sort=asc&apikey=YourApiKeyToken")
j = res.json()
j
df = pd.DataFrame(j)
df.head()
output example picture
Try this
import requests
import pandas as pd
import json
res = requests.get("http://api.etherscan.io/api?module=account&action=txlist&address=0xddbd2b932c763ba5b1b7ae3b362eac3e8d40121a&startblock=0&endblock=99999999&sort=asc&apikey=YourApiKeyToken")
j = res.json()
# print(j)
filename ="temp.csv"
df = pd.DataFrame(j['result'])
print(df.head())
df.to_csv(filename)
Looks like you need.
df = pd.DataFrame(j["result"])
I am using python 3.6 and trying to download json file (350 MB) as pandas dataframe using the code below. However, I get the following error:
data_json_str = "[" + ",".join(data) + "]
"TypeError: sequence item 0: expected str instance, bytes found
How can I fix the error?
import pandas as pd
# read the entire file into a python array
with open('C:/Users/Alberto/nutrients.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ",".join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
From your code, it looks like you're loading a JSON file which has JSON data on each separate line. read_json supports a lines argument for data like this:
data_df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
Note
Remove lines=True if you have a single JSON object instead of individual JSON objects on each line.
Using the json module you can parse the json into a python object, then create a dataframe from that:
import json
import pandas as pd
with open('C:/Users/Alberto/nutrients.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(data)
If you open the file as binary ('rb'), you will get bytes. How about:
with open('C:/Users/Alberto/nutrients.json', 'rU') as f:
Also as noted in this answer you can also use pandas directly like:
df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
if you want to convert it into an array of JSON objects, I think this one will do what you want
import json
data = []
with open('nutrients.json', errors='ignore') as f:
for line in f:
data.append(json.loads(line))
print(data[0])
The easiest way to read json file using pandas is:
pd.read_json("sample.json",lines=True,orient='columns')
To deal with nested json like this
[[{Value1:1},{value2:2}],[{value3:3},{value4:4}],.....]
Use Python basics
value1 = df['column_name'][0][0].get(Value1)
Please the code below
#call the pandas library
import pandas as pd
#set the file location as URL or filepath of the json file
url = 'https://www.something.com/data.json'
#load the json data from the file to a pandas dataframe
df = pd.read_json(url, orient='columns')
#display the top 10 rows from the dataframe (this is to test only)
df.head(10)
Please review the code and modify based on your need. I have added comments to explain each line of code. Hope this helps!
I am having a bit of trouble trying to load a JSON file into a pandas data frame. I have gone through different iterations of code listed below with no luck:
import pandas as pd
fileName = "ipl2015_auction_data.json"
jsonData = pd.read_json(fileName, orient="records")
ERROR: ValueError: arrays must all be same length
import pandas as pd
fileName = "ipl2015_auction_data.json"
#jsonData = pd.read_json(fileName, orient="records")
with open(fileName, "r") as jsonFile:
data = jsonFile.read()
df = pd.DataFrame(data)
print df.head()
ERROR: pandas.core.common.PandasError: DataFrame constructor not properly called!
I can't figure out what I am doing wrong. Here is the JSON data I am trying to load.
SOLUTION:
It turns out there was an issue with my pandas installation. Re-installing it did the trick.
I am trying the kaggle challenge here, and unfortunately I am stuck at a very basic step.
I am trying to read the datasets into a pandas dataframe by executing following command:
test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")
The problem is that this file as you would find out has over 300,000 records, but I am reading only 7945.
print (test.shape)
(7945, 21)
Now I have double checked the file and I cannot find anything special about line number 7945. Any pointers why this could be happening?
I think better is use function read_csv with parameters quoting=csv.QUOTE_NONE and error_bad_lines=False. link
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
print (test.shape)
#(381422, 22)
But some data (problematic) will be skipped.
If you want skip emails body data, you can use:
import pandas as pd
import csv
test = pd.read_csv(
"output/Emails.csv",
quoting=csv.QUOTE_NONE,
sep=',',
error_bad_lines=False,
header=None,
names=[
"Id", "DocNumber", "MetadataSubject", "MetadataTo", "MetadataFrom",
"SenderPersonId", "MetadataDateSent", "MetadataDateReleased",
"MetadataPdfLink", "MetadataCaseNumber", "MetadataDocumentClass",
"ExtractedSubject", "ExtractedTo", "ExtractedFrom", "ExtractedCc",
"ExtractedDateSent", "ExtractedCaseNumber", "ExtractedDocNumber",
"ExtractedDateReleased", "ExtractedReleaseInPartOrFull",
"ExtractedBodyText", "RawText"])
print (test.shape)
#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']