Variable not defined during data analysis - python

I'm new to programming and I've looked at previous answers to this question but none seem relevant to this specific query.
I'm learning to analyse data with python.
This is the code:
import pandas as pd
import os
os.chdir('/Users/Benjy/Documents/Python/Data Analysis Python')
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header = None, names = unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header = None, names = rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header = None, names = mnames)
data = pd.merge(pd.merge(ratings, users), movies)
mean_ratings=data.pivot_table('ratings',rows='title', cols='gender',aggfunc='mean')
I keep getting an error saying mean_ratings is not defined...but surely it is defined in the last line of code above?

I think this will work: mean_ratings=data.pivot_table('rating',index='title',columns='gender',aggfunc='‌​mean')

Related

How to setup Pandas groupby into subplots of tables?

So I currently have what is above.
I've managed to separate them into categories using groupby but now I would like to put them in a subplot of tables.
##open comma separated file and the columns Name, In Stock, committed, reorder point
file = pd.read_csv('Katana/InventoryItems-2022-01-06-09_10.csv',
usecols=['Name','In stock','Committed', 'Reorder point','Category'])
##take the columns and put them in to a list
Name = file['Name'].tolist()
InStock = file['In stock'].tolist()
Committed = file['Committed'].tolist()
ReorderPT = file['Reorder point'].tolist()
Category = file['Category'].tolist()
##take the lists and change them into appropriate type of data
inStock = [int(float(i)) for i in InStock]
commited = [int(float(i)) for i in Committed]
reorderpt = [int(float(i)) for i in ReorderPT]
##have the liss with correct data type and arrange them
inventory = {'Name': Name,
'In stock': inStock,
'Committed': commited,
'Reorder point': reorderpt,
'Category': Category
}
##take the inventory arrangement and display them into a table
frame = DataFrame(inventory)
grouped = frame.groupby(frame.Category)
df_elec = grouped.get_group('Electronics')
df_bedp = grouped.get_group('Bed Packaging')
df_fil = grouped.get_group('Filament')
df_fast = grouped.get_group('Fasteners')
df_kit = grouped.get_group('Kit Packaging')
df_pap = grouped.get_group('Paper')
Try something along the lines of:
import matplotlib.pyplot as plt
fig,axs = plt.subplots(nrows=6,ncols=1)
for ax,data in zip(axs,[df_elec,df_bedp,df_fil,df_fast,df_kit,df_pap]):
data.plot(ax=ax,table=True)

How to get replies or quotes on a specific tweet

I am getting the tweets and the corresponding id of that user in an object obj. I want to know why I don't get the other informations like conversation_id. I want to use it to get the replies and the quotes. That's the solution that I found in the internet but didn't know how to make it work.
Does any anyone know to extract the conversation_id or any other parameters like geo.place_id? I am using tweepy but if anyone has any other solution using another library to get the same result it will be also helpful. Thanks for your help!!!
You can try the code if you create another file config and define your tokens. I can't share mine due to security purposes.
import tweepy
import config
users_name = ['derspiegel', 'zeitonline']
tweet_tab = []
def getClient():
client = tweepy.Client(bearer_token=config.BEARER_TOKEN,
consumer_key=config.API_KEY,
consumer_secret=config.API_KEY_SECRET,
access_token=config.ACCESS_TOKEN,
access_token_secret=config.ACCESS_TOKEN_SECRET)
def searchTweets(client):
for i in users_name:
client = getClient()
user = client.get_user(username=i)
userId = user.data.id
tweets = client.get_users_tweets(userId,
expansions=[
'author_id', 'referenced_tweets.id', 'referenced_tweets.id.author_id',
'in_reply_to_user_id', 'attachments.media_keys', 'entities.mentions.username', 'geo.place_id'],
tweet_fields=[
'id', 'text', 'author_id', 'created_at', 'conversation_id', 'entities',
'public_metrics', 'referenced_tweets'
],
user_fields=[
'id', 'name', 'username', 'created_at', 'description', 'public_metrics',
'verified'
],
place_fields=['full_name', 'id'],
media_fields=['type', 'url', 'alt_text', 'public_metrics'])
if not tweets is None and len(tweets) > 0:
obj = {}
obj['id'] = userId
obj['text'] = tweets
tweet_tab.append(obj)
return tweet_tab
searchTweets(client)
print("tableau final", tweet_tab)
my guess is that you need to put the ids into a list through which the function can iterate. Create the id list and try:
def get_tweets_from_timelines():
tweets_timelines_list = []
for id in range(0, len(ids), 1):
one_id = (ids[id:id+1])
one_id = ' '.join(one_id)
for tweet in tweepy.Paginator(client.get_users_tweets, id=one_id, max_results=100,
tweet_fields=['attachments', 'author_id', 'context_annotations', 'created_at', 'entities', \
'conversation_id', 'possibly_sensitive', 'public_metrics', 'referenced_tweets', \
'reply_settings', 'source', 'withheld' ],\
user_fields=['created_at', 'description', 'entities', 'profile_image_url', 'protected', \
'public_metrics', 'url', 'verified', 'withheld'],
expansions=['referenced_tweets.id', 'in_reply_to_user_id', 'attachments.media_keys', ],
media_fields=['preview_image_url'],
):
tweets_timelines_list.append(tweet)
return tweets_timelines_list

PRAW Loop With HTTP Exceptions

I am using a Python script to loop through a list of subreddits and pull their posts. The list is long, however, and occassionally there will be 403, 404, etc. errors in there. I am attempting to bypass those which give errors, but have been unable to do so thus far. The code is below.
I am using a list of subreddits and praw to pull from them. However, the list is quite long and occasionally a subreddit on it will be deleted, resulting in an HTTP exception (403, 404, etc). My code is below, does anyone know a line or two I can put in to skip those which give errors?
df = pd.read_csv('reddits.csv', sep = ',')
df.head()
Submission = namedtuple('Submission', ['time', 'score', 'title', 'text', 'author', 'comments', 'url', 'domain', 'permalink', 'ups', 'downs', 'likes', 'crosspost', 'duplicates', 'views'])
data = []
for i in df.reddits:
subreddit = reddit.subreddit(i)
for submission in subreddit.new(limit=10):
time = datetime.utcfromtimestamp(submission.created_utc)
score = submission.score
title = submission.title
text = submission.selftext
author = submission.author
comments = submission.num_comments
url = submission.url
domain = submission.domain
permalink = submission.permalink
ups = submission.ups
downs = submission.downs
likes = submission.likes
crosspost = submission.num_crossposts
duplicates = submission.num_duplicates
views = submission.view_count
data.append(Submission(time, score, title, text, author, comments, url, domain, permalink, ups, downs, likes, crosspost, duplicates, views))
df = pd.DataFrame(data)
os.chdir('wd')
filename = i + str(datetime.now()) + '.csv'
df.to_csv(filename, index=False, encoding='utf-8')
You need to catch the exception, then you can continue
df = pd.read_csv('reddits.csv', sep = ',')
df.head()
Submission = namedtuple('Submission', ['time', 'score', 'title', 'text', 'author', 'comments', 'url', 'domain', 'permalink', 'ups', 'downs', 'likes', 'crosspost', 'duplicates', 'views'])
data = []
for i in df.reddits:
try:
subreddit = reddit.subreddit(i)
except HTTPError as e:
print(f"Got {e} retrieving {subreddit}")
continue # control passes back to next iteration of outer loop
for submission in subreddit.new(limit=10):
submission = Submission(
datetime.utcfromtimestamp(submission.created_utc),
submission.score,
submission.title,
submission.selftext,
submission.author,
submission.num_comments,
submission.url,
submission.domain,
submission.permalink,
submission.ups,
submission.downs,
submission.likes,
submission.num_crossposts,
submission.num_duplicates,
submission.view_count,
)
data.append(submission)
df = pd.DataFrame(data)
os.chdir('wd')
filename = i + str(datetime.now()) + '.csv'
df.to_csv(filename, index=False, encoding='utf-8')
also, unrelated: i is not a good name for the value; it traditionally stands for "index", which is not what is contained there. e would be the corresponding generic name, standing for "element", but reddit would be the idiomatic choice in python.

Python Error Logging to Check for Duplicate Rows and Duplicate Columns

The following code reads an existing MS Excel spreadsheet, creates a column map, then exports results to another Excel spreadsheet. The sheet and associated columns are actually much larger, but to keep the context simple I have paired down the number of columns in the map.
You will note that in the process I am creating 2 NULL columns and dropping any duplicate rows. I am struggling with a proper Try: Except: statement(s) that will validate that I am not overwriting existing columns with the created NULL columns, and to validate that are no duplicate rows. I know that I am not, but need the error log report for audit purposes. Following is a simple mock up of the code, this is as far as I have gotten. I am still fairly new to exception handling and would appreciate your help. Thanks in advance.
from datetime import datetime
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s] %(message)s",
handlers=[logging.StreamHandler()])
os.chdir(r'M:\Loans')
col_map = {'Loan #' : 'LoanNo',
'Last Name' : 'LastName',
'Purchase Price' : 'PurchasePrice',
'Loan Amt' : 'LoanAmt',
'Property Address' : 'PropertyAddress',
'City' : 'City',
'State' : 'State',
'Zip Code' : 'ZipCode',
'Interest Rate' : 'InterestRate',
'UPBCurrent' : 'UPBCurrent',
'NextDueDateAtPurchase' : 'NextDueDateAtPurchase',
'CurrentAdvanceRate': 'CurrentAdvanceRate',
'Comments' : 'Comments',
'CurrentAdvanceAmount': 'CurrentAdvanceAmount',
'SecondRoundCurrentAdvanceRate' : 'SecRoundCurrentAdvRate',
'SecondRoundCurrentAvanceAmount' : 'SecRoundCurrentAdvAmount',
}
for f in os.listdir():
logging.info('Reading in file {}'.format(f))
df=pd.read_excel('M:\Loans\Loan Blotter XYZ OLD.xlsx')
df['UPBCurrent'] = None
df['NextDueDateAtPurchase'] = None
df = df[col_map.keys()]
df.drop_duplicates(inplace=True)
df.columns = [col_map[col] for col in df.columns]
df['Channel'] = 'Whole Loans'
df['DateCreated'] = datetime.today().date()
df.to_excel(r'M:\Err Log.xlsx', index=False)
to check that you won't overwrite existing cols:
null_cols = ['UPBCurrent', 'UPBCurrent']
for null_col in null_cols:
if null_col in df.columns:
logging.error("{} will be overwritten.".format(null_col))
else:
logging.info("Adding null column {}.".format(null_col))
df[null_col] = None
to check that drop duplicates will work:
try:
df.drop_duplicates(inplace=True)
except:
logging.error("Failed to drop duplicate rows.")

Unable to get help on retrieve data from Nasdaq in Python

I am planning to do some financial research and learning using data from the NASDAQ.
I want to retrieve data from Nasdaq such that the header has the following:
Stock Symbol
Company Name
Last Sale
Market Capitalization
IPO
Year
Sector
Industry
Last Update
And I used Python code to get the "list of companies and ticker names" using:
import pandas as pd
import json
PACKAGE_NAME = 'nasdaq-listings'
PACKAGE_TITLE = 'Nasdaq Listings'
nasdaq_listing = 'ftp://ftp.nasdaqtrader.com/symboldirectory/nasdaqlisted.txt'# Nasdaq only
def process():
nasdaq = pd.read_csv(nasdaq_listing,sep='|')
nasdaq = _clean_data(nasdaq)
# Create a few other data sets
nasdaq_symbols = nasdaq[['Symbol','Company Name']] # Nasdaq w/ 2 columns
# (dataframe, filename) datasets we will put in schema & create csv
datasets = [(nasdaq,'nasdaq-listed'), (nasdaq_symbols,'nasdaq-listed-symbols')]
for df, filename in datasets:
df.to_csv('data/' + filename + '.csv', index=False)
with open("datapackage.json", "w") as outfile:
json.dump(_create_datapackage(datasets), outfile, indent=4, sort_keys=True)
def _clean_data(df):
# TODO: do I want to save the file creation time (last row)
df = df.copy()
# Remove test listings
df = df[df['Test Issue'] == 'N']
# Create New Column w/ Just Company Name
df['Company Name'] = df['Security Name'].apply(lambda x: x.split('-')[0]) #nasdaq file uses - to separate stock type
#df['Company Name'] = TODO, remove stock type for otherlisted file (no separator)
# Move Company Name to 2nd Col
cols = list(df.columns)
cols.insert(1, cols.pop(-1))
df = df.loc[:, cols]
return df
def _create_file_schema(df, filename):
fields = []
for name, dtype in zip(df.columns,df.dtypes):
if str(dtype) == 'object' or str(dtype) == 'boolean': # does datapackage.json use boolean type?
dtype = 'string'
else:
dtype = 'number'
fields.append({'name':name, 'description':'', 'type':dtype})
return {
'name': filename,
'path': 'data/' + filename + '.csv',
'format':'csv',
'mediatype': 'text/csv',
'schema':{'fields':fields}
}
def _create_datapackage(datasets):
resources = []
for df, filename in datasets:
resources.append(_create_file_schema(df,filename))
return {
'name': PACKAGE_NAME,
'title': PACKAGE_TITLE,
'license': '',
'resources': resources,
}
process()
Now for each of these symbols, I want to get the other data (as in above).
Is there anyway I could do this?
Have you taken a look at pandas-datareader? You maybe able to get the other data from there. It has multiple data sources, such as Google, Yahoo Finance,
http://pandas-datareader.readthedocs.io/en/latest/remote_data.html#remote-data-google

Categories

Resources