ValueError: labels ['timestamp'] not contained in axis - python

I am learning machine learning and I came across this code.
I am trying to run the file "Recommender-Systems.py" from the above source. But it throws an error
ValueError: labels ['timestamp'] not contained in axis. How can it be removed?
Here's a dropbox link of u.data file.

Your data is missing the headers so it's being wrongly inferred by the first row.
You need to change a little bit the Recommender-Systems.py and manually inform the headers.
The right header is available in the README file from your data set.
Change your file to something like this:
## Explore the data (line 27)
data = pd.read_table('u.data', header=None) # header=None avoid getting the columns automatically
data.columns = ['userID', 'itemID',
'rating', 'timestamp'] # Manually set the columns.
data = data.drop('timestamp', axis=1) # Continue with regular work.
...
## Load user information (line 75)
users_info = pd.read_table('u.user', sep='|', header=None)
users_info.columns = ['useID', 'age', 'gender',
'occupation' 'zipcode']
users_info = users_info.set_index('userID')
...
## Load movie information (line 88)
movies_info = pd.read_table('u.item', sep='|', header=None)
movies_info.columns = ['movieID', 'movie title', 'release date',
'video release date', 'IMDb URL', 'unknown',
'Action', 'Adventure', 'Animation', "Children's",
'Comedy', 'Crime', 'Documentary', 'Drama',
'Fantasy', 'Film-Noir', 'Horror', 'Musical',
'Mystery', 'Romance', 'Sci-Fi',' Thriller',
'War', 'Western']
movies_info = movies_info.set_index('movieID')#.drop(low_count_movies)
This should work (but I'm not sure if I got all the right names for the columns).

Related

Why is there no duplicates in pandas dataframe.index?

I just wrote a program for college using pandas to structure some unstructured data. I definitely made it harder than it should be, but I ended up finding something interesting.
here is the data I parsed
Center/Daycare
825 23rd Street South
Arlington, VA 22202
703-979-BABY (2229)
22.
Maria Teresa Desaba, Owner/Director; Tony Saba, Org. >Director.
Website: www.mariateresasbabies.com
Serving children 6 wks to 5yrs full-time.
National Science Foundation Child Development Center
23.
4201 Wilson Blvd., Suite 180 22203
703-292-4794
Website: www.brighthorizons.com 112 children, ages 6 wks - 5 yrs.
7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years.
here is the (aggressively commented for school)code that is mostly irrelevant but here for completeness sake
import csv
import pandas as pd
lines = []
"""opening the raw data from a text file"""
with open('raw_data.txt') as f:
lines = f.readlines()
f.close()
"""removing new line characters"""
for i in range(len(lines)):
lines[i] = lines[i].rstrip('\n')
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
"""here I am taking every column and adding corresponding values from the original dataframe
extra data frames chould be garbage collected but this serves for demonstration"""
df.index = df.index.astype('str')
df1 = df[df.index.str.contains('bus')]
df2 = df[df.index.str.contains('address')]
df3 = df[df.index.str.contains('location')]
df4 = df[df.index.str.contains('number')]
df5 = df[df.index.str.contains('know')]
df6 = df[df.index.str.contains('owner')]
df7 = df[df.index.str.contains('site')]
df8 = df[df.index.str.contains('descript')]
df9 = df[df.index.str.contains('null')]
for i in range(len(df1)):
df['business type'][i] = df1['info'][i]
for i in range(len(df2)):
df['address'][i] = df2['info'][i]
for i in range(len(df3)):
df['location'][i] = df3['info'][i]
for i in range(len(df4)):
df['phone number'][i] = df4['info'][i]
for i in range(len(df5)):
df['unknown'][i] = df5['info'][i]
for i in range(len(df6)):
df['owner'][i] = df6['info'][i]
for i in range(len(df7)):
df['website'][i] = df7['info'][i]
for i in range(len(df8)):
df['description'][i] = df8['info'][i]
for i in range(len(df9)):
df['null'][i] = df9['info'][i]
"""dropping unnecessary columns"""
df.drop(columns='info', inplace=True)
df.drop(columns='null', inplace=True)
df.drop(columns='unknown', inplace=True)
"""changing the index values to int to make easier to drop unused rows"""
idx = []
for i in range(0, len(df)):
idx.append(i)
df.index = idx
"""dropping unused rows"""
for i in range(2, 15):
df.drop([i], inplace=True)
"""writing to csv and printing to console"""
df.to_csv("new.csv", index=False)
print(df.to_string())
I'm just curious why when I create more columns by using the name of the index[i] item here
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
doesn't contain any duplicates.
when I add
print(df.columns)
I get the output
Index(['info', 'business type', 'address', 'location', 'phone number',
'unknown', 'owner', 'website', 'description', 'null'],
dtype='object')
I'm just generally curious why there are no duplicates as I'm sure that could be problematic in certain situations and also pandas is interesting and I hardly understand it and would like to know more. Also, if you feel extra enthusiastic any info on a more efficient way to do this would be greatly appreciated, but if not no worries, I'll eventually read the docs.
The pandas DataFrame is designed for tabular data in which all the entries in any one column have the same type (e.g. integer or string). One row usually represents one instance, sample, or individual. So the natural way to parse your data into a DataFrame is to have two rows, one for each institution, and define the columns as what you have called index (perhaps with the address split into several columns), e.g. business type, street, city, state, post code, phone number, etc.
So there would be one row per institution, and the index would be used to assign a unique identifier to each of them. That's why it's desirable for the index to contain no duplicates.

Plotly displays discrete colors instead of a colorscale

I'm learning Plotly Choropleth maps by doing some very basic examples. I'm plotting countries' GDP on a world map. Instead of a colorscale, from lower to higher GDP, I get a map with a discrete color for every country.
I suspect it might have to do with the GDP in the original dataset being a string, e.g. '23,350,230'. I have converted it to float, and confirmed the conversion worked.
fig = px.choropleth(df, locations="Code",
color="GDP",
hover_name="Country",
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
I have also tried using other values for color_continuous_scale, including once from here, and removing the parameter all together, the result was still the same map with discrete colors.
Please tell me what I'm doing wrong, thank you!
EDIT
To reproduce the issue:
The dataset is from Kaggle and can be downloaded here. It's formatting is not great, with many empty/redundant rows and 3 empty columns, so I have done some steps to preprocess the data. Btw the preprocessing if pretty rough so if you have any suggestions on how I could improve it, it is very welcome!
df = pd.read_csv("gdp-ppp.csv", encoding = "ISO-8859-1")
df = df.drop(['Unnamed: 2', 'Unnamed: 5', 'Unnamed: 6'], axis=1)
df = df.drop(df.index[0:4])
df = df.drop(df.index[195:])
df = df.drop(df.index[-4:])
df.columns = ['Code', 'Rank', 'Country', 'GDP']
i = 4
for gdp in df["GDP"]:
gdp = gdp.replace(",", "")
df["GDP"][i] = float(gdp)
i += 1
for gdp in df["GDP"]:
if type(gdp) != type(1.1):
print(gdp)
This seems to work, the print(gdp) in the last loop is never called, and the dataframe looks nice and clean. So that's when I use the code above to create the choropleth map, which is created, and the data is displayed correctly in the bar on the left, but the coloring is discrete. Here's the screenshot of the map I get.
Your suspicion is correct, plotly is seeing GDP as a string and thus using discrete colors. Use str.replace to remove the comma from the csv data (and then convert to float). Something like:
df["GDP"] = df["GDP"].str.replace(",","").astype(float)
This would come right after df.columns = ['Code', 'Rank', 'Country', 'GDP'], and then remove the for loops.
Complete code:
import pandas as pd
import plotly.express as px
df = pd.read_csv("gdp-csv-.csv", encoding = "ISO-8859-1")
df = df.drop(['Unnamed: 2', 'Unnamed: 5', 'Unnamed: 6'], axis=1)
df = df.drop(['Unnamed: 9', 'Unnamed: 10', 'Unnamed: 7', 'Unnamed: 8'], axis=1)
df = df.drop(df.index[0:4])
df = df.drop(df.index[195:])
df = df.drop(df.index[-4:])
df.columns = ['Code', 'Rank', 'Country', 'GDP']
df["GDP"] = df["GDP"].str.replace(",","").astype(float)
fig = px.choropleth(df, locations="Code",
color="GDP",
hover_name="Country",
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
btw, a cleaner way to bring the csv in would be to specify the column with usecols and rows with skiprows, see here:
df = pd.read_csv("gdp-csv-.csv", encoding = "ISO-8859-1", usecols=[0,1,3,4], skiprows=4,
skipfooter=122, engine='python')
df.columns = ['Code', 'Rank', 'Country', 'GDP']
df["GDP"] = df["GDP"].str.replace(",","").astype(float)
EDIT: added skipfooter to pd.read_csv

How to write metachars within the pandas DatFarame in column's names

I have CSV file So, when parsing this file with pandas with ISO-8859-1 encoding. However i'm just trying to create a DataFrame df_cols to print only selected columns but it giving the error on execution as it has metachars like / ' (example 'Card Holder's Name', 'CVV/CVV2') hence fails to get the output.
#!/grid/common/pkgs/python/v3.6.1/bin/python3
##### Pandas Display Setting for the complete output on the terminal ####
import pandas as pd
pd.set_option('display.height', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('expand_frame_repr', True)
df_list = pd.read_csv('/docs/Credit_Card.csv', encoding='ISO-8859-1')
df_cols = df_list[['Card Type Full Name', 'Issuing Bank', 'Card Number', 'Card Holder's Name', 'CVV/CVV2', 'Issue Date', 'Expiry Date','Credit Limit']]
print(df_cols)
try put column name in three quotation mark
"""Card Holder's Name"""
Try escaping the single quote character with \
df_cols = df_list[['Card Type Full Name', 'Issuing Bank', 'Card Number', 'Card Holder\'s Name', 'CVV/CVV2', 'Issue Date', 'Expiry Date','Credit Limit']]
As suggested by d_kennetz , we can directly read the columns based on names or index position on the DataFrame df_list itself which will reduce the time and resource utilization( memory consumption) to read the whole CSV.
As mentioned there are two way to read the columns first based on names where we need to be extra careful about special/metachars whereas the second method based on the index position we don't need to care about this which is slightly more useful to avoid this glitch.
df_list = pd.read_csv('/docs/Credit_Card.csv', encoding='ISO-8859-1',usecols=['Card Type Full Name', 'Issuing Bank', 'Card Number', 'Card Holder\'s Name', 'CVV/CVV2', 'Issue Date', 'Expiry Date','Credit Limit'])
OR
df_list = pd.read_csv('/docs/Credit_Card.csv', encoding='ISO-8859-1',usecols=[1, 2, 3, 4, 5, 6, 7, 10])

How can i graph this as a stacked bar chart?

I have this code below, it produces my data frame exactly how i want it, but i can't seem to graph it via a grouped bar chart,
I'd like to have the department on the X axis and on the Y axis have completed with the remaining information on top
import pandas as pd
import matplotlib
data = pd.read_excel('audit.xls', skiprows=2, index_col = 'Employee Department')
data.rename(columns = {'Curriculum Name':'Curriculum','Organization Employee Number':'Employee_Number', 'Employee Name': 'Employee','Employee Email':'Email', 'Employee Status':'Status', 'Date Assigned':'Assigned','Completion Date':'Completed'}, inplace=True)
data.drop(['Employee_Number', 'Employee','Assigned', 'Status', 'Manager Name', 'Manager Email', 'Completion Status','Unnamed: 1', 'Unnamed: 5', 'Unnamed: 6'], axis=1, inplace=True)
new_data = data.query('Curriculum ==["CARB Security Training","OIS Support Training","Legal EO Training"]')
new_data2 = new_data.groupby('Employee Department').count().eval('Remaining = Email - Completed', inplace=False)
new_data2
I assume i need to convert it to a pivot table somehow since that's how it is in excel
Have you tried something like this: new_data2[['Completed','Remaining']].plot.bar(stacked=True)
The following example works for me:
df = pd.DataFrame(np.arange(1,10).reshape(3,3), columns=['Email', 'Completed', 'Remaining'], index=['A', 'B', 'C'])
df[['Completed', 'Remaining']].plot.bar(stacked=True)

Pandas reading and sorting a file's content

I am reading a file from SIPRI. It reads in to pandas and dataframe is created and I can display it but when I try to sort by a column, I get a KeyError. Here is the code and the error:
import os
import pandas as pd
os.chdir('C:\\Users\\Student\\Documents')
#Find the top 20 countries in military spending by sorting
data = pd.read_excel('SIPRI-Milex-data-1949-2016.xls',
header = 0, index_col = 0, sheetname = 'Current USD')
data.sort_values(by = '2016', ascending = False)
KeyError: '2016'
You get the key error because the column '2016' is not present in the dataframe. Based on the excel file its in the integer form. Cleaning of data must be done in your dataframe to sort the things.
You can skip the top 5 rows and the bottom 8 rows to get the countries, then replace all the string and missing values with NaN. The following code will help you get that.
data = pd.read_excel('./SIPRI-Milex-data-1949-2016.xlsx', header = 0, index_col = 0, sheetname = 'Current USD',skiprows=5,skip_footer = 8)
data = data.replace(r'\s+', np.nan, regex=True).replace('xxx',np.nan)
new_df = data.sort_values(2016,ascending=False)
top_20 = new_df[:20].index.tolist()
Output:
['USA', 'China, P.R.', 'Russian Federation', 'Saudi Arabia', 'India', 'France', 'UK', 'Japan', 'Germany', 'Korea, South', 'Italy', 'Australia', 'Brazil', 'Israel', 'Canada', 'Spain', 'Turkey', 'Iran', 'Algeria', 'Pakistan']
​
Well this could be helpful, I guess:
data = pd.read_excel('SIPRI-Milex-data-1949-2016.xlsx', skiprows=5, index_col = 0, sheetname = 'Current USD')
data.dropna(inplace=True)
data.sort_values(by=2016, ascending=False, inplace=True)
And to get Top20 you can use:
data[data[2016].apply(lambda x: isinstance(x, (int, float)))][:20]
I downloaded the file and looks like the 2016 is not a column itself so you need to modify the dataframe a bit so as to change the row of country to be the header.
The next thing is you need to say data.sort_values(by = 2016, ascending = False). treat the column name as an integer instead of a string.
data = pd.read_excel('SIPRI-Milex-data-1949-2016.xlsx',
header = 0, index_col = 0, sheetname = 'Current USD')
data = data[4:]
data.columns = data.iloc[0]
data.sort_values(by =2016, ascending = False)

Categories

Resources