I have this code below, it produces my data frame exactly how i want it, but i can't seem to graph it via a grouped bar chart,
I'd like to have the department on the X axis and on the Y axis have completed with the remaining information on top
import pandas as pd
import matplotlib
data = pd.read_excel('audit.xls', skiprows=2, index_col = 'Employee Department')
data.rename(columns = {'Curriculum Name':'Curriculum','Organization Employee Number':'Employee_Number', 'Employee Name': 'Employee','Employee Email':'Email', 'Employee Status':'Status', 'Date Assigned':'Assigned','Completion Date':'Completed'}, inplace=True)
data.drop(['Employee_Number', 'Employee','Assigned', 'Status', 'Manager Name', 'Manager Email', 'Completion Status','Unnamed: 1', 'Unnamed: 5', 'Unnamed: 6'], axis=1, inplace=True)
new_data = data.query('Curriculum ==["CARB Security Training","OIS Support Training","Legal EO Training"]')
new_data2 = new_data.groupby('Employee Department').count().eval('Remaining = Email - Completed', inplace=False)
new_data2
I assume i need to convert it to a pivot table somehow since that's how it is in excel
Have you tried something like this: new_data2[['Completed','Remaining']].plot.bar(stacked=True)
The following example works for me:
df = pd.DataFrame(np.arange(1,10).reshape(3,3), columns=['Email', 'Completed', 'Remaining'], index=['A', 'B', 'C'])
df[['Completed', 'Remaining']].plot.bar(stacked=True)
Related
I want to print a data frame as a png image, and followed the following approach.
import pandas as pd
import dataframe_image as dfi
data = {'Type': ['Type 1', 'Type 2', 'Type 3', 'Total'], 'Value': [20, 21, 19, 60]}
df = pd.DataFrame(data)
dfi.export(df, 'table.png')
I however want to also print a date stamp above the table on the image - with the intention of creating a series of images on consecutive days. If possible I would also like to format the table with a horizontal line indicating the summation of values for the final 'Total' row.
Is this possible with the above package? Or is there a better approach to do this?
You can add the line df.index.name = pd.Timestamp('now').replace(microsecond=0) to add the timestamp on the first row:
To add the line you can use .style.set_table_styles:
data = {'Type': ['Type 1', 'Type 2', 'Type 3'], 'Value': [20, 21, 19]}
df = pd.DataFrame(data)
df.index.name = pd.Timestamp('now').replace(microsecond=0)
df.loc[len(df)] = ['Total',df['Value'].sum()]
test = df.style.set_table_styles([{'selector' : '.row3','props' : [('border-top','3px solid black')]}])
dfi.export(test, 'table.png')
I just wrote a program for college using pandas to structure some unstructured data. I definitely made it harder than it should be, but I ended up finding something interesting.
here is the data I parsed
Center/Daycare
825 23rd Street South
Arlington, VA 22202
703-979-BABY (2229)
22.
Maria Teresa Desaba, Owner/Director; Tony Saba, Org. >Director.
Website: www.mariateresasbabies.com
Serving children 6 wks to 5yrs full-time.
National Science Foundation Child Development Center
23.
4201 Wilson Blvd., Suite 180 22203
703-292-4794
Website: www.brighthorizons.com 112 children, ages 6 wks - 5 yrs.
7:00 a.m. – 6:00 p.m. Summer Camp for children 5 - 9 years.
here is the (aggressively commented for school)code that is mostly irrelevant but here for completeness sake
import csv
import pandas as pd
lines = []
"""opening the raw data from a text file"""
with open('raw_data.txt') as f:
lines = f.readlines()
f.close()
"""removing new line characters"""
for i in range(len(lines)):
lines[i] = lines[i].rstrip('\n')
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
"""here I am taking every column and adding corresponding values from the original dataframe
extra data frames chould be garbage collected but this serves for demonstration"""
df.index = df.index.astype('str')
df1 = df[df.index.str.contains('bus')]
df2 = df[df.index.str.contains('address')]
df3 = df[df.index.str.contains('location')]
df4 = df[df.index.str.contains('number')]
df5 = df[df.index.str.contains('know')]
df6 = df[df.index.str.contains('owner')]
df7 = df[df.index.str.contains('site')]
df8 = df[df.index.str.contains('descript')]
df9 = df[df.index.str.contains('null')]
for i in range(len(df1)):
df['business type'][i] = df1['info'][i]
for i in range(len(df2)):
df['address'][i] = df2['info'][i]
for i in range(len(df3)):
df['location'][i] = df3['info'][i]
for i in range(len(df4)):
df['phone number'][i] = df4['info'][i]
for i in range(len(df5)):
df['unknown'][i] = df5['info'][i]
for i in range(len(df6)):
df['owner'][i] = df6['info'][i]
for i in range(len(df7)):
df['website'][i] = df7['info'][i]
for i in range(len(df8)):
df['description'][i] = df8['info'][i]
for i in range(len(df9)):
df['null'][i] = df9['info'][i]
"""dropping unnecessary columns"""
df.drop(columns='info', inplace=True)
df.drop(columns='null', inplace=True)
df.drop(columns='unknown', inplace=True)
"""changing the index values to int to make easier to drop unused rows"""
idx = []
for i in range(0, len(df)):
idx.append(i)
df.index = idx
"""dropping unused rows"""
for i in range(2, 15):
df.drop([i], inplace=True)
"""writing to csv and printing to console"""
df.to_csv("new.csv", index=False)
print(df.to_string())
I'm just curious why when I create more columns by using the name of the index[i] item here
df = pd.DataFrame(lines, columns=['info'], index=['business type', 'address', 'location',
'phone number', 'unknown', 'owner', 'website', 'description',
'null', 'business type', 'unknown', 'address', 'phone number',
'website', 'description'])
"""creating more columns with the value at each index. This doesn't contain any duplicates"""
for i in df.index:
df[i] = ''
doesn't contain any duplicates.
when I add
print(df.columns)
I get the output
Index(['info', 'business type', 'address', 'location', 'phone number',
'unknown', 'owner', 'website', 'description', 'null'],
dtype='object')
I'm just generally curious why there are no duplicates as I'm sure that could be problematic in certain situations and also pandas is interesting and I hardly understand it and would like to know more. Also, if you feel extra enthusiastic any info on a more efficient way to do this would be greatly appreciated, but if not no worries, I'll eventually read the docs.
The pandas DataFrame is designed for tabular data in which all the entries in any one column have the same type (e.g. integer or string). One row usually represents one instance, sample, or individual. So the natural way to parse your data into a DataFrame is to have two rows, one for each institution, and define the columns as what you have called index (perhaps with the address split into several columns), e.g. business type, street, city, state, post code, phone number, etc.
So there would be one row per institution, and the index would be used to assign a unique identifier to each of them. That's why it's desirable for the index to contain no duplicates.
I'm learning Plotly Choropleth maps by doing some very basic examples. I'm plotting countries' GDP on a world map. Instead of a colorscale, from lower to higher GDP, I get a map with a discrete color for every country.
I suspect it might have to do with the GDP in the original dataset being a string, e.g. '23,350,230'. I have converted it to float, and confirmed the conversion worked.
fig = px.choropleth(df, locations="Code",
color="GDP",
hover_name="Country",
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
I have also tried using other values for color_continuous_scale, including once from here, and removing the parameter all together, the result was still the same map with discrete colors.
Please tell me what I'm doing wrong, thank you!
EDIT
To reproduce the issue:
The dataset is from Kaggle and can be downloaded here. It's formatting is not great, with many empty/redundant rows and 3 empty columns, so I have done some steps to preprocess the data. Btw the preprocessing if pretty rough so if you have any suggestions on how I could improve it, it is very welcome!
df = pd.read_csv("gdp-ppp.csv", encoding = "ISO-8859-1")
df = df.drop(['Unnamed: 2', 'Unnamed: 5', 'Unnamed: 6'], axis=1)
df = df.drop(df.index[0:4])
df = df.drop(df.index[195:])
df = df.drop(df.index[-4:])
df.columns = ['Code', 'Rank', 'Country', 'GDP']
i = 4
for gdp in df["GDP"]:
gdp = gdp.replace(",", "")
df["GDP"][i] = float(gdp)
i += 1
for gdp in df["GDP"]:
if type(gdp) != type(1.1):
print(gdp)
This seems to work, the print(gdp) in the last loop is never called, and the dataframe looks nice and clean. So that's when I use the code above to create the choropleth map, which is created, and the data is displayed correctly in the bar on the left, but the coloring is discrete. Here's the screenshot of the map I get.
Your suspicion is correct, plotly is seeing GDP as a string and thus using discrete colors. Use str.replace to remove the comma from the csv data (and then convert to float). Something like:
df["GDP"] = df["GDP"].str.replace(",","").astype(float)
This would come right after df.columns = ['Code', 'Rank', 'Country', 'GDP'], and then remove the for loops.
Complete code:
import pandas as pd
import plotly.express as px
df = pd.read_csv("gdp-csv-.csv", encoding = "ISO-8859-1")
df = df.drop(['Unnamed: 2', 'Unnamed: 5', 'Unnamed: 6'], axis=1)
df = df.drop(['Unnamed: 9', 'Unnamed: 10', 'Unnamed: 7', 'Unnamed: 8'], axis=1)
df = df.drop(df.index[0:4])
df = df.drop(df.index[195:])
df = df.drop(df.index[-4:])
df.columns = ['Code', 'Rank', 'Country', 'GDP']
df["GDP"] = df["GDP"].str.replace(",","").astype(float)
fig = px.choropleth(df, locations="Code",
color="GDP",
hover_name="Country",
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
btw, a cleaner way to bring the csv in would be to specify the column with usecols and rows with skiprows, see here:
df = pd.read_csv("gdp-csv-.csv", encoding = "ISO-8859-1", usecols=[0,1,3,4], skiprows=4,
skipfooter=122, engine='python')
df.columns = ['Code', 'Rank', 'Country', 'GDP']
df["GDP"] = df["GDP"].str.replace(",","").astype(float)
EDIT: added skipfooter to pd.read_csv
I'm working with worldbank data and I'm trying to create some graphs representing time, but the data I have now looks like this:
As I don't think there's a way to change it to a datetime I think the only way is to replace all these years columns with 1 column called 'Year' with column names I have right now as values and current values in a separate column.
Is there any nice function in Python that allows that or would I have to iterate through the entire dataframe?
Edit to include some code:
df2 = pd.DataFrame({'Country Name': ['Aruba', 'Afghanistan', 'Angola'],
'Country Code': ['ABW', 'AFG', 'AGO'],
'1960':[65.66, 32.29, 33.25],
'1961': [66.07, 32.74, 33.57],
'1962': [66.44, 33.18, 33.91],
'1963': [66.79, 33.62, 34.27],
'1964': [66.11, 34.06, 34.65],
'1965': [67.44, 34.49, 35.03]}).set_index('Country Name')
You can try taking transpose of the dataframe thus the year values will become rows and then you can rename this as year and use it in the plots.
You can try something like this :
import pandas as pd
from matplotlib import pyplot as plt
df1 = pd.DataFrame({'Country Name' : ['Aruba', 'Afghanistan', 'Angola'],
'Country Code' : ['ABW', 'AFG', 'AGO'],
'1960' : [65.66, 32.29, 33.25],
'1961' : [66.07, 32.74, 33.57],
'1962' : [66.44, 33.18, 33.91],
'1963' : [66.79, 33.62, 34.27],
'1964' : [66.11, 34.06, 34.65],
'1965' : [67.44, 34.49, 35.03]})
df2 = df1.transpose()
df2.columns = df1['Country Name']
df2 = df2[2:]
df2['Year'] = df2.index.values
plt.plot(df2['Year'], df2['Aruba'])
plt.plot(df2['Year'], df2['Afghanistan'])
plt.plot(df2['Year'], df2['Angola'])
plt.legend()
plt.show()
Output : Plot Output
I am learning machine learning and I came across this code.
I am trying to run the file "Recommender-Systems.py" from the above source. But it throws an error
ValueError: labels ['timestamp'] not contained in axis. How can it be removed?
Here's a dropbox link of u.data file.
Your data is missing the headers so it's being wrongly inferred by the first row.
You need to change a little bit the Recommender-Systems.py and manually inform the headers.
The right header is available in the README file from your data set.
Change your file to something like this:
## Explore the data (line 27)
data = pd.read_table('u.data', header=None) # header=None avoid getting the columns automatically
data.columns = ['userID', 'itemID',
'rating', 'timestamp'] # Manually set the columns.
data = data.drop('timestamp', axis=1) # Continue with regular work.
...
## Load user information (line 75)
users_info = pd.read_table('u.user', sep='|', header=None)
users_info.columns = ['useID', 'age', 'gender',
'occupation' 'zipcode']
users_info = users_info.set_index('userID')
...
## Load movie information (line 88)
movies_info = pd.read_table('u.item', sep='|', header=None)
movies_info.columns = ['movieID', 'movie title', 'release date',
'video release date', 'IMDb URL', 'unknown',
'Action', 'Adventure', 'Animation', "Children's",
'Comedy', 'Crime', 'Documentary', 'Drama',
'Fantasy', 'Film-Noir', 'Horror', 'Musical',
'Mystery', 'Romance', 'Sci-Fi',' Thriller',
'War', 'Western']
movies_info = movies_info.set_index('movieID')#.drop(low_count_movies)
This should work (but I'm not sure if I got all the right names for the columns).