I've done a lot of data cleaning in my dataset to be able to have all the information needed in my data frame. I'm at the moment where I would like to visualize it but the result that I'm getting is useless.
#combine dataframe
test = soccer_data_2.append(soccer_data_3)
test = test.sort_values(by = 'date')
test = pd.DataFrame(test, columns = ['date', 'Country', 'Result'])
#converte date format to only have year
test['date'] = pd.to_datetime(test['date'])
test['year'] = test['date'].dt.year
#group my result
test.set_index('year', inplace=True)
test = test.groupby(['year', 'Country', 'Result']).Result.count()
print(test)
year Country Result
1873 England Win 1
Scotland Win 1
1874 England Win 1
Scotland Win 1
1876 England Win 1
..
2019 Yorkshire Win 2
Zambia Win 5
Zimbabwe Win 7
2020 Costa Rica Win 1
United States Win 1
I would like to find a way to plot a time series and see the evolution of victory for each country by times.
Try this:
plt.figure(figsize=(10,10))
countries=df.Country.unique()
for country in countries:
grouped=df[df.Country==country].groupby('year').count()
years=list(grouped.index)
wins=list(grouped.Result)
plt.plot(years,wins,label=country)
plt.legend()
Convert all date columns to datetime objects, then import matplotlib by
import matplotlib.pyplot as plt
then to plot the data :
test = df.groupby(['year', 'Country', 'Result'])
for groups, data in test :
plt.plot(data['Year'], data['<the column that you would like to plot>']
Related
I have the dataset shown below. I am trying to sort it so that the columns are in this order: Week End, Australia, Germany, France, etc...
I have tried using loc and assigning each of the data sets as variables but when I create a new DataFrame it causes an error. Any help would be appreciated.
This is the data before any changes:
Region
Week End
Value
Australia
2014-01-11
1.480510
Germany
2014-01-11
1.481258
France
2014-01-11
0.986507
United Kingdom
2014-01-11
1.973014
Italy
2014-01-11
0.740629
This is my desired output:
Week End
Australia
Germany
France
United Kingdom
Italy
2014-01-11
1.480510
1.481258
0.986507
1.973014
0.740629
What I've tried:
cols = (['Region','Week End','Value'])
df = GS.loc[GS['Brand'].isin(rows)]
df = df[cols]
AUS = df.loc[df['Region'] == 'Australia']
JPN = df.loc[df['Region'] == 'Japan']
US = df.loc[df['Region'] == 'United States of America']
I think that you could actually just do:
df.pivot(index="Week End", columns="Region", values="Value")
User 965311532's answer is much more concise, but an alternative approach using dictionaries would be:
new_df = {'Week End': df['Week End'][0]}
new_df.update({region: value for region, value in zip(df['Region'], df['Value'])})
new_df = pd.DataFrame(new_df, index = [0])
As user 965311532 pointed out, the above code will not work if there are more dates. In this case, we could use pandas groupby:
dates = []
for date, group in df.groupby('Week End'):
date_df = {'Week End': date}
date_df.update({region: value for region, value in zip(df['Region'], df['Value'])})
date_df = pd.DataFrame(date_df, index = [0])
dates.append(date_df)
new_df = pd.concat(dates)
Dataframe:
Horror films released in 2019
Title Director Country Year
3 from Hell Rob Zombie United States 2019
Bliss Joe Begos United States 2019
Bedeviled The Vang Brothers United States 2016
Creep 2 Patrick Brice United States 2017
Brightburn David Yarovesky United States 2019
Delirium Dennis Iliadis Ireland 2018
Child's Play Lars Klevberg United States 2019
The Conjuring 2 James Wan United States 2016
Bloodlands Steven Kastrissios Albania 2017
Bird Box Susanne Bier United States 2017
need to plot a histogram showing the number of titles released over the years using Pandas plot function
code:
df=pd.read_csv(filename)
group = df.groupby('Year').count()[['Title']]
new_df = grouped.reset_index()
xtick=newdf['Year'].tolist()
width = newdf.Year[1] - newdf.Year[0]
newdf.iloc[:,1:2].plot(kind='bar', width=width)
Cannot figure out a way to label x axis with values from the Year column, also unsure if my approach is correct.
Thanks in advance :)
It sounds like you want a bar chart, not a histogram, because you have discrete/categorical variables (years). And you say "kind=bar" in your plot statement, so you are on the right track. Try this to see if it works for you. I forced the y-axis to be integers since you are looking for counts, but that is optional.
import pandas as pd
import matplotlib.pyplot as plt
title = [ 'Movie1','Movie2','Movie3',
'Movie4','Movie5','Movie6',
'Movie7','Movie8','Movie9',
]
year = [2019,2019,2018,
2017,2019,2018,
2019,2017,2018
]
df = pd.DataFrame(list(zip(title, year)),
columns =['Title', 'Year']
)
print(df)
group = df.groupby('Year').count()[['Title']]\
.rename(columns={'Title': 'No. of Movies'})\
.reset_index()
print(group)
ax = group.plot.bar(x='Year', rot=0)
ax.yaxis.get_major_locator().set_params(integer=True)
plt.show()
Title Year
0 Movie1 2019
1 Movie2 2019
2 Movie3 2018
3 Movie4 2017
4 Movie5 2019
5 Movie6 2018
6 Movie7 2019
7 Movie8 2017
8 Movie9 2018
Year No. of Movies
0 2017 2
1 2018 3
2 2019 4
The api offers a few different ways to do this (not a great thing imo). Here is one way to get what you want:
df = pd.read_csv(filename)
group = df.groupby('Year').count()[['Title']]
df2 = group.reset_index()
df2.plot(kind='bar', x="Year", y="Title")
Or, even more concisely:
df.value_counts("Year").plot(kind="bar")
Note that in the second case, you're creating a bar plot from a Series object.
You can simply do
df.groupby('Year').Title.count().plot(kind='bar')
Output
Im working in a World Happiness Report project that includes datasets from 2015 to 2019. I concatenated them into a final dataframe to get average of parameters (economy, health, etc.) for every country across that time span. But what I forgot to add was the respective region that the country is in (ej: England - Western Europe). How could I add the 'Region' column to my final dataframe and to be sure that that region matches with its respective country?
Not sure if this is what you are looking for.
You may want to do something like this:
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
Or you can also use merge statement. The assumption is that for each country, you have a region that it can map to.
df = pd.merge(df,region_df,how='left',on = ['Country'])
Make sure you have indexed both on Country before you merge to get the optimized response.
data setup
import pandas as pd
c = ['Country','Happiness Score','Other_fields']
d = [['Denmark', 7.5460,1.25],
['Norway',7.5410,1.50],
['Finland',7.5378,1.85]]
region_cols = ['Country','Region']
region_data = [['Denmark','R1'],['Norway','R2'],['Finland','R3']]
df = pd.DataFrame(data = d, columns = c)
region_df = pd.DataFrame(data = region_data, columns = region_cols)
Based on the lookup DataFrame, you can do a map to check for Country and assign Region to df.
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
print (df)
Your result will be as follows:
Base DataFrame:
Country Happiness Score Other_fields
0 Denmark 7.5460 1.25
1 Norway 7.5410 1.50
2 Finland 7.5378 1.85
Lookup DataFrame:
Country Region
0 Denmark R1
1 Norway R2
2 Finland R3
Updated DataFrame:
Country Happiness Score Other_fields Region
0 Denmark 7.5460 1.25 R1
1 Norway 7.5410 1.50 R2
2 Finland 7.5378 1.85 R3
I want to create multiple (two in this case) boxplots based on a data in a dataframe
I have the following dataframe:
Country Fund R^2 Style
0 Austria BG EMCore Convertibles Global CHF R T 0.739131 Allocation
1 Austria BG EMCore Convertibles Global R T 0.740917 Allocation
2 Austria BG Trend A T 0.738376 Fixed Income
3 Austria Banken Euro Bond-Mix A 0.71161 Fixed Income
4 Austria Banken KMU-Fonds T 0.778276 Allocation
5 Brazil Banken Nachhaltigkeitsfonds T 0.912808 Allocation
6 Brazil Banken Portfolio-Mix A 0.857019 Allocation
7 Brazil Banken Portfolio-Mix T 0.868856 Fixed Income
8 Brazil Banken Sachwerte-Fonds T 0.730626 Fixed Income
9 Brazil Banken Strategie Wachstum T 0.918684 Fixed Income
I want to create a boxplot chart for each country summarized by Style and showing the distribution of R^2.
I was thinking of groupby operation but somehow I don't manage to make two charts for each country.
Thanks in advance
Here You go. Description in code.
=^..^=
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
data = StringIO("""
Country R^2 Style
Austria 0.739131 Allocation
Austria 0.740917 Allocation
Austria 0.738376 Fixed_Income
Austria 0.71161 Fixed_Income
Austria 0.778276 Allocation
Brazil 0.912808 Allocation
Brazil 0.857019 Allocation
Brazil 0.868856 New_Style
Brazil 0.730626 Fixed_Income
Brazil 0.918684 Fixed_Income
Brazil 0.618684 New_Style
""")
# load data into data frame
df = pd.read_csv(data, sep=' ')
# group data by Country
grouped_data = df.groupby(['Country'])
# create list of grouped data frames
df_list = []
country_list = []
for item in list(grouped_data):
df_list.append(item[1])
country_list.append(item[0])
# plot box for each Country
for df in df_list:
country = df['Country'].unique()
df = df.drop(['Country'], axis=1)
df = df[['Style', 'R^2']]
columns_names = list(set(df['Style']))
# pivot rows into columns
df = df.assign(g = df.groupby('Style').cumcount()).pivot('g','Style','R^2')
# plot box
df.boxplot(column=colums_names)
plt.title(country[0])
plt.show()
Output:
Came up with some solution myself.
df= "This is the table from the original question"
uniquenames=df.Country.unique()
# create dictionary of the data with countries set as keys
diction={elem:pd.DataFrame for elem in uniquenames}
# fill dictionary with values
for key in diction.keys():
diction[key]=df[:][df.Country==key]
#plot the data
for i in diction.keys():
diction[i].boxplot(column="R^2",by="Style",
figsize=(15,6),patch_artist=True,fontsize=12)
plt.xticks(rotation=90)
plt.title(i,fontsize=12)
Use seaborn for this kind of tasks. Here are a couple of options:
Use seaborn's boxplot
import seaborn as sns
sns.set()
# Note - the data is stored in a data frame df
sns.boxplot(x='Country', y='R^2', hue='Style', data=df)
Alternatively, you can use seaborn's FacetGrid.
g = sns.FacetGrid(df, col="Country", row="Style")
g = g.map(sns.boxplot, 'R^2', orient='v')
I am new to python and pandas and I am struggling to figure out how to pull out the 10 counties with the most water used for irrigation in 2014.
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('info.csv') #reads csv
data['Year'] = pd.to_datetime(['Year'], format='%Y') #converts string to
datetime
data.index = data['Year'] #makes year the index
del data['Year'] #delete the duplicate year column
This is what the data looks like (this is only partial of the data):
County WUCode RegNo Year SourceCode SourceID Annual CountyName
1 IR 311 2014 WELL 1 946 Adams
1 IN 311 2014 INTAKE 1 268056 Adams
1 IN 312 2014 WELL 1 48 Adams
1 IN 312 2014 WELL 2 96 Adams
1 IR 312 2014 INTAKE 1 337968 Adams
3 IR 315 2014 WELL 5 81900 Putnam
3 PS 315 2014 WELL 6 104400 Putnam
I have a couple questions:
I am not sure how to pull out only the "IR" in the WUCode Column with pandas and I am not sure how to print out a table with the 10 counties with the highest water usage for irrigation in 2014.
I have been able to use the .loc function to pull out the information I need, with something like this:
data.loc['2014', ['CountyName', 'Annual', 'WUCode']]
From here I am kind of lost. Help would be appreciated!
import numpy as np
import pandas as pd
import string
df = pd.DataFrame(data={"Annual": np.random.randint(20, 1000000, 1000),
"Year": np.random.randint(2012, 2016, 1000),
"CountyName": np.random.choice(list(string.ascii_letters), 1000)},
columns=["Annual", "Year", "CountyName"])
Say df looks like:
Annual Year CountyName
0 518966 2012 s
1 44511 2013 E
2 332010 2012 e
3 382168 2013 c
4 202816 2013 y
For the year 2014...
df[df['Year'] == 2014]
Group by CountyName...
df[df['Year'] == 2014].groupby("CountyName")
Look at Annual...
df[df['Year'] == 2014].groupby("CountyName")["Annual"]
Get the sum...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum()
Sort the result descending...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False)
Take the top 10...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False).head(10)
This example prints out (your actual result may vary since my data was random):
CountyName
Q 5191814
y 4335358
r 4315072
f 3985170
A 3685844
a 3583360
S 3301817
I 3231621
t 3228578
u 3164965
This may work for you:
res = df[df['WUCode'] == 'IR'].groupby(['Year', 'CountyName'])['Annual'].sum()\
.reset_index()\
.sort_values('Annual', ascending=False)\
.head(10)
# Year CountyName Annual
# 0 2014 Adams 338914
# 1 2014 Putnam 81900
Explanation
Filter by WUCode, as required, and groupby Year and CountyName.
Use reset_index so your result is a dataframe rather than a series.
Use sort_values and extract top 10 via pd.DataFrame.head.