I have a dataframe with multiple repeating series, over time.
I want to create visualizations that compare series over time, as well as with one another.
What is the best way to structure my data to accomplish this?
I have thus far been trying to make smaller dataframes from this, by either dropping years or selecting only one series; using a variety of indexes, lists or series calls that refer to the multiple years, .Series, .loc or .drop etc..
I always seem to encounter the same issues when making the actual graphs; usually relating to the years.
My best result has been making simple bar graphs with countries on the x axis and GDP from only 2018 on the Y axis.
I would like to eventually be able to have countries represented by color with 3D plotly graphs, wherein a series like GDP is Z (depth), Years are Y, and some other series like GNI could be X.
For now I am just aiming to make a scatterplot
I am also fine with using matplot, seaborn, whatever makes the most sense here.
Columns: [country, series, 1994, 1995, 1996, etc..]
Country Series 1994 1995 1996 ...
USA GDP 3.12 4.13
USA Export% 25.5 32
USA GNI 867,123,111 989,666,123
UK GDP 2.87 etc.
UK Export% 43.1
UK GNI 981,125,555
China GDP 5.98
China Export% NaN
China GNI 787,123,447
...
df1 = df.loc[df['series']== 'GDP']
time = df1['1994':'1996']
gdp_time = px.scatter(df1, x = time, y= 'series', color="country")
gdp_time.show()
#Desired Graph
gdp_time = px.scatter(df1, x = years, y= GDP, color= Countries)
gdp_time.show()
I find it hard to believe that I cant simply create a series that references the multiple year columns as a singular 'time'.
What am I missing?
Related
I have a pandas dataframe, US state temperature data that is grouped firstly by State and then by Year. I have already selected the first and last years of entries by subsetting the original dataframe. I want to create a new dataframe that shows the difference in AvgTemperature from the first year (1995) and the last year (2019) for all 50 states.
State
Year
AvgTemperature
Alabama
1995
63.66
Alabama
2019
66.32
Alaska
1995
35.97
...
...
...
I want to have a result that I can plot to show which states have changed the most over time, preferably in the format simply of State as column 1 and Temperature_Change as column 2.
You can pivot, compute the diff and plot as bar:
(df.pivot('State', 'Year', 'AvgTemperature')
.diff(axis=1)
.iloc[:,-1]
.rename('diff')
.plot.bar()
)
NB. I used dummy data for Alaska in 2019.
Output:
Try this:
df.sort_values(['State', 'Year']).groupby('State').apply(lambda g: g.iloc[-1]['AvgTemperature'] - g.iloc[0]['AvgTemperature'])
Dataframe 50 countries, 80 features (varying widely in scale), over 25 years.
The variance between feature values, and values per country within the same feature, is wide.
Trying to accurately impute the missing values across the whole dataframe, all at once.
Tried SimpleImputer with Mean, but this would give a mean for the entire feature column, and ignored any yearly time trend for that specific country.
This led to the imputed values being wildly inaccurate for smaller countries, as their imputed values reflected the mean of that feature column across ALL the larger countries as well
AND, if there was a trend for that feature declining across all countries, it would be ignored due to the mean being so much larger than the smaller countries'.
TLDR;
Currently:
Year x1 x2 x3 ...
USA 1990 4 581000 472
USA 1991 5 723000 389
etc...
CHN 1990 5 482000 393
CHN 1991 7 623000 512
etc...
CDR 1990 1 NaN 97
CDR 1991 NaN 91000 NaN
etc...
How can I impute the missing values most accurately and efficiently, where the imputation takes into account the scale of the country and feature, while noting the yearly time trend??
Goal:
Year x1 x2 x3 ...
USA 1990 3 581000 472
USA 1991 5 723000 389
etc...
CHN 1990 5 482000 393
CHN 1991 7 623000 512
etc...
CDR 1990 1 (87000) 97
CDR 1991 (3) 91000 (95)
etc...
Wherein the 3, 87000, and 95 would be suitable values as they follow a general time trend that other countries do, but the values are scaled to the other values in that same feature for the specific country (in this case CDR)
Using SimpleImputer, these values would be MUCH higher, and far less logical.
I know imputation is never perfect, but it can surely be made more accurate in this case
If there is a noticeable trend over the years for that country, how can I reflect that while keeping the imputed values to a scale that matches the feature for the particular country?
You can try the following techniques.
Random forest imputation .
you can refer to this paper .
Backward Forward filling(though it will just consider year) .
Log returns
I'm new to Pandas and I would like to play with random text data. I am trying to add 2 new columns to a DataFrame df which would be each filled by a key (newcol1) + value (newcol2) randomly selected from a dictionary.
countries = {'Africa':'Ghana','Europe':'France','Europe':'Greece','Asia':'Vietnam','Europe':'Lithuania'}
My df already has 2 columns and I'd like something like this :
Year Approved Continent Country
0 2016 Yes Africa Ghana
1 2016 Yes Europe Lithuania
2 2017 No Europe Greece
I can certainly use a for or while loop to fill df['Continent'] and df['Country'] but I sense .apply() and np.random.choice may provide a simpler more pandorable solution for that.
Yep, you're right. You can use np.random.choice with map:
df
Year Approved
0 2016 Yes
1 2016 Yes
2 2017 No
df['Continent'] = np.random.choice(list(countries), len(df))
df['Country'] = df['Continent'].map(countries)
df
Year Approved Continent Country
0 2016 Yes Africa Ghana
1 2016 Yes Asia Vietnam
2 2017 No Europe Lithuania
You choose len(df) number of keys at random from the country key-list, and then use the country dictionary as a mapper to find the country equivalents of the previously picked keys.
You could also try using DataFrame.sample():
df.join(
pd.DataFrame(list(countries.items()), columns=["continent", "country"])
.sample(len(df), replace=True)
.reset_index(drop=True)
)
Which can be made faster if your continent-country map is already a dataframe.
If you're on Python 3.6, another method would be to use random.choices():
df.join(
pd.DataFrame(choices([*countries.items()], k=len(df)), columns=["continent", "country"])
)
random.choices() is similar to numpy.random.choice() except that you can pass a list of key-value tuple pairs whereas numpy.random.choice() only accepts 1-D arrays.
I have a dataframe that looks like this I have made my continents my Index field. I want it to show up a little different. I would like to get the dataframe to just have 3 continents and then have all the countries that fall under that continent to show up as a count
Continent Country
Oceania Australia 53 154.3 203.6 209.9
Europe Austria 28.2 49.3 59.7 59.9
Europe Belgium 33.2 70.3 83.4 82.8
Europe Denmark 18.6 26.0 38.9 36.1
Asia Japan 382.9 835.5 1028.1 1049.0
So my output would look like such: and it would show just the number of countries under that continent. I would also like it for when it combines everything into num_countries that it gives the mean of everything for that country so its all rolled into one for each continent
Continent num_Countries mean
Oceania 1 209.9
Europe 3 328.2
Asia 1 382.9
I have tried to create these columns but i can get the new columns to create and when I do they come up as Nan values and for the continents I cant get the groupby() function to work in the way I want it to because it doesnt roll all of the countries into just the continents it displays the full list of continents and countries.
You can use a pivot table for this. (I labeled the unlabeled columns with 1 to 4)
df.pivot_table(index="Continent", values=["Country", "1"],
aggfunc=('count', 'mean'))
The following groups by 'Continent' and applies a function that counts the number of countries and finds the mean of means (I assumed this is what you wanted since you have 4 columns of numeric data for a number of countries per continent).
def f(group):
return pd.DataFrame([{'num_Countries': group.Country.count(),
'mean': group.mean().mean()}])
grouped = df.groupby('Continent')
result = grouped.apply(f).reset_index(level=1, drop=True)
I have a dataframe with two keys. I'm looking to do a stacked bar plot of the number of items within key2 (meaning taking the count values from a fully populated column of data).
A small portion of the dataframe I have is:
Sector industry
Basic Industries Agricultural Chemicals 17
Aluminum 3
Containers/Packaging 1
Electric Utilities: Central 2
Engineering & Construction 12
Name: Symbol, dtype: int64
Key1 is Sector, Key2 is Industry. I want the value in Symbol (the counted column to be represented as industry stackings) in a bar comprising Basic Industries.
I know if I do a df.reset_index I'll have a column with (non-unique) Sectors and Industries with an integer counter. Is there a way to simply assign the column 1,2,3 data to pandas plot or matplotlib to make a stacked bar chart?
Alternatively, is there a way to easily specify using both keys in the aforementioned dataframe?
I'm looking for both guidance on approach from more experienced people as well as help with the actual syntax.
I just added a new Sector to improve the example.
Symbol
Sector industry
Basic Industries Agricultural Chemicals 17
Aluminum 3
Containers/Packaging 1
Electric Utilities: Central 2
Engineering & Construction 22
Basic Industries2 Agricultural Chemicals 7
Aluminum 8
Containers/Packaging 11
Electric Utilities: Central 7
Engineering & Construction 4
Assuming your dataframe is indexed by ["Sector", "industry"] you need first reset_index and then pivot your dataframe and finally make the stacked plot.
df.reset_index().pivot_table(index="industry", columns="Sector", values="Symbol").T.plot(kind='bar', stacked=True, figsize=(14, 6))
Another way, instead of reset_index, you can use this:
df.unstack().Symbol.plot(kind='bar', stacked=True)