Python pandas pivot matplotlib - python

I created the following python pandas pivot table.
df_pv = pd.pivot_table(df,index=["Fiscal_Week"],columns=["Year"],values=["Category","Sales","Traffic"],
aggfunc={"Category":len,"Sales":np.sum,"Traffic":np.sum},fill_value=0)
Category Sales Traffic
Year |2014 2015 2016 | 2014 2015 2016 | 2014 2015 2016
Fiscal_Week
FW01 4 3 4 35678 654654 47547 567 231 765
FW02 2 6 7 6565 4686 34554 297 464 564
FW03 4 4 5 5867 56856 34346 287 45 324
FW04 2 5 3 8568 45745 3564 546 765 978
FW05 2 5 5 5685 3464 4754 325 235 654
FW06 4 3 2 56765 35663 3643 456 935 936
FW07 1 6 2 8686 2454 2463 324 728 598
FW08 6 2 3 34634 34543 4754 198 436 234
I would like to create the two following plots:
Scatterplot: Number of Campaigns by Sales and each year have it's own color.
The second graph should be Traffic by Fiscal Week.
I tried this unsucessfully
df_pv.plot(x="Fiscal_Week", y="Sales")
KeyError: 'Fiscal_Week'
Is there a better way - for example to not pivot, but within the graph specify the aggregations?

You're trying to use the index as a normal column. That's not possible.
Ways to overcome this:
Reset the index reset_index()
Use the index explicitely .plot(x=df_pv.index, y="Sales")
Use the index implicitely .plot(y="Sales", use_index=True)

Related

How to join all columns in dataframe? [duplicate]

This question already has answers here:
Pandas: Multiple columns into one column
(4 answers)
How to stack/append all columns into one column in Pandas? [duplicate]
(4 answers)
Closed 10 months ago.
I would like one column to have all the other columns in the data frame combined.
here is what the dataframe looks like
0 1 2
0 123 321 231
1 232 321 231
2 432 432 432
dataframe name = task_ba
I would like it to look like this
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
Easiest and fastest option, use the underlying numpy array:
df2 = pd.DataFrame(df.values.ravel(order='F'))
NB. If you prefer a series, use pd.Series instead
Output:
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
You can use pd.DataFrame.melt() and then drop the variable column:
>>> df
0 1 2
0 123 321 231
1 232 321 231
2 432 432 432
>>> df.melt().drop("variable", axis=1) # Drops the 'variable' column
value
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
Or if you want 0 as your column name:
>>> df.melt(value_name=0).drop("variable", axis=1)
0
0 123
1 232
2 432
3 321
4 321
5 432
6 231
7 231
8 432
You can learn all this (and more!) in the official documentation.

Dataframe structures in pandas/seaborn. Do I need to be careful about observations vs variables?

I'm learning how to use python for data analysis and I have my first few dataframes to work with that I have pulled from video games I play.
So the dataframe I'm working with currently uses the header row for all the player names (8 players)
All the statistics are the first column.
Is it a better practice to have these positions reversed. i.e. should all the players be in the first col instead of the first row?
Arctic Shat Sly Snky Nanm Zax zack Sorn Cort
Statistics
Assists 470 415 388 182 212 92 40 5 4
Avg Damage Dealt 203.82 167.37 165.2 163.45 136.3 85.08 114.96 128.72 26.71
Boosts 1972 1807 1790 668 1392 471 103 7 33
Damage Dealt 236222.66 239680.08 164373.73 74696.195 99904.48 27991.652 13910.629 901.01385 1228.7041
Days 206 234 218 78 157 94 29 3 10
Head Shot Kills 395 307 219 119 130 29 12 0 0
Headshot % 26.37% 18.65% 18.96% 23.85% 19.58% 16.11% 17.14% 0% 0%
Heals 3139 4385 2516 1326 2007 749 382 15 78
K/D 1.36 1.2 1.22 1.13 0.95 0.58 0.59 0.57 0.07
Kills 1498 1646 1155 499 664 180 70 4 3
Longest Kill 461.77765 430.9177 410.534 292.18732 354.3065 287.72366 217.98175 110.25433 24.15225
Longest Time Survived 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Losses 1117 1376 959 448 709 320 119 7 46
Max Kill Streaks 4 4 4 3 4 3 3 1 1
Most Survival Time 2051.842 2180.98 1984.259 1948.513 2064.065 1979.101 2051.846 1486.288 1670.048
Revives 281 455 155 104 221 83 19 2 2
Ride Distance 1610093.4 2157408.8 1572710 486170.5 714986.3 524297 204585.53 156.07877 63669.613
Road Kills 1 4 5 4 0 0 0 0 0
Round Most Kills 9 8 9 7 9 6 5 2 1
Rounds Played 1159 1432 995 457 733 329 121 7 46
Suicides 16 42 14 6 10 4 4 0 2
Swim Distance 2830.028 4966.6914 2703.0044 1740.3292 2317.7866 1035.3792 395.86472 0 92.01848
Team Kills 22 47 23 9 15 4 5 0 2
Time Survived 969792.2 1284232.6 930141.94 328190.22 637273.3 284434.3 109724.04 4580.869 37748.414
Top10s 531 654 509 196 350 187 74 2 28
Vehicle Destroys 23 9 29 4 15 3 1 0 0
Walk Distance 1545281.6 1975185 1517812 505191 1039509.8 461860.53 170913.25 9665.322 63900.125
Weapons Acquired 5043 7226 4683 1551 2909 1514 433 23 204
Wins 55 63 48 17 32 19 3 0 3
dBNOs 1489 1575 1058 488 587 179 78 5 8
Yes, it is better to transpose.
The current best practice is to have one instance (row) for each observation, in your case, player. And one feature (column) for each variable.
This is called "tidy data" (from the paper published by Hadley Wickham). Tidy data works more or less like guidelines for us, data scientists, much like normalization rules for relational database people.
Also Most frameworks/programs/data structures are implemented considering this organization. For instance, in python pandas, using a dataframe with this data you have, if you would want to check out the average headshots, would need to check just a df['Head Shot Kills'].mean() (if it was transposed...).

How to mask a data frame by the month?

I have a dataframe df1 with a column dates which includes dates. I want to plot the dataframe for just a certain month. The column dates look like:
Unnamed: 0 Unnamed: 0.1 dates DPD weekday
0 0 1612 2007-06-01 23575.0 4
1 3 1615 2007-06-04 28484.0 0
2 4 1616 2007-06-05 29544.0 1
3 5 1617 2007-06-06 29129.0 2
4 6 1618 2007-06-07 27836.0 3
5 7 1619 2007-06-08 23434.0 4
6 10 1622 2007-06-11 28893.0 0
7 11 1623 2007-06-12 28698.0 1
8 12 1624 2007-06-13 27959.0 2
9 13 1625 2007-06-14 28534.0 3
10 14 1626 2007-06-15 23974.0 4
.. ... ... ... ... ...
513 721 2351 2009-06-09 54658.0 1
514 722 2352 2009-06-10 51406.0 2
515 723 2353 2009-06-11 48255.0 3
516 724 2354 2009-06-12 40874.0 4
517 727 2357 2009-06-15 77085.0 0
518 728 2358 2009-06-16 77989.0 1
519 729 2359 2009-06-17 75209.0 2
520 730 2360 2009-06-18 72298.0 3
521 731 2361 2009-06-19 60037.0 4
522 734 2364 2009-06-22 69348.0 0
523 735 2365 2009-06-23 74086.0 1
524 736 2366 2009-06-24 69187.0 2
525 737 2367 2009-06-25 68912.0 3
526 738 2368 2009-06-26 57848.0 4
527 741 2371 2009-06-29 72718.0 0
528 742 2372 2009-06-30 72306.0 1
And I just want to have June 2007 for example.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['month'] = pd.PeriodIndex(df1.dates, freq='M')
nov_mask=df1['month'] == 2007-06
plot_data= df1[nov_mask].pivot(index='dates', values='DPD')
plot_data.plot()
plt.show()
I don't know what's wrong with my code.The error shows that there is something wrong with 2007-06 when i defining nov_mask, i think the data type is wrong but I tried a lot and nothing works..
You don't need PeriodIndex if you just want to get June 2007 data. I have no access to IPython right now but this should point you in the right direction.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['year'] = df1['dates'].dt.year
df1['month'] = df1['dates'].dt.month
july_mask = ((df1['year'] == 2007) & (df1['month'] == 7))
filtered = df1[july_mask ]
# ... Do something with filtered.

Read 4 lines of data into one row of pandas data frame

I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.

DataFrame to DataPanel in Pandas / Python

I have a data frame that looks like this:
Name Permits_13 Score_13 Permits_14 Score_14 Permits_15 Score_15
0 P.S. 015 ROBERTO CLEMENTE 12.0 284 22 279 32 283
1 P.S. 019 ASHER LEVY 18.0 296 51 301 55 308
2 P.S. 020 ANNA SILVER 9.0 294 9 290 10 293
3 P.S. 034 FRANKLIN D. ROOSEVELT 3.0 294 4 292 1 296
4 P.S. 064 ROBERT SIMON 3.0 287 15 288 17 291
5 P.S. 110 FLORENCE NIGHTINGALE 0.0 313 3 306 4 308
6 P.S. 134 HENRIETTA SZOLD 4.0 290 12 292 17 288
7 P.S. 137 JOHN L. BERNSTEIN 4.0 276 12 273 17 274
8 P.S. 140 NATHAN STRAUS 13.0 282 37 284 59 284
9 P.S. 142 AMALIA CASTRO 7.0 290 15 285 25 284
10 P.S. 184M SHUANG WEN 5.0 327 12 327 9 327
And I would like to transform it to a data panel structure as the answer for this question Fixed effect in Pandas or Statsmodels, so I can use the PanelOLS with fixed effects.
My first attempt was to do this transformation:
df1 = df.ix[:,['Permits_13', 'Score_13']].T
df2 = df.ix[:,['Permits_14', 'Score_14']].T
df3 = df.ix[:,['Permits_15', 'Score_15']].T
pf = pandas.Panel({'df1':df1,'df2':df2,'df3':df3})
However, it doesn't seem to be the correct way, once I have no information about time. Here, columns ending with 13, 14 and 15, represent observations for the years of 2013, 2014 and 2015, in that order.
Do I have to create a data frame for each one of the rows in the original data?
This is my first trial using Pandas, and any help would be appreciated.
The docstring of DataFrame.to_panel() says:
Transform long (stacked) format (DataFrame) into wide (3D, Panel)
format.
Currently the index of the DataFrame must be a 2-level MultiIndex.
This may be generalized later
So that means you need to do:
Stack your dataframe (as it's currently "wide", not "long")
Pick two columns who can unique define the index of your dataframe
Set those columns as your index
Call to_panel()
So that's:
df.stack().set_index(['first_col', 'other_col']).to_panel()

Categories

Resources