selecting a column from pandas pivot table - python

I have the below pivot table which I created from a dataframe using the following code:
table = pd.pivot_table(df, values='count', index=['days'],columns['movements'], aggfunc=np.sum)
movements 0 1 2 3 4 5 6 7
days
0 2777 51 2
1 6279 200 7 3
2 5609 110 32 4
3 4109 118 101 8 3
4 3034 129 109 6 2 2
5 2288 131 131 9 2 1
6 1918 139 109 13 1 1
7 1442 109 153 13 10 1
8 1085 76 111 13 7 1
9 845 81 86 8 8
10 646 70 83 1 2 1 1
As you can see from pivot table that it has 8 columns from 0-7 and now I want to plot some specific columns instead of all. I could not manage to select columns. Lets say I want to plot column 0 and column 2 against index. what should I use for y to select column 0 and column 2?
plt.plot(x=table.index, y=??)
I tried with y = table.value['0', '2'] and y=table['0','2'] but nothing works.

You cannot select ndarray for y if you need those two column values in a single plot you can use:
plt.plot(table['0'])
plt.plot(table['2'])
If column names are intergers then:
plt.plot(table[0])
plt.plot(table[2])

Related

Average of every x rows with a step size of y per each subset using pandas

I have a pandas data frame like this:
Subset Position Value
1 1 2
1 10 3
1 15 0.285714
1 43 1
1 48 0
1 89 2
1 132 2
1 152 0.285714
1 189 0.133333
1 200 0
2 1 0.133333
2 10 0
2 15 2
2 33 2
2 36 0.285714
2 72 2
2 132 0.133333
2 152 0.133333
2 220 3
2 250 8
2 350 6
2 750 0
I want to know how can I get the mean of values for every "x" row with "y" step size per subset in pandas?
For example, mean of every 5 rows (step size =2) for value column in each subset like this:
Subset Start_position End_position Mean
1 1 48 1.2571428
1 15 132 1.0571428
1 48 189 0.8838094
2 1 36 0.8838094
2 15 132 1.2838094
2 36 220 1.110476
2 132 350 3.4533332
Is this what you were looking for:
df = pd.DataFrame({'Subset': [1]*10+[2]*12,
'Position': [1,10,15,43,48,89,132,152,189,200,1,10,15,33,36,72,132,152,220,250,350,750],
'Value': [2,3,.285714,1,0,2,2,.285714,.1333333,0,0.133333,0,2,2,.285714,2,.133333,.133333,3,8,6,0]})
averaged_df = pd.DataFrame(columns=['Subset', 'Start_position', 'End_position', 'Mean'])
window = 5
step_size = 2
for subset in df.Subset.unique():
subset_df = df[df.Subset==subset].reset_index(drop=True)
for i in range(0,len(df),step_size):
window_rows = subset_df.iloc[i:i+window]
if len(window_rows) < window:
continue
window_average = {'Subset': window_rows.Subset.loc[0+i],
'Start_position': window_rows.Position[0+i],
'End_position': window_rows.Position.iloc[-1],
'Mean': window_rows.Value.mean()}
averaged_df = averaged_df.append(window_average,ignore_index=True)
Some notes about the code:
It assumes all subsets are in order in the original df (1,1,2,1,2,2 will behave as if it was 1,1,1,2,2,2)
If there is a group left that's smaller than a window, it will skip it (e.g. 1, 132, 200, 0,60476 is not included`)
One version specific answer would be, using pandas.api.indexers.FixedForwardWindowIndexer introduced in pandas 1.1.0:
>>> window=5
>>> step=2
>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window)
>>> df2 = df.join(df.Position.shift(-(window-1)), lsuffix='_start', rsuffix='_end')
>>> df2 = df2.assign(Mean=df2.pop('Value').rolling(window=indexer).mean()).iloc[::step]
>>> df2 = df2[df2.Position_start.lt(df2.Position_end)].dropna()
>>> df2['Position_end'] = df2['Position_end'].astype(int)
>>> df2
Subset Position_start Position_end Mean
0 1 1 48 1.257143
2 1 15 132 1.057143
4 1 48 189 0.883809
10 2 1 36 0.883809
12 2 15 132 1.283809
14 2 36 220 1.110476
16 2 132 350 3.453333

Combining dataframe with first 2 columns of another dataframe without changing index position

I originally have a 'monthly' DataFrame with months (1-11) as column index and number of disease cases as values.
I have another 'disease' DataFrame with the first 2 columns as 'Country' and 'Province'.
I want to combine the 'monthly' DataFrame with the 2 columns, and the 2 columns should be still be the first 2 columns in the combined 'monthly' DataFrame (Same index position).
In other words, the original 'monthly' DataFrame is:
1 2 3 4 5 6 7 8 9 10 11
0 1 5 8 0 9 9 8 18 82 89 81
1 0 1 9 19 8 12 29 19 91 74 93
The desired output is:
Country Province 1 2 3 4 5 6 7 8 9 10 11
0 Afghanistan Afghanistan 1 5 8 0 9 9 8 18 82 89 81
1 Argentina Argentina 0 1 9 19 8 12 29 19 91 74 93
I was able to append the 2 columns into the 'monthly' DataFrame by this code:
monthly['Country'] = disease['Country']
monthly['Province'] = disease['Province']
However, this puts the 2 columns at the end of the 'monthly' DataFrame.
1 2 3 4 5 6 7 8 9 10 11 Country Province
0 1 5 8 0 9 9 8 18 82 89 81 Afghanistan Afghanistan
1 0 1 9 19 8 12 29 19 91 74 93 Argentina Argentina
How should I improve the code without using the insert() function ? Can I use the iloc to specify the index position?
Thanks for your help in advance!
Use concat with selecting first 2 columns by positions by DataFrame.iloc, here first : means get all rows:
df = pd.concat((disease.iloc[:, :2], monthly), axis=1)
Or by columns names:
df = pd.concat((disease[['Country','Province']], monthly), axis=1)

Get row numbers based on column values from numpy array

I am new to numpy and need some help in solving my problem.
I read records from a binary file using dtypes, then I am selecting 3 columns
df = pd.DataFrame(np.array([(124,90,5),(125,90,5),(126,90,5),(127,90,0),(128,91,5),(129,91,5),(130,91,5),(131,91,0)]), columns = ['atype','btype','ctype'] )
which gives
atype btype ctype
0 124 90 5
1 125 90 5
2 126 90 5
3 127 90 0
4 128 91 5
5 129 91 5
6 130 91 5
7 131 91 0
'atype' is of no interest to me for now.
But what I want is the row numbers when
(x,90,5) appears in 2nd and 3rd columns
(x,90,0) appears in 2nd and 3rd columns
when (x,91,5) appears in 2nd and 3rd columns
and (x,91,0) appears in 2nd and 3rd columns
etc
There are 7 variables like 90,91,92,93,94,95,96 and correspondingly there will be values of either 5 or 0 in the 3rd column.
The entries are 1 million. So is there anyway to find out these without a for loop.
Using pandas you could try the following.
df[(df['btype'].between(90, 96)) & (df['ctype'].isin([0, 5]))]
Using your example. if some of the values are changed, such that df is
atype btype ctype
0 124 90 5
1 125 90 5
2 126 0 5
3 127 90 100
4 128 91 5
5 129 0 5
6 130 91 5
7 131 91 0
then using the solution above, the following is returned.
atype btype ctype
0 124 90 5
1 125 90 5
4 128 91 5
6 130 91 5
7 131 91 0

Converting a Dataframe into Series of Pandas Series

How can i convert a dataframe into series of Pandas series.
My dataframe below and i want it to plot with stacked bar chart.
city soru_id value1 value2 value3 value4 value5
0 1 2 147 119 69 92 106
1 2 2 31 20 12 14 26
2 3 2 37 22 24 18 19
3 4 2 10 13 7 13 10
4 5 2 38 48 18 30 27
5 6 2 401 409 168 354 338
.. ... ... ... ... ... ... ...
76 77 2 12 7 3 12 8
77 78 2 4 2 1 12 3
78 79 2 3 None 1 None None
79 80 2 12 7 4 4 7
80 81 2 18 13 7 10 2
[81 rows x 7 columns]
Therefore i need to get the dataframe into form like below to plot it. I can not loop into DataFrame() method. How can i plot it.
df = pd.DataFrame([pd.Series([value1,value2,value3,value4,value5]),
pd.Series([value1,value2,value3,value4,value5]),
pd.Series([value1,value2,value3,value4,value5]),
..................................
..................................
pd.Series([value1,value2,value3,value4,value5]),
pd.Series([value1,value2,value3,value4,value5])
], index=[index0,index2,index3,....,index79,index80])
df.plot.bar(stacked=True)
plt.show()
You don't need to change anything about your dataframe, just index the dataframe based on your value columns and plot that.
cols = [col for col in df.columns if col.startswith('value')]
df[cols].plot.bar(stacked=True)
plt.show()

How to find out if there was weekend between days?

I have two data frames. One representing when an order was placed and arrived, while the other one represents the working days of the shop.
Days are taken as days of the year. i.e. 32 = 1th February.
orders = DataFrame({'placed':[100,103,104,105,108,109], 'arrived':[103,104,105,106,111,111]})
Out[25]:
arrived placed
0 103 100
1 104 103
2 105 104
3 106 105
4 111 108
5 111 109
calendar = DataFrame({'day':['100','101','102','103','104','105','106','107','108','109','110','111','112','113','114','115','116','117','118','119','120'], 'closed':[0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0]})
Out[21]:
closed day
0 0 100
1 1 101
2 1 102
3 0 103
4 0 104
5 0 105
6 0 106
7 0 107
8 1 108
9 1 109
10 0 110
11 0 111
12 0 112
13 0 113
14 0 114
15 1 115
16 1 116
17 0 117
18 0 118
19 0 119
20 0 120
What i want to do is to compute the difference between placed and arrived
x = orders['arrived'] - orders['placed']
Out[24]:
0 3
1 1
2 1
3 1
4 3
5 2
dtype: int64
and subtract one if any day between arrived and placed (included) was a day in which the shop was closed.
i.e. in the first row the order is placed on day 100 and arrived on day 103. the day used are 100, 101, 102, 103. the difference between 103 and 100 is 3. However, since 101 and 102 are days in which the shop is closed I want to subtract 1 for each. That is 3 -1 -1 = 1. And finally append this result on the orders df.

Categories

Resources