Converting a Dataframe into Series of Pandas Series - python

How can i convert a dataframe into series of Pandas series.
My dataframe below and i want it to plot with stacked bar chart.
city soru_id value1 value2 value3 value4 value5
0 1 2 147 119 69 92 106
1 2 2 31 20 12 14 26
2 3 2 37 22 24 18 19
3 4 2 10 13 7 13 10
4 5 2 38 48 18 30 27
5 6 2 401 409 168 354 338
.. ... ... ... ... ... ... ...
76 77 2 12 7 3 12 8
77 78 2 4 2 1 12 3
78 79 2 3 None 1 None None
79 80 2 12 7 4 4 7
80 81 2 18 13 7 10 2
[81 rows x 7 columns]
Therefore i need to get the dataframe into form like below to plot it. I can not loop into DataFrame() method. How can i plot it.
df = pd.DataFrame([pd.Series([value1,value2,value3,value4,value5]),
pd.Series([value1,value2,value3,value4,value5]),
pd.Series([value1,value2,value3,value4,value5]),
..................................
..................................
pd.Series([value1,value2,value3,value4,value5]),
pd.Series([value1,value2,value3,value4,value5])
], index=[index0,index2,index3,....,index79,index80])
df.plot.bar(stacked=True)
plt.show()

You don't need to change anything about your dataframe, just index the dataframe based on your value columns and plot that.
cols = [col for col in df.columns if col.startswith('value')]
df[cols].plot.bar(stacked=True)
plt.show()

Related

How to create MultiIndex DataFrame from other pandas DataFrames

I have basically two DataFrames from different dates and want to join them into one
let's say this is data from 25 Sep
hour columnA columnB
0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
here is data from 26sep
hour columnA columnB
0 54 89
1 45 3
2 33 97
3 12 13
4 78 47
now I want to join both DataFrames and get MultiIndex DataFrame like this
25sep hour columnA columnB
0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
26sep hour columnA columnB
0 54 89
1 45 3
2 33 97
3 12 13
4 78 47
I read the docs about MultiIndex but am not sure how to apply it to my situation.
Use pandas.concat
https://pandas.pydata.org/docs/reference/api/pandas.concat.html
>>> df = pd.concat([df1.set_index('hour'), df2.set_index('hour')],
keys=["25sep", "26sep"])
>>> df
columnA columnB
hour
25sep 0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
26sep 0 54 89
1 45 3
2 33 97
3 12 13
4 78 47
Let us try
out = pd.concat({ y : x.set_index('hour') for x, y in zip([df1,df2],['25sep','26sep'])})
columnA columnB
hour
25sep 0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
26sep 0 54 89
1 45 3
2 33 97
3 12 13
4 78 47

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Combining dataframe with first 2 columns of another dataframe without changing index position

I originally have a 'monthly' DataFrame with months (1-11) as column index and number of disease cases as values.
I have another 'disease' DataFrame with the first 2 columns as 'Country' and 'Province'.
I want to combine the 'monthly' DataFrame with the 2 columns, and the 2 columns should be still be the first 2 columns in the combined 'monthly' DataFrame (Same index position).
In other words, the original 'monthly' DataFrame is:
1 2 3 4 5 6 7 8 9 10 11
0 1 5 8 0 9 9 8 18 82 89 81
1 0 1 9 19 8 12 29 19 91 74 93
The desired output is:
Country Province 1 2 3 4 5 6 7 8 9 10 11
0 Afghanistan Afghanistan 1 5 8 0 9 9 8 18 82 89 81
1 Argentina Argentina 0 1 9 19 8 12 29 19 91 74 93
I was able to append the 2 columns into the 'monthly' DataFrame by this code:
monthly['Country'] = disease['Country']
monthly['Province'] = disease['Province']
However, this puts the 2 columns at the end of the 'monthly' DataFrame.
1 2 3 4 5 6 7 8 9 10 11 Country Province
0 1 5 8 0 9 9 8 18 82 89 81 Afghanistan Afghanistan
1 0 1 9 19 8 12 29 19 91 74 93 Argentina Argentina
How should I improve the code without using the insert() function ? Can I use the iloc to specify the index position?
Thanks for your help in advance!
Use concat with selecting first 2 columns by positions by DataFrame.iloc, here first : means get all rows:
df = pd.concat((disease.iloc[:, :2], monthly), axis=1)
Or by columns names:
df = pd.concat((disease[['Country','Province']], monthly), axis=1)

Labeling by period

my dataset
name day value
A 7 88
A 15 101
A 21 121
A 29 56
B 21 131
B 30 78
B 35 102
C 8 80
C 16 101
...
I am trying to plot with values for these days, but I want to label because there are too many unique numbers of days.
I try to label it consistently,
Is there a way to speed up labeling by cutting it every 7 days(week)?
For example, ~ 7day = 1week, 8 ~ 14day = 2week, and so on.
output what I want
name day value week
A 7 88 1
A 15 101 3
A 21 121 3
A 29 56 5
B 21 131 3
B 30 78 5
B 35 102 5
C 8 80 2
C 16 101 3
thank you for reading
Subtract 1, then use integer division by 7 and last add 1:
df['week'] = (df['day'] - 1) // 7 + 1
print (df)
name day value week
0 A 7 88 1
1 A 15 101 3
2 A 21 121 3
3 A 29 56 5
4 B 21 131 3
5 B 30 78 5
6 B 35 102 5
7 C 8 80 2
8 C 16 101 3

selecting a column from pandas pivot table

I have the below pivot table which I created from a dataframe using the following code:
table = pd.pivot_table(df, values='count', index=['days'],columns['movements'], aggfunc=np.sum)
movements 0 1 2 3 4 5 6 7
days
0 2777 51 2
1 6279 200 7 3
2 5609 110 32 4
3 4109 118 101 8 3
4 3034 129 109 6 2 2
5 2288 131 131 9 2 1
6 1918 139 109 13 1 1
7 1442 109 153 13 10 1
8 1085 76 111 13 7 1
9 845 81 86 8 8
10 646 70 83 1 2 1 1
As you can see from pivot table that it has 8 columns from 0-7 and now I want to plot some specific columns instead of all. I could not manage to select columns. Lets say I want to plot column 0 and column 2 against index. what should I use for y to select column 0 and column 2?
plt.plot(x=table.index, y=??)
I tried with y = table.value['0', '2'] and y=table['0','2'] but nothing works.
You cannot select ndarray for y if you need those two column values in a single plot you can use:
plt.plot(table['0'])
plt.plot(table['2'])
If column names are intergers then:
plt.plot(table[0])
plt.plot(table[2])

Categories

Resources