I have basically two DataFrames from different dates and want to join them into one
let's say this is data from 25 Sep
hour columnA columnB
0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
here is data from 26sep
hour columnA columnB
0 54 89
1 45 3
2 33 97
3 12 13
4 78 47
now I want to join both DataFrames and get MultiIndex DataFrame like this
25sep hour columnA columnB
0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
26sep hour columnA columnB
0 54 89
1 45 3
2 33 97
3 12 13
4 78 47
I read the docs about MultiIndex but am not sure how to apply it to my situation.
Use pandas.concat
https://pandas.pydata.org/docs/reference/api/pandas.concat.html
>>> df = pd.concat([df1.set_index('hour'), df2.set_index('hour')],
keys=["25sep", "26sep"])
>>> df
columnA columnB
hour
25sep 0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
26sep 0 54 89
1 45 3
2 33 97
3 12 13
4 78 47
Let us try
out = pd.concat({ y : x.set_index('hour') for x, y in zip([df1,df2],['25sep','26sep'])})
columnA columnB
hour
25sep 0 12 24
1 45 87
2 10 58
3 12 13
4 12 20
26sep 0 54 89
1 45 3
2 33 97
3 12 13
4 78 47
Related
I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass
I have a group of data like below
ID Type value_1 value_2
1 A 12 89
2 A 13 78
3 A 11 92
4 A 9 79
5 B 15 83
6 B 34 91
7 B 2 87
8 B 3 86
9 B 7 85
10 C 9 83
11 C 3 85
12 C 2 87
13 C 12 88
14 C 11 82
I want to get the top 3 member of each Type according to the value_1 . The only solution occurs to me is that: first , get each Type data into a dataframe and sorted according to the value_1 and get the top 3; Then, merge the result together.
But is ther any simple method to solve it ? For easy discuss , I have codes below
#coding:utf-8
import pandas as pd
_data = [
["1","A",12,89],
["2","A",13,78],
["3","A",11,92],
["4","A",9,79],
["5","B",15,83],
["6","B",34,91],
["7","B",2,87],
["8","B",3,86],
["9","B",7,85],
["10","C",9,83],
["11","C",3,85],
["12","C",2,87],
["13","C",12,88],
["14","C",11,82]
]
head= ["ID","type","value_1","value_2"]
df = pd.DataFrame(_data, columns=head)
Then we using groupby tail with sort_values
newdf=df.sort_values(['type','value_1']).groupby('type').tail(3)
newer
ID type value_1 value_2
2 3 A 11 92
0 1 A 12 89
1 2 A 13 78
8 9 B 7 85
4 5 B 15 83
5 6 B 34 91
9 10 C 9 83
13 14 C 11 82
12 13 C 12 88
Sure! DataFrame.groupby can split a dataframe into different parts by the group fields and apply function can apply UDF on each group.
df.groupby('type', as_index=False, group_keys=False)\
.apply(lambda x: x.sort_values('value_1', ascending=False).head(3))
How can i convert a dataframe into series of Pandas series.
My dataframe below and i want it to plot with stacked bar chart.
city soru_id value1 value2 value3 value4 value5
0 1 2 147 119 69 92 106
1 2 2 31 20 12 14 26
2 3 2 37 22 24 18 19
3 4 2 10 13 7 13 10
4 5 2 38 48 18 30 27
5 6 2 401 409 168 354 338
.. ... ... ... ... ... ... ...
76 77 2 12 7 3 12 8
77 78 2 4 2 1 12 3
78 79 2 3 None 1 None None
79 80 2 12 7 4 4 7
80 81 2 18 13 7 10 2
[81 rows x 7 columns]
Therefore i need to get the dataframe into form like below to plot it. I can not loop into DataFrame() method. How can i plot it.
df = pd.DataFrame([pd.Series([value1,value2,value3,value4,value5]),
pd.Series([value1,value2,value3,value4,value5]),
pd.Series([value1,value2,value3,value4,value5]),
..................................
..................................
pd.Series([value1,value2,value3,value4,value5]),
pd.Series([value1,value2,value3,value4,value5])
], index=[index0,index2,index3,....,index79,index80])
df.plot.bar(stacked=True)
plt.show()
You don't need to change anything about your dataframe, just index the dataframe based on your value columns and plot that.
cols = [col for col in df.columns if col.startswith('value')]
df[cols].plot.bar(stacked=True)
plt.show()
I have two different dataframes with the same column names:
eg.
0 1 2
0 10 13 17
1 14 21 34
2 68 32 12
0 1 2
0 45 56 32
1 9 22 86
2 55 64 19
I would like to append the second frame to the right of the first one while continuing the column names from the first frame. The output would look like this:
0 1 2 3 4 5
0 10 13 17 45 56 32
1 14 21 34 9 22 86
2 68 32 12 55 64 19
What is the most efficient way of doing this?
Thanks.
Use pd.concat first and then reset the columns.
In [1108]: df_out = pd.concat([df1, df2], axis=1)
In [1109]: df_out.columns = list(range(len(df_out.columns)))
In [1110]: df_out
Out[1110]:
0 1 2 3 4 5
0 10 13 17 45 56 32
1 14 21 34 9 22 86
2 68 32 12 55 64 19
Why not join:
>>> df=df.join(df_,lsuffix='_')
>>> df.columns=range(len(df.columns))
>>> df
0 1 2 3 4 5
0 10 13 17 45 56 32
1 14 21 34 9 22 86
2 68 32 12 55 64 19
>>>
join is your friend, i use lsuffix (could be rsuffix too) to ignore error for saying duplicate columns.
Question is pretty self explanatory, how would you insert a dataframe with a couple of values in to a bigger dataframe at a given point (between index's 10 and 11). Meaning that .append cant be used
You can use concat with sliced df by loc:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(100, size=(5,6)), columns=list('ABCDEF'))
print (df1)
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
df2 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#inserted between 4 and 5 index values
print (pd.concat([df1.loc[:4], df2, df1.loc[4:]], ignore_index=True))
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
5 1 4 7 1 5 7
6 2 5 8 3 3 4
7 3 6 9 5 6 3
8 27 4 31 1 13 83