Dataframe pandas how to pass list as columns - python

I have two lists, such as:
list_columns = ['a','b','c','d','e','f','g','h','k','l','m','n']
and a list of values
list_values = [11,22,33,44,55,66,77,88,99,100, 111, 222]
I want to create a Pandas dataframe using list_columns as columns.
I tried with df = pd.DataFrame(list_values, columns=list_columns)
but it doesn't work
I get this error: ValueError: Shape of passed values is (1, 12), indices imply (12, 12)

A dataframe is a two-dimensional object. To reflect this, you need to feed a nested list. Each sublist, in this case the only sublist, represents a row.
df = pd.DataFrame([list_values], columns=list_columns)
print(df)
# a b c d e f g h k l m n
# 0 11 22 33 44 55 66 77 88 99 100 111 222
If you supply an index with length greater than 1, Pandas broadcasts for you:
df = pd.DataFrame([list_values], columns=list_columns, index=[0, 1, 2])
print(df)
# a b c d e f g h k l m n
# 0 11 22 33 44 55 66 77 88 99 100 111 222
# 1 11 22 33 44 55 66 77 88 99 100 111 222
# 2 11 22 33 44 55 66 77 88 99 100 111 222

If I understand your question correctly just wrap list_values in brackets so it's a list of lists
list_columns = ['a','b','c','d','e','f','g','h','k','l','m','n']
list_values = [[11,22,33,44,55,66,77,88,99,100, 111, 222]]
pd.DataFrame(list_values, columns=list_columns)
a b c d e f g h k l m n
0 11 22 33 44 55 66 77 88 99 100 111 222

from your list you can do like below:
df = pd.DataFrame(list_values)
df=df.T
df.columns=list_columns
>>df
a b c d e f g h k l m n
0 11 22 33 44 55 66 77 88 99 100 111 222

Related

Add Sum to all grouped rows in pandas dataframe

I have a dataframe and i want to group its "First" and "Second" column and then to produce the expected output as mentioned below:
df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
print(df)
Output>
First Second Value_1 Value_2
0 a q 17 70
1 b e 44 47
2 c e 5 56
3 a e 23 58
4 b e 10 76
5 a q 11 67
6 b q 21 84
7 c q 42 67
8 b e 36 53
9 c q 16 63
When i Grouped this DataFrame using groupby, I am getting below output:
def func(arr,columns):
return arr.sort_values(by = columns).drop(columns, axis = 1)
df.groupby(['First','Second']).apply(func, columns = ['First','Second'])
Value_1 Value_2
First Second
a e 3 23 58
q 0 17 70
5 11 67
b e 1 44 47
4 10 76
8 36 53
q 6 21 84
c e 2 5 56
q 7 42 67
9 16 63
However i want below output:
Expected output:
Value_1 Value_2
First Second
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130
It's not necessary to print "All" string but to print the sum of all grouped rows.
df = pd.DataFrame({'First':list('abcababcbc'), 'Second':list('qeeeeqqqeq'),'Value_1':np.random.randint(4,50,10),'Value_2':np.random.randint(40,90,10)})
First Second Value_1 Value_2
0 a q 4 69
1 b e 20 74
2 c e 13 82
3 a e 9 41
4 b e 11 79
5 a q 32 77
6 b q 6 75
7 c q 39 62
8 b e 26 80
9 c q 26 42
def lambda_t(x):
df = x.sort_values(['First','Second']).drop(['First','Second'],axis=1)
df.loc['all'] = df.sum()
return df
df.groupby(['First','Second']).apply(lambda_t)
Value_1 Value_2
First Second
a e 3 9 41
all 9 41
q 0 4 69
5 32 77
all 36 146
b e 1 20 74
4 11 79
8 26 80
all 57 233
q 6 6 75
all 6 75
c e 2 13 82
all 13 82
q 7 39 62
9 26 42
all 65 104
You can try this:
reset the index in your group by:
d1 = df.groupby(['First','Second']).apply(func, columns = ['First','Second']).reset_index()
Then group by 'First' and 'Second' and sum the values columns.
d2 = d.groupby(['First', 'Second']).sum().reset_index()
Create the 'level_2' column in the new dataframe and concatenate with the initial one to get the desired result
d2.loc[:,'level_2'] = 'All'
pd.concat([d1,d2],0).sort_values(by = ['First', 'Second'])
Not sure about your function; however, you could chunk it into two steps:
Create an indexed dataframe, where you append the First and Second columns to the existing index:
df.index = df.index.astype(str).rename("Total")
indexed = df.set_index(["First", "Second"], append=True).reorder_levels(
["First", "Second", "Total"]
)
indexed
Value_1 Value_2
First Second Total
a q 0 17 70
b e 1 44 47
c e 2 5 56
a e 3 23 58
b e 4 10 76
a q 5 11 67
b q 6 21 84
c q 7 42 67
b e 8 36 53
c q 9 16 63
Create an aggregation, grouped by First and Second:
summary = (
df.groupby(["First", "Second"])
.sum()
.assign(Total="All")
.set_index("Total", append=True)
)
summary
Value_1 Value_2
First Second Total
a e All 23 58
q All 28 137
b e All 90 176
q All 21 84
c e All 5 56
q All 58 130
Combine indexed and summary dataframes:
pd.concat([indexed, summary]).sort_index(level=["First", "Second"])
Value_1 Value_2
First Second Total
a e 3 23 58
All 23 58
q 0 17 70
5 11 67
All 28 137
b e 1 44 47
4 10 76
8 36 53
All 90 176
q 6 21 84
All 21 84
c e 2 5 56
All 5 56
q 7 42 67
9 16 63
All 58 130

Pandas Multi/Hierarchical Indexing - performing arithmetic operation between columns

I have a dataframe like this:
arr = np.random.randint(10, 99, (4,4))
df = pd.DataFrame(arr)
df.columns = pd.MultiIndex.from_product([['X','Y'],['A','B']])
And it looks like this:
X Y
A B A B
0 76 78 29 24
1 34 80 83 56
2 56 44 40 30
3 16 38 45 93
For all rows where A < B in X, I want to do A - B in Y. How do I do that?
I did this to filter and select A and B from Y
df[df['X']['A'] < df['X']['B']].loc[:, ('Y', ['A', 'B'])]
Y
A B
0 29 24
1 83 56
3 45 93
But I am lost on how to do A - B.
Thanks.
Assuming you want to subtract and update A with the result, you can do so by indexing as:
m = (df[('X','A')] < df[('X','B')])
df.loc[m,('Y','A')] = df.loc[m,('Y','A')] - df.loc[m,('Y','B')]
print(df)
X Y
A B A B
0 77 67 55 87
1 36 85 26 50
2 77 14 62 89
3 88 33 82 44
You can select columns by tuples for MultiIndex like:
np.random.seed(20)
arr = np.random.randint(10, 99, (4,4))
df = pd.DataFrame(arr)
df.columns = pd.MultiIndex.from_product([['X','Y'],['A','B']])
print (df)
X Y
A B A B
0 25 38 19 30
1 85 32 81 44
2 50 95 36 93
3 26 72 26 17
mask = df[('X','A')].lt(df[('X','B')])
print (mask)
0 True
1 False
2 True
3 True
dtype: bool
s = df.loc[mask, ('Y','A')].sub(df.loc[mask, ('Y','B')])
print (s)
0 -11
2 -57
3 9
dtype: int32

Looping a function with Pandas DataFrames

I have some function that takes a DataFrame and an integer as arguments:
func(df, int)
The function returns a new DataFrame, e.g.:
df2 = func(df,2)
I'd like to write a loop for integers 2-10, resulting in 9 DataFrames. If I do this manually it would look like this:
df2 = func(df,2)
df3 = func(df2,3)
df4 = func(df3,4)
df5 = func(df4,5)
df6 = func(df5,6)
df7 = func(df6,7)
df8 = func(df7,8)
df9 = func(df8,9)
df10 = func(df9,10)
Is there a way to write a loop that does this?
This type of thing is what lists are for.
data_frames = [df]
for i in range(2, 11):
data_frames.append(func(data_frames[-1], i))
It's a sign of brittle code when you see variable names like df1, df2, df3, etc. Use lists when you have a sequence of related objects to build.
To clarify, this data_frames is a list of DataFrames that can be concatenated with data_frames = pd.concat(data_frames, sort=False), resulting in one DataFrame that combines the original df with everything that results from the loop, correct?
Yup, that's right. If your goal is one final data frame, you can concatenate the entire list at the end to combine the information into a single frame.
Do you mind explaining why data_frames[-1], which takes the last item of the list, returns a DataFrame? Not clear on this.
Because as you're building the list, at all times each entry is a data frame. data_frames[-1] evaluates to the last element in the list, which in this case, is the data frame you most recently appended.
You may try using itertools.accumulate as follows:
sample data
df:
a b c
0 75 18 17
1 48 56 3
import itertools
def func(x, y):
return x + y
dfs = list(itertools.accumulate([df] + list(range(2, 11)), func))
[ a b c
0 75 18 17
1 48 56 3, a b c
0 77 20 19
1 50 58 5, a b c
0 80 23 22
1 53 61 8, a b c
0 84 27 26
1 57 65 12, a b c
0 89 32 31
1 62 70 17, a b c
0 95 38 37
1 68 76 23, a b c
0 102 45 44
1 75 83 30, a b c
0 110 53 52
1 83 91 38, a b c
0 119 62 61
1 92 100 47, a b c
0 129 72 71
1 102 110 57]
dfs is the list of result dataframes where each one is the adding of 2 - 10 to the previous result
If you want concat them all into one dataframe, Use pd.concat
pd.concat(dfs)
Out[29]:
a b c
0 75 18 17
1 48 56 3
0 77 20 19
1 50 58 5
0 80 23 22
1 53 61 8
0 84 27 26
1 57 65 12
0 89 32 31
1 62 70 17
0 95 38 37
1 68 76 23
0 102 45 44
1 75 83 30
0 110 53 52
1 83 91 38
0 119 62 61
1 92 100 47
0 129 72 71
1 102 110 57
You can use exec with a formatted string:
for i in range(2, 11):
exec("df{0} = func(df{1}, {0})".format(i, i - 1 if i > 2 else ''))

Getting aggregate from Pandas for other categories

I am trying to perform an aggregate calculation, but I want the calculation to apply to every other category.
So,
df.groupby(['index']).agg({data : [func1,func2]})
Will perform the aggregate calculations func1 and func2 on the data grouped by index, but I want to perform the calculations on all the data that isn't in the index.
For example:
index data
A 1
A 2
A 1
B 2
B 2
B 4
B 4
C 1
C 3
D 4
D 1
I would want the results for A to be performed on the data in B,C,D.
Is there a novel way to accomplish this?
Well, I actually think I figured this out. Basically, I created a new dataframe and re-index'd it.
value
original_index
A 44
A 65
A 88
B 69
B 11
B 52
C 56
C 42
C 85
D 66
D 77
D 9
Loop through each index and and copy everything not in that index to a new dataframe. Then concat them all together.
l = []
for i in df.index.unique():
d = df[~df.index.isin([i])].copy()
d['new_index'] = i
d.drop('original_index',axis=0,inplace=True)
d.set_index('new_index',inplace=True)
l.append(d)
df2 = pd.concat(l,axis=0)
Ouput:
value
new_index
A 69
A 11
A 52
A 56
A 42
A 85
A 66
A 77
A 9
B 44
B 65
B 88
B 56
B 42
B 85
B 66
B 77
B 9
C 44
C 65
C 88
C 69
C 11
C 52
C 66
C 77
C 9
D 44
D 65
D 88
D 69
D 11
D 52
D 56
D 42
D 85
Now we can apply our groupby function on the new index and it will return results from values that were originally not in the index.
group_df = df2.groupby(['new_index']).agg({'value' :[func1,func2]})[['value']]
It works, but I'm sure there must be a better way.

Python DataFrame Loop through an element and assign a value

Here, in my code, the correlation matrix is a dataframe and diag is a list.
When I run the following code (CholDC part at the bottom), it returns numpy.float64 object is not iterable.
What do I need to do to make this code work?
def CholDC (correl, diag):
for column in correl:
j = 0
for j in correl[str(column)][j]:
Sum = correl[str(column)][j]
k = int(column)-1
if k >= 1:
Sum = Sum - correl[str(column)][k]*correl[str(j)][k]
else:
Sum = Sum
if int(column) == j:
if Sum <= 0:
print ("Should be PSD")
else:
diag[int(column)] = np.sqrt(Sum)
else:
correl[str(j)][int(column)] = Sum / diag[int(column)]
diag = []
df_correl = pd.DataFrame(df_correlation)
CholDC(df_correl, diag)
To loop through a dataframe, you need to use iterrows(). See the example below:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100, size=(10, 4)), columns=list('ABCD'))
print(df)
for index, row in df.iterrows():
print(row['B'], row['C'])
#dataframe output
A B C D
0 53 60 63 44
1 17 12 20 55
2 85 28 76 99
3 39 75 69 30
4 2 85 21 3
5 22 5 45 33
6 78 65 22 38
7 14 99 0 67
8 18 70 53 19
9 54 25 96 7
#output from loop
60 63
12 20
28 76
75 69
85 21
5 45
65 22
99 0
70 53
25 96
So use iterrows() in your code instead of for column in correl.

Categories

Resources