Add a name to pandas dataframe index - python

As the picture shows , how can I add a name to index in pandas dataframe?And when added it should be like this:

You need set index name:
df.index.name = 'code'
Or rename_axis:
df = df.rename_axis('code')
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10,size=(5,5)),columns=list('ABCDE'),index=list('abcde'))
print (df)
A B C D E
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4
df.index.name = 'code'
print (df)
A B C D E
code
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4
df = df.rename_axis('code')
print (df)
A B C D E
code
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4

Related

Pandas take the line value below

There is such a model of real data:
C S E D
1 1 3 0 0
2 1 5 0 0
3 1 6 0 0
4 2 1 0 0
5 2 3 0 0
6 2 7 0 0
ะก - category, S - start, E - end, D - delta
Using pandas, you need to enter the value of column S with the condition id = id+1 in column E, and the last value of category E is equal to the value from column S of the same row
It turns out:
C S E D
1 1 3 5 0
2 1 5 6 0
3 1 6 6 0
4 2 1 3 0
5 2 3 7 0
6 2 7 7 0
And then subtract S from E and put it in D. This, in principle, is easy. The difficulty is filling in column E
The result is this:
C S E D
1 1 3 5 2
2 1 5 6 1
3 1 6 6 0
4 2 1 3 2
5 2 3 7 4
6 2 7 7 0
Use DataFrameGroupBy.shift with replace last missing values by original with Series.fillna and then only subtract for column D:
df['E'] = df.groupby('C')['S'].shift(-1).fillna(df['S']).astype(int)
df['D'] = df['E'] - df['S']
Or if use DataFrame.assign is necessary use lambda function for use counted values of E column:
df = df.assign(E = df.groupby('C')['S'].shift(-1).fillna(df['S']).astype(int),
D = lambda x: x['E'] - x['S'])
print (df)
C S E D
1 1 3 5 2
2 1 5 6 1
3 1 6 6 0
4 2 1 3 2
5 2 3 7 4
6 2 7 7 0

Pandas: Add rows in the groups of a dataframe

I have a data frame as follows:
df = pd.DataFrame({"date": [1,2,5,6,2,3,4,5,1,3,4,5,6,1,2,3,4,5,6],
"variable": ["A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","D","D","D","D"]})
date variable
0 1 A
1 2 A
2 5 A
3 6 A
4 2 B
5 3 B
6 4 B
7 5 B
8 1 C
9 3 C
10 4 C
11 5 C
12 6 C
13 1 D
14 2 D
15 3 D
16 4 D
17 5 D
18 6 D
In this data frame, there are 4 values in the variable column: A, B, C, D. My goal is that each of the variables needs to contain 1 to 6 dates in the date column.
But currently, a few values in the date column are missing for some variable. I tried grouping them and filling each value with a counter but sometimes there are more than one dates missing (For example, in variable A, the dates 4 and 5 are missing). Also, the counter made my code terribly slow as I have a couple of thousand of rows.
Is there a faster and smarter way to do this without using a counter?
The desired output should be as follows:
date variable
0 1 A
1 2 A
2 3 A
3 4 A
4 5 A
5 6 A
6 1 B
7 2 B
8 3 B
9 4 B
10 5 B
11 6 B
12 1 C
13 2 C
14 3 C
15 4 C
16 5 C
17 6 C
18 1 D
19 2 D
20 3 D
21 4 D
22 5 D
23 6 D
itertools.product
from itertools import product
pd.DataFrame([*product(
range(df.date.min(), df.date.max() + 1),
sorted({*df.variable})
)], columns=df.columns)
date variable
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 B
6 2 C
7 2 D
8 3 A
9 3 B
10 3 C
11 3 D
12 4 A
13 4 B
14 4 C
15 4 D
16 5 A
17 5 B
18 5 C
19 5 D
20 6 A
21 6 B
22 6 C
23 6 D
Using grpupby + reindex
df.groupby('variable', as_index=False).apply(
lambda g: g.set_index('date').reindex([1,2,3,4,5,6]).ffill().bfill())
.reset_index(level=1)
Output:
date variable
0 1 A
0 2 A
0 3 A
0 4 A
0 5 A
0 6 A
1 1 B
1 2 B
1 3 B
1 4 B
1 5 B
1 6 B
2 1 C
2 2 C
2 3 C
2 4 C
2 5 C
2 6 C
3 1 D
3 2 D
3 3 D
3 4 D
3 5 D
3 6 D
This is more of a work around but it should work
df.groupby(by=['variable']).agg({'date': range(6)}).explode('date')

Pad dataframe discontinuous column

I have the following dataframe:
Name B C D E
1 A 1 2 2 7
2 A 7 1 1 7
3 B 1 1 3 4
4 B 2 1 3 4
5 B 3 1 3 4
What I'm trying to do is to obtain a new dataframe in which, for rows with the same "Name", the elements in the "B" column are continuous, hence in this example for rows with "Name" = A, the dataframe would have to be padded with elements ranging from 1 to 7, and the values for columns C, D, E should be 0.
Name B C D E
1 A 1 2 2 7
2 A 2 0 0 0
3 A 3 0 0 0
4 A 4 0 0 0
5 A 5 0 0 0
6 A 6 0 0 0
7 A 7 0 0 0
8 B 1 1 3 4
9 B 2 1 5 4
10 B 3 4 3 6
What I've done so far is to turn the B column values for the same "Name" into continuous values:
new_idx = df_.groupby('Name').apply(lambda x: np.arange(x.index.min(), x.index.max() + 1)).apply(pd.Series).stack()
and reindexing the original (having set B as the index) df using this new Series, but I'm having trouble reindexing using duplicates. Any help would be appreciated.
You can use:
def f(x):
a = np.arange(x.index.min(), x.index.max() + 1)
x = x.reindex(a, fill_value=0)
return (x)
new_idx = (df.set_index('B')
.groupby('Name')
.apply(f)
.drop('Name', 1)
.reset_index()
.reindex(columns=df.columns))
print (new_idx)
Name B C D E
0 A 1 2 2 7
1 A 2 0 0 0
2 A 3 0 0 0
3 A 4 0 0 0
4 A 5 0 0 0
5 A 6 0 0 0
6 A 7 1 1 7
7 B 1 1 3 4
8 B 2 1 3 4
9 B 3 1 3 4

TypeError: unhashable type: 'dict' when using pandas Multi Index

I have tried to add one dataframe that has 2 rows and about 200 columns to the top of another dataframe, but I got TypeError: unhashable type: 'dict' .
This is code I'm using:
df is first dataframe with 2 rows and about 200 columns that I am trying to add to finaldata dataframe.
finaldata.columns = pd.MultiIndex.from_arrays([df.values[0], finaldata.columns])
When I check type of dataframes with type(), I got pandas.core.frame.DataFrame
It seems you need iloc for select by position first and second row of df:
finaldata.columns = pd.MultiIndex.from_arrays([df.iloc[0], df.iloc[1], finaldata.columns])
Sample:
df = pd.DataFrame({'a':[2,3],
'b':[5,6],
'c':[1,5],
'd':[4,5],
'e':[1,5],
'f':[8,9]})
print (df)
a b c d e f
0 2 5 1 4 1 8
1 3 6 5 5 5 9
finaldata = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (finaldata)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
names = ['first','second','third']
finaldata.columns = pd.MultiIndex.from_arrays([df.iloc[0],
df.iloc[1],
finaldata.columns], names=names)
print (finaldata)
first 2 5 1 4 1 8
second 3 6 5 5 5 9
third A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
Another solution with numpy.concatenate:
a = np.concatenate([df.values, np.array(finaldata.columns).reshape(-1,df.shape[1])]).tolist()
print (a)
[[2, 5, 1, 4, 1, 8], [3, 6, 5, 5, 5, 9], ['A', 'B', 'C', 'D', 'E', 'F']]
names = ['first','second','third']
finaldata.columns = pd.MultiIndex.from_arrays(a, names=names)
print (finaldata)
first 2 5 1 4 1 8
second 3 6 5 5 5 9
third A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
EDIT:
Solution is very similar, only need reindex columns:
df = pd.DataFrame({'A':[2,3],
'B':[5,6],
'C':[1,5],
'D':[4,5],
'E':[1,5],
'F':[8,9]})
print (df)
A B C D E F
0 2 5 1 4 1 8
1 3 6 5 5 5 9
finaldata = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'E':[7,8,9],
'F':[1,3,5]})
print (finaldata)
A B E F
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
df1 = df.reindex(columns=finaldata.columns)
print (df1)
A B E F
0 2 5 1 8
1 3 6 5 9
names = ['first','second','third']
finaldata.columns = pd.MultiIndex.from_arrays([df1.iloc[0],
df1.iloc[1],
finaldata.columns], names=names)
print (finaldata)
first 2 5 1 8
second 3 6 5 9
third A B E F
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5

add new column to pandas DataFrame with value depended on previous row

I have an existing pandas DataFrame, and I want to add a new column, where the value of each row will depend on the previous row.
for example:
df1 = pd.DataFrame(np.random.randint(10, size=(4, 4)), columns=['a', 'b', 'c', 'd'])
df1
Out[31]:
a b c d
0 9 3 3 0
1 3 9 5 1
2 1 7 5 6
3 8 0 1 7
and now I want to create column e, where for each row i the value of df1['e'][i] would be: df1['e'][i] = df1['d'][i] - df1['d'][i-1]
desired output:
df1:
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
how can I achieve this?
You can use sub with shift:
df['e'] = df.d.sub(df.d.shift(), fill_value=0)
print (df)
a b c d e
0 9 3 3 0 0.0
1 3 9 5 1 1.0
2 1 7 5 6 5.0
3 8 0 1 7 1.0
If need convert to int:
df['e'] = df.d.sub(df.d.shift(), fill_value=0).astype(int)
print (df)
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1

Categories

Resources