I have the dataframe df
import pandas as pd
b=np.array([0,1,2,2,0,1,2,2,3,4,4,4,5,6,0,1,0,0]).reshape(-1,1)
c=np.array(['a','a','a','a','b','b','b','b','b','b','b','b','b','b','c','c','d','e']).reshape(-1,1)
df = pd.DataFrame(np.hstack([b,c]),columns=['Start','File'])
df
Out[22]:
Start File
0 0 a
1 1 a
2 2 a
3 2 a
4 0 b
5 1 b
6 2 b
7 2 b
8 3 b
9 4 b
10 4 b
11 4 b
12 5 b
13 6 b
14 0 c
15 1 c
16 0 d
17 0 e
I would like to rename the index using index_File
in order to have 0_a, 1_a, ...17_e as indeces
You use set_index with or without the inplace=True
df.set_index(df.File.radd(df.index.astype(str) + '_'))
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
At the expense of a few more code characters, we can quicken this up and take care of the unnecessary index name
df.set_index(df.File.values.__radd__(df.index.astype(str) + '_'))
Start File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
You can directly assign to the index, first by converting the default index to str using astype and then concatenate the str as usual:
In[41]:
df.index = df.index.astype(str) + '_' + df['File']
df
Out[41]:
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
Related
I have a dataframe df:
(A,B) (B,C) (D,B) (E,F)
0 3 0 1
1 1 3 0
2 2 4 2
I want to split it into different columns for all columns in df as shown below:
A B B C D B E F
0 0 3 3 0 0 1 1
1 1 1 1 3 3 0 0
2 2 2 2 4 4 2 2
and add similar columns together:
A B C D E F
0 3 3 0 1 1
1 5 1 3 0 0
2 6 2 4 2 2
how to achieve this using pandas?
With pandas, you can use this :
out = (
df
.T
.reset_index()
.assign(col= lambda x: x.pop("index").str.strip("()").str.split(","))
.explode("col")
.groupby("col", as_index=False).sum()
.set_index("col")
.T
.rename_axis(None, axis=1)
)
# Output :
print(out)
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
i think (A, B) as tuple
pd.concat([pd.DataFrame([df[i].tolist()] * len(i), index=list(i)) for i in df.columns]).sum(level=0).T
result:
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
if future warning occur, use following code:
pd.concat([pd.DataFrame([df[i].tolist()] * len(i), index=list(i)) for i in df.columns]).groupby(level=0).sum().T
same result
Use concat with removed levels with MultiIndex in columns by Series.str.findall:
df.columns = df.columns.str.findall('(\w+)').map(tuple)
df = (pd.concat([df.droplevel(x, axis=1) for x in range(df.columns.nlevels)], axis=1)
.groupby(level=0, axis=1)
.sum())
print (df)
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
For write ouput to file without index use:
df.to_csv('file.csv', index=False)
You can use findall to extract the variables in the header, then melt and explode, finallypivot_table:
out = (df
.reset_index().melt('index')
.assign(variable=lambda d: d['variable'].str.findall('(\w+)'))
.explode('variable')
.pivot_table(index='index', columns='variable', values='value', aggfunc='sum')
.rename_axis(index=None, columns=None)
)
Output:
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
Reproducible input:
df = pd.DataFrame({'(A,B)': [0, 1, 2],
'(B,C)': [3, 1, 2],
'(D,B)': [0, 3, 4],
'(E,F)': [1, 0, 2]})
printing/saving without index:
print(out.to_string(index=False))
A B C D E F
0 3 3 0 1 1
1 5 1 3 0 0
2 8 2 4 2 2
# as file
out.to_csv('yourfile.csv', index=False)
As the picture shows , how can I add a name to index in pandas dataframe?And when added it should be like this:
You need set index name:
df.index.name = 'code'
Or rename_axis:
df = df.rename_axis('code')
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10,size=(5,5)),columns=list('ABCDE'),index=list('abcde'))
print (df)
A B C D E
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4
df.index.name = 'code'
print (df)
A B C D E
code
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4
df = df.rename_axis('code')
print (df)
A B C D E
code
a 8 8 3 7 7
b 0 4 2 5 2
c 2 2 1 0 8
d 4 0 9 6 2
e 4 1 5 3 4
I have an existing pandas DataFrame, and I want to add a new column, where the value of each row will depend on the previous row.
for example:
df1 = pd.DataFrame(np.random.randint(10, size=(4, 4)), columns=['a', 'b', 'c', 'd'])
df1
Out[31]:
a b c d
0 9 3 3 0
1 3 9 5 1
2 1 7 5 6
3 8 0 1 7
and now I want to create column e, where for each row i the value of df1['e'][i] would be: df1['e'][i] = df1['d'][i] - df1['d'][i-1]
desired output:
df1:
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
how can I achieve this?
You can use sub with shift:
df['e'] = df.d.sub(df.d.shift(), fill_value=0)
print (df)
a b c d e
0 9 3 3 0 0.0
1 3 9 5 1 1.0
2 1 7 5 6 5.0
3 8 0 1 7 1.0
If need convert to int:
df['e'] = df.d.sub(df.d.shift(), fill_value=0).astype(int)
print (df)
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
I have a dataframe with a column called product_type such as:
df1.product_type.unique()
>> ["prod_1", "prod_2", "prod_3"]
df.prod_cost.dtype
>> dtype('O')
I am looking for the most efficient way to replace that by numerical values [1, 2, 3].
Thanks
Use factorize to encode a new column:
In [2]:
df = pd.DataFrame({'a':list('abcdbcbccc')})
df
Out[2]:
a
0 a
1 b
2 c
3 d
4 b
5 c
6 b
7 c
8 c
9 c
In [5]:
df['code'] = df['a'].factorize()[0] + 1
df
Out[5]:
a code
0 a 1
1 b 2
2 c 3
3 d 4
4 b 2
5 c 3
6 b 2
7 c 3
8 c 3
9 c 3
so in your case:
df1['product_type'] = df1['product_type'].factorize()[0] + 1
should work
Cast the column as a category, and then get the codes.
df1 = pd.DataFrame({'product_type': ['prod_1'] * 3 + ['prod_2'] * 3 + ['prod_3'] * 3})
df1['product_type_code'] = df1.product_type.astype('category').cat.codes
>>> df1
product_type product_type_code
0 prod_1 0
1 prod_1 0
2 prod_1 0
3 prod_2 1
4 prod_2 1
5 prod_2 1
6 prod_3 2
7 prod_3 2
8 prod_3 2
This is a simple problem but for some reason I am not able to find an easy solution.
I have a hierarchically indexed Series, for example:
s = pd.Series(data=randint(0, 3, 45),
index=pd.MultiIndex.from_tuples(list(itertools.product('pqr',[0,1,2],'abcde')),
names=['Index1', 'Index2', 'Index3']), name='P')
s = s.map({0:'A', 1:'B', 2:'C'})
So it looks like
Index1 Index2 Index3
p 0 a A
b A
c C
d B
e C
1 a B
b C
c C
d B
e B
q 0 a B
b C
c C
d C
e C
1 a A
b A
c B
d C
e A
I want to do a frequency count by value so that the output looks like
Index1 Index2 P
p 0 A 2
B 1
C 2
1 A 0
B 3
C 2
q 0 A 0
B 1
C 4
1 A 3
B 1
C 1
You can apply value_counts to the Series groupby:
In [11]: s.groupby(level=[0, 1]).value_counts() # equiv .apply(pd.value_counts)
Out[11]:
Index1 Index2
p 0 C 2
A 2
B 1
1 B 3
A 2
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 B 2
C 2
A 1
2 C 3
B 1
A 1
r 0 A 3
B 1
C 1
1 B 3
C 2
2 B 3
C 1
A 1
dtype: int64
If you want to include the 0s (which the above won't) you could use cross_tab:
In [21]: ct = pd.crosstab(rows=[s.index.get_level_values(0), s.index.get_level_values(1)],
cols=s.values,
aggfunc=len,
rownames=s.index.names[:2],
colnames=s.index.names[2:3])
In [22]: ct
Out[22]:
Index3 A B C
Index1 Index2
p 0 2 1 2
1 2 3 0
2 3 1 1
q 0 3 1 1
1 1 2 2
2 1 1 3
r 0 3 1 1
1 0 3 2
2 1 3 1
In [23]: ct.stack()
Out[23]:
Index1 Index2 Index3
p 0 A 2
B 1
C 2
1 A 2
B 3
C 0
2 A 3
B 1
C 1
q 0 A 3
B 1
C 1
1 A 1
B 2
C 2
2 A 1
B 1
C 3
r 0 A 3
B 1
C 1
1 A 0
B 3
C 2
2 A 1
B 3
C 1
dtype: int64
Which may be slightly faster...