Pandas reshape dataframe every N rows to columns - python

I have a dataframe as follows :
df1=pd.DataFrame(np.arange(24).reshape(6,-1),columns=['a','b','c','d'])
and I want to take 3 set of rows and convert them to columns with following order
Numpy reshape doesn't give intended answer
pd.DataFrame(np.reshape(df1.values,(3,-1)),columns=['a','b','c','d','e','f','g','h'])

In [258]: df = pd.DataFrame(np.hstack(np.split(df1, 2)))
In [259]: df
Out[259]:
0 1 2 3 4 5 6 7
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
In [260]: import string
In [261]: df.columns = list(string.ascii_lowercase[:len(df.columns)])
In [262]: df
Out[262]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Create 3d array by reshape:
a = np.hstack(np.reshape(df1.values,(-1, 3, len(df1.columns))))
df = pd.DataFrame(a,columns=['a','b','c','d','e','f','g','h'])
print (df)
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

This uses the reshape/swapaxes/reshape idiom for rearranging sub-blocks of NumPy arrays.
In [26]: pd.DataFrame(df1.values.reshape(2,3,4).swapaxes(0,1).reshape(3,-1), columns=['a','b','c','d','e','f','g','h'])
Out[26]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

If you want a pure pandas solution:
df.set_index([df.index % 3, df.index // 3])\
.unstack()\
.sort_index(level=1, axis=1)\
.set_axis(list('abcdefgh'), axis=1, inplace=False)
Output:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Related

CSV & Pandas: Unnamed columns and multi-index

I have a set of data:
,,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8
The desired output I'm trying to achieve is:
I know that I can read the CSV and remove any NaN rows with:
df = pd.read_csv("Stores.csv",skipinitialspace=True)
df.dropna(how="all", inplace=True)
My 2 main issues are:
How do I group the unnamed columns so that they are just the countries "England" and "France"
How do I setup an index so that each of the 3 stores fall under the relevant countries?
I believe that I can use hierarchical indexing for the headings but all examples I've come across use nice, clean data frames unlike my CSV. I'd be very grateful if someone could point me in the right direction as I'm fairly new to pandas.
Thank you.
You can try this:
from io import StringIO
import pandas as pd
import numpy as np
test=StringIO(""",,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8""")
df = pd.read_csv(test, index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1)
df.columns = pd.MultiIndex.from_frame(df.columns
.to_frame()
.apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x))\
.ffill())
df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill())
print(df)
Output:
0 England ... France
1 Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3
2 F P M D F P M D F P ... M D F P M D F P M D
0 1 ...
Year 1 M 0 5 7 9 2 18 5 10 4 9 ... 18 11 10 19 18 20 3 17 19 13
F 0 13 14 11 0 6 8 6 2 12 ... 12 18 6 17 16 14 0 4 2 5
Year 2 M 5 10 6 6 1 20 5 18 4 9 ... 15 19 2 18 16 13 1 19 5 12
F 1 11 14 15 0 9 9 2 2 12 ... 18 14 9 18 13 14 0 9 2 10
Evening M 4 10 6 5 3 13 19 5 4 9 ... 10 18 3 11 20 11 4 18 17 20
F 4 12 12 13 0 9 3 8 2 12 ... 11 18 1 13 13 10 0 6 2 8
[6 rows x 24 columns]
You'll have to set the (multi) index and headers yourself:
df = pd.read_csv("Stores.csv", header=None)
df.dropna(how='all', inplace=True)
df.reset_index(inplace=True, drop=True)
# getting headers as a product of [England, France], [Store1, Store2, Store3] and [F, P, M, D]
headers = pd.MultiIndex.from_product([df.iloc[0].dropna().unique(),
df.iloc[1].dropna().unique(),
df.iloc[2].dropna().unique()])
df.drop([0, 1, 2], inplace=True) # removing header rows
df[0].ffill(inplace=True) # filling nan values for first index col
df.set_index([0,1], inplace=True) # setting mulitiindex
df.columns = headers
print(df)
Output:
England ... France
Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3
F P M D F P M D F P M ... P M D F P M D F P M D
0 1 ...
Year 1 M 0 5 7 9 2 18 5 10 4 9 6 ... 14 18 11 10 19 18 20 3 17 19 13
F 0 13 14 11 0 6 8 6 2 12 14 ... 17 12 18 6 17 16 14 0 4 2 5
Year 2 M 5 10 6 6 1 20 5 18 4 9 6 ... 13 15 19 2 18 16 13 1 19 5 12
F 1 11 14 15 0 9 9 2 2 12 14 ... 17 18 14 9 18 13 14 0 9 2 10
Evening M 4 10 6 5 3 13 19 5 4 9 6 ... 17 10 18 3 11 20 11 4 18 17 20
F 4 12 12 13 0 9 3 8 2 12 14 ... 18 11 18 1 13 13 10 0 6 2 8
[6 rows x 24 columns]

Conditional Cumulative Count pandas while preserving values before first change

I work with Pandas and I am trying to create a column where the value is increased and especially reset by condition based on the Time column
Input data:
Out[73]:
ID Time Job Level Counter
0 1 17 a
1 1 18 a
2 1 19 a
3 1 20 a
4 1 21 a
5 1 22 b
6 1 23. b
7 1 24. b
8 2 10. a
9 2 11 a
10 2 12 a
11 2 13 a
12 2 14. b
13 2 15 b
14 2 16 b
15 2 17 c
16 2 18 c
I want to create a new vector 'count' where the value within each ID group remains the same before the first change and start from zero every time a change in the Job level is encountered while remains equal to Time before the first change or no change.
What I would like to have:
ID Time Job Level Counter
0 1 17 a 17
1 1 18 a 18
2 1 19 a 19
3 1 20 a 20
4 1 21 a 21
5 1 22 b 0
6 1 23 b 1
7 1 24 b 2
8 2 10 a 10
9 2 11 a 11
10 2 12 a 12
11 2 13 a 13
12 2 14 b 0
13 2 15 b 1
14 2 16 b 2
15 2 17 c 0
16 2 18 c 1
This is what I tried
df = df.sort_values(['ID']).reset_index(drop=True)
df['Counter'] = promo_details.groupby('ID')['job_level'].apply(lambda x: x.shift()!=x)
def func(group):
group.loc[group.index[0],'Counter']=group.loc[group.index[0],'time_in_level']
return group
df = df.groupby('emp_id').apply(func)
df['Counter'] = df['Counter'].replace(True,'a')
df['Counter'] = np.where(df.Counter == False,df['Time'],df['Counter'])
df['Counter'] = df['Counter'].replace('a',0)
This is not creating a cumulative change after the first change while preserving counts before it,
Use GroupBy.cumcount for counter with filter first group - there is added values from column Time:
#if need test consecutive duplicates
s = df['Job Level'].ne(df['Job Level'].shift()).cumsum()
m = s.groupby(df['ID']).transform('first').eq(s)
df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount())
print (df)
ID Time Job Level Counter
0 1 17 a 17
1 1 18 a 18
2 1 19 a 19
3 1 20 a 20
4 1 21 a 21
5 1 22 b 0
6 1 23 b 1
7 1 24 b 2
8 2 10 a 10
9 2 11 a 11
10 2 12 a 12
11 2 13 a 13
12 2 14 b 0
13 2 15 b 1
14 2 16 b 2
15 2 17 c 0
16 2 18 c 1
Or:
#if each groups are unique
m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level'])
df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount())
Difference in changed data:
print (df)
ID Time Job Level
12 2 14 b
13 2 15 b
14 2 16 b
15 2 17 c
16 2 18 c
10 2 12 a
11 2 18 a
12 2 19 b
13 2 20 b
#if need test consecutive duplicates
s = df['Job Level'].ne(df['Job Level'].shift()).cumsum()
m = s.groupby(df['ID']).transform('first').eq(s)
df['Counter1'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount())
m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level'])
df['Counter2'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount())
print (df)
ID Time Job Level Counter1 Counter2
12 2 14 b 14 14
13 2 15 b 15 15
14 2 16 b 16 16
15 2 17 c 0 0
16 2 18 c 1 1
10 2 12 a 0 0
11 2 18 a 1 1
12 2 19 b 0 19
13 2 20 b 1 20

Pandas df.isna().sum() not showing all column names

I have simple code in databricks:
import pandas as pd
data_frame = pd.read_csv('/dbfs/some_very_large_file.csv')
data_frame.isna().sum()
Out[41]:
A 0
B 0
C 0
D 0
E 0
..
T 0
V 0
X 0
Z 0
Y 0
Length: 287, dtype: int64
How can i see all column (A to Y) names along with is N/A values? Tried setting pd.set_option('display.max_rows', 287) and pd.set_option('display.max_columns', 287) but this doesn't seem to work here. Also isna() and sum() methods do not have any arguments that would allow me to manipulate output as far as i can say.
The default settings for pandas display options are set to 10 rows maximum. If the df to be displayed exceeds this number, it will be centrally truncated. To view the entire frame, you need to change the display options.
To display all rows of df:
pd.set_option('display.max_rows',None)
Ex:
>>> df
A B C
0 4 8 8
1 13 17 13
2 19 13 2
3 9 9 16
4 14 19 19
.. .. .. ..
7 7 2 2
8 5 7 2
9 18 12 17
10 10 5 11
11 5 3 18
[12 rows x 3 columns]
>>> pd.set_option('display.max_rows',None)
>>> df
A B C
0 4 8 8
1 13 17 13
2 19 13 2
3 9 9 16
4 14 19 19
5 3 17 12
6 9 13 17
7 7 2 2
8 5 7 2
9 18 12 17
10 10 5 11
11 5 3 18
Documentation:
pandas.set_option

Concatenate dataframes along columns in a pandas dataframe

I want to concatenate two df along columns. Both have the same number of indices.
df1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
df2
D E F
0 13 14 15
1 16 17 18
2 19 20 21
3 22 23 24
Expected:
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
I have done:
df_combined = pd.concat([df1,df2], axis=1)
But, the df_combined have new rows with NaN values in some columns...
I can't find my error. So, what I have to do? Thanks in advance!
In this case, merge() works.
pd.merge(df1, df2, left_index=True, right_index=True)
output
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
This works only if both dataframe have same indices.

Permute groups in Pandas

Say I have a Pandas DataFrame whose data look like
import numpy as np
import pandas as pd
n = 30
df = pd.DataFrame({'a': np.arange(n),
'b': np.random.choice([0, 1, 2], n),
'c': np.arange(n)})
Question: how to permute groups (grouped by b column)?
Not permutation within each group, but permutation in group level?
Example
Before
a b c
1 0 1
2 0 2
3 1 3
4 1 4
5 2 5
6 2 6
After
a b c
3 1 3
4 1 4
1 0 1
2 0 2
5 2 5
6 2 6
Basically before permutation, df['b'].unqiue() == [0, 1, 2], after permutation, df['b'].unique() == [1, 0, 2].
Here's an answer inspired by the accepted answer to this SO post, which uses a temporary Categorical column as a sorting key to do custom sort orderings. In this answer, I produce all permutations, but you can just take the first one if you are looking for only one.
import itertools
df_results = list()
orderings = itertools.permutations(df["b"].unique())
for ordering in orderings:
df_2 = df.copy()
df_2["b_key"] = pd.Categorical(df_2["b"], [i for i in ordering])
df_2.sort_values("b_key", inplace=True)
df_2.drop(["b_key"], axis=1, inplace=True)
df_results.append(df_2)
for df in df_results:
print(df)
The idea here is that we create a new categorical variable each time, with a slightly different enumerated order, then sort by it. We discard it at the end once we no longer need it.
If i understood your question correctly, you can do it this way:
n = 30
df = pd.DataFrame({'a': np.arange(n),
'b': np.random.choice([0, 1, 2], n),
'c': np.arange(n)})
order = pd.Series([1,0,2])
cols = df.columns
df['idx'] = df.b.map(order)
index = df.index
df = df.reset_index().sort_values(['idx', 'index'])[cols]
Step by step:
In [103]: df['idx'] = df.b.map(order)
In [104]: df
Out[104]:
a b c idx
0 0 2 0 2
1 1 0 1 1
2 2 1 2 0
3 3 0 3 1
4 4 1 4 0
5 5 1 5 0
6 6 1 6 0
7 7 2 7 2
8 8 0 8 1
9 9 1 9 0
10 10 0 10 1
11 11 1 11 0
12 12 0 12 1
13 13 2 13 2
14 14 0 14 1
15 15 2 15 2
16 16 1 16 0
17 17 2 17 2
18 18 1 18 0
19 19 1 19 0
20 20 0 20 1
21 21 0 21 1
22 22 1 22 0
23 23 1 23 0
24 24 2 24 2
25 25 0 25 1
26 26 0 26 1
27 27 0 27 1
28 28 1 28 0
29 29 1 29 0
In [105]: df.reset_index().sort_values(['idx', 'index'])
Out[105]:
index a b c idx
2 2 2 1 2 0
4 4 4 1 4 0
5 5 5 1 5 0
6 6 6 1 6 0
9 9 9 1 9 0
11 11 11 1 11 0
16 16 16 1 16 0
18 18 18 1 18 0
19 19 19 1 19 0
22 22 22 1 22 0
23 23 23 1 23 0
28 28 28 1 28 0
29 29 29 1 29 0
1 1 1 0 1 1
3 3 3 0 3 1
8 8 8 0 8 1
10 10 10 0 10 1
12 12 12 0 12 1
14 14 14 0 14 1
20 20 20 0 20 1
21 21 21 0 21 1
25 25 25 0 25 1
26 26 26 0 26 1
27 27 27 0 27 1
0 0 0 2 0 2
7 7 7 2 7 2
13 13 13 2 13 2
15 15 15 2 15 2
17 17 17 2 17 2
24 24 24 2 24 2

Categories

Resources