Pandas reshape dataframe every N rows to columns - python
I have a dataframe as follows :
df1=pd.DataFrame(np.arange(24).reshape(6,-1),columns=['a','b','c','d'])
and I want to take 3 set of rows and convert them to columns with following order
Numpy reshape doesn't give intended answer
pd.DataFrame(np.reshape(df1.values,(3,-1)),columns=['a','b','c','d','e','f','g','h'])
In [258]: df = pd.DataFrame(np.hstack(np.split(df1, 2)))
In [259]: df
Out[259]:
0 1 2 3 4 5 6 7
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
In [260]: import string
In [261]: df.columns = list(string.ascii_lowercase[:len(df.columns)])
In [262]: df
Out[262]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
Create 3d array by reshape:
a = np.hstack(np.reshape(df1.values,(-1, 3, len(df1.columns))))
df = pd.DataFrame(a,columns=['a','b','c','d','e','f','g','h'])
print (df)
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
This uses the reshape/swapaxes/reshape idiom for rearranging sub-blocks of NumPy arrays.
In [26]: pd.DataFrame(df1.values.reshape(2,3,4).swapaxes(0,1).reshape(3,-1), columns=['a','b','c','d','e','f','g','h'])
Out[26]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
If you want a pure pandas solution:
df.set_index([df.index % 3, df.index // 3])\
.unstack()\
.sort_index(level=1, axis=1)\
.set_axis(list('abcdefgh'), axis=1, inplace=False)
Output:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
Related
CSV & Pandas: Unnamed columns and multi-index
I have a set of data: ,,England,,,,,,,,,,,,France,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,, ,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,, ,,,,,,,,,,,,,,,,,,,,,,,,, ,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D ,,,,,,,,,,,,,,,,,,,,,,,,, Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5 ,,,,,,,,,,,,,,,,,,,,,,,,, Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10 ,,,,,,,,,,,,,,,,,,,,,,,,, Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8 The desired output I'm trying to achieve is: I know that I can read the CSV and remove any NaN rows with: df = pd.read_csv("Stores.csv",skipinitialspace=True) df.dropna(how="all", inplace=True) My 2 main issues are: How do I group the unnamed columns so that they are just the countries "England" and "France" How do I setup an index so that each of the 3 stores fall under the relevant countries? I believe that I can use hierarchical indexing for the headings but all examples I've come across use nice, clean data frames unlike my CSV. I'd be very grateful if someone could point me in the right direction as I'm fairly new to pandas. Thank you.
You can try this: from io import StringIO import pandas as pd import numpy as np test=StringIO(""",,England,,,,,,,,,,,,France,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,, ,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,, ,,,,,,,,,,,,,,,,,,,,,,,,, ,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D ,,,,,,,,,,,,,,,,,,,,,,,,, Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5 ,,,,,,,,,,,,,,,,,,,,,,,,, Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10 ,,,,,,,,,,,,,,,,,,,,,,,,, Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20 ,,,,,,,,,,,,,,,,,,,,,,,,, ,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8""") df = pd.read_csv(test, index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1) df.columns = pd.MultiIndex.from_frame(df.columns .to_frame() .apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x))\ .ffill()) df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill()) print(df) Output: 0 England ... France 1 Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3 2 F P M D F P M D F P ... M D F P M D F P M D 0 1 ... Year 1 M 0 5 7 9 2 18 5 10 4 9 ... 18 11 10 19 18 20 3 17 19 13 F 0 13 14 11 0 6 8 6 2 12 ... 12 18 6 17 16 14 0 4 2 5 Year 2 M 5 10 6 6 1 20 5 18 4 9 ... 15 19 2 18 16 13 1 19 5 12 F 1 11 14 15 0 9 9 2 2 12 ... 18 14 9 18 13 14 0 9 2 10 Evening M 4 10 6 5 3 13 19 5 4 9 ... 10 18 3 11 20 11 4 18 17 20 F 4 12 12 13 0 9 3 8 2 12 ... 11 18 1 13 13 10 0 6 2 8 [6 rows x 24 columns]
You'll have to set the (multi) index and headers yourself: df = pd.read_csv("Stores.csv", header=None) df.dropna(how='all', inplace=True) df.reset_index(inplace=True, drop=True) # getting headers as a product of [England, France], [Store1, Store2, Store3] and [F, P, M, D] headers = pd.MultiIndex.from_product([df.iloc[0].dropna().unique(), df.iloc[1].dropna().unique(), df.iloc[2].dropna().unique()]) df.drop([0, 1, 2], inplace=True) # removing header rows df[0].ffill(inplace=True) # filling nan values for first index col df.set_index([0,1], inplace=True) # setting mulitiindex df.columns = headers print(df) Output: England ... France Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3 F P M D F P M D F P M ... P M D F P M D F P M D 0 1 ... Year 1 M 0 5 7 9 2 18 5 10 4 9 6 ... 14 18 11 10 19 18 20 3 17 19 13 F 0 13 14 11 0 6 8 6 2 12 14 ... 17 12 18 6 17 16 14 0 4 2 5 Year 2 M 5 10 6 6 1 20 5 18 4 9 6 ... 13 15 19 2 18 16 13 1 19 5 12 F 1 11 14 15 0 9 9 2 2 12 14 ... 17 18 14 9 18 13 14 0 9 2 10 Evening M 4 10 6 5 3 13 19 5 4 9 6 ... 17 10 18 3 11 20 11 4 18 17 20 F 4 12 12 13 0 9 3 8 2 12 14 ... 18 11 18 1 13 13 10 0 6 2 8 [6 rows x 24 columns]
Conditional Cumulative Count pandas while preserving values before first change
I work with Pandas and I am trying to create a column where the value is increased and especially reset by condition based on the Time column Input data: Out[73]: ID Time Job Level Counter 0 1 17 a 1 1 18 a 2 1 19 a 3 1 20 a 4 1 21 a 5 1 22 b 6 1 23. b 7 1 24. b 8 2 10. a 9 2 11 a 10 2 12 a 11 2 13 a 12 2 14. b 13 2 15 b 14 2 16 b 15 2 17 c 16 2 18 c I want to create a new vector 'count' where the value within each ID group remains the same before the first change and start from zero every time a change in the Job level is encountered while remains equal to Time before the first change or no change. What I would like to have: ID Time Job Level Counter 0 1 17 a 17 1 1 18 a 18 2 1 19 a 19 3 1 20 a 20 4 1 21 a 21 5 1 22 b 0 6 1 23 b 1 7 1 24 b 2 8 2 10 a 10 9 2 11 a 11 10 2 12 a 12 11 2 13 a 13 12 2 14 b 0 13 2 15 b 1 14 2 16 b 2 15 2 17 c 0 16 2 18 c 1 This is what I tried df = df.sort_values(['ID']).reset_index(drop=True) df['Counter'] = promo_details.groupby('ID')['job_level'].apply(lambda x: x.shift()!=x) def func(group): group.loc[group.index[0],'Counter']=group.loc[group.index[0],'time_in_level'] return group df = df.groupby('emp_id').apply(func) df['Counter'] = df['Counter'].replace(True,'a') df['Counter'] = np.where(df.Counter == False,df['Time'],df['Counter']) df['Counter'] = df['Counter'].replace('a',0) This is not creating a cumulative change after the first change while preserving counts before it,
Use GroupBy.cumcount for counter with filter first group - there is added values from column Time: #if need test consecutive duplicates s = df['Job Level'].ne(df['Job Level'].shift()).cumsum() m = s.groupby(df['ID']).transform('first').eq(s) df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount()) print (df) ID Time Job Level Counter 0 1 17 a 17 1 1 18 a 18 2 1 19 a 19 3 1 20 a 20 4 1 21 a 21 5 1 22 b 0 6 1 23 b 1 7 1 24 b 2 8 2 10 a 10 9 2 11 a 11 10 2 12 a 12 11 2 13 a 13 12 2 14 b 0 13 2 15 b 1 14 2 16 b 2 15 2 17 c 0 16 2 18 c 1 Or: #if each groups are unique m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level']) df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount()) Difference in changed data: print (df) ID Time Job Level 12 2 14 b 13 2 15 b 14 2 16 b 15 2 17 c 16 2 18 c 10 2 12 a 11 2 18 a 12 2 19 b 13 2 20 b #if need test consecutive duplicates s = df['Job Level'].ne(df['Job Level'].shift()).cumsum() m = s.groupby(df['ID']).transform('first').eq(s) df['Counter1'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount()) m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level']) df['Counter2'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount()) print (df) ID Time Job Level Counter1 Counter2 12 2 14 b 14 14 13 2 15 b 15 15 14 2 16 b 16 16 15 2 17 c 0 0 16 2 18 c 1 1 10 2 12 a 0 0 11 2 18 a 1 1 12 2 19 b 0 19 13 2 20 b 1 20
Pandas df.isna().sum() not showing all column names
I have simple code in databricks: import pandas as pd data_frame = pd.read_csv('/dbfs/some_very_large_file.csv') data_frame.isna().sum() Out[41]: A 0 B 0 C 0 D 0 E 0 .. T 0 V 0 X 0 Z 0 Y 0 Length: 287, dtype: int64 How can i see all column (A to Y) names along with is N/A values? Tried setting pd.set_option('display.max_rows', 287) and pd.set_option('display.max_columns', 287) but this doesn't seem to work here. Also isna() and sum() methods do not have any arguments that would allow me to manipulate output as far as i can say.
The default settings for pandas display options are set to 10 rows maximum. If the df to be displayed exceeds this number, it will be centrally truncated. To view the entire frame, you need to change the display options. To display all rows of df: pd.set_option('display.max_rows',None) Ex: >>> df A B C 0 4 8 8 1 13 17 13 2 19 13 2 3 9 9 16 4 14 19 19 .. .. .. .. 7 7 2 2 8 5 7 2 9 18 12 17 10 10 5 11 11 5 3 18 [12 rows x 3 columns] >>> pd.set_option('display.max_rows',None) >>> df A B C 0 4 8 8 1 13 17 13 2 19 13 2 3 9 9 16 4 14 19 19 5 3 17 12 6 9 13 17 7 7 2 2 8 5 7 2 9 18 12 17 10 10 5 11 11 5 3 18 Documentation: pandas.set_option
Concatenate dataframes along columns in a pandas dataframe
I want to concatenate two df along columns. Both have the same number of indices. df1 A B C 0 1 2 3 1 4 5 6 2 7 8 9 3 10 11 12 df2 D E F 0 13 14 15 1 16 17 18 2 19 20 21 3 22 23 24 Expected: A B C D E F 0 1 2 3 13 14 15 1 4 5 6 16 17 18 2 7 8 9 19 20 21 3 10 11 12 22 23 24 I have done: df_combined = pd.concat([df1,df2], axis=1) But, the df_combined have new rows with NaN values in some columns... I can't find my error. So, what I have to do? Thanks in advance!
In this case, merge() works. pd.merge(df1, df2, left_index=True, right_index=True) output A B C D E F 0 1 2 3 13 14 15 1 4 5 6 16 17 18 2 7 8 9 19 20 21 3 10 11 12 22 23 24 This works only if both dataframe have same indices.
Permute groups in Pandas
Say I have a Pandas DataFrame whose data look like import numpy as np import pandas as pd n = 30 df = pd.DataFrame({'a': np.arange(n), 'b': np.random.choice([0, 1, 2], n), 'c': np.arange(n)}) Question: how to permute groups (grouped by b column)? Not permutation within each group, but permutation in group level? Example Before a b c 1 0 1 2 0 2 3 1 3 4 1 4 5 2 5 6 2 6 After a b c 3 1 3 4 1 4 1 0 1 2 0 2 5 2 5 6 2 6 Basically before permutation, df['b'].unqiue() == [0, 1, 2], after permutation, df['b'].unique() == [1, 0, 2].
Here's an answer inspired by the accepted answer to this SO post, which uses a temporary Categorical column as a sorting key to do custom sort orderings. In this answer, I produce all permutations, but you can just take the first one if you are looking for only one. import itertools df_results = list() orderings = itertools.permutations(df["b"].unique()) for ordering in orderings: df_2 = df.copy() df_2["b_key"] = pd.Categorical(df_2["b"], [i for i in ordering]) df_2.sort_values("b_key", inplace=True) df_2.drop(["b_key"], axis=1, inplace=True) df_results.append(df_2) for df in df_results: print(df) The idea here is that we create a new categorical variable each time, with a slightly different enumerated order, then sort by it. We discard it at the end once we no longer need it.
If i understood your question correctly, you can do it this way: n = 30 df = pd.DataFrame({'a': np.arange(n), 'b': np.random.choice([0, 1, 2], n), 'c': np.arange(n)}) order = pd.Series([1,0,2]) cols = df.columns df['idx'] = df.b.map(order) index = df.index df = df.reset_index().sort_values(['idx', 'index'])[cols] Step by step: In [103]: df['idx'] = df.b.map(order) In [104]: df Out[104]: a b c idx 0 0 2 0 2 1 1 0 1 1 2 2 1 2 0 3 3 0 3 1 4 4 1 4 0 5 5 1 5 0 6 6 1 6 0 7 7 2 7 2 8 8 0 8 1 9 9 1 9 0 10 10 0 10 1 11 11 1 11 0 12 12 0 12 1 13 13 2 13 2 14 14 0 14 1 15 15 2 15 2 16 16 1 16 0 17 17 2 17 2 18 18 1 18 0 19 19 1 19 0 20 20 0 20 1 21 21 0 21 1 22 22 1 22 0 23 23 1 23 0 24 24 2 24 2 25 25 0 25 1 26 26 0 26 1 27 27 0 27 1 28 28 1 28 0 29 29 1 29 0 In [105]: df.reset_index().sort_values(['idx', 'index']) Out[105]: index a b c idx 2 2 2 1 2 0 4 4 4 1 4 0 5 5 5 1 5 0 6 6 6 1 6 0 9 9 9 1 9 0 11 11 11 1 11 0 16 16 16 1 16 0 18 18 18 1 18 0 19 19 19 1 19 0 22 22 22 1 22 0 23 23 23 1 23 0 28 28 28 1 28 0 29 29 29 1 29 0 1 1 1 0 1 1 3 3 3 0 3 1 8 8 8 0 8 1 10 10 10 0 10 1 12 12 12 0 12 1 14 14 14 0 14 1 20 20 20 0 20 1 21 21 21 0 21 1 25 25 25 0 25 1 26 26 26 0 26 1 27 27 27 0 27 1 0 0 0 2 0 2 7 7 7 2 7 2 13 13 13 2 13 2 15 15 15 2 15 2 17 17 17 2 17 2 24 24 24 2 24 2