CSV & Pandas: Unnamed columns and multi-index

CSV & Pandas: Unnamed columns and multi-index - python

I have a set of data:
,,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8
The desired output I'm trying to achieve is:
I know that I can read the CSV and remove any NaN rows with:
df = pd.read_csv("Stores.csv",skipinitialspace=True)
df.dropna(how="all", inplace=True)
My 2 main issues are:
How do I group the unnamed columns so that they are just the countries "England" and "France"
How do I setup an index so that each of the 3 stores fall under the relevant countries?
I believe that I can use hierarchical indexing for the headings but all examples I've come across use nice, clean data frames unlike my CSV. I'd be very grateful if someone could point me in the right direction as I'm fairly new to pandas.
Thank you.

You can try this:
from io import StringIO
import pandas as pd
import numpy as np
test=StringIO(""",,England,,,,,,,,,,,,France,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,Store 1,,,,Store 2,,,,Store 3,,,,Store 1,,,,Store 2,,,,Store 3,,,
,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5
,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10
,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20
,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8""")
df = pd.read_csv(test, index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1)
df.columns = pd.MultiIndex.from_frame(df.columns
.to_frame()
.apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x))\
.ffill())
df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill())
print(df)
Output:
0 England ... France
1 Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3
2 F P M D F P M D F P ... M D F P M D F P M D
0 1 ...
Year 1 M 0 5 7 9 2 18 5 10 4 9 ... 18 11 10 19 18 20 3 17 19 13
F 0 13 14 11 0 6 8 6 2 12 ... 12 18 6 17 16 14 0 4 2 5
Year 2 M 5 10 6 6 1 20 5 18 4 9 ... 15 19 2 18 16 13 1 19 5 12
F 1 11 14 15 0 9 9 2 2 12 ... 18 14 9 18 13 14 0 9 2 10
Evening M 4 10 6 5 3 13 19 5 4 9 ... 10 18 3 11 20 11 4 18 17 20
F 4 12 12 13 0 9 3 8 2 12 ... 11 18 1 13 13 10 0 6 2 8
[6 rows x 24 columns]

You'll have to set the (multi) index and headers yourself:
df = pd.read_csv("Stores.csv", header=None)
df.dropna(how='all', inplace=True)
df.reset_index(inplace=True, drop=True)
# getting headers as a product of [England, France], [Store1, Store2, Store3] and [F, P, M, D]
headers = pd.MultiIndex.from_product([df.iloc[0].dropna().unique(),
df.iloc[1].dropna().unique(),
df.iloc[2].dropna().unique()])
df.drop([0, 1, 2], inplace=True) # removing header rows
df[0].ffill(inplace=True) # filling nan values for first index col
df.set_index([0,1], inplace=True) # setting mulitiindex
df.columns = headers
print(df)
Output:
England ... France
Store 1 Store 2 Store 3 ... Store 1 Store 2 Store 3
F P M D F P M D F P M ... P M D F P M D F P M D
0 1 ...
Year 1 M 0 5 7 9 2 18 5 10 4 9 6 ... 14 18 11 10 19 18 20 3 17 19 13
F 0 13 14 11 0 6 8 6 2 12 14 ... 17 12 18 6 17 16 14 0 4 2 5
Year 2 M 5 10 6 6 1 20 5 18 4 9 6 ... 13 15 19 2 18 16 13 1 19 5 12
F 1 11 14 15 0 9 9 2 2 12 14 ... 17 18 14 9 18 13 14 0 9 2 10
Evening M 4 10 6 5 3 13 19 5 4 9 6 ... 17 10 18 3 11 20 11 4 18 17 20
F 4 12 12 13 0 9 3 8 2 12 14 ... 18 11 18 1 13 13 10 0 6 2 8
[6 rows x 24 columns]

Related

Pandas df.isna().sum() not showing all column names

I have simple code in databricks:
import pandas as pd
data_frame = pd.read_csv('/dbfs/some_very_large_file.csv')
data_frame.isna().sum()
Out[41]:
A 0
B 0
C 0
D 0
E 0
..
T 0
V 0
X 0
Z 0
Y 0
Length: 287, dtype: int64
How can i see all column (A to Y) names along with is N/A values? Tried setting pd.set_option('display.max_rows', 287) and pd.set_option('display.max_columns', 287) but this doesn't seem to work here. Also isna() and sum() methods do not have any arguments that would allow me to manipulate output as far as i can say.

The default settings for pandas display options are set to 10 rows maximum. If the df to be displayed exceeds this number, it will be centrally truncated. To view the entire frame, you need to change the display options.
To display all rows of df:
pd.set_option('display.max_rows',None)
Ex:
>>> df
A B C
0 4 8 8
1 13 17 13
2 19 13 2
3 9 9 16
4 14 19 19
.. .. .. ..
7 7 2 2
8 5 7 2
9 18 12 17
10 10 5 11
11 5 3 18
[12 rows x 3 columns]
>>> pd.set_option('display.max_rows',None)
>>> df
A B C
0 4 8 8
1 13 17 13
2 19 13 2
3 9 9 16
4 14 19 19
5 3 17 12
6 9 13 17
7 7 2 2
8 5 7 2
9 18 12 17
10 10 5 11
11 5 3 18
Documentation:
pandas.set_option

Concatenate dataframes along columns in a pandas dataframe

I want to concatenate two df along columns. Both have the same number of indices.
df1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
df2
D E F
0 13 14 15
1 16 17 18
2 19 20 21
3 22 23 24
Expected:
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
I have done:
df_combined = pd.concat([df1,df2], axis=1)
But, the df_combined have new rows with NaN values in some columns...
I can't find my error. So, what I have to do? Thanks in advance!

In this case, merge() works.
pd.merge(df1, df2, left_index=True, right_index=True)
output
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
This works only if both dataframe have same indices.

Faulty Pandas dataframe read_json sorting on python3.5.9

Dataframe with more than 10 rows is incorrectly sorted on python3.5.9 after converting to json and back to pandas.DataFrame.
from pandas import DataFrame, read_json
columns = ['a', 'b', 'c']
data = [[1*i, 2*i, 3*i] for i in range(11)]
df = DataFrame(columns=columns, data=data)
print(df)
# a b c
# 0 0 0 0
# 1 1 2 3
# 2 2 4 6
# 3 3 6 9
# 4 4 8 12
# 5 5 10 15
# 6 6 12 18
# 7 7 14 21
# 8 8 16 24
# 9 9 18 27
# 10 10 20 30
new_df = read_json(df.to_json())
print(new_df)
# a b c
# 0 0 0 0
# 1 1 2 3
# 10 10 20 30 # this should be the last line
# 2 2 4 6
# 3 3 6 9
# 4 4 8 12
# 5 5 10 15
# 6 6 12 18
# 7 7 14 21
# 8 8 16 24
# 9 9 18 27
So DataFrame which was created with read_json seems to be sorting indexes like strings (1,10,2,3,...) instead of ints (1,2,3..).
Behaviour generated with Python 3.5.9 (default, Jan 4 2020, 04:09:01) (docker image python:3.5-stretch)
Everything seems to be working fine on my local machine (Python 3.8.1 (default, Dec 21 2019, 20:57:38)).
pandas==0.25.3 was used on both instances.
Is where a way to fix this without upgrading python?

Use sort_values to sort the dataframe on the column a. Something like below:
new_df = read_json(df.to_json())
#sort column
print(new_df.sort_values('a'))
#sort index
print(new_df.sort_index())
#ouput
a b c
0 0 0 0
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
6 6 12 18
7 7 14 21
8 8 16 24
9 9 18 27
10 10 20 30
``

Pandas reshape dataframe every N rows to columns

I have a dataframe as follows :
df1=pd.DataFrame(np.arange(24).reshape(6,-1),columns=['a','b','c','d'])
and I want to take 3 set of rows and convert them to columns with following order
Numpy reshape doesn't give intended answer
pd.DataFrame(np.reshape(df1.values,(3,-1)),columns=['a','b','c','d','e','f','g','h'])

In [258]: df = pd.DataFrame(np.hstack(np.split(df1, 2)))
In [259]: df
Out[259]:
0 1 2 3 4 5 6 7
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
In [260]: import string
In [261]: df.columns = list(string.ascii_lowercase[:len(df.columns)])
In [262]: df
Out[262]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Create 3d array by reshape:
a = np.hstack(np.reshape(df1.values,(-1, 3, len(df1.columns))))
df = pd.DataFrame(a,columns=['a','b','c','d','e','f','g','h'])
print (df)
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

This uses the reshape/swapaxes/reshape idiom for rearranging sub-blocks of NumPy arrays.
In [26]: pd.DataFrame(df1.values.reshape(2,3,4).swapaxes(0,1).reshape(3,-1), columns=['a','b','c','d','e','f','g','h'])
Out[26]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

If you want a pure pandas solution:
df.set_index([df.index % 3, df.index // 3])\
.unstack()\
.sort_index(level=1, axis=1)\
.set_axis(list('abcdefgh'), axis=1, inplace=False)
Output:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Pandas: Group by calendar-week, then plot grouped barplots for the real datetime

EDIT
I found a quite nice solution and posted it below as an answer.
The result will look like this:
Some example data you can generate for this problem:
codes = list('ABCDEFGH');
dates = pd.Series(pd.date_range('2013-11-01', '2014-01-31'));
dates = dates.append(dates)
dates.sort()
df = pd.DataFrame({'amount': np.random.randint(1, 10, dates.size), 'col1': np.random.choice(codes, dates.size), 'col2': np.random.choice(codes, dates.size), 'date': dates})
resulting in:
In [55]: df
Out[55]:
amount col1 col2 date
0 1 D E 2013-11-01
0 5 E B 2013-11-01
1 5 G A 2013-11-02
1 7 D H 2013-11-02
2 5 E G 2013-11-03
2 4 H G 2013-11-03
3 7 A F 2013-11-04
3 3 A A 2013-11-04
4 1 E G 2013-11-05
4 7 D C 2013-11-05
5 5 C A 2013-11-06
5 7 H F 2013-11-06
6 1 G B 2013-11-07
6 8 D A 2013-11-07
7 1 B H 2013-11-08
7 8 F H 2013-11-08
8 3 A E 2013-11-09
8 1 H D 2013-11-09
9 3 B D 2013-11-10
9 1 H G 2013-11-10
10 6 E E 2013-11-11
10 6 F E 2013-11-11
11 2 G B 2013-11-12
11 5 H H 2013-11-12
12 5 F G 2013-11-13
12 5 G B 2013-11-13
13 8 H B 2013-11-14
13 6 G F 2013-11-14
14 9 F C 2013-11-15
14 4 H A 2013-11-15
.. ... ... ... ...
77 9 A B 2014-01-17
77 7 E B 2014-01-17
78 4 F E 2014-01-18
78 6 B E 2014-01-18
79 6 A H 2014-01-19
79 3 G D 2014-01-19
80 7 E E 2014-01-20
80 6 G C 2014-01-20
81 9 H G 2014-01-21
81 9 C B 2014-01-21
82 2 D D 2014-01-22
82 7 D A 2014-01-22
83 6 G B 2014-01-23
83 1 A G 2014-01-23
84 9 B D 2014-01-24
84 7 G D 2014-01-24
85 7 A F 2014-01-25
85 9 B H 2014-01-25
86 9 C D 2014-01-26
86 5 E B 2014-01-26
87 3 C H 2014-01-27
87 7 F D 2014-01-27
88 3 D G 2014-01-28
88 4 A D 2014-01-28
89 2 F A 2014-01-29
89 8 D A 2014-01-29
90 1 A G 2014-01-30
90 6 C A 2014-01-30
91 6 H C 2014-01-31
91 2 G F 2014-01-31
[184 rows x 4 columns]
I'd like to group by calendar-week and by value of col1. Like this:
kw = lambda x: x.isocalendar()[1]
grouped = df.groupby([df['date'].map(kw), 'col1'], sort=False).agg({'amount': 'sum'})
resulting in:
In [58]: grouped
Out[58]:
amount
date col1
44 D 8
E 10
G 5
H 4
45 D 15
E 1
G 1
H 9
A 13
C 5
B 4
F 8
46 E 7
G 13
H 17
B 9
F 23
47 G 14
H 4
A 40
C 7
B 16
F 13
48 D 7
E 16
G 9
H 2
A 7
C 7
B 2
... ...
1 H 14
A 14
B 15
F 19
2 D 13
H 13
A 13
B 10
F 32
3 D 8
E 18
G 3
H 6
A 30
C 9
B 6
F 5
4 D 9
E 12
G 19
H 9
A 8
C 18
B 18
5 D 11
G 2
H 6
A 5
C 9
F 9
[87 rows x 1 columns]
Then I want a plot to be generated like this:
That means: calendar-week and year (datetime) on the x-axis and for each of the grouped col1 one bar.
The problem I'm facing is: I only have integers describing the calendar week (KW in the plot), but I somehow have to merge back the date on it to get the ticks labeled by year as well. Furthermore I can't only plot the grouped calendar week because I need a correct order of the items (kw 47, kw 48 (year 2013) have to be on the left side of kw 1 (because this is 2014)).
EDIT
I figured out from here:
http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-barplot that grouped bars need to be columns instead of rows. So I thought about how to transform the data and found the method pivot which turns out to be a great function. reset_index is needed to transform the multiindex into columns. At the end I fill NaNs by zero:
A = grouped.reset_index().pivot(index='date', columns='col1', values='amount').fillna(0)
transforms the data into:
col1 A B C D E F G H
date
1 4 31 0 0 0 18 13 8
2 0 12 13 22 1 17 0 8
3 3 10 4 13 12 8 7 6
4 17 0 10 7 0 25 7 4
5 7 0 7 9 8 6 0 7
44 0 0 2 11 7 0 0 2
45 9 3 2 14 0 16 21 2
46 0 14 7 2 17 13 11 8
47 5 13 0 15 19 7 5 10
48 15 8 12 2 20 4 7 6
49 20 0 0 18 22 17 11 0
50 7 11 8 6 5 6 13 10
51 8 26 0 0 5 5 16 9
52 8 13 7 5 4 10 0 11
which looks like the example data in the docs to be plotted in grouped bars:
A. plot(kind='bar')
gets this:
whereas I have the problem with the axis as it is now sorted (from 1-52), which is actually wrong, because calendar week 52 belongs to year 2013 in this case... Any ideas on how to merge back the real datetime for the calendar-weeks and use them as x-axis ticks?

I think resample('W') is a better way to do this - by default it groups by weeks ending on Sunday ('W' is the same as 'W-SUN') but you can specify whatever you want.
In your example, try this:
grouped = (df
.groupby('col1')
.apply(lambda g: # work on groups of col1
g.set_index('date')
[['amount']]
.resample('W').agg('sum') # sum the amount field across weeks
)
.unstack(level=0) # pivot the col1 index rows to columns
.fillna(0)
)
grouped.columns=grouped.columns.droplevel() # drop the 'col1' part of the multi-index column names
print grouped
grouped.plot(kind='bar')
which should print your data table and make a plot similar to yours, but with "real" date labels:
col1 A B C D E F G H
date
2013-11-03 18 0 9 0 8 0 0 4
2013-11-10 4 11 0 1 16 2 15 2
2013-11-17 10 14 19 8 13 6 9 8
2013-11-24 10 13 13 0 0 13 15 10
2013-12-01 6 3 19 8 8 17 8 12
2013-12-08 5 15 5 7 12 0 11 8
2013-12-15 8 6 11 11 0 16 6 14
2013-12-22 16 3 13 8 8 11 15 0
2013-12-29 1 3 6 10 7 7 17 15
2014-01-05 12 7 10 11 6 0 1 12
2014-01-12 13 0 17 0 23 0 10 12
2014-01-19 10 9 2 3 8 1 18 3
2014-01-26 24 9 8 1 19 10 0 3
2014-02-02 1 6 16 0 0 10 8 13

Okay I answer the question myself as I finally figured it out. The key is to not group by calendar week (as you would loose information about the year) but rather group by a string containing calendar week and year.
Then change the layout (reshaping) as mentioned in the question already by using pivot. The date will be the index. Use reset_index() to make the current date-index a column and instead get a integer-range as an index (which is then in the correct order to be plotted (lowest-year/calendar week is index 0 and highest year/calendar week is the highest integer).
Select the date-column as a new variable ticks as a list and delete that column from the DataFrame. Now plot the bars and simply set the labels of the xticks to ticks. Completey solution is quite easy and here:
codes = list('ABCDEFGH');
dates = pd.Series(pd.date_range('2013-11-01', '2014-01-31'));
dates = dates.append(dates)
dates.sort()
df = pd.DataFrame({'amount': np.random.randint(1, 10, dates.size), 'col1': np.random.choice(codes, dates.size), 'col2': np.random.choice(codes, dates.size), 'date': dates})
kw = lambda x: x.isocalendar()[1];
kw_year = lambda x: str(x.year) + ' - ' + str(x.isocalendar()[1])
grouped = df.groupby([df['date'].map(kw_year), 'col1'], sort=False, as_index=False).agg({'amount': 'sum'})
A = grouped.pivot(index='date', columns='col1', values='amount').fillna(0).reset_index()
ticks = A.date.values.tolist()
del A['date']
ax = A.plot(kind='bar')
ax.set_xticklabels(ticks)
RESULT:

Add the week to 52 times the year, so that weeks are ordered "by year". Set the tick labels back, which might be nontrivial, to what you want.
What you want is for the weeks to increase like so
nth week → (n+1)th week → (n+2)th week → etc.
but when you have a new year it instead falls by 51 (52 → 1).
To offset this, note that the year increases by one. So add the year's increase multiplied by 52 and the total change will be -51 + 52 = 1 as wanted.

KeyError: 'date'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
in
10 kw_year = lambda x: str(x.year) + ' - ' + str(x.isocalendar()[1])
11 grouped = df.groupby([df['date'].map(kw_year), 'col1'], sort=False, as_index=False).agg({'amount': 'sum'})
---> 12 A = grouped.pivot(index='date', columns='col1', values='amount').fillna(0).reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

CSV & Pandas: Unnamed columns and multi-index - python

Related

Pandas df.isna().sum() not showing all column names

Concatenate dataframes along columns in a pandas dataframe

Faulty Pandas dataframe read_json sorting on python3.5.9

Pandas reshape dataframe every N rows to columns

Pandas: Group by calendar-week, then plot grouped barplots for the real datetime

Categories

Resources