reshaping pandas DataFrame for export in a nested dict - python

Given the following DataFrame:
Category Area Country Code Function Last Name LanID Spend1 Spend2 Spend3 Spend4 Spend5
0 Bisc EE RU02,UA02 Mk Smith df3432 1.0 NaN NaN NaN NaN
1 Bisc EE RU02 Mk Bibs fdss34 1.0 NaN NaN NaN NaN
2 Bisc EE UA02,EURASIA Mk Crow fdsdr43 1.0 NaN NaN NaN NaN
3 Bisc WE FR31 Mk Ellis fdssdf3 1.0 NaN NaN NaN NaN
4 Bisc WE BE32,NL31 Mk Mower TOZ1720 1.0 NaN NaN NaN NaN
5 Bisc WE FR31,BE32,NL31 LKU Elan SKY8851 1.0 1.0 1.0 1.0 1.0
6 Bisc SE IT31 Mk Bobret 3dfsfg 1.0 NaN NaN NaN NaN
7 Bisc SE GR31 Mk Concept MOSGX009 1.0 NaN NaN NaN NaN
8 Bisc SE RU02,IT31,GR31,PT31,ES31 LKU Solar MSS5723 1.0 1.0 1.0 1.0 1.0
9 Bisc SE IT31,GR31,PT31,ES31 Mk Brix fdgd22 NaN 1.0 NaN NaN NaN
10 Choc CE RU02,CZ31,SK31,PL31,LT31 Fin Ocoser 43233d NaN 1.0 NaN NaN NaN
11 Choc CE DE31,AT31,HU31,CH31 Fin Smuth 4rewf NaN 1.0 NaN NaN NaN
12 Choc CE BG31,RO31,EMA Fin Momocs hgghg2 NaN 1.0 NaN NaN NaN
13 Choc WE FR31,BE32,NL31 Fin Bruntly ffdd32 NaN NaN NaN NaN 1.0
14 Choc WE FR31,BE32,NL31 Mk Ofer BROGX011 NaN 1.0 1.0 NaN NaN
15 Choc WE FR31,BE32,NL31 Mk Hem NZJ3189 NaN NaN NaN 1.0 1.0
16 G&C NE UA02,SE31 Mk Cre ORY9499 1.0 NaN NaN NaN NaN
17 G&C NE NO31 Mk Qlyo XVM7639 1.0 NaN NaN NaN NaN
18 G&C NE GB31,NO31,SE31,IE31,FI31 Mk Omny LOX1512 NaN 1.0 1.0 NaN NaN
I would like to get it exported into a nested Dict with the below structure:
{RU02: {Bisc: {EE: {Mkt: {Spend1: {df3432: Smith}
{fdss34: Bibs}
{Bisc: {SE: {LKU: {Spend1: {MSS5723: Solar}
{Spend2: {MSS5723: Solar}
{Spend3: {MSS5723: Solar}
{Spend4: {MSS5723: Solar}
{Spend5: {MSS5723: Solar}
{Choc: {CE: {Fin: {Spend2: {43233d: Ocoser}
.....
{UA02: {Bisc: {EE: {Mkt: {Spend1: {df3432: Smith}
{ffdsdr43: Crow}
{G&C: {NE: {Mkt: {Spend1: {ORY9499: Cre}
.....
So essentially, in this Dict I'm trying to track for each CountryCode, what is the list of LastNames+LandIDs, per Spend category (Spend1,Spend2, etc.) and their attributes (Function, Category, Area).
The DataFrame is not very large (less than 200rows), but it contains almost all types of combinations between Category/Area/Country Code as well as LastNames and their Spend categories (many-to-many).
My challenge is that i'm unable to figure out how to clearly conceptualise the steps i need to do in order to prepare the DataFrame properly for export to Dict....
What i figured so far is that i would need:
a way to slice the contents of the "Country Code" column based on the "," separator: DONE
create new columns based on unique Country Codes, and have 1 in each row where that column code is preset: DONE
set the index of the DataFrame recursively to each of the newly added columns
move into a new DataFrame each rows for each Country Code where there is data
export all the new DataFrames to Dicts, and then merge them
Not sure if steps 3-6 is the best way to go about this though, as i'm having difficulties still to understand how pd.DataFrame.to_dict should be configured for my case (if that's even possible)...
Highly appreciate your help on the coding side, but also in briefly explaining your thought process for each stage.
Here is how far i got on my own..
#keeping track of initial order of columns
initialOrder = list(df.columns.values)
# split the Country Code by ","
CCodeNoCommas= [item for items in df['Country Code'].values for item in items.split(",")]
# add only the UNIQUE Country Codes -via set- as new columns in the DataFrame,
#with NaN for row values
df = pd.concat([df,pd.DataFrame(columns=list(set(CCodeNoCommas)))])
# reordering columns to have the newly added ones at the end
reordered = initialOrder + [c for c in df.columns if c not in initialOrder]
df = df[reordered]
# replace NaN with 1 in the newly added columns (Country Codes), where the same Country code
# exists in the initial column "Country Code"; do this for each row
CCodeUniqueOnly = set(CCodeNoCommas)
for c in CCodeUniqueOnly:
CCodeIsPresent_rowIndex = df.index[df['Country Code'].str.contains(c)]
#print (CCodeIsPresent_rowIndex)
df.loc[CCodeIsPresent_rowIndex, c] = 1
# no clue what do do next ??

If you re-shape your dataframe into the right format, you can use the handy recursive dictionary function from the answer by #DSM to this question. The goal is to get a dataframe where each row contains only one "entry" - a unique combination of the columns you're interested in.
First, you need to split your country code strings into lists:
df['Country Code'] = df['Country Code'].str.split(',')
And then expand those lists into multiple rows (using #RomanPekar's technique from this question):
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
.stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
Then you can reshape the Spend* columns into rows, where there's a row for each Spend* column where the value is not nan.
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
Now you have a dataframe where each level in your nested dictionary is its own column. So you can use this recursive dictionary function:
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
And apply it only to the columns you want to produce the nested dictionary, listed in the order in which they should nest:
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])
That should produce your desired result.
All in one piece:
df['Country Code'] = df['Country Code'].str.split(',')
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
.stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])

Related

Pandas, how to calculate delta between one cell and another in different rows

I have the following frame:
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2
123,45,,,
123,,46,,
123,,47,,
123,,48,,
123,,49,,
123,,51,,
124,45,,,
124,,46,,
124,,47,,
124,,48,,
124,,49,,
124,,51,,
I'd like to add a 4th column that is (EVENT2TIME - EVENT1TIME)
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2, DELTA
123,45,,,,
123,,46,,,1
123,,47,,,2
123,,48,,,3
123,,49,,,4
123,,51,,,6
124,45,,,,
124,,46,,,1
124,,47,,,2
124,,48,,,3
124,,49,,,4
124,,51,,,6
I think the first thing to do is to copy the value from the row where EVENT1TIME is populated into the other instances of that USERID. But I suspect there may be a better way.
I am making some assumptions:
You want to calculate the difference between column EVENT2TIME and first row of EVENT1TIME
You want to store the results into DELTA
You can do this as follows:
import pandas as pd
df = pd.read_csv('abc.txt')
print (df)
df['DELTA'] = df.iloc[:,2] - df.iloc[0,1]
print (df)
The output of this will be:
USERID EVENT1TIME EVENT2TIME MISC1 MISC2 DELTA
0 123 45.0 NaN NaN NaN
1 123 NaN 46.0 NaN NaN 1.0
2 123 NaN 47.0 NaN NaN 2.0
3 123 NaN 48.0 NaN NaN 3.0
4 123 NaN 49.0 NaN NaN 4.0
5 123 NaN 51.0 NaN NaN 6.0
If you know EVENT1TIME is always and only in the first row, just store it as a variable and subtract it.
val = df.EVENT1TIME[0]
df['DELTA'] = df.EVENT2TIME - val
If you have multiple values every so often in EVENT1TIME, use some logic to back or forward fill all the empty rows for EVENT1TIME. This fill is not stored in the final output df.
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.ffill() # forward fill (down) all nan values
# OR
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.bfill() # back fill (up) all nan values
EDIT: Keeping this for continuity despite how hacky it is.
locations = list(df[~np.isnan(df.EVENT1TIME)].index)
vals = df.EVENT1TIME.loc[locations] # all EVENT1TIME values
locations.append(df.EVENT1TIME.index[-1]+1) # last row index + 1
last_loc = locations[0]
for idx, next_loc in enumerate(locations[1:]):
temp = df.loc[last_loc:next_loc-1]
df['DELTA'].loc[last_loc:next_loc-1] = temp.EVENT2VALUE - vals[last_loc]
last_loc = next_loc

pd.json_normalize() gives “str object has no attribute 'values'"

I manually create a DataFrame:
import pandas as pd
df_articles1 = pd.DataFrame({'Id' : [4,5,8,9],
'Class':[
{'encourage': 1, 'contacting': 1},
{'cardinality': 16, 'subClassOf': 3},
{'get-13.5.1': 1},
{'cardinality': 12, 'encourage': 1}
]
})
I export it to a csv file to import after splitting it:
df_articles1.to_csv(f"""{path}articles_split.csv""", index = False, sep=";")
I can split it with pd.json_normalize():
df_articles1 = pd.json_normalize(df_articles1['Class'])
I import its csv file to a DataFrame:
df_articles2 = pd.read_csv(f"""{path}articles_split.csv""", sep=";")
But this fails with:
AttributeError: 'str' object has no attribute 'values' pd.json_normalize(df_articles2['Class'])
that was because when you save by to_csv() the data in your 'Class' column is stored as string not as dictionary/json so after loading that saved data:
df_articles2 = pd.read_csv(f"""{path}articles_split.csv""", sep=";")
Then to make it back in original form make use of eval() method and apply() method:-
df_articles2['Class']=df_articles2['Class'].apply(lambda x:eval(x))
Finally:
resultdf=pd.json_normalize(df_articles2['Class'])
Now If you print resultdf you will get your desired output
While the accepted answer works, using eval is bad practice.
To parse a string column that looks like JSON/dict, use one of the following options (last one is best, if possible).
ast.literal_eval (better)
import ast
objects = df2['Class'].apply(ast.literal_eval)
normed = pd.json_normalize(objects)
df2[['Id']].join(normed)
# Id encourage contacting cardinality subClassOf get-13.5.1
# 0 4 1.0 1.0 NaN NaN NaN
# 1 5 NaN NaN 16.0 3.0 NaN
# 2 8 NaN NaN NaN NaN 1.0
# 3 9 1.0 NaN 12.0 NaN NaN
json.loads (even better)
import json
objects = df2['Class'].apply(json.loads)
normed = pd.json_normalize(objects)
df2[['Id']].join(normed)
# encourage contacting cardinality subClassOf get-13.5.1
# 0 1.0 1.0 NaN NaN NaN
# 1 NaN NaN 16.0 3.0 NaN
# 2 NaN NaN NaN NaN 1.0
# 3 1.0 NaN 12.0 NaN NaN
If the strings are single quoted, use str.replace to convert them to double quotes (and thus valid JSON) before applying json.loads:
objects = df2['Class'].str.replace("'", '"').apply(json.loads)
normed = pd.json_normalize(objects)
df2[['Id']].join(normed)
pd.json_normalize before pd.to_csv (recommended)
If possible, when you originally save to CSV, just save the normalized JSON (not raw JSON objects):
df1 = df1[['Id']].join(pd.json_normalize(df1['Class']))
df1.to_csv('df1_normalized.csv', index=False, sep=';')
# Id;encourage;contacting;cardinality;subClassOf;get-13.5.1
# 4;1.0;1.0;;;
# 5;;;16.0;3.0;
# 8;;;;;1.0
# 9;1.0;;12.0;;
This is a more natural CSV workflow (rather than storing/loading object blobs):
df2 = pd.read_csv('df1_normalized.csv', sep=';')
# Id encourage contacting cardinality subClassOf get-13.5.1
# 0 4 1.0 1.0 NaN NaN NaN
# 1 5 NaN NaN 16.0 3.0 NaN
# 2 8 NaN NaN NaN NaN 1.0
# 3 9 1.0 NaN 12.0 NaN NaN

python pandas reshape - time series table

i need to reshape the time series table
ex) A => B
A
no,A,B,B_sub
1,start,val_s,val_s_sub
2,study,val_st,val_st_sub
3,work,val_w,val_w_sub
4,end,val_e,val_e_sub
5,start,val_s1,val_s1_sub
6,end,val_e1,val_e1_sub
7,start,val_s2,val_s2_sub
8,work,val_w1,val_w1_sub
9,end,val_e2,val_e2_sub
B
,start,,study,,work,,end,
,B,B_sub,B,B_sub,B,B_sub,B,B_sub
4-1,val_s,val_s_sub,val_st,val_st_sub,val_w,val_w_sub,val_e,val_e_sub
6-5,val_s1,val_s1_sub,,,,,val_e1,val_e1_sub
9-7,val_s2,val_s2_sub,,,val_w1,val_w1_sub,val_e2,val_e2_sub
I tried to use the pivot table function of the python - pandas library,
but there is no common string to use as index in my table
can i get a hint?
i'm lost.. help me plz..
Does this get you close enough?
df_a['grp'] = (df_a['A'] == 'start').cumsum()
df_a.set_index(['grp','A']).unstack('A')
Output:
no B B_sub
A end start study work end start study work end start study work
grp
1 4.0 1.0 2.0 3.0 val_e val_s val_st val_w val_e_sub val_s_sub val_st_sub val_w_sub
2 6.0 5.0 NaN NaN val_e1 val_s1 NaN NaN val_e1_sub val_s1_sub NaN NaN
3 9.0 7.0 NaN 8.0 val_e2 val_s2 NaN val_w1 val_e2_sub val_s2_sub NaN val_w1_sub
Going a little further with reshaping and renaming and shaping:
df_r = df_a.set_index(['grp','A']).unstack('A')
steps = df_r[('no', 'end')].astype(int).astype(str).str.cat(df_r[('no', 'start')].astype(int).astype(str), sep='-')
df_r.set_index(steps)[['B', 'B_sub']].swaplevel(0,1, axis=1).sort_index(level=0, axis=1)
Output:
A end start study work
B B_sub B B_sub B B_sub B B_sub
(no, end)
4-1 val_e val_e_sub val_s val_s_sub val_st val_st_sub val_w val_w_sub
6-5 val_e1 val_e1_sub val_s1 val_s1_sub NaN NaN NaN NaN
9-7 val_e2 val_e2_sub val_s2 val_s2_sub NaN NaN val_w1 val_w1_sub

How to load a text file of data with many commented rows, into pandas?

I am trying to read a deliminated text file into a dataframe in python. The deliminator is not being identified when I use pd.read_table. If I explicitly set sep = ' ', I get an error: Error tokenizing data. C error. Notably the defaults work when I use np.loadtxt().
Example:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None)
0
0 1850 1 -0.777 0.412 NaN NaN...
1 1850 2 -0.239 0.458 NaN NaN...
2 1850 3 -0.426 0.447 NaN NaN...
3 1850 4 -0.680 0.367 NaN NaN...
4 1850 5 -0.687 0.298 NaN NaN...
If I set sep = ' ', I get another error:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None,
sep = ' ')
ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58
Looking up this error, people suggest using header = None (already done) and setting sep = explicitly, but that is causing the problem: Python Pandas Error tokenizing data. I looked up line 78 and can't see any problems. If I set error_bad_lines=False i get an empty df suggesting there is a problem with every entry.
Notably this works when I use np.loadtxt():
pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comments = '%'))
0 1 2 3 4 5 6 7 8 9 10 11
0 1850.0 1.0 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850.0 2.0 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850.0 3.0 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
3 1850.0 4.0 -0.680 0.367 NaN NaN NaN NaN NaN NaN NaN NaN
4 1850.0 5.0 -0.687 0.298 NaN NaN NaN NaN NaN NaN NaN NaN
This suggests to me that there isn't something wrong with the file, but rather with how I am calling pd.read_table(). I looked through the documentation for np.loadtxt() in the hope of setting the sep to the same value, but that just shows: delimiter=None (https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).
I'd prefer to be able to import this as a pd.DataFrame, setting the names, rather than having to import as a matrix and then convert to pd.DataFrame.
What am I getting wrong?
This one is quite tricky. Please try out the snippet code below:
import pandas as pd
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
df = pd.read_csv(url,
sep='\s+',
comment='%',
usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),
names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',
'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',
'20y.Anomaly', '20y.Unc.'))
The issue is the file has 77 rows of commented text, for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'
Two of the rows are headers
There's a bunch of data, then there are two more headers, and a new set of data for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
This solution separates the two tables in the file into separate dataframes.
This is not as nice as the other answer, but the data is properly separated into different dataframes.
The headers were a pain, it would probably be easier to manually create a custom header, and skip the lines of code for separating the headers from the text.
The important point separating air and ice data.
import requests
import pandas as pd
import math
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# specify the data from the ranges in the file
air_header1 = data[74].split() # not used
air_header2 = [v.strip() for v in data[75].split(',')]
# combine the 2 parts of the header into a single header
air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]
air_data = [v.split() for v in data[77:2125]]
h2o_header1 = data[2129].split() # not used
h2o_header2 = [v.strip() for v in data[2130].split(',')]
# combine the 2 parts of the header into a single header
h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=air_header)
h2o = pd.DataFrame(h2o_data, columns=h2o_header)
Without the header code
Simplify the code, by using a manual header list.
import pandas as pd
import requests
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# manually created header
headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',
'Annual_Anomaly', 'Annual_Unc.',
'Five-year_Anomaly', 'Five-year_Unc.',
'Ten-year_Anomaly', 'Ten-year_Unc.',
'Twenty-year_Anomaly', 'Twenty-year_Unc.']
# separate the air and h2o data
air_data = [v.split() for v in data[77:2125]]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=headers)
h2o = pd.DataFrame(h2o_data, columns=headers)
air
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
h2o
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.724 0.370 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.221 0.430 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.443 0.419 NaN NaN NaN NaN NaN NaN NaN NaN

Python - how to pass a result from group by to Pivot?

My goal was to apply pivot function to a data frame that contains duplicate records. I solved it by adding a unique column to the data frame:
my_df['id_column'] = range(1, len(my_df.index)+1)
df_pivot = my_df.pivot(index ='id_column', columns = 'type', values = 'age_16_18').fillna(0).astype(int)
I want to figure out how to apply pivot to the data frame without deleting duplicates or using pivot_table? By fist grouping by multiple columns, and then passing the result to the pivot function. I'm not sure how to pass a result after grouping to pivot.
year category state_name type is_state gender age_16_18 age_18_30
0 2001 Foreigners CA Convicts 0 M 8 5
1 2001 Indians NY Convicts 0 F 5 2
2 2005 Foreigners NY Others 1 M 0 9
3 2009 Indians NJ Detenus 0 F 7 0
It's not entirely clear what you're attempting but see if you can get some inspiration from the following approaches. What columns are you wishing to group by?
import pandas
my_df = pandas.DataFrame( { 'year' : [2001, 2001, 2005, 2009] ,
'category' : ['Foreigners','Indians','Foreigners','Indians'] ,
'state_name': ['CA','NY','NY','NJ' ],
'type': ['Convicts', 'Convicts','Others','Detenus'],
'is_state' : [0,0,1,0] ,
'gender' : ['M','F','M','F'],
'age_16_18':[8,5,0,7],
'age_18_30' : [5,2,9,0] }, columns=[ 'year','category','state_name','type','is_state','gender','age_16_18','age_18_30'])
>>> my_df.pivot( columns = 'type', values = 'age_16_18' )
type Convicts Detenus Others
0 8.0 NaN NaN
1 5.0 NaN NaN
2 NaN NaN 0.0
3 NaN 7.0 NaN
>>> my_df['key'] = my_df.category.str.cat(my_df.gender)
>>> my_df.pivot( index='key', columns = 'type', values = 'age_16_18' )
type Convicts Detenus Others
key
ForeignersM 8.0 NaN 0.0
IndiansF 5.0 7.0 NaN

Categories

Resources