My goal was to apply pivot function to a data frame that contains duplicate records. I solved it by adding a unique column to the data frame:
my_df['id_column'] = range(1, len(my_df.index)+1)
df_pivot = my_df.pivot(index ='id_column', columns = 'type', values = 'age_16_18').fillna(0).astype(int)
I want to figure out how to apply pivot to the data frame without deleting duplicates or using pivot_table? By fist grouping by multiple columns, and then passing the result to the pivot function. I'm not sure how to pass a result after grouping to pivot.
year category state_name type is_state gender age_16_18 age_18_30
0 2001 Foreigners CA Convicts 0 M 8 5
1 2001 Indians NY Convicts 0 F 5 2
2 2005 Foreigners NY Others 1 M 0 9
3 2009 Indians NJ Detenus 0 F 7 0
It's not entirely clear what you're attempting but see if you can get some inspiration from the following approaches. What columns are you wishing to group by?
import pandas
my_df = pandas.DataFrame( { 'year' : [2001, 2001, 2005, 2009] ,
'category' : ['Foreigners','Indians','Foreigners','Indians'] ,
'state_name': ['CA','NY','NY','NJ' ],
'type': ['Convicts', 'Convicts','Others','Detenus'],
'is_state' : [0,0,1,0] ,
'gender' : ['M','F','M','F'],
'age_16_18':[8,5,0,7],
'age_18_30' : [5,2,9,0] }, columns=[ 'year','category','state_name','type','is_state','gender','age_16_18','age_18_30'])
>>> my_df.pivot( columns = 'type', values = 'age_16_18' )
type Convicts Detenus Others
0 8.0 NaN NaN
1 5.0 NaN NaN
2 NaN NaN 0.0
3 NaN 7.0 NaN
>>> my_df['key'] = my_df.category.str.cat(my_df.gender)
>>> my_df.pivot( index='key', columns = 'type', values = 'age_16_18' )
type Convicts Detenus Others
key
ForeignersM 8.0 NaN 0.0
IndiansF 5.0 7.0 NaN
Related
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have a dataframe like below:
data = {'ID': [1,2,3,4,5],
'NODE_ID': [10,10,20,15,20],
'TYPE': ['a','a','b','a','b'],
'DATE': ['2021-12-02','2021-12-02','2021-12-02','2021-12-03','2021-12-02'],
'HOUR': [0,0,3,2,3],
'EVENTS_COUNT': [10,15,10,21,12]
}
df = pd.DataFrame(data,columns=['ID','NODE_ID', 'TYPE', 'DATE', 'HOUR', 'EVENTS_COUNT'])
I have two different TYPE - a and b. I want to create a dataframe out of this so I have a sum of each TYPE (a and b) for the group of NODE_ID, DATE, HOUR.
The expected output is
Can you please suggest how to do this?
EDIT:
Updated the expected output.
Try with pivot_table:
output = (df.pivot_table("EVENTS_COUNT",["NODE_ID","DATE","HOUR"],"TYPE","sum")
.fillna(0)
.add_prefix("EVENTS_COUNT_")
.reset_index()
.rename_axis(None, axis=1))
>>> output
TYPE NODE_ID DATE HOUR EVENTS_COUNT_a EVENTS_COUNT_b
0 10 2021-12-02 0 25.0 0.0
1 15 2021-12-03 2 21.0 0.0
2 20 2021-12-02 3 0.0 22.0
If there is only one TYPE for each combination of ["NODE_ID", "DATE", "HOUR"], and you would like this as a separate column, you can do:
output["TYPE"] = output.filter(like="EVENTS_COUNT_").idxmax(axis=1).str.lstrip("EVENTS_COUNT_")
>>> output
NODE_ID DATE HOUR EVENTS_COUNT_a EVENTS_COUNT_b TYPE
0 10 2021-12-02 0 25.0 0.0 a
1 15 2021-12-03 2 21.0 0.0 a
2 20 2021-12-02 3 0.0 22.0 b
Given the following DataFrame:
Category Area Country Code Function Last Name LanID Spend1 Spend2 Spend3 Spend4 Spend5
0 Bisc EE RU02,UA02 Mk Smith df3432 1.0 NaN NaN NaN NaN
1 Bisc EE RU02 Mk Bibs fdss34 1.0 NaN NaN NaN NaN
2 Bisc EE UA02,EURASIA Mk Crow fdsdr43 1.0 NaN NaN NaN NaN
3 Bisc WE FR31 Mk Ellis fdssdf3 1.0 NaN NaN NaN NaN
4 Bisc WE BE32,NL31 Mk Mower TOZ1720 1.0 NaN NaN NaN NaN
5 Bisc WE FR31,BE32,NL31 LKU Elan SKY8851 1.0 1.0 1.0 1.0 1.0
6 Bisc SE IT31 Mk Bobret 3dfsfg 1.0 NaN NaN NaN NaN
7 Bisc SE GR31 Mk Concept MOSGX009 1.0 NaN NaN NaN NaN
8 Bisc SE RU02,IT31,GR31,PT31,ES31 LKU Solar MSS5723 1.0 1.0 1.0 1.0 1.0
9 Bisc SE IT31,GR31,PT31,ES31 Mk Brix fdgd22 NaN 1.0 NaN NaN NaN
10 Choc CE RU02,CZ31,SK31,PL31,LT31 Fin Ocoser 43233d NaN 1.0 NaN NaN NaN
11 Choc CE DE31,AT31,HU31,CH31 Fin Smuth 4rewf NaN 1.0 NaN NaN NaN
12 Choc CE BG31,RO31,EMA Fin Momocs hgghg2 NaN 1.0 NaN NaN NaN
13 Choc WE FR31,BE32,NL31 Fin Bruntly ffdd32 NaN NaN NaN NaN 1.0
14 Choc WE FR31,BE32,NL31 Mk Ofer BROGX011 NaN 1.0 1.0 NaN NaN
15 Choc WE FR31,BE32,NL31 Mk Hem NZJ3189 NaN NaN NaN 1.0 1.0
16 G&C NE UA02,SE31 Mk Cre ORY9499 1.0 NaN NaN NaN NaN
17 G&C NE NO31 Mk Qlyo XVM7639 1.0 NaN NaN NaN NaN
18 G&C NE GB31,NO31,SE31,IE31,FI31 Mk Omny LOX1512 NaN 1.0 1.0 NaN NaN
I would like to get it exported into a nested Dict with the below structure:
{RU02: {Bisc: {EE: {Mkt: {Spend1: {df3432: Smith}
{fdss34: Bibs}
{Bisc: {SE: {LKU: {Spend1: {MSS5723: Solar}
{Spend2: {MSS5723: Solar}
{Spend3: {MSS5723: Solar}
{Spend4: {MSS5723: Solar}
{Spend5: {MSS5723: Solar}
{Choc: {CE: {Fin: {Spend2: {43233d: Ocoser}
.....
{UA02: {Bisc: {EE: {Mkt: {Spend1: {df3432: Smith}
{ffdsdr43: Crow}
{G&C: {NE: {Mkt: {Spend1: {ORY9499: Cre}
.....
So essentially, in this Dict I'm trying to track for each CountryCode, what is the list of LastNames+LandIDs, per Spend category (Spend1,Spend2, etc.) and their attributes (Function, Category, Area).
The DataFrame is not very large (less than 200rows), but it contains almost all types of combinations between Category/Area/Country Code as well as LastNames and their Spend categories (many-to-many).
My challenge is that i'm unable to figure out how to clearly conceptualise the steps i need to do in order to prepare the DataFrame properly for export to Dict....
What i figured so far is that i would need:
a way to slice the contents of the "Country Code" column based on the "," separator: DONE
create new columns based on unique Country Codes, and have 1 in each row where that column code is preset: DONE
set the index of the DataFrame recursively to each of the newly added columns
move into a new DataFrame each rows for each Country Code where there is data
export all the new DataFrames to Dicts, and then merge them
Not sure if steps 3-6 is the best way to go about this though, as i'm having difficulties still to understand how pd.DataFrame.to_dict should be configured for my case (if that's even possible)...
Highly appreciate your help on the coding side, but also in briefly explaining your thought process for each stage.
Here is how far i got on my own..
#keeping track of initial order of columns
initialOrder = list(df.columns.values)
# split the Country Code by ","
CCodeNoCommas= [item for items in df['Country Code'].values for item in items.split(",")]
# add only the UNIQUE Country Codes -via set- as new columns in the DataFrame,
#with NaN for row values
df = pd.concat([df,pd.DataFrame(columns=list(set(CCodeNoCommas)))])
# reordering columns to have the newly added ones at the end
reordered = initialOrder + [c for c in df.columns if c not in initialOrder]
df = df[reordered]
# replace NaN with 1 in the newly added columns (Country Codes), where the same Country code
# exists in the initial column "Country Code"; do this for each row
CCodeUniqueOnly = set(CCodeNoCommas)
for c in CCodeUniqueOnly:
CCodeIsPresent_rowIndex = df.index[df['Country Code'].str.contains(c)]
#print (CCodeIsPresent_rowIndex)
df.loc[CCodeIsPresent_rowIndex, c] = 1
# no clue what do do next ??
If you re-shape your dataframe into the right format, you can use the handy recursive dictionary function from the answer by #DSM to this question. The goal is to get a dataframe where each row contains only one "entry" - a unique combination of the columns you're interested in.
First, you need to split your country code strings into lists:
df['Country Code'] = df['Country Code'].str.split(',')
And then expand those lists into multiple rows (using #RomanPekar's technique from this question):
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
.stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
Then you can reshape the Spend* columns into rows, where there's a row for each Spend* column where the value is not nan.
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
Now you have a dataframe where each level in your nested dictionary is its own column. So you can use this recursive dictionary function:
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
And apply it only to the columns you want to produce the nested dictionary, listed in the order in which they should nest:
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])
That should produce your desired result.
All in one piece:
df['Country Code'] = df['Country Code'].str.split(',')
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
.stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])
So I have a dataframe of wine data
wines_dict = {
'Appellation': list(wine_appellations),
'Ratings': list(wine_ratings),
'Region': list(wine_regions),
'Name': list(wine_names),
'Varietal': list(wine_varietals),
'WineType': list(wine_wine_types),
'RetailPrice': list(wine_retail_prices)
}
wines_df = pd.DataFrame(
data = wines_dict,
columns =[
'Region',
'Ratings',
'Appellation',
'Name',
'Varietal',
'WineType',
'RetailPrice'
]
)
I am trying to slice it using wines_df.where((wines_df['Ratings'] > 95) & (~pd.isnull(wines_df['Ratings']))) but it is returning back NaN ratings still.
0 NaN
1 NaN
2 NaN
3 NaN
4 97.0
5 98.0
6 NaN
How can i slice it so that it returns all the Non Null values that are greater than 95?
A simple slice like this should give you the desired output
wines_df[(wines_df['Ratings'] > 95) & (wines_df['Ratings'].notnull())]
My table looks like this:
In [82]:df.head()
Out[82]:
MatDoc MatYr MvT Material Plnt SLoc Batch Customer AmountLC Amount ... PO MatYr.1 MatDoc.1 Order ProfitCtr SLED/BBD PstngDate EntryDate Time Username
0 4912693062 2015 551 100062 HDC2 0001 5G30MC1A11 NaN 9.03 9.06 ... NaN NaN NaN NaN IN1165B085 26.01.2016 01.08.2015 01.08.2015 01:13:16 O33462
1 4912693063 2015 501 166 HDC2 0004 NaN NaN 0.00 0.00 ... NaN NaN NaN NaN IN1165B085 NaN 01.08.2015 01.08.2015 01:13:17 O33462
2 4912693320 2015 551 101343 HDC2 0001 5G28MC1A11 NaN 53.73 53.72 ... NaN NaN NaN NaN IN1165B085 25.01.2016 01.08.2015 01.08.2015 01:16:30 O33462
Here, I need to group by data on Order column and sum only AmountLC column.Then I need to check for the Order column values such that it should be present in both MvT101group and MvT102group. and if an Order matches in both sets of data then I need to subtract MvT102group from MvT101group. and display
Order|Plnt|Material|Batch|Sum101=SumofMvt101ofAmountLC|Sum102=SumofMvt102ofAmountLC|(Sum101-Sum102)/100
What I have done is first I made new df containing only 101 and 102: Mvt101 and MvT102
MvT101 = df.loc[df['MvT'] == 101]
MvT102 = df.loc[df['MvT'] == 102]
Then I grouped it by Order and got the sum value for the column
MvT101group = MvT101.groupby('Order', sort=True)
In [76]:
MvT101group[['AmountLC']].sum()
Out[76]:
Order AmountLC
1127828 16348566.88
1127829 22237710.38
1127830 29803745.65
1127831 30621381.06
1127832 33926352.51
MvT102group = MvT102.groupby('Order', sort=True)
In [77]:
MvT102group[['AmountLC']].sum()
Out[77]:
Order AmountLC
1127830 53221.70
1127831 651475.13
1127834 67442.16
1127835 2477494.17
1128622 218743.14
After this I am not able to understand how should I write my query.
Please ask me any further details if you want.Here is the CSV file from where I am working Link
Hope I understood the question correctly. After grouping both groups as you did:
MvT101group = MvT101.groupby('Order',sort=True).sum()
MvT102group = MvT102.groupby('Order',sort=True).sum()
You can update the columns' names for both groups:
MvT101group.columns = MvT101group.columns.map(lambda x: str(x) + '_101')
MvT102group.columns = MvT102group.columns.map(lambda x: str(x) + '_102')
Then merge all 3 tables so that you will have all 3 columns in the main table:
df = df.merge(MvT101group, left_on=['Order'], right_index=True, how='left')
df = df.merge(MvT102group, left_on=['Order'], right_index=True, how='left')
And then you can add the calculated column:
df['calc'] = (df['Order_101']-df['Order_102']) / 100
I tried several examples of this topic but with no results. I'm reading a DataFrame like:
Code,Counts
10006,5
10011,2
10012,26
10013,20
10014,17
10015,2
10018,2
10019,3
How can I get another DataFrame like:
Code,Counts
10006,5
10007,NaN
10008,NaN
...
10011,2
10012,26
10013,20
10014,17
10015,2
10016,NaN
10017,NaN
10018,2
10019,3
Basically filling the missing values of the 'Code' Column? I tried the df.reindex() method but I can't figure out how it works. Thanks a lot.
I'd set the index to you 'Code' column, then reindex by passing in a new array based on your current index, arange accepts a start and stop param (you need to add 1 to the end) and then reset_index this assumes that your 'Code' values are already sorted:
In [21]:
df.set_index('Code', inplace=True)
df = df.reindex(index = np.arange(df.index[0], df.index[-1] + 1)).reset_index()
df
Out[21]:
Code Counts
0 10006 5
1 10007 NaN
2 10008 NaN
3 10009 NaN
4 10010 NaN
5 10011 2
6 10012 26
7 10013 20
8 10014 17
9 10015 2
10 10016 NaN
11 10017 NaN
12 10018 2
13 10019 3