Pandas Dataframe Indexing, Where

Pandas Dataframe Indexing, Where - python

So I have a dataframe of wine data
wines_dict = {
'Appellation': list(wine_appellations),
'Ratings': list(wine_ratings),
'Region': list(wine_regions),
'Name': list(wine_names),
'Varietal': list(wine_varietals),
'WineType': list(wine_wine_types),
'RetailPrice': list(wine_retail_prices)
}
wines_df = pd.DataFrame(
data = wines_dict,
columns =[
'Region',
'Ratings',
'Appellation',
'Name',
'Varietal',
'WineType',
'RetailPrice'
]
)
I am trying to slice it using wines_df.where((wines_df['Ratings'] > 95) & (~pd.isnull(wines_df['Ratings']))) but it is returning back NaN ratings still.
0 NaN
1 NaN
2 NaN
3 NaN
4 97.0
5 98.0
6 NaN
How can i slice it so that it returns all the Non Null values that are greater than 95?

A simple slice like this should give you the desired output
wines_df[(wines_df['Ratings'] > 95) & (wines_df['Ratings'].notnull())]

Related

Python - how to pass a result from group by to Pivot?

My goal was to apply pivot function to a data frame that contains duplicate records. I solved it by adding a unique column to the data frame:
my_df['id_column'] = range(1, len(my_df.index)+1)
df_pivot = my_df.pivot(index ='id_column', columns = 'type', values = 'age_16_18').fillna(0).astype(int)
I want to figure out how to apply pivot to the data frame without deleting duplicates or using pivot_table? By fist grouping by multiple columns, and then passing the result to the pivot function. I'm not sure how to pass a result after grouping to pivot.
year category state_name type is_state gender age_16_18 age_18_30
0 2001 Foreigners CA Convicts 0 M 8 5
1 2001 Indians NY Convicts 0 F 5 2
2 2005 Foreigners NY Others 1 M 0 9
3 2009 Indians NJ Detenus 0 F 7 0

It's not entirely clear what you're attempting but see if you can get some inspiration from the following approaches. What columns are you wishing to group by?
import pandas
my_df = pandas.DataFrame( { 'year' : [2001, 2001, 2005, 2009] ,
'category' : ['Foreigners','Indians','Foreigners','Indians'] ,
'state_name': ['CA','NY','NY','NJ' ],
'type': ['Convicts', 'Convicts','Others','Detenus'],
'is_state' : [0,0,1,0] ,
'gender' : ['M','F','M','F'],
'age_16_18':[8,5,0,7],
'age_18_30' : [5,2,9,0] }, columns=[ 'year','category','state_name','type','is_state','gender','age_16_18','age_18_30'])
>>> my_df.pivot( columns = 'type', values = 'age_16_18' )
type Convicts Detenus Others
0 8.0 NaN NaN
1 5.0 NaN NaN
2 NaN NaN 0.0
3 NaN 7.0 NaN
>>> my_df['key'] = my_df.category.str.cat(my_df.gender)
>>> my_df.pivot( index='key', columns = 'type', values = 'age_16_18' )
type Convicts Detenus Others
key
ForeignersM 8.0 NaN 0.0
IndiansF 5.0 7.0 NaN

reshaping pandas DataFrame for export in a nested dict

Given the following DataFrame:
Category Area Country Code Function Last Name LanID Spend1 Spend2 Spend3 Spend4 Spend5
0 Bisc EE RU02,UA02 Mk Smith df3432 1.0 NaN NaN NaN NaN
1 Bisc EE RU02 Mk Bibs fdss34 1.0 NaN NaN NaN NaN
2 Bisc EE UA02,EURASIA Mk Crow fdsdr43 1.0 NaN NaN NaN NaN
3 Bisc WE FR31 Mk Ellis fdssdf3 1.0 NaN NaN NaN NaN
4 Bisc WE BE32,NL31 Mk Mower TOZ1720 1.0 NaN NaN NaN NaN
5 Bisc WE FR31,BE32,NL31 LKU Elan SKY8851 1.0 1.0 1.0 1.0 1.0
6 Bisc SE IT31 Mk Bobret 3dfsfg 1.0 NaN NaN NaN NaN
7 Bisc SE GR31 Mk Concept MOSGX009 1.0 NaN NaN NaN NaN
8 Bisc SE RU02,IT31,GR31,PT31,ES31 LKU Solar MSS5723 1.0 1.0 1.0 1.0 1.0
9 Bisc SE IT31,GR31,PT31,ES31 Mk Brix fdgd22 NaN 1.0 NaN NaN NaN
10 Choc CE RU02,CZ31,SK31,PL31,LT31 Fin Ocoser 43233d NaN 1.0 NaN NaN NaN
11 Choc CE DE31,AT31,HU31,CH31 Fin Smuth 4rewf NaN 1.0 NaN NaN NaN
12 Choc CE BG31,RO31,EMA Fin Momocs hgghg2 NaN 1.0 NaN NaN NaN
13 Choc WE FR31,BE32,NL31 Fin Bruntly ffdd32 NaN NaN NaN NaN 1.0
14 Choc WE FR31,BE32,NL31 Mk Ofer BROGX011 NaN 1.0 1.0 NaN NaN
15 Choc WE FR31,BE32,NL31 Mk Hem NZJ3189 NaN NaN NaN 1.0 1.0
16 G&C NE UA02,SE31 Mk Cre ORY9499 1.0 NaN NaN NaN NaN
17 G&C NE NO31 Mk Qlyo XVM7639 1.0 NaN NaN NaN NaN
18 G&C NE GB31,NO31,SE31,IE31,FI31 Mk Omny LOX1512 NaN 1.0 1.0 NaN NaN
I would like to get it exported into a nested Dict with the below structure:
{RU02: {Bisc: {EE: {Mkt: {Spend1: {df3432: Smith}
{fdss34: Bibs}
{Bisc: {SE: {LKU: {Spend1: {MSS5723: Solar}
{Spend2: {MSS5723: Solar}
{Spend3: {MSS5723: Solar}
{Spend4: {MSS5723: Solar}
{Spend5: {MSS5723: Solar}
{Choc: {CE: {Fin: {Spend2: {43233d: Ocoser}
.....
{UA02: {Bisc: {EE: {Mkt: {Spend1: {df3432: Smith}
{ffdsdr43: Crow}
{G&C: {NE: {Mkt: {Spend1: {ORY9499: Cre}
.....
So essentially, in this Dict I'm trying to track for each CountryCode, what is the list of LastNames+LandIDs, per Spend category (Spend1,Spend2, etc.) and their attributes (Function, Category, Area).
The DataFrame is not very large (less than 200rows), but it contains almost all types of combinations between Category/Area/Country Code as well as LastNames and their Spend categories (many-to-many).
My challenge is that i'm unable to figure out how to clearly conceptualise the steps i need to do in order to prepare the DataFrame properly for export to Dict....
What i figured so far is that i would need:
a way to slice the contents of the "Country Code" column based on the "," separator: DONE
create new columns based on unique Country Codes, and have 1 in each row where that column code is preset: DONE
set the index of the DataFrame recursively to each of the newly added columns
move into a new DataFrame each rows for each Country Code where there is data
export all the new DataFrames to Dicts, and then merge them
Not sure if steps 3-6 is the best way to go about this though, as i'm having difficulties still to understand how pd.DataFrame.to_dict should be configured for my case (if that's even possible)...
Highly appreciate your help on the coding side, but also in briefly explaining your thought process for each stage.
Here is how far i got on my own..
#keeping track of initial order of columns
initialOrder = list(df.columns.values)
# split the Country Code by ","
CCodeNoCommas= [item for items in df['Country Code'].values for item in items.split(",")]
# add only the UNIQUE Country Codes -via set- as new columns in the DataFrame,
#with NaN for row values
df = pd.concat([df,pd.DataFrame(columns=list(set(CCodeNoCommas)))])
# reordering columns to have the newly added ones at the end
reordered = initialOrder + [c for c in df.columns if c not in initialOrder]
df = df[reordered]
# replace NaN with 1 in the newly added columns (Country Codes), where the same Country code
# exists in the initial column "Country Code"; do this for each row
CCodeUniqueOnly = set(CCodeNoCommas)
for c in CCodeUniqueOnly:
CCodeIsPresent_rowIndex = df.index[df['Country Code'].str.contains(c)]
#print (CCodeIsPresent_rowIndex)
df.loc[CCodeIsPresent_rowIndex, c] = 1
# no clue what do do next ??

If you re-shape your dataframe into the right format, you can use the handy recursive dictionary function from the answer by #DSM to this question. The goal is to get a dataframe where each row contains only one "entry" - a unique combination of the columns you're interested in.
First, you need to split your country code strings into lists:
df['Country Code'] = df['Country Code'].str.split(',')
And then expand those lists into multiple rows (using #RomanPekar's technique from this question):
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
.stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
Then you can reshape the Spend* columns into rows, where there's a row for each Spend* column where the value is not nan.
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
Now you have a dataframe where each level in your nested dictionary is its own column. So you can use this recursive dictionary function:
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
And apply it only to the columns you want to produce the nested dictionary, listed in the order in which they should nest:
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])
That should produce your desired result.
All in one piece:
df['Country Code'] = df['Country Code'].str.split(',')
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
.stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)
spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
.apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
.reset_index(level=1)['level_1'])) \
.reset_index(drop=True)
def recur_dictify(frame):
if len(frame.columns) == 1:
if frame.values.size == 1: return frame.values[0][0]
return frame.values.squeeze()
grouped = frame.groupby(frame.columns[0])
d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
return d
cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])

Extract series objects from Pandas DataFrame

I have a dataframe with the columns
['CPL4', 'Part Number', 'Calendar Year/Month', 'Sales', 'Inventory']
For each 'Part Number', 'Calendar Year/Month' will be unique on each Part Number.
I want to convert each part number to a univariate Series with 'Calendar Year/Month' as the index and either 'Sales' or 'Inventory' as the value.
How can I accomplish this using pandas built-in functions and not iterating through the dataframe manually?

In pandas this is called a MultiIndex. Try:
import pandas as pd
df = pd.DataFrame(file,
index=['Part Number', 'Calendar Year/Month'],
columns = ['Sales', 'Inventory'])

you can use the groupby method such has:
grouped_df = df.groupby('Part Number')
and then you can access the df of a certain part number and set the index easily such has:
new_df = grouped_df.get_group('THEPARTNUMBERYOUWANT').set_index('Calendar Year/Month')
if you only want the 2 columns you can do:
print new_df[['Sales', 'Inventory']]]

From the answers and comments here, along with a little more research, I ended with the following solution.
temp_series = df[df[ "Part Number" == sku ] ].pivot(columns = ["Calendar Year/Month"], values = "Sales").iloc[0]
Where sku is a specific part number from df["Part Number"].unique()
This will give you a univariate time series(temp_series) indexed by "Calendar Year/Month" with values of "Sales" EG:
1.2015 NaN
1.2016 NaN
2.2015 NaN
2.2016 NaN
3.2015 NaN
3.2016 NaN
4.2015 NaN
4.2016 NaN
5.2015 NaN
5.2016 NaN
6.2015 NaN
6.2016 NaN
7.2015 NaN
7.2016 NaN
8.2015 NaN
8.2016 NaN
9.2015 NaN
10.2015 NaN
11.2015 NaN
12.2015 NaN
Name: 161, dtype: float64
<class 'pandas.core.series.Series'>])
from the columns
['CPL4', 'Part Number', 'Calendar Year/Month', 'Sales', 'Inventory']

How should I subtract two dataframes and in Pandas and diplay the required output?

My table looks like this:
In [82]:df.head()
Out[82]:
MatDoc MatYr MvT Material Plnt SLoc Batch Customer AmountLC Amount ... PO MatYr.1 MatDoc.1 Order ProfitCtr SLED/BBD PstngDate EntryDate Time Username
0 4912693062 2015 551 100062 HDC2 0001 5G30MC1A11 NaN 9.03 9.06 ... NaN NaN NaN NaN IN1165B085 26.01.2016 01.08.2015 01.08.2015 01:13:16 O33462
1 4912693063 2015 501 166 HDC2 0004 NaN NaN 0.00 0.00 ... NaN NaN NaN NaN IN1165B085 NaN 01.08.2015 01.08.2015 01:13:17 O33462
2 4912693320 2015 551 101343 HDC2 0001 5G28MC1A11 NaN 53.73 53.72 ... NaN NaN NaN NaN IN1165B085 25.01.2016 01.08.2015 01.08.2015 01:16:30 O33462
Here, I need to group by data on Order column and sum only AmountLC column.Then I need to check for the Order column values such that it should be present in both MvT101group and MvT102group. and if an Order matches in both sets of data then I need to subtract MvT102group from MvT101group. and display
Order|Plnt|Material|Batch|Sum101=SumofMvt101ofAmountLC|Sum102=SumofMvt102ofAmountLC|(Sum101-Sum102)/100
What I have done is first I made new df containing only 101 and 102: Mvt101 and MvT102
MvT101 = df.loc[df['MvT'] == 101]
MvT102 = df.loc[df['MvT'] == 102]
Then I grouped it by Order and got the sum value for the column
MvT101group = MvT101.groupby('Order', sort=True)
In [76]:
MvT101group[['AmountLC']].sum()
Out[76]:
Order AmountLC
1127828 16348566.88
1127829 22237710.38
1127830 29803745.65
1127831 30621381.06
1127832 33926352.51
MvT102group = MvT102.groupby('Order', sort=True)
In [77]:
MvT102group[['AmountLC']].sum()
Out[77]:
Order AmountLC
1127830 53221.70
1127831 651475.13
1127834 67442.16
1127835 2477494.17
1128622 218743.14
After this I am not able to understand how should I write my query.
Please ask me any further details if you want.Here is the CSV file from where I am working Link

Hope I understood the question correctly. After grouping both groups as you did:
MvT101group = MvT101.groupby('Order',sort=True).sum()
MvT102group = MvT102.groupby('Order',sort=True).sum()
You can update the columns' names for both groups:
MvT101group.columns = MvT101group.columns.map(lambda x: str(x) + '_101')
MvT102group.columns = MvT102group.columns.map(lambda x: str(x) + '_102')
Then merge all 3 tables so that you will have all 3 columns in the main table:
df = df.merge(MvT101group, left_on=['Order'], right_index=True, how='left')
df = df.merge(MvT102group, left_on=['Order'], right_index=True, how='left')
And then you can add the calculated column:
df['calc'] = (df['Order_101']-df['Order_102']) / 100

Pandas aligning multiple dataframes with TimeStamp index

This has been the bane of my life for the past couple of days. I have numerous Pandas Dataframes that contain time series data with irregular frequencies. I try to align these into a single dataframe.
Below is some code, with representative dataframes, df1, df2, and df3 ( I actually have n=5, and would appreciate a solution that would work for all n>2):
# df1, df2, df3 are given at the bottom
import pandas as pd
import datetime
# I can align df1 to df2 easily
df1aligned, df2aligned = df1.align(df2)
# And then concatenate into a single dataframe
combined_1_n_2 = pd.concat([df1aligned, df2aligned], axis =1 )
# Since I don't know any better, I then try to align df3 to combined_1_n_2 manually:
combined_1_n_2.align(df3)
error: Reindexing only valid with uniquely valued Index objects
I have an idea why I get this error, so I get rid of the duplicate indices in combined_1_n_2 and try again:
combined_1_n_2 = combined_1_n_2.groupby(combined_1_n_2.index).first()
combined_1_n_2.align(df3) # But stll get the same error
error: Reindexing only valid with uniquely valued Index objects
Why am I getting this error? Even if this worked, it is completely manual and ugly. How can I align >2 time series and combine them in a single dataframe?
Data:
df1 = pd.DataFrame( {'price' : [62.1250,62.2500,62.2375,61.9250,61.9125 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:03:59.614000', '2008-06-01 06:03:59.692000',
'2008-06-01 06:15:42.004000', '2008-06-01 06:15:42.083000','2008-06-01 06:17:01.654000' ] ])
df2 = pd.DataFrame({'price': [241.0625, 241.5000, 241.3750, 241.2500, 241.3750 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:13:34.524000', '2008-06-01 06:13:34.602000',
'2008-06-01 06:15:05.399000', '2008-06-01 06:15:05.399000','2008-06-01 06:15:42.082000' ] ])
df3 = pd.DataFrame({'price': [67.656, 67.875, 67.8125, 67.75, 67.6875 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:03:52.281000', '2008-06-01 06:03:52.359000',
'2008-06-01 06:13:34.848000', '2008-06-01 06:13:34.926000','2008-06-01 06:15:05.321000' ] ])

Your specific error is due the column names of combined_1_n_2 having duplicates (both columns will be named 'price'). You could rename the columns and the second align would work.
One alternative way would be to chain the join operator, which merges frames on the index, as below.
In [23]: df1.join(df2, how='outer', rsuffix='_1').join(df3, how='outer', rsuffix='_2')
Out[23]:
price price_1 price_2
2008-06-01 06:03:52.281000 NaN NaN 67.6560
2008-06-01 06:03:52.359000 NaN NaN 67.8750
2008-06-01 06:03:59.614000 62.1250 NaN NaN
2008-06-01 06:03:59.692000 62.2500 NaN NaN
2008-06-01 06:13:34.524000 NaN 241.0625 NaN
2008-06-01 06:13:34.602000 NaN 241.5000 NaN
2008-06-01 06:13:34.848000 NaN NaN 67.8125
2008-06-01 06:13:34.926000 NaN NaN 67.7500
2008-06-01 06:15:05.321000 NaN NaN 67.6875
2008-06-01 06:15:05.399000 NaN 241.3750 NaN
2008-06-01 06:15:05.399000 NaN 241.2500 NaN
2008-06-01 06:15:42.004000 62.2375 NaN NaN
2008-06-01 06:15:42.082000 NaN 241.3750 NaN
2008-06-01 06:15:42.083000 61.9250 NaN NaN
2008-06-01 06:17:01.654000 61.9125 NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe Indexing, Where - python

A simple slice like this should give you the desired output wines_df[(wines_df['Ratings'] > 95) & (wines_df['Ratings'].notnull())]

Related

Python - how to pass a result from group by to Pivot?

reshaping pandas DataFrame for export in a nested dict

Extract series objects from Pandas DataFrame

How should I subtract two dataframes and in Pandas and diplay the required output?

Pandas aligning multiple dataframes with TimeStamp index

Categories

Resources