I read in a .csv file. I have the following data frame that counts vowels and consonants in a string in the column Description. This works great, but my problem is I want to split Description into 8 columns and count the consonants and vowels for each column. The second part of my code allows for me to split Description into 8 columns. How can I count the vowels and consonants on all 8 columns the Description is split into?
import pandas as pd
import re
def anti_vowel(s):
result = re.sub(r'[AEIOU]', '', s, flags=re.IGNORECASE)
return result
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data['Vowels'] = data['Description'].str.count(r'[aeiou]', flags=re.I)
data['Consonant'] = data['Description'].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
print (data)
This is the code I'm using to split the column Description into 8 columns.
import pandas as pd
data = data["Description"].str.split(" ", n = 8, expand = True)
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data = data["Description"].str.split(" ", n = 8, expand = True)
print (data)
Now how can I put it all together?
In order to read each column of the 8 and count consonants I know i can use the following replacing the 0 with 0-7:
testconsonant = data[0].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
testvowel = data[0].str.count(r'[aeiou]', flags=re.I)
Desired output would be:
Description [0] vowel count consonant count Description [1] vowel count consonant count Description [2] vowel count consonant count Description [3] vowel count consonant count Description [4] vowel count consonant count all the way to description [7]
stack then unstack
stacked = data.stack()
pd.concat({
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()
Consonant Vowels
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 3.0 5.0 5.0 1.0 2.0 NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
1 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
2 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
3 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
4 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
5 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
6 3.0 4.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
7 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
8 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 3.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
9 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
10 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
11 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
12 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
13 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 1.0 0.0 0.0 0.0 NaN NaN NaN NaN
14 3.0 5.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15 3.0 3.0 0.0 3.0 1.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN
If you want to combine this with the data dataframe, you can do:
stacked = data.stack()
pd.concat({
'Data': data,
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()
Related
I have dataframe which looks like below:
df:
Review_Text Noun Thumbups
Would be nice to be able to import files from ... [My, Tracks, app, phone, Google, Drive, import... 1.0
No Offline Maps! It used to have offline maps ... [Offline, Maps, menu, option, video, exchange,... 18.0
Great application. Designed with very well tho... [application, application] 16.0
Great App. Nice and simple but accurate. Wish ... [Great, App, Nice, Exported] 0.0
Save For Offline - This does not work. The rou... [Save, Offline, route, filesystem] 12.0
Since latest update app will not run. Subscrip... [update, app, Subscription, March, application] 9.0
Great app. Love it! And all the things it does... [Great, app, Thank, work] 1.0
I have paid for subscription but keeps telling... [subscription, trial, period] 0.0
Error: The route cannot be save for no locatio... [Error, route, i, GPS] 0.0
When try to restore my tracks it says "unable ... [try, file, locally-1] 0.0
Was a good app but since the update it only re... [app, update, metre] 2.0
based on 'Noun' Column values, I want to create other columns. For example, all values of noun column from first row become columns and those columns contain value of 'Thumbups' column value. If the column name already present in dataframe then it adds 'Thumbups' value into the existing value of the column.
I was trying to implement by using pivot_table :
pd.pivot_table(latest_review,columns='Noun',values='Thumbups')
But got following error:
TypeError: unhashable type: 'list'
Could anyone help me in fixing the issue?
Use Series.str.join with Series.str.get_dummies for dummies and then multiple by column Thumbups by DataFrame.mul:
df1 = df['Noun'].str.join('|').str.get_dummies().mul(df['Thumbups'], axis=0)
print (df1)
App Drive Error Exported GPS Google Great Maps March My Nice \
0 0.0 10.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 10.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 180.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 90.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Offline Save Subscription Thank Tracks app application exchange \
0 0.0 0.0 0.0 0.0 10.0 10.0 0.0 0.0
1 180.0 0.0 0.0 0.0 0.0 0.0 0.0 180.0
2 0.0 0.0 0.0 0.0 0.0 0.0 160.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 120.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 90.0 0.0 0.0 90.0 90.0 0.0
6 0.0 0.0 0.0 10.0 0.0 10.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN
file filesystem i import locally-1 menu metre option period \
0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 180.0 0.0 180.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN
phone route subscription trial try update video work
0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 180.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 90.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN
rows = []
#_unpacking Noun column row list values and storing it in rows list
_ = df.apply(lambda row: [rows.append([row['Review_Text'],row['Thumbups'], nn])
for nn in row.Noun], axis=1)
#_creates new dataframe with unpacked values
df_new = pd.DataFrame(rows, columns=df.columns)
#_now doing pivot operation on df_new
pivot_df = df_new.pivot(index='Review_Text', columns='Noun')
I have dataframe not sequences. if I use len(df.columns), my data has 3586 columns. How to re-order the data sequences?
ID V1 V10 V100 V1000 V1001 V1002 ... V990 V991 V992 V993 V994
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0
I used this df = df.reindex(sorted(df.columns), axis=1) (based on this question Re-ordering columns in pandas dataframe based on column name) but still not working.
thank you
First get all columns without pattern V + number by filtering with str.contains, then sorting all another values by Index.difference, add together and pass to DataFrame.reindex - get first all non numeric non matched columns in first positions and then sorted V + number columns:
L1 = df.columns[~df.columns.str.contains('^V\d+$')].tolist()
L2 = sorted(df.columns.difference(L1), key=lambda x: float(x[1:]))
df = df.reindex(L1 + L2, axis=1)
print (df)
ID V1 V10 V100 V990 V991 V992 V993 V994 V1000 V1001 V1002
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I'm using pandas in Python, and I have performed some crosstab calculations and concatenations, and at the end up with a data frame that looks like this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
The problem is that I want the last 4 rows, that start with Superior to be places before Total row. So, simply I want to switch the positions of last 4 rows with the 4 rows that start with Regular. How can I achieve this in pandas? So that I get this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
More generalized solution Categorical and argsort, I know this df was ordered , so ffill is safe here
s=df.ID
s=s.where(s.isin(['Total','Regular','Superior'])).ffill()
s=pd.Categorical(s,['Total','Superior','Regular'],ordered=True)
df=df.iloc[np.argsort(s)]
df
Out[188]:
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
5 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
6 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
7 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
8 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
1 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
2 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
3 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
4 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Here's one way:
import numpy as np
df.iloc[1:,:] = np.roll(df.iloc[1:,:].values, 4, axis=0)
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
1 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
2 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
3 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
4 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
5 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
6 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
7 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
8 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
For a specific answer to this question, just use iloc
df.iloc[[0,5,6,7,8,1,2,3,4],:]
For a more generalized solution,
m = (df.ID.eq('Superior') | df.ID.eq('Regular')).cumsum()
pd.concat([df[m==0], df[m==2], df[m==1]])
or
order = (2,1)
pd.concat([df[m==0], *[df[m==c] for c in order]])
where order defines the mapping from previous ordering to new ordering.
I am trying to update this table 1 (Level I, Level II, and Level III) by using pandas iloc or loc with the dataset referenced below. I am open to a better way than loc and iloc if there are suggestions.
Table 1
Example 1
If I want the table to update with new information for the 1102 selection for Pay Grade 13 and Level III I would use the following pd.loc code:
jobseries = '1102'
result = df.loc[('3',jobseries),'13']
print (result)
14.0
Example 2: This works too.
jobseries = '1102'
result = df.loc[('3',jobseries),'13'].sum()
print (result)
14
However, the challenge is when I need to select multiple indexes or multiple columns.
MULTIPLE ROWS
Now, if I want to update Table 1, Total for all Level I, instead of doing some type of df.isin, I need o do the following:
Example 3:
total = df.loc[('1',jobseries),'07'] + df.loc[('1',jobseries),'09'] + and so on...
print (total)
32
This works but I believe eventually will throw a RuntimeWarning: invalid value encountered in long_scalars. So its not the best way to do this. Any recommendations?
MULTIPLE COLUMNs
Now, if I want to update Table 1, # certs for Level I, Level II, and Level III, and any given grade level, I can't figure out the code. I've tried the following but its throwing a keyError. I've tried multiple ways of doing this and still cannot figure it out:
Example 4:
jobseries = '1102'
result = df.loc[('1','2','3',jobseries),'All']
print (result)
KeyError: "None of [[('1', '2', '3', '1102')]] are in the [index]"
This is strange because if I check my index the keyError confuses me.
df.index:
MultiIndex(levels=[['1', '2', '3', 'All'], ['', '0301', '0341', '0342', '0343', '0501', '0560', '0810', '0850', '1101', '1102', '1105', '1106', '1109', '1145', '1146', '1170', '1410']],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3], [2, 3, 4, 6, 7, 9, 10, 11, 12, 13, 16, 17, 2, 8, 9, 10, 11, 1, 3, 4, 5, 9, 10, 11, 14, 15, 16, 0]],
names=['Level', 'JobSeries'])
I've also tried df.xs:
Example 5:
jobseries = '1102'
result = df.xs(jobseries, level=1)
print (result)
01 07 08 09 11 12 13 14 15 All
Level
1 1.0 0.0 0.0 9.0 8.0 9.0 6.0 0.0 0.0 15
2 0.0 0.0 0.0 4.0 6.0 12.0 6.0 1.0 0.0 13
3 1.0 0.0 0.0 0.0 1.0 11.0 14.0 9.0 3.0 14
CHANGES IN ROWS OR COLUMNS
The other challenge is that if the dataset changes and index or rows change the pd.loc and pd.iloc will throw a key error. Is there anyway around this?
df:
01 07 08 09 11 12 13 14 15 All
Level JobSeries
1 0341 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
0342 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0343 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 2
0560 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1
0810 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1
1101 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
1102 1.0 0.0 0.0 9.0 8.0 9.0 6.0 0.0 0.0 15
1105 0.0 7.0 3.0 5.0 0.0 0.0 0.0 0.0 0.0 9
1106 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2
1109 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 2
1170 0.0 0.0 0.0 0.0 1.0 2.0 0.0 0.0 0.0 3
1410 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1
2 0341 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1
0850 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1
1101 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 2
1102 0.0 0.0 0.0 4.0 6.0 12.0 6.0 1.0 0.0 13
1105 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1
3 0301 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1
0342 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1
0343 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1
0501 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1
1101 0.0 0.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0 2
1102 1.0 0.0 0.0 0.0 1.0 11.0 14.0 9.0 3.0 14
1105 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
1145 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1
1146 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1
1170 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 2
All 2.0 8.0 4.0 11.0 11.0 14.0 15.0 9.0 4.0 17
Reference:
pd.loc: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html
pd.xs: https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.xs.html
pd.iloc: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer
I'm not totally clear on the ask, but would
df.groupby(df.index).count()[13] or df.groupby(df.index).sum()[13] for a column, or
df.groupby(['Level','JobSeries']).sum().loc[1,341] for a row
accomplish what you're looking for? The level argument in the groupby is designed to deal with multi-index problems
I have a Pandas Dataframe which tells me monthly sales of items in shops
df.head():
ID month sold
0 150983 0 1.0
1 56520 0 13.0
2 56520 1 7.0
3 56520 2 13.0
4 56520 3 8.0
I want to remove all IDs where there were no sales last month. I.e. month == 33 & sold == 0. Doing the following
unwanted_df = df[((df['month'] == 33) & (df['sold'] == 0.0))]
I just get 46 rows, which is far too little. But nevermind, I would like to have the data in different format anyway. Pivoted version of above table is just what I want:
pivoted_df = df.pivot(index='month', columns = 'ID', values = 'sold').fillna(0)
pivoted_df.head()
ID 0 2 3 5 6 7 8 10 11 12 ... 214182 214185 214187 214190 214191 214192 214193 214195 214197 214199
month
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Question. How to remove columns with the value 0 in the last row in pivoted_df?
You can do this with one line:
pivoted_df= pivoted_df.drop(pivoted_df.columns[pivoted_df.iloc[-1,:]==0],axis=1)
I want to remove all IDs where there were no sales last month
You can first calculate the IDs satisfying your condition:
id_selected = df.loc[(df['month'] == 33) & (df['sold'] == 0), 'ID']
Then filter these from your dataframe via a Boolean mask:
df = df[~df['ID'].isin(id_selected)]
Finally, use pd.pivot_table with your filtered dataframe.