I want to pivot this dataframe and convert the columns to a second level multiindex or column.
Original dataframe:
Type VC C B Security
0 Standard 2 2 2 A
1 Standard 16 13 0 B
2 Standard 52 35 2 C
3 RI 10 10 0 A
4 RI 10 15 31 B
5 RI 10 15 31 C
Desired dataframe:
Type A B C
0 Standard VC 2 16 52
1 Standard C 2 13 35
2 Standard B 2 0 2
3 RI VC 10 10 10
11 RI C 10 15 15
12 RI B 0 31 31
You could try as follows:
Use df.pivot and then transpose using df.T.
Next, chain df.sort_index to rearrange the entries, and apply df.swaplevel to change the order of the MultiIndex.
Lastly, consider getting rid of the Security as columns.name, and adding an index.name for the unnamed variable, e.g. Subtype here.
If you want the MultiIndex as columns, you can of course simply use df.reset_index at this stage.
res = (df.pivot(index='Security', columns='Type').T
.sort_index(level=[1,0], ascending=[False, False])
.swaplevel(0))
res.columns.name = None
res.index.names = ['Type','Subtype']
print(res)
A B C
Type Subtype
Standard VC 2 16 52
C 2 13 35
B 2 0 2
RI VC 10 10 10
C 10 15 15
B 0 31 31
I'm trying to get the unique available value for each site. The original pandas dataframe is with three columns:
Site
Available
Capacity
A
7
20
A
7
20
A
8
20
B
15
35
B
15
35
C
12
25
C
12
25
C
11
25
and I want to get the unique available of each site. The desired table is like below:
Site
Unique Available
A
7
8
B
15
C
12
11
You can get the lists of unique available per site with GroupBy.unique()
>>> df.groupby('Site')['Available'].unique()
Site
A [7, 8]
B [15]
C [12, 11]
Name: Available, dtype: object
Then with explode() you can expand these lists and with reset_index() get the index back to a column:
>>> df.groupby('Site')['Available'].unique().explode().reset_index()
Site Available
0 A 7
1 A 8
2 B 15
3 C 12
4 C 11
Otherwise simply get both columns and remove duplicates:
>>> df[['Site', 'Available']].drop_duplicates()
Site Available
0 A 7
2 A 8
3 B 15
5 C 12
7 C 11
Approach with: GroupBy.apply() + Series.drop_duplicates()
(df.groupby('Site')['Available']
.apply(lambda s: s.drop_duplicates())
.reset_index(level=1, drop=True)
.reset_index(name='Unique Available')
)
Result:
Site Unique Available
0 A 7
1 A 8
2 B 15
3 C 12
4 C 11
I have this table:
a b c d e f 19-08-06 19-08-07 19-08-08 g h i
1 2 3 4 5 6 7 8 9 10 11 12
I have 34 columns of the date, so I want to melt the date columns to be into one column only.
How can I do this in pyhton?
Thanks in advance
You can use pd.Series.fullmatch to create a boolean mask for extracting date columns, then use df.melt
m = df.columns.str.fullmatch("\d{2}-\d{2}-\d{2}")
cols = df.columns[m]
df.melt(value_vars=cols, var_name='date', value_name='vals')
date vals
0 19-08-06 7
1 19-08-07 8
2 19-08-08 9
If you want to melt while keeping other columns then try this.
df.melt(
id_vars=df.columns.difference(cols), var_name="date", value_name="vals"
)
a b c d e f g h i date vals
0 1 2 3 4 5 6 10 11 12 19-08-06 7
1 1 2 3 4 5 6 10 11 12 19-08-07 8
2 1 2 3 4 5 6 10 11 12 19-08-08 9
Here I did not use value_vars=cols as it's done implicitly
value_vars: tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are
not set as id_vars.
I saw a primitive version of this question here
but i my dataframe has diffrent names and i want to calculate separately for them
A B C
0 a 3 5
1 a 6 9
2 b 3 8
3 b 11 19
i want to groupby A and then find diffence between alternate B and C.something like this
A B C dA
0 a 3 5 6
1 a 6 9 NaN
2 b 3 8 16
3 b 11 19 NaN
i tried doing
df['dA']=df.groupby('A')(['C']-['B'])
df['dA']=df.groupby('A')['C']-df.groupby('A')['B']
none of them helped
what mistake am i making?
IIUC, here is one way to perform the calculation:
# create the data frame
from io import StringIO
import pandas as pd
data = '''idx A B C
0 a 3 5
1 a 6 9
2 b 3 8
3 b 11 19
'''
df = pd.read_csv(StringIO(data), sep='\s+', engine='python').set_index('idx')
Now, compute dA. I look last value of C less first value of B, as grouped by A. (Is this right? Or is it max(C) less min(B)?). If you're guaranteed to have the A values in pairs, then #BenT's shift() would be more concise.
dA = (
(df.groupby('A')['C'].transform('last') -
df.groupby('A')['B'].transform('first'))
.drop_duplicates()
.rename('dA'))
print(pd.concat([df, dA], axis=1))
A B C dA
idx
0 a 3 5 6.0
1 a 6 9 NaN
2 b 3 8 16.0
3 b 11 19 NaN
I used groupby().transform() to preserve index values, to support the concat operation.
I have a number of string-based columns in a pandas dataframe that I'm looking to use on scikitlearn classification models. I know I have to use oneHotEncoder to properly encode the variables, but first, I want to reduce the variation in the columns, taking out either the strings that appear less than x% of the time in the column, or are not among the top x strings by count in the column.
Here's an example:
df1 = pd.DataFrame({'a':range(22), 'b':list('aaaaaaaabbbbbbbcccdefg'), 'c':range(22)})
df1
a b c
0 0 a 0
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
5 5 a 5
6 6 a 6
7 7 a 7
8 8 b 8
9 9 b 9
10 10 b 10
11 11 b 11
12 12 b 12
13 13 b 13
14 14 b 14
15 15 c 15
16 16 c 16
17 17 c 17
18 18 d 18
19 19 e 19
20 20 f 20
21 21 g 21
As you can see, a, b, and c appear in column b more than 10% of the time, so I'd like to keep them. On the other hand, d, e, f, and g appear less than 10% (actually about 5% of the time), so I'd like to bucket these by changing them into 'other':
df1['b']
0 a
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 b
9 b
10 b
11 b
12 b
13 b
14 b
15 c
16 c
17 c
18 other
19 other
20 other
21 other
I'd similarly like to be able to say that I only want to keep the values that appear in the top 2 in terms of frequency, so that column b looks like this:
df1['b']
0 a
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 b
9 b
10 b
11 b
12 b
13 b
14 b
15 other
16 other
17 other
18 other
19 other
20 other
21 other
I don't see an obvious way to do this in Pandas, although I admittedly know a lot more about this in R. Any ideas? Any thoughts on how to make this robust to Nones, which may appear more than 10% of the time or sit in the top x number of values?
This is kinda contorted, but it's kind of a complicated question.
First, get the counts:
In [24]: sizes = df1["b"].value_counts()
In [25]: sizes
Out[25]:
b
a 8
b 7
c 3
d 1
e 1
f 1
g 1
dtype: int64
Now, pick the indices you don't like:
In [27]: bad = sizes.index[sizes < df1.shape[0]*0.1]
In [28]: bad
Out[28]: Index([u'd', u'e', u'f', u'g'], dtype='object')
Finally, assign "other" to those rows containing bad indices:
In [34]: df1.loc[df1["b"].isin(bad), "b"] = "other"
In [36]: df1
Out[36]:
a b c
0 0 a 0
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
5 5 a 5
6 6 a 6
7 7 a 7
8 8 b 8
9 9 b 9
10 10 b 10
11 11 b 11
12 12 b 12
13 13 b 13
14 14 b 14
15 15 c 15
16 16 c 16
17 17 c 17
18 18 other 18
19 19 other 19
20 20 other 20
21 21 other 21
[22 rows x 3 columns]
You can use sizes.sort() and get the last n values from the result in order to find just the top two indices.
Edit: you should be able to do something like this, replacing all instances of "b" with filterByColumn:
def filterDataFrame(df1, filterByColumn):
sizes = df1[filterByColumn].value_counts()
...
Here's my solution:
def cleanupData(inputCol, fillString, cutoffPercent=None, cutoffNum=31):
col=inputCol
col.fillna(fillString, inplace=True)
valueCounts=col.value_counts()
totalAmount=sum(valueCounts)
if cutoffPercent is not None and cutoffNum is not None:
raise NameError("both cutoff percent and number have values. Please only give one of these values")
if cutoffPercent is not None:
cutoffAmount=cutoffPercent*totalAmount
valuesToKeep=valueCounts[valueCounts>cutoffAmount]
valuesToKeep=valuesToKeep.index.tolist()
numValuesKept=len(valuesToKeep)
print "keeping "+str(numValuesKept)+" unique values in the returned column"
if cutoffNum is not None:
valueNames=valueCounts.index.tolist()
valuesToKeep=valueNames[0:cutoffNum]
newlist=[]
for row in col:
if any(row in element for element in valuesToKeep):
newlist.append(row)
else:
newlist.append("Other")
return newlist
##
cleanupData(df1['b'], "Other", cutoffNum=2)
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other']
assign your frame to variable (I call it x) and then count per value in the column "b"
d = {s: sum(x["b"].values==s) for s in set(x["b"].values)}
You can then use this mask to assign a new value if d[s] is below a certain threshold
x[x["b"].values==s] = "Call me Peter Maffay"