I'm trying to get the unique available value for each site. The original pandas dataframe is with three columns:
Site
Available
Capacity
A
7
20
A
7
20
A
8
20
B
15
35
B
15
35
C
12
25
C
12
25
C
11
25
and I want to get the unique available of each site. The desired table is like below:
Site
Unique Available
A
7
8
B
15
C
12
11
You can get the lists of unique available per site with GroupBy.unique()
>>> df.groupby('Site')['Available'].unique()
Site
A [7, 8]
B [15]
C [12, 11]
Name: Available, dtype: object
Then with explode() you can expand these lists and with reset_index() get the index back to a column:
>>> df.groupby('Site')['Available'].unique().explode().reset_index()
Site Available
0 A 7
1 A 8
2 B 15
3 C 12
4 C 11
Otherwise simply get both columns and remove duplicates:
>>> df[['Site', 'Available']].drop_duplicates()
Site Available
0 A 7
2 A 8
3 B 15
5 C 12
7 C 11
Approach with: GroupBy.apply() + Series.drop_duplicates()
(df.groupby('Site')['Available']
.apply(lambda s: s.drop_duplicates())
.reset_index(level=1, drop=True)
.reset_index(name='Unique Available')
)
Result:
Site Unique Available
0 A 7
1 A 8
2 B 15
3 C 12
4 C 11
Related
I want to pivot this dataframe and convert the columns to a second level multiindex or column.
Original dataframe:
Type VC C B Security
0 Standard 2 2 2 A
1 Standard 16 13 0 B
2 Standard 52 35 2 C
3 RI 10 10 0 A
4 RI 10 15 31 B
5 RI 10 15 31 C
Desired dataframe:
Type A B C
0 Standard VC 2 16 52
1 Standard C 2 13 35
2 Standard B 2 0 2
3 RI VC 10 10 10
11 RI C 10 15 15
12 RI B 0 31 31
You could try as follows:
Use df.pivot and then transpose using df.T.
Next, chain df.sort_index to rearrange the entries, and apply df.swaplevel to change the order of the MultiIndex.
Lastly, consider getting rid of the Security as columns.name, and adding an index.name for the unnamed variable, e.g. Subtype here.
If you want the MultiIndex as columns, you can of course simply use df.reset_index at this stage.
res = (df.pivot(index='Security', columns='Type').T
.sort_index(level=[1,0], ascending=[False, False])
.swaplevel(0))
res.columns.name = None
res.index.names = ['Type','Subtype']
print(res)
A B C
Type Subtype
Standard VC 2 16 52
C 2 13 35
B 2 0 2
RI VC 10 10 10
C 10 15 15
B 0 31 31
I have a data frame like that :
Index
Time
Id
0
10:10:00
11
1
10:10:01
12
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
8
10:10:12
11
9
10:10:14
13
I want to compare id column for each pairs. So between the row 0 and 1, between the row 2 and 3 etc.
In others words I want to compare even rows with odd rows and keep same id pairs rows.
My ideal output would be :
Index
Time
Id
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
I tried that but it did not work :
df = df[
df[::2]["id"] ==df[1::2]["id"]
]
You can use a GroupBy.transform approach:
# for each pair, is there only one kind of Id?
out = df[df.groupby(np.arange(len(df))//2)['Id'].transform('nunique').eq(1)]
Or, more efficient, using the underlying numpy array:
# convert to numpy
a = df['Id'].to_numpy()
# are the odds equal to evens?
out = df[np.repeat((a[::2]==a[1::2]), 2)]
output:
Index Time Id
2 2 10:10:02 12
3 3 10:10:04 12
4 4 10:10:06 13
5 5 10:10:07 13
6 6 10:10:08 11
7 7 10:10:10 11
I have this table:
a b c d e f 19-08-06 19-08-07 19-08-08 g h i
1 2 3 4 5 6 7 8 9 10 11 12
I have 34 columns of the date, so I want to melt the date columns to be into one column only.
How can I do this in pyhton?
Thanks in advance
You can use pd.Series.fullmatch to create a boolean mask for extracting date columns, then use df.melt
m = df.columns.str.fullmatch("\d{2}-\d{2}-\d{2}")
cols = df.columns[m]
df.melt(value_vars=cols, var_name='date', value_name='vals')
date vals
0 19-08-06 7
1 19-08-07 8
2 19-08-08 9
If you want to melt while keeping other columns then try this.
df.melt(
id_vars=df.columns.difference(cols), var_name="date", value_name="vals"
)
a b c d e f g h i date vals
0 1 2 3 4 5 6 10 11 12 19-08-06 7
1 1 2 3 4 5 6 10 11 12 19-08-07 8
2 1 2 3 4 5 6 10 11 12 19-08-08 9
Here I did not use value_vars=cols as it's done implicitly
value_vars: tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are
not set as id_vars.
I have a number of string-based columns in a pandas dataframe that I'm looking to use on scikitlearn classification models. I know I have to use oneHotEncoder to properly encode the variables, but first, I want to reduce the variation in the columns, taking out either the strings that appear less than x% of the time in the column, or are not among the top x strings by count in the column.
Here's an example:
df1 = pd.DataFrame({'a':range(22), 'b':list('aaaaaaaabbbbbbbcccdefg'), 'c':range(22)})
df1
a b c
0 0 a 0
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
5 5 a 5
6 6 a 6
7 7 a 7
8 8 b 8
9 9 b 9
10 10 b 10
11 11 b 11
12 12 b 12
13 13 b 13
14 14 b 14
15 15 c 15
16 16 c 16
17 17 c 17
18 18 d 18
19 19 e 19
20 20 f 20
21 21 g 21
As you can see, a, b, and c appear in column b more than 10% of the time, so I'd like to keep them. On the other hand, d, e, f, and g appear less than 10% (actually about 5% of the time), so I'd like to bucket these by changing them into 'other':
df1['b']
0 a
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 b
9 b
10 b
11 b
12 b
13 b
14 b
15 c
16 c
17 c
18 other
19 other
20 other
21 other
I'd similarly like to be able to say that I only want to keep the values that appear in the top 2 in terms of frequency, so that column b looks like this:
df1['b']
0 a
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 b
9 b
10 b
11 b
12 b
13 b
14 b
15 other
16 other
17 other
18 other
19 other
20 other
21 other
I don't see an obvious way to do this in Pandas, although I admittedly know a lot more about this in R. Any ideas? Any thoughts on how to make this robust to Nones, which may appear more than 10% of the time or sit in the top x number of values?
This is kinda contorted, but it's kind of a complicated question.
First, get the counts:
In [24]: sizes = df1["b"].value_counts()
In [25]: sizes
Out[25]:
b
a 8
b 7
c 3
d 1
e 1
f 1
g 1
dtype: int64
Now, pick the indices you don't like:
In [27]: bad = sizes.index[sizes < df1.shape[0]*0.1]
In [28]: bad
Out[28]: Index([u'd', u'e', u'f', u'g'], dtype='object')
Finally, assign "other" to those rows containing bad indices:
In [34]: df1.loc[df1["b"].isin(bad), "b"] = "other"
In [36]: df1
Out[36]:
a b c
0 0 a 0
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
5 5 a 5
6 6 a 6
7 7 a 7
8 8 b 8
9 9 b 9
10 10 b 10
11 11 b 11
12 12 b 12
13 13 b 13
14 14 b 14
15 15 c 15
16 16 c 16
17 17 c 17
18 18 other 18
19 19 other 19
20 20 other 20
21 21 other 21
[22 rows x 3 columns]
You can use sizes.sort() and get the last n values from the result in order to find just the top two indices.
Edit: you should be able to do something like this, replacing all instances of "b" with filterByColumn:
def filterDataFrame(df1, filterByColumn):
sizes = df1[filterByColumn].value_counts()
...
Here's my solution:
def cleanupData(inputCol, fillString, cutoffPercent=None, cutoffNum=31):
col=inputCol
col.fillna(fillString, inplace=True)
valueCounts=col.value_counts()
totalAmount=sum(valueCounts)
if cutoffPercent is not None and cutoffNum is not None:
raise NameError("both cutoff percent and number have values. Please only give one of these values")
if cutoffPercent is not None:
cutoffAmount=cutoffPercent*totalAmount
valuesToKeep=valueCounts[valueCounts>cutoffAmount]
valuesToKeep=valuesToKeep.index.tolist()
numValuesKept=len(valuesToKeep)
print "keeping "+str(numValuesKept)+" unique values in the returned column"
if cutoffNum is not None:
valueNames=valueCounts.index.tolist()
valuesToKeep=valueNames[0:cutoffNum]
newlist=[]
for row in col:
if any(row in element for element in valuesToKeep):
newlist.append(row)
else:
newlist.append("Other")
return newlist
##
cleanupData(df1['b'], "Other", cutoffNum=2)
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other']
assign your frame to variable (I call it x) and then count per value in the column "b"
d = {s: sum(x["b"].values==s) for s in set(x["b"].values)}
You can then use this mask to assign a new value if d[s] is below a certain threshold
x[x["b"].values==s] = "Call me Peter Maffay"
I have a pandas DataFrame in the following format:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
I want to append a calculated row that performs some math based on a given items index value, e.g. adding a row that sums the values of all items with an index value < 2, with the new row having an index label of 'Red'. Ultimately, I am trying to add three rows that group the index values into categories:
A row with the sum of item values where index value are < 2, labeled as 'Red'
A row with the sum of item values where index values are 1 < x < 4, labeled as 'Blue'
A row with the sum of item values where index values are > 3, labeled as 'Green'
Ideal output would look like this:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Red 3 5 7
Blue 15 17 19
Green 27 29 31
My current solution involves transposing the DataFrame, applying a map function for each calculated column and then re-transposing, but I would imagine pandas has a more efficient way of doing this, likely using .append().
EDIT:
My in-elegant pre-set list solution (originally used .transpose() but I improved it using .groupby() and .append()):
df = pd.DataFrame(np.arange(18).reshape((6,3)),columns=['a', 'b', 'c'])
df['x'] = ['Red', 'Red', 'Blue', 'Blue', 'Green', 'Green']
df2 = df.groupby('x').sum()
df = df.append(df2)
del df['x']
I much prefer the flexibility of BrenBarn's answer (see below).
Here is one way:
def group(ix):
if ix < 2:
return "Red"
elif 2 <= ix < 4:
return "Blue"
else:
return "Green"
>>> print d
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
>>> print d.append(d.groupby(d.index.to_series().map(group)).sum())
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Blue 15 17 19
Green 27 29 31
Red 3 5 7
For the general case, you need to define a function (or dict) to handle the mapping to different groups. Then you can just use groupby and its usual abilities.
For your particular case, it can be done more simply by directly slicing on the index value as Dan Allan showed, but that will break down if you have a more complex case where the groups you want are not simply definable in terms of contiguous blocks of rows. The method above will also easily extend to situations where the groups you want to create are not based on the index but on some other column (i.e., group together all rows whose value in column X is within range 0-10, or whatever).
The role of "transpose," which you say you used in your unshown solution, might be played more naturally by the orient keyword argument, which is available when you construct a DataFrame from a dictionary.
In [23]: df
Out[23]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
In [24]: dict = {'Red': df.loc[:1].sum(),
'Blue': df.loc[2:3].sum(),
'Green': df.loc[4:].sum()}
In [25]: DataFrame.from_dict(dict, orient='index')
Out[25]:
a b c
Blue 15 17 19
Green 27 29 31
Red 3 5 7
In [26]: df.append(_)
Out[26]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Blue 15 17 19
Green 27 29 31
Red 3 5 7
Based the numbers in your example, I assume that by "> 4" you actually meant ">= 4".