I have a table in pandas df1
id value
1 1500
2 -1000
3 0
4 50000
5 50
also I have another table in dataframe df2, that contains upper boundaries of groups, so essentially every row represents an interval from the previous boundary to the current one (the first interval is "<0"):
group upper
0 0
1 1000
2 NaN
How should I get the relevant groups for value from df, using intervals from df2? I can't use join, merge etc., because the rules for this join should be like "if value is between previous upper and current upper" and not "if value equals something". The only way that I've found is using predefined function with df.apply() (also there is a case of categorical values in it with interval_flag==False):
def values_to_group(x, interval_flag, groups_def):
if interval_flag==True:
for ind, gr in groups_def.sort_values(by='group').iterrows():
if x<gr[1]:
return gr[0]
elif math.isnan(gr[1]) == True:
return gr[0]
else:
for ind, gr in groups_def.sort_values(by='group').iterrows():
if x in gr[1]:
return gr[0]
Is there an easier/more optimal way to do it?
The expected output should be this:
id value group
1 1500 2
2 -1000 0
3 0 1
4 50000 2
5 50 1
I suggest use cut with sorted DataFrame of df2 by sorted upper and repalce last NaN to np.inf:
df2 = pd.DataFrame({'group':[0,1,2], 'upper':[0,1000,np.nan]})
df2 = df2.sort_values('upper')
df2['upper'] = df2['upper'].replace(np.nan, np.inf)
print (df2)
group upper
0 0 0.000000
1 1 1000.000000
2 2 inf
#added first bin -np.inf
bins = np.insert(df2['upper'].values, 0, -np.inf)
df1['group'] = pd.cut(df1['value'], bins=bins, labels=df2['group'], right=False)
print (df1)
id value group
0 1 1500 2
1 2 -1000 0
2 3 0 1
3 4 50000 2
4 5 50 1
Here's a solution using numpy.digitize. Your only task is to construct bins and names input lists, which should be possible via an input dataframe.
import pandas as pd, numpy as np
df = pd.DataFrame({'val': [99, 53, 71, 84, 84]})
df['ratio'] = df['val']/ df['val'].shift() - 1
bins = [-np.inf, 0, 0.2, 0.4, 0.6, 0.8, 1.0, np.inf]
names = ['<0', '0.0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8', '0.8-1.0', '>1']
d = dict(enumerate(names, 1))
df['Bucket'] = list(map(d.get, np.digitize(df['ratio'], bins)))
print(df)
val ratio Bucket
0 99 NaN None
1 53 -0.464646 <0
2 71 0.339623 0.2-0.4
3 84 0.183099 0.0-0.2
4 84 0.000000 0.0-0.2
Related
I am looking to add a column that counts consecutive positive numbers and resets the counter on finding a negative on a pandas dataframe. I might be able to loop through it with 'for' statement but I just know there is a better solution. I have looked at various similar posts that almost ask the same but I just cannot get those solutions to work on my problem.
I have:
Slope
-25
-15
17
6
0.1
5
-3
5
1
3
-0.1
-0.2
1
-9
What I want:
Slope Count
-25 0
-15 0
17 1
6 2
0.1 3
5 4
-3 0
5 1
1 2
3 3
-0.1 0
-0.2 0
1 1
-9 0
Please keep in mind that this a low-skill level question. If there are multiple steps on your proposed solution, please explain each. I would like an answer, but would prefer for me to understand the 'how'.
You first want to mark the positions where new segments (i.e., groups) start:
>>> df['Count'] = df.Slope.lt(0)
>>> df.head(7)
Slope Count
0 -25.0 True
1 -15.0 True
2 17.0 False
3 6.0 False
4 0.1 False
5 5.0 False
6 -3.0 True
Now you need to label each group using the cumulative sum: as True is evaluated as 1 in mathematical equations, the cumulative sum will label each segment with an incrementing integer. (This is a very powerful concept in pandas!)
>>> df['Count'] = df.Count.cumsum()
>>> df.head(7)
Slope Count
0 -25.0 1
1 -15.0 2
2 17.0 2
3 6.0 2
4 0.1 2
5 5.0 2
6 -3.0 3
Now you can use groupby to access each segment, then all you need to do is generate an incrementing sequence starting at zero for each group. There are many ways to do that, I'd just use the (reset'ed) index of each group, i.e., reset the index, get the fresh RangeIndex starting at 0, and turn it into a series:
>>> df.groupby('Count').apply(lambda x: x.reset_index().index.to_series())
Count
1 0 0
2 0 0
1 1
2 2
3 3
4 4
3 0 0
1 1
2 2
3 3
4 0 0
5 0 0
1 1
6 0 0
This results in the expected counts, but note that the final index doesn't match the original dataframe, so we need another reset_index() with drop=True to discard the grouped index to put this into our original dataframe:
>>> df['Count'] = df.groupby('Count').apply(lambda x:x.reset_index().index.to_series()).reset_index(drop=True)
Et voilá:
>>> df
Slope Count
0 -25.0 0
1 -15.0 0
2 17.0 1
3 6.0 2
4 0.1 3
5 5.0 4
6 -3.0 0
7 5.0 1
8 1.0 2
9 3.0 3
10 -0.1 0
11 -0.2 0
12 1.0 1
13 -9.0 0
we can solve the problem by looping through all the rows and using the loc feature in pandas. Assuming that you already have a dataframe named df with a column called slope. The idea is that we are going to sequentially add one to the previous row, but if we ever hit a count where slope_i < 0 the row is multiplied by 0.
df['new_col'] = 0 # just preset everything to be zero
for i in range(1, len(df)):
df.loc[i, 'new_col'] = (df.loc[i-1, 'new_col'] + 1) * (df.loc[i, 'slope'] >= 0)
you can do this by using the groupby-command. It requires some steps, which probably could be shortened, but it works this way.
First, you create a reset column by finding negative numbers
# create reset condition
df['reset'] = df.slope.lt(0)
Then you create groups with a cumsum() to this resets --> at this point every group of positives gets an unique group value. the last line here gives all negative numbers the group 0
# create groups of positive values
df['group'] = df.reset.cumsum()
df.loc[df['reset'], 'group'] = 0
Now you take the groups of positives and cumsum some ones (there MUST be a better solution than that) to get your result. The last line again cleans up results for negative values
# sum ones :-D
df['count'] = 1
df['count'] = df.groupby('group')['count'].cumsum()
df.loc[df['reset'], 'count'] = 0
It is not that fine one-line, but especially for larger datasets it should be faster than iterating through the whole dataframe
for easier copy&paste the whole thing (including some commented lines which replace the lines before. makes it shorter but harder to understand)
import pandas as pd
## create data
slope = [-25, -15, 17, 6, 0.1, 5, -3, 5, 1, 3, -0.1, -0.2, 1, -9]
df = pd.DataFrame(data=slope, columns=['slope'])
## create reset condition
df['reset'] = df.slope.lt(0)
## create groups of positive values
df['group'] = df.reset.cumsum()
df.loc[df['reset'], 'group'] = 0
# df['group'] = df.reset.cumsum().mask(df.reset, 0)
## sum ones :-D
df['count'] = 1
df['count'] = df.groupby('group')['count'].cumsum()
df.loc[df['reset'], 'count'] = 0
# df['count'] = df.groupby('group')['count'].cumsum().mask(df.reset, 0)
IMO, solving this problem iteratively is the only way because there is a condition that has to meet. you can use any iterative way like for or while. solving this problem with map will be troublesome since this problem still need the previous element to be modified and assign to current element
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
'b': [1,1,1,0,0,0,0],
})
grouped = df.groupby('b')
now sample from each group, e.g., I want 30% from group b = 1, and 20% from group b = 0. How should I do that?
if I want to have 150% for some group, can i do that?
You can dynamically return a random sample dataframe with different % of samples as defined per group. You can do this with percentages below 100% (see example 1) AND above 100% (see example 2) by passing replace=True:
Using np.select, create a new column c that returns the number of rows per group to be sampled randomly according to a 20%, 40%, etc. percentage that you set.
From there, you can sample x rows per group based off these percentage conditions. From these rows, return the .index of the rows and filter for the rows with .loc as well as columns 'a','b'. The code grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])) creates a multiindex series of the output you are looking for, but it requires some cleanup. This is why for me it is just easier to grab the .index and filter the original dataframe with .loc, rather than try to clean up the messy multiindex series.
grouped = df.groupby('b', group_keys=False)
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)], [0.4, 0.2])
df.loc[grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])).index, ['a','b']]
Out[1]:
a b
6 7 0
8 9 0
3 4 1
If you would like to return a larger random sample using duplicates of the existing cvalues, simply pass replace=True. Then, do some cleanup to get the output.
grouped = df.groupby('b', group_keys=False)
v = df['b'].value_counts()
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)],
[int(v.loc[0] * 1.2), int(v.loc[1] * 2)]) #frac parameter doesn't work with sample when frac > 1, so we have to calcualte the integer value for number of rows to be sampled.
(grouped.apply(lambda x: x['b'].sample(x['c'].iloc[0], replace=True))
.reset_index()
.rename({'index' : 'a'}, axis=1))
Out[2]:
a b
0 7 0
1 8 0
2 9 0
3 7 0
4 7 0
5 8 0
6 1 1
7 3 1
8 3 1
9 1 1
10 0 1
11 0 1
12 4 1
13 2 1
14 3 1
15 0 1
You can get a DataFrame from the GroupBy object with, e.g. grouped.get_group(0). If you want to sample from that you can use the .sample method. For instance grouped.get_group(0).sample(frac=0.2) gives:
a
5 6
For the example you give both samples will only give one element because the groups have 4 and 3 elements and 0.2*4 = 0.8 and 0.3*3 = 0.9 both round to 1.
I have a data frame with 144 rows and 48 columns. It contains results from various prediction models as either 1 or 0. I want to go through a row, find the percentage of 1's in that row and add a new column with either 1 if the percentage is greater than 80, else 0.
I know how to do this in excel with if and countif/count%, but here I don't really know how to do it. I hope I provided enough info, I am sorry if I did not. Thank you very much for any advice.
You can find the percentage of 1's in each row with:
df['percentage'] = df.mean(axis=1)
Then to create your new binary column you can use np.where:
df['new'] = np.where(df['percentage'] > 0.8, 1, 0)
This works the same way as the excel =IF (condition, value if true, value if false).
Example with dummy data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1':[0,0,1],'var2':[0,1,1], 'var3':[1,1,1]})
df['percentage'] = df.mean(axis=1)
df['new'] = np.where(df['percentage'] > 0.8, 1, 0)
print(df)
Output:
var1 var2 var3 percentage new
0 0 0 1 0.333333 0
1 0 1 1 0.666667 0
2 1 1 1 1.000000 1
You can use .sum and cast to int in you prefer it over boolean. To set the value of the column lots_of_ones to 1 if the percentage of 1s on the other columns is bigger than a threshold you can do:
import pandas as pd
threshold = 0.8
df = pd.DataFrame([[0,0,0,0],[0,1,1,1], [1,1,1,1]])
df["lots_of_ones"] = (df.sum(axis=1) / df.columns.shape[0] > threshold).astype(int)
Result
>>> df
0 1 2 3 lots_of_ones
0 0 0 0 0 0
1 0 1 1 1 0
2 1 1 1 1 1
Having the following Data Frame:
name value count total_count
0 A 0 1 20
1 A 1 2 20
2 A 2 2 20
3 A 3 2 20
4 A 4 3 20
5 A 5 3 20
6 A 6 2 20
7 A 7 2 20
8 A 8 2 20
9 A 9 1 20
----------------------------------
10 B 0 10 75
11 B 5 30 75
12 B 6 20 75
13 B 8 10 75
14 B 9 5 75
I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.
Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:
name 0-1 2-3 4-5 6-7 8-9
0 A 0.150000 0.2 0.3 0.2 0.150000
1 B 0.133333 0.0 0.4 0.4 0.066667
For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A
name 0-1
0 A (1+2)/20 = 0.15
I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.
Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.
import pandas as pd
import numpy as np
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5,
'number': np.arange(0,10),
'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)
The null values in the original data will not affect the result.
To get the exact result you could try this.
bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)
df1.columns = df1.columns.astype(str) # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i'] # rename the cols
cols = ['a',"b","d","f","h"]
df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)
You can manually rename the cols later.
# Output:
a b d f h
name
A 0.150000 0.2 0.3 0.200000 0.15
B 0.133333 NaN 0.4 0.266667 0.20
You can replace the NaN values using df1.fillna("0.0")
I have a code below that creates a summary table of missing values in each column of my data frame. I wish I could build a similar table to count unique values, but DataFrame does not have an unique() method, only each column independently.
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum()/len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
return mis_val_table_ren_columns
(source: https://stackoverflow.com/a/39734251/7044473)
How can I accomplish the same for unique values?
You can use function called 'nunique()' to get unique count of all columns
df = pd.DataFrame(np.random.randint(0, 3, (4, 3)))
print(df)
0 1 2
0 2 0 2
1 1 2 1
2 1 2 2
3 1 1 2
count=df.nunique()
print(count)
0 2
1 3
2 2
dtype: int64
You can create a series of unique value counts using the pd.unique function. For example:
>>> df = pd.DataFrame(np.random.randint(0, 3, (4, 3)))
>>> print(df)
0 1 2
0 2 0 2
1 1 2 1
2 1 2 2
3 1 1 2
>>> pd.Series({col: len(pd.unique(df[col])) for col in df})
0 2
1 3
2 2
dtype: int64
If you actually want the number of times each value appears in each column, you can do a similar thing with pd.value_counts:
>>> pd.DataFrame({col: pd.value_counts(df[col]) for col in df}).fillna(0)
0 1 2
0 0.0 1 0.0
1 3.0 1 1.0
2 1.0 2 3.0
This is not exactly what you asked for, but may be useful for your analysis.
def diversity_percentage(df, columns):
"""
This function returns the number of different elements in each column as a percentage of the total elements in the group.
A low value indicates there are many repeated elements.
Example 1: a value of 0 indicates all values are the same.
Example 2: a value of 100 indicates all values are different.
"""
diversity = dict()
for col in columns:
diversity[col] = len(df[col].unique())
diversity_series = pd.Series(diversity)
return (100*diversity_series/len(df)).sort_values()
__
>>> diversity_percentage(df, selected_columns)
operationdate 0.002803
payment 1.076414
description 16.933901
customer_id 17.536581
customer_name 48.895554
customer_email 62.129282
token 68.290632
id 100.000000
transactionid 100.000000
dtype: float64
However, you can always return diversity_series directly and will obtain just the count.