I'd like to generate random numbers from 1 to n based on the ID column in my DataFrame. Repeating values in this ID column should have the same random number. A Random number should be assigned to more than 1 ID, but the number of IDs belonging to each random number should be equal or the most equal as possible. I'd also like a seed value that way I can replicate the results.
A very simple example is let's say I have an ID column with values of A,B,C,D,E. I'd like to assign a random number of 1 to 2. So in this example, IDs A,B,E would be assigned to random number 1 and IDs C,D to 2.
ID Random
A 1
C 2
A 1
B 1
E 1
D 2
Also, I have a very large DataFrame so speed is very important.
Update: What I tried previously was getting a unique list of the IDs then generating random numbers for each, but I made a DataFrame and tried to merge the 2 DataFrames which was too time consuming.
Thanks to S3DEV he suggested mapping a dictionary to the column which was a lot faster.
ID_list = df['ID'].unique()
random_list = np.random.randint(1, 2, size=len(ID_list))
dic = {ID_list[i]: random_list[i] for i in range(len(ID_list))}
df['Random'] = df['ID'].map(dic)
To fix your approach (i.e. create side dataframe):
n = 10
ids = df[["ID"]].drop_duplicates()
ids["Random"] = np.random.randint(1, n, len(ids))
ids.set_index("ID", inplace=True)
df.set_index("ID", inplace=True)
df["Random"] = ids["Random"]
df.reset_index(inplace=True)
Outputs:
ID Random
0 A 6
1 C 7
2 A 6
3 B 4
4 E 1
5 D 6
Related
I ultimately want to count the number of months in a given range per each user. For example, see below, user 1 has 1 range of data from April 2021-June 2021. Where I'm struggling is counting users that multiple ranges (see users 3 & 4).
I have a pandas df w/ columns that looks like these:
username Jan_2021 Feb_2021 March_2021 April_2021 May_2021 June_2021 July_2021 Sum_of_Months
user 1 0 0 0 1 1 1 0 3
user 2 0 0 0 0 0 0 1 1
user 3 1 1 1 0 1 1 0 5
user 4 0 1 1 1 0 1 1 5
Id like to be able to get a summary column that says the number of groups and their count.
For example:
When I say num of groups I mean the amount of grouped 1's together. and when I say length of group I mean the amount of months in 1 group, like if I were to draw a circle around the 1s. For example, user 1 is 3 because there's a 1 in columns April-June 2021
username Num_of_groups Lenth_of_group
user 1 1 3
user 2 1 1
user 3 2 3,2
user 4 2 3,2
You can try with groupby function from itertools
from itertools import groupby
df1 = df[[col for col in df.columns if "2021" in col]]
df["Lenth_of_group"] = df1.apply(lambda x: [sum(g) for i, g in groupby(x) if i == 1],axis=1)
df["Num_of_groups"] = df["Lenth_of_group"].apply(lambda x: len(x))
Hope this Helps...
This solution uses staircase, and works by treating each users data as a step function (of 1s and 0s)
setup
import pandas as pd
df = pd.DataFrame(
{
"username":["user 1", "user 2", "user 3", "user 4"],
"Jan_2021":[0,0,1,0],
"Feb_2021":[0,0,1,0],
"Mar_2021":[0,0,1,1],
"April_2021":[1,0,0,1],
"May_2021":[1,0,1,0],
"June_2021":[1,0,1,1],
"July_2021":[0,1,1,0],
"Sum_of_Months":[3,1,5,5],
}
)
solution
import staircase as sc
# trim down to month columns only, and transpose to make users correspond to columns, and months correspond to rows
data = df[["Jan_2021", "Feb_2021", "Mar_2021", "April_2021", "May_2021", "June_2021", "July_2021"]].transpose().reset_index(drop=True)
def extract_groups(series):
return (
sc.Stairs.from_values(initial_value=0, values=series) # create step function for each user
.clip(0, len(series)) # clip step function to region of interest
.to_frame() # represent data as start/stop intervals in a dataframe
.query("value==1") # filter for groups of 1s
.eval("dist=end-start") # calculate the length of each "group"
["dist"].to_list() # convert the result from Series to list
)
sfs = data.columns.to_series().apply(lambda c: extract_groups(data[c]))
sfs is a pandas.Series where the values are lists representing number of groups and the lengths of each. It looks like this:
0 [3]
1 [1]
2 [3, 3]
3 [2, 1]
dtype: object
You can use it to create the data you need, eg
df["Num_of_groups"] = sfs.apply(list.__len__)
adds the Num_of_groups column to your original dataframe
Disclaimer: I am author of staircase
I am going to merge two datasets soon by 3 columns.
The hope is that there are no/few 3 column group repeats in the original dataset. I would like to produce something that says approximately how unique each row is. Like maybe some kind of frequency plot (might not work as I have a very large dataset), maybe a table that displays the average frequency for each .5million rows or something like that.
Is there a way to determine how unique each row is compared to the other rows?
1 2 3
A 100 B
A 200 B
A 200 B
Like for the above data frame, I would like to say that each row is unique
1 2 3
A 200 B
A 200 B
A 100 B
For this data set, rows 1 and 2 are not unique. I don't want to drop one, but I am hoping to quantify/weight the amount of non-unique rows.
The problem is my dataframe is 14,000,000 lines long, so I need to think of a way I can show how unique each row is on a set this big.
Assuming you are using pandas, here's one possible way:
import pandas as pd
# Setup, which you can probably skip since you already have the data.
cols = ["1", "2", "3"]
rows = [
["A", 200, "B",],
["A", 200, "B",],
["A", 100, "B",],
]
df1 = pd.DataFrame(rows, columns=cols)
# Get focus column values before adding a new column.
key_columns = df1.columns.values.tolist()
# Add a line column
df1["line"] = 1
# Set new column to cumulative sum of line values.
df1["match_count"] = df1.groupby(key_columns )['line'].apply(lambda x: x.cumsum())
# Drop line column.
df1.drop("line", axis=1, inplace=True)
Print results
print(df1)
Output -
1 2 3 match_count
0 A 200 B 1
1 A 200 B 2
2 A 100 B 1
Return only unique rows:
# We only want results where the count is less than 2,
# because we have our key columns saved, we can just return those
# and not worry about 'match_count'
df_unique = df1.loc[df1["match_count"] < 2, key_columns]
print(df_unique)
Output -
1 2 3
0 A 200 B
2 A 100 B
I have a DataFrame which I want to groupby with a few columns. I know how to aggregate the data after that, or view each index tuple. However, I am unsure of the best way to just append the "group number" of each group in a column on the original dataframe:
For example, I have a dataframe, df, with two indices (a_id and b_id) which I want to use for grouping the df using groupby.
import pandas as pd
a = pd.DataFrame({'a_id':['q','q','q','q','q','r','r','r','r','r'],
'b_id':['m','m','j','j','j','g','g','f','f','f'],
'val': [1,2,3,4,5,6,7,8,9,8]})
# Output:
a_id b_id val
0 q m 1
1 q m 2
2 q j 3
3 q j 4
4 q j 5
5 r g 6
6 r g 7
7 r f 8
8 r f 9
9 r f 8
When I do the groupby, rather than aggregate everything, I just want to add a column group_id that has an integer representing the group. However, I am not sure if there is a simple way to do this. My current solution involves inverting the GroupBy.indices dictionary, turning that into a series, and appending it to the dataframe as follows:
gb = a.groupby(['a_id','b_id'])
dict_g = dict(enumerate(gb.indices.values()))
dict_g_reversed = {x:k for k,v in dict_g.items() for x in v}
group_ids = pd.Series(dict_g_reversed)
a['group_id'] = group_ids
This gives me sort of what I want, although the group_id indices are not in the right order. This seems like it should be a simple function, but I'm not sure why it seems not to be. I know in MATLAB, for example, they have a findgroups that does exactly what I would like. So far I haven't been able to find an equivalent in pandas. How can this be done with a pd DataFrame?
You can using ngroup this will provide the order as occurrence
a.groupby(['a_id','b_id']).ngroup()
Or using factorize
pd.factorize(list(map(tuple,a[['a_id','b_id']].values.tolist())))[0]+1
df['newid']=pd.factorize(list(map(tuple,a.values.tolist())))[0]+1
I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1
This question is similar to this other question.
I have a pandas dataframe. I want to split it into groups, and select an arbitrary member of each group, defined elsewhere.
Example: I have a dataframe that can be divided in 6 groups of 4 observations each. I want to extract the observations according to:
selected = [0,3,2,3,1,3]
This is very similar to
df.groupy('groupvar').nth(n)
But, crucially, n varies for each group according to the selected list.
Thanks!
Typically everything that you do within groupby should be group independent. So, within any groupby.apply(), you will only get the group itself, not the context. An alternative is to compute the index value for the whole sample (following, index) out of the indices for the groups (here, selected). Note that the dataset is sorted by groups, which you need to do if you want to apply the following.
I use test, out of which I want to select selected:
In[231]: test
Out[231]:
score
name
0 A -0.208392
1 A -0.103659
2 A 1.645287
0 B 0.119709
1 B -0.047639
2 B -0.479155
0 C -0.415372
1 C -1.390416
2 C -0.384158
3 C -1.328278
selected = [0, 2, 1]
c = test.groupby(level=1).count()
In[242]: index = c.shift(1).cumsum().add(array([selected]).T, fill_value=0)
In[243]: index
Out[243]:
score
name
A 0
B 5
C 4
In[255]: test.iloc[index.values[:,0]]
Out[255]:
score
name
0 A -0.208392
2 B -0.479155
1 C -1.390416