I have some data as follows:
+--------+------+
| Reason | Keys |
+--------+------+
| x | a |
| y | a |
| z | a |
| y | b |
| z | b |
| x | c |
| w | d |
| x | d |
| w | d |
+--------+------+
I want to get the Reason corresponding to the first occurrence of each Key. Like here, I should get Reasons x,y,x,w for Keys a,b,c,d respectively. After that, I want to compute the percentage of each Reason, as in a metric for how many times each Reason occurs. Thus x = 2/4 = 50%. And w,y = 25% each.
For the percentage, I think I can use something like value_counts(normalize=True) * 100, based on the previous step. What is a good way to proceed?
You are right about the second step and the first step could be achieved by
summary = df.groupby("Keys").first()
You can using drop_duplicates
df.drop_duplicates(['Reason'])
Out[207]:
Reason Keys
0 x a
1 y a
2 z a
6 w d
Related
I have some data stored in my pandas dataframe that shows salary for a bunch of users and their category.
| category | user_id | salary |
|----------|-----------|--------|
| A | 546457568 | 49203 |
| C | 356835679 | 49694 |
| A | 356785637 | 48766 |
| B | 45668758 | 36627 |
| C | 686794 | 59508 |
| C | 234232376 | 32765 |
| C | 4356345 | 44058 |
| A | 9878987 | 9999999|
What i would like to do is generate a new column salary_bucket that shows a bucket for salary, that is determined from the upper/lower limits of the Interquartile range for salary.
e.g. calculate upper/lower limits according to q1 - 1.5 x iqr and q3 + 1.5 x iqr, then split this into 10 equal buckets and assign each row to the relevant bucket based on salary. I know from exploration that there is no data outside the lower limit , but for data above the upper limit I would like a seperate bucket such as outside_iqr.
In the end I would liek to get something like so:
| category | user_id | salary | salary_bucket |
|----------|-----------|--------|---------------|
| A | 546457568 | 49203 | 7 |
| C | 356835679 | 49694 | 7 |
| A | 356785637 | 48766 | 7 |
| B | 45668758 | 36627 | 3 |
| C | 686794 | 59508 | 5 |
| C | 234232376 | 32765 | 3 |
| C | 4356345 | 44058 | 4 |
| A | 9878987 | 9999999|outside_iqr |
(these buckets are not actually calculate just for illustration sake)
Is something like qcut useful here?
You can use pandas.cut to turn continuous data into categorical data.
# First, we need to calculate our IQR.
q1 = df.salary.quantile(0.25)
q3 = df.salary.quantile(0.75)
iqr = q3 - q1
# Now let's calculate upper and lower bounds.
lower = q1 - 1.5*iqr
upper = q3 + 1.5*iqr
# Let us create our bins:
num_bins = 10
bin_width = (upper - lower) / num_bins
bins = [lower + i*bin_width for i in range(num_bins)]
bins += [upper, float('inf')] # Now we add our last bin, which will contain any value greater than the upper-bound of the IQR.
# Let us create our labels:
labels = [f'Bucket {i}' for i in range(1,num_bins+1)]
labels.append('Outside IQR')
# Finally, we add a new column to the df:
df['salary_bucket'] = pd.cut(df.salary, bins=bins, labels=labels)
So basically, you'll need to generate your own list of buckets and labels according to what you require, and then pass those as arguments to pandas.cut.
I have a table that looks like this; it is the stacked version of a crosstab, so each combination of item and period is unique:
+------+--------+-------+
| item | period | value |
+------+--------+-------+
| x | 1 | 6 |
| x | 2 | 4 |
| x | 3 | 5 |
| y | 1 | 9 |
| y | 2 | 10 |
| y | 3 | 100 |
+------+--------+-------+
For each item, I need to find the period with the lowest value, so the desired result is:
+------+--------+-------+
| item | period | value |
+------+--------+-------+
| x | 2 | 4 |
| y | 1 | 9 |
+------+--------+-------+
I have looked into pandas.DataFrame.idxmin() but it doesn't seem to be what I need.
I have found a way with groupby, min and merge but I was wondering if there is a more elegant solution?
I have found many similar questions related to R and SQL (my solution is in fact "SQLish", but not to Python
My solution is:
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['item'] = np.repeat(['x','y'],3)
df['period'] = np.tile( [1,2,3] ,2 )
df['value'] = [6,4,5,9,10,100]
min_value = df[['item','value']].groupby('item').min().reset_index(drop = False)
periods_with_min_value = pd.merge(min_value, df, how ='inner', on=['item','value'])
df.loc[df.groupby("item")["value"].idxmin()]
Out[12]:
item period value
1 x 2 4
3 y 1 9
Tested on pandas 1.1.3, python 3.7, debian 10 64-bit. No warning was emitted.
N.B. This solution won't work if there were repeated or corrupted index values. This could be resolved by .reset_index(drop=True) in advance.
I have a multiindexed dataframe where the index levels have multiple categories, something like this:
|Var1|Var2|Var3|
|Level1|Level2|Level3|----|----|----|
| A | A | A | | | |
| A | A | B | | | |
| A | B | A | | | |
| A | B | B | | | |
| B | A | A | | | |
| B | A | B | | | |
| B | B | A | | | |
| B | B | B | | | |
In summary, and specifically in my case, Level 1 has 2 levels, Level 2 has 24, Level 3 has 6, and there are also Levels 4 (674) and Level 5 (9) (with some minor variation depending on specific higher-level values - Level1 == 1 actually has 24 Level2s, but Level1 == 2 has 23).
I need to generate all possible combinations of 3 at Level 5, then calculate their means for Vars 1-3.
I am trying something like this:
# Resulting df to be populated
df_result = pd.DataFrame([])
# Retrieving values at Level1
lev1s = df.index.get_level_values("Level1").unique()
# Looping through each Level1 value
for lev1 in lev1s:
# Filtering df based on Level1 value
df_lev1 = df.query('Level1 == ' + str(lev1))
# Repeating...
lev2s = df_lev1.index.get_level_values("Level2").unique()
for lev2 in lev2s:
df_lev2 = df_lev1.query('Level2 == ' + str(lev2))
# ... until Level3
lev3s = df_lev2.index.get_level_values("Level3").unique()
# Creating all combinations
combs = itertools.combinations(lev3s, 3)
# Looping through each combination
for comb in combs:
# Filtering values in combination
df_comb = df_wl.query('Level3 in ' + str(comb))
# Calculating means using groupby (groupby might not be necessary,
# but I don't believe it has much of an impact
df_means = df_comb.reset_index().groupby(['Level1', 'Level2']).mean()
# Extending resulting dataframe
df_result = df_result.append(df_means)
The thing is, after a little while, this process gets really slow. Since I have around 2 * 24 * 6 * 674 levels and 84 combinations (of 9 elements, 3 by 3), I am expecting more than 16 million df_meanss to be calculated.
Is there any more efficient way to do this?
Thank you.
I have data from a platform that records a users events - whether answers to polls, or clickstream data. I am trying to bring together a number of related datasets, each of which has a session_id column.
Each dataset began as a csv that was read in as a series of nested lists. Not every session will have a user answering a question, or completing certain actions, so each dataset will not contain an entry for every session -- however, every session exists in at least one of the datasets.
assume there are 5 sessions recorded:
e.g. dataset 1:
SessionID |a | b | c | d
1 | x | x | x | x
2 | x | x | x | x
5 | x | x | x | x
e.g. dataset 2:
SessionID |e | f | g | h
1 | x | x | x | x
3 | x | x | x | x
5 | x | x | x | x
e.g. dataset 3:
SessionID |i | j | k | l
2 | x | x | x | x
3 | x | x | x | x
4 | x | x | x | x
How would I construct this:
SessionID |a | b | c | d | e | f | h | i |j | k | l
1 | x | x | x | x | x | x | x | x | - | - | - | -
2 | x | x | x | x | - | - | - | - | x | x | x | x
3 | - | - | - | - | x | x | x | x | x | x | x | x
4 | - | - | - | - | - | - | - | - | x | x | x | x
5 | x | x | x | x | x | x | x | x | - | - | - | -
By far the easiest way to do this is to import each csv into pandas:
merged_df = pd.merge(dataset1, dataset2, how = 'outer', on="sessionID")
pd.merge(merged_df, dataset3, how = 'outer', on="sessionID")
however the requirements are that I not use any external libraries.
I'm struggling to find a workable logic to detect gaps in the sessionID, and then pad the lists with null data so the three lists would be simply added together.
Any ideas?
How do you define "external libraries"? Does sqlite3 qualify as external or internal?
If it doesn't and you want to think about the problem in terms of relational operations, you could slam your tables into a sqlite3 file and take it from there.
If the number of datasets is finite, you could create a class Session, containing a dictionary where each column (a to j) would be a key. If you are proficient, you could use the __getattr__ function to use a "dot" notation when you need it. For the "table", I would simply use a dictionary, with the key as the id, then fill up your dictionary in three steps (dataset1, dataset2, dataset3). In this way you wouldn't have to worry about gaps.
So I have a dataframe with some values. This is my dataframe:
|in|x|y|z|
+--+-+-+-+
| 1|a|a|b|
| 2|a|b|b|
| 3|a|b|c|
| 4|b|b|c|
I would like to get number of unique values of each row, and number of values that are not equal to value in column x. The result should look like this:
|in | x | y | z | count of not x |unique|
+---+---+---+---+---+---+
| 1 | a | a | b | 1 | 2 |
| 2 | a | b | b | 2 | 2 |
| 3 | a | b | c | 2 | 3 |
| 4 | b | b |nan| 0 | 1 |
I could come up with some dirty decisions here. But there must be some elegant way of doing that. My mind is turning around dropduplicates(that does not work on series); turning into array and .unique(); df.iterrows() that I want to evade; and .apply on each row.
Here are solutions using apply.
df['count of not x'] = df.apply(lambda x: (x[['y','z']] != x['x']).sum(), axis=1)
df['unique'] = df.apply(lambda x: x[['x','y','z']].nunique(), axis=1)
A non-apply solution for getting count of not x:
df['count of not x'] = (~df[['y','z']].isin(df['x'])).sum(1)
Can't think of anything great for unique. This uses apply, but may be faster, depending on the shape of the data.
df['unique'] = df[['x','y','z']].T.apply(lambda x: x.nunique())