I'd need a little suggestion on a procedure using pandas, I have a 2-columns dataset that looks like this:
A 0.4533
B 0.2323
A 1.2343
A 1.2353
B 4.3521
C 3.2113
C 2.1233
.. ...
where first column contains strings and the second one floats. Looking at each subgroup given by the 'keys' 'A','B','C', I would like to filter out the values in each subgroup that differ more than 20% from the minimum of each subgroup.
What I've tried to do has been:
out = df[df.groupby(stringcol)[floatcol].apply(lambda x: x <= x.min() * 1.2)]
But I'm not sure if, doing like this, the global minimum over the entire dataset is considered, or the minimum with respect to each subgroup, that is what I want.
Many thanks,
James
Related
I have a Pandas DataFrame like:
COURSE BIB# COURSE 1 COURSE 2 STRAIGHT-GLIDING MEAN PRESTASJON
1 2 20.220 22.535 19.91 21.3775 1.073707
0 1 21.235 23.345 20.69 22.2900 1.077332
This is from a pilot and the DataFrame may be much longer when we perform the real experiment. Now that I have calculated the performance for each BIB#, I want to allocate them into two different groups based on their performance. I have therefore written the following code:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
This sorts values in the DataFrame. Now I want to assign even rows to one group and odd rows to another. How can I do this?
I have no idea what I am looking for. I have looked up in the documentation for the random module in Python but that is not exactly what I am looking for. I have seen some questions/posts pointing to a scikit-learn stratification function but I don't know if that is a good choice. Alternatively, is there a way to create a loop that accomplishes this? I appreciate your help.
Here a figure to illustrate what I want to accomplish
How about this:
threshold = 0.5
df1['group'] = df1['PRESTASJON'] > threshold
Or if you want values for your groups:
df['group'] = np.where(df['PRESTASJON'] > threshold, 'A', 'B')
Here, 'A' will be assigned to column 'group' if precision meets our threshold, otherwise 'B'.
UPDATE: Per OP's update on the post, if you want to group them alternatively into two groups:
#sort your dataframe based on precision column
df1 = df1.sort_values(by='PRESTASJON')
#create new column with default value 'A' and assign even rows (alternative rows) to 'B'
df1['group'] = 'A'
df1.iloc[1::2,-1] = 'B'
Are you splitting the dataframe alternatingly? If so, you can do:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
for i,d in df1.groupby(np.arange(len(df1)) %2):
print(f'group {i}')
print(d)
Another way without groupby:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
mask = np.arange(len(df1)) %2
group1 = df1.loc[mask==0]
group2 = df1.loc[mask==1]
I have a dataset black friday.
Here is how it looks.
The Age is given in range like 1-17,18-25 etc. I want to replace all such ranges by their mean. I can either traverse each element of the Age column and parse them and replace the string value by mean. That probably would be inefficient.
So I want to know is there any shorter way to do that ? or Is there any alternative way to process the range of data? (in python ofcourse)
There are several ways to transform this variable. In the picture I see, that there are not only bins, but also value '55+', it needs to be considered.
1) One liner:
df['age'].apply(lambda x: np.mean([int(x.split('-')[0]), int(x.split('-')[1])]) if '+' not in x else x[:-1])
It checks whether the value contains '+' (like 55+), if yes than the value without '+' is returned. Otherwise the bin is splitted into two values, they are converted to ints and their mean is calculated.
2) Using dictionary for transformation:
mapping = {'1-17': 9, '18-25': 21.5, '55+': 55}
df['age'].apply(lambda x: mapping[x])
You need to add all values to mapping dictionary (calculate them manually or automatically). Then you apply this transformation to the series.
I have a dataset from which I want a few averages of multiple variables I created.
I started off with:
data2['socialIdeology2'].mean()
data2['econIdeology'].mean()
^ that works perfectly, and gives me the averages I'm looking for.
Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'
Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.
This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:
def mean():
sum = 0.0
count = 0
for index in range(0, len(data2['socialIdeology2'])):
sum = sum + (data2['socialIdeology2'][index])
print(data2['socialIdeology2'][index])
count = count + 1
return sum / count
print(mean())
Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index] within the loop prints nan over and over again.
So my question is: if the data stored within the socialIdeology2 variable really is a nan (which I don't understand how it could be), why is it that the .mean() function works with it?
And how can I get generate means by category?
Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():
means = data2.groupby('voteChoice').mean()
or maybe, in your case, the following would be more efficient:
means = data2.groupby('voteChoice')['socialIdeology2'].mean()
to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.
If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:
voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()
If you want to get the means for multiple groups simultaneously then you can use groupby, as in Brad's answer. However, I would do it like this:
means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()
Placing the ['socialIdeology2'] index before the .mean() means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean() (i.e. data2.groupby('voteChoice').mean()['socialIdeology2']) this computes the means over all columns and then selects only the 'socialIdeology2' column from the result, which is less efficient.
See here for more info on indexing DataFrames using .loc and here for more info on groupby.
TL;DR - I want to mimic the behaviour of functions such as DataFrameGroupBy.std()
I have a DataFrame which I group.
I want to take one row to represent each group, and then add extra statistics regarding these groups to the resulting DataFrame (such as the mean and std of these groups)
Here's an example of what I mean:
df = pandas.DataFrame({"Amount": [numpy.nan,0,numpy.nan,0,0,100,200,50,0,numpy.nan,numpy.nan,100,200,100,0],
"Id": [0,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
"Date": pandas.to_datetime(["2011-11-02","NA","2011-11-03","2011-11-04",
"2011-11-05","NA","2011-11-04","2011-11-04",
"2011-11-06","2011-11-06","2011-11-06","2011-11-06",
"2011-11-08","2011-11-08","2011-11-08"],errors='coerce')})
g = df.groupby("Id")
f = g.first()
f["std"] = g.Amount.std()
Now, this works - but let's say I want a special std, which ignores 0, and regards each unique value only once:
def get_unique_std(group):
vals = group.unique()
vals = vals[vals>0]
return vals.std() if vals.shape[0] > 1 else 0
If I use
f["std"] = g.Amount.transform(get_unique_std)
I only get zeros... (Also for any other function such as max etc.)
But if I do it like this:
std = g.Amount.transform(get_unique_std)
I get the correct result, only not grouped anymore... I guess I can calculate all of these into columns of the original DataFrame (in this case df) before I take the representing row of the group:
df["std"] = g.Amount.transform(get_unique_std)
# regroup again the modified df
g = df.groupby("Id")
f = g.first()
But that would just be a waste of memory space since many rows corresponding to the same group would get the same value, and I'd also have to group df twice - once for calculating these statistics, and a second time to get the representing row...
So, as stated in the beginning, I wonder how I can mimic the behaviour of DataFrameGroupBy.std().
I think you may be looking for DataFrameGroupBy.agg()
You can pass your custom function like this and get a grouped result:
g.Amount.agg(get_unique_std)
You can also pass a dictionary and get each key as a column:
g.Amount.agg({'my_std': get_unique_std, 'numpy_std': pandas.np.std})
Thanks for reading, I've spent 3-4 hours searching for examples to solve this but can't find any that solve.. the ones I did try didn't seem to work with pandas DataFrame object.. any help would be very much appreciated!!:)
Ok this is my problem.
I have a Pandas DataFrame containing 12 columns.
I have 500,000 rows of data.
Most of the columns are useless. The variables/columns I am interested in are called: x,y and profit
Many of the x and y points are the same,
so i'd like to group them into a unique combination then add up all the profit for each unique combination.
Each unique combination is a bin (like a bin used in histograms)
Then I'd like to plot a 2d chart/heatmap etc of x,y for each bin and the colour to be total profit.
e.g.
x,y,profit
7,4,230.0
7,5,162.4
6,8,19.3
7,4,-11.6
7,4,180.2
7,5,15.7
4,3,121.0
7,4,1162.8
Note how values x=7, y=4, there are 3 rows that meet this criteria.. well the total profit should be:
230.0 - 11.6 +1162.8 = 1381.2
So in bin x=7, y = 4, the profit is 1381.2
Note for values x=7, y=5, there are 2 instances.. total profit should be: 162.4 + 15.7 = 178.1
So in bin x=7, y = 5, the profit is 178.1
So finally, I just want to be able to plot: x,y,total_profit_of_bin
e.g. To help illustrate what I'm looking for, I found this on internet, it is similar to what I'd like, (ignore the axis & numbers)
http://2.bp.blogspot.com/-F8q_ZcI-HJg/T4_l7D0C7yI/AAAAAAAAAgE/Bqtx3eIHzRk/s1600/heatmap.jpg
Thank-you so much for taking the time to read:)
If for 'bin' of x where the values are x are equal, and the values of y are equal, then you can use groupby.agg. That would look something like this
import pandas as pd
import numpy as np
df = YourData
AggDF = df.groupby('x').agg({'y' : 'max', 'profit' : 'sum'})
AggDF
That would get you the data I think you want, then you could plot as you see fit. Do you need assistance with that also?
NB this is only going to work in the way you want it to if within each 'bin' i.e. the data grouped according to the values of x, the values of y are equal. I assume this must be the case, as otherwise I don't think it would make much sense to be trying to graph x and y together.