I am getting to grips with python pandas.
The toy problem below, illustrates an issue I am having in a related exercise.
I have sorted a data-frame so that it presents a column's values (in this case students' test scores) in ascending order:
df_sorted =
variable test_score
1 52.0
1 53.0
4 54.0
6 64.0
6 64.0
6 64.0
5 71.0
10 73.0
15 75.0
4 77.0
However, I would now like to bin the data-frame by the means of 2 columns (here "variable" and "test_score") but for every X entries from the start to the end of the data-frame. This will also me to create bins that contain equal numbers of entries (very useful for plotting in my associated exercise).
The output if I bin every 3 rows would therefore looks like:
df_sorted_binned =
variable test_score
2 53.0
6 64.0
10 73.0
4 77.0
Can anyone see how I can do this easily?
Much obliged!
Just groupby a dummy variable that goes 0, 0, 0, 1, 1, 1, etc. This can be obtained with floor division:
>>> d.groupby(np.arange(len(d))//3).mean()
variable test_score
0 2 53
1 6 64
2 10 73
3 4 77
Related
as part of some data cleansing, i want to add the mean of a variable back into a dataframe to use if the variable is missing for a particular observation. so i've calculated my averages as follows
avg=all_data2.groupby("portfolio")"[sales"].mean().reset_index(name="sales_mean")
now I wanted to add that back into my original dataframe using a left join, but it doesnt appear to be working. what format is my avg now? I thought it would be a dataframe but is it something else?
UPDATED
This is probably the most succinct way to do it:
all_data2.sales = all_data2.sales.fillna(all_data2.groupby('portfolio').sales.transform('mean'))
This is another way to do it:
all_data2['sales'] = all_data2[['portfolio', 'sales']].groupby('portfolio').transform(lambda x: x.fillna(x.mean()))
Output:
portfolio sales
0 1 10.0
1 1 20.0
2 2 30.0
3 2 40.0
4 3 50.0
5 3 60.0
6 3 NaN
portfolio sales
0 1 10.0
1 1 20.0
2 2 30.0
3 2 40.0
4 3 50.0
5 3 60.0
6 3 55.0
To answer your the part of your question that reads "what format is my avg now? I thought it would be a dataframe but is it something else?", avg is indeed a dataframe but using it may not be the most direct way to update missing data in the original dataframe. The dataframe avg looks like this for the sample input data above:
portfolio sales_mean
0 1 15.0
1 2 35.0
2 3 55.0
A related SO question that you may find helpful is here.
If you want to add a new column, you can use this code:
df['sales_mean']=df[['sales_1','sales_2']].mean(axis=1)
Somewhat similar question to an earlier question I had here: Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID
However, instead of just taking the sum of datapoints, I wanted to have the weighted average in an extra column. I'll repeat and rephrase the question:
I want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surfaces and U-values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and surface-weighted average U-value per appartment. There are three conditions for the original dataframe:
Three conditions:
the dataframe can contain empty cells
when the values of surface or U-value are equal for all of the rows within that ID
(so all the same values for the same ID), then the data (surface, volumes) is not
summed but one value/row is passed to the new summary column (example: 'ID 4')(as
this could be a mistake in the original dataframe and the total surface/volume was
inserted for all the rooms by the government-employee)
the average U-value should be the Surface-weighted average U-value
Initial dataframe 'data':
print(data)
ID Surface U-value
0 2 10.0 1.0
1 2 12.0 1.0
2 2 24.0 0.5
3 2 8.0 1.0
4 4 84.0 0.8
5 4 84.0 0.8
6 4 84.0 0.8
7 52 NaN 0.2
8 52 96.0 1.0
9 95 8.0 2.0
10 95 6.0 2.0
11 95 12.0 2.0
12 95 30.0 1.0
13 95 12.0 1.5
Desired output from 'df':
print(df)
ID Surface U-value #-> U-value = surface weighted U-value!; Surface = sum of all surfaces except when all surfaces per ID are the same (example 'ID 4')
0 2 54.0 0.777
1 4 84.0 0.8 #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2 52 96.0 1.0 # -> as one of 2 surface is empty, the corresponding U-value for the empty cell is ignored, so the output here should be the weighted average of the values that have both 'Surface'&'U-value'-values (in this case 1,0)
3 95 68.0 1.47
The code of jezrael in the reference already works brilliant for the sum() but how
to add a weighted average 'U-value'-column to it? I really have no idea. An
average could be set with a mean()-function instead of the sum() but
the weighted-average..?
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,4,52,95]})
data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
"Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
"U-value":
[1.0,1.0,0.5,1.0,0.8,0.8,0.8,0.2,1.0,2.0,2.0,2.0,1.0,1.5]})
print(data)
cols = ['Surface']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
This should do the trick:
data.groupby('ID').apply(lambda g: (g['U-value']*g['Surface']).sum() / g['Surface'].sum())
To add to original dataframe, don't reset the index first:
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum()
df['U-value'] = data.groupby('ID').apply(
lambda g: (g['U-value'] * g['Surface']).sum() / g['Surface'].sum())
df.reset_index(inplace=True)
The result:
ID Surface U-value
0 2 54.0 0.777778
1 4 84.0 0.800000
2 52 96.0 1.000000
3 95 68.0 1.470588
This is the raw distribution of the var FREQUENCY
NaN 22131161
1.0 4182626
7.0 218343
3.0 145863
1 59432
0.0 29906
2.0 28129
4.0 15237
5.0 4553
8.0 3617
3 2754
7 2635
9.0 633
2 584
4 276
0 112
8 51
5 42
6.0 19
A 9
I 7
9 6
Q 3
Y 2
X 2
Z 1
C 1
N 1
G 1
B 1
Name: FREQUENCY, dtype: int64
group 1.0 should be the same as 1. I wrote df['x']=df['x].replace({'1.0:'1'}). it does not change anything. 9.0 vs 9, 3.0 vs.3 have same symptom
How could frequency be render as int64 where letters are present?
Desired outcome 1: group all letter groups +NaN into one group. Remaining numeric value groups consolidate (1.0 and 1 =1,for example). In SAS, I just run this : y=1*X. I just give a value of 10 to represent character groups + NaN. How to do it in Python, especially elegantly?
Outcome 2: extract a binary variable z=1 if x=NaN. Otherwise z=0
The first issue "
group 1.0 should be the same as 1. I wrote df['x']=df['x].replace({'1.0:'1'}). it does not change anything. 9.0 vs 9, 3.0 vs.3 have same symptom"
was fixed once I add dtype={'FREQUANCY':'object'} while reading the csv file. Group 1.0 collapsed with group 1... After than replace works just fine.
All other issues pretty much are resolved, except issue 2 in that it still sets the variable type to be int64 where character variables are present. My guess is perhaps Python adopts a majority rule to vote on data type. It is indeed true numeric values dominate the count.
I just found the 2 issues causing this, see solution below
I want to create a new column in my dataframe (df) based on another dataframe.
Basically df2 contains updated informations that I want to plug into df.
In order to replicate my real case (>1m lines), I will just populate two random df with simple columns.
I use pandas.merge() to do this, but this is giving me strange results.
Here is a typical example. Let's create df randomly and create df2 with a simple relationship : "New Type" = "Type" + 1. I create this simple relationship so that we can check easily the ouput. In my real application I don't have such an easy relationship of course.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)),columns = ["Type"])
df.head()
Type
0 45
1 3
2 89
3 6
4 39
df1 = pd.DataFrame({"Type":range(1,100)})
df1["New Type"] = df1["Type"] + 1
print(df1.head())
Type New Type
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
Now let's say I want to update df "Type" based on the "New Type" on df1
df["Type2"] = df.merge(df1,on="Type")["New Type"]
print(df.head())
I get this strange output where we clearly see that it does not work
Type Type2
0 45 46.0
1 3 4.0
2 89 4.0
3 6 4.0
4 39 90.0
I would think output should be like
Type Type2
0 45 46.0
1 3 4.0
2 89 90.0
3 6 7.0
4 39 40.0
Only the first line is properly matched. Do you know what I've missed?
Solution
1.I need to do merge with how="left" otherwise the default choice is "inner" producing another table with a different dimension than df.
Also I need to use sort=false as attribute to my merge function. Otherwise the merge result is sorted before being applied to df.
One way you could do this using map, set_index, and squeeze:
df['Type2'] = df['Type'].map(df1.set_index('Type').squeeze())
Output:
Type Type2
0 22 23.0
1 56 57.0
2 63 64.0
3 33 34.0
4 25 26.0
First, I'd construct a Series of New Type indexed by the old Type from df1:
new_vals = df1.set_index('Type')['New Type']
Then it's simply:
df.replace(new_vals)
That will leave values which aren't mapped intact. If you want to instead have the output be NaN (null) where not mapped, do this:
new_vals[df.Type]
I have a dataframe in pandas which look something like this:
>>> df[1:3]
0 1 2 3 4 5 6 7 8
1 -0.59 -99.0 924.0 20.1 5.0 4.0 57.0 19.0 8.0
2 -1.30 -279.0 297.0 16.1 30.0 4.4 63.0 19.0 10.0
The number of points in the dataframe is ~1000.
Given a set of columns, I want to find out how many time each point is "better" than the other?
Given a set of n columns, a point is better than other point, if it better in at least one of the columns and equal in other columns.
A point which is better in one column and worse in n-1 is not considered better because its better than the other point in at least one column.
Edit1: Example:
>>> df
0 1 2
1 -0.59 -99.0 924.0
2 -1.30 -279.0 297.0
3 2.00 -100.0 500.0
4 0.0 0.0 0.0
If we consider only column 0, then the result would be:
1 - 1
2 - 0
3 - 3
4 - 2
because point 1 (-0.59) is only better than point 2 with respect to column 1.
Another example by taking columns - 0 and 1:
1 - 1 (only for point 2 all values i.e. column 0 and column 1 are either smaller than point 1 or lesser)
2 - 0 (since no point is has any lesser than this in any dimension)
3 - 1 (point 2)
4 - 2 (point 1 and 2)
Edit 2:
Perhaps, something like a function which when given a dataframe, a point (index of point) and a set of columns could give the count as - for each subset of columns how many times that point is better than other points.
def f(p, df, c):
"""returns
A list : L = [(c1,n), (c2,m)..]
where c1 is a proper subset of c and n is the number of times that this point was better than other points."""
rank each column separately
by ranking each column, I can see exactly how many other rows in that column the particular row you're in is greater than.
d1 = df.rank().sub(1)
d1
to solve your problem, it logically has to be the case that for a particular row, the smallest rank among the row elements is precisely the number of other rows in which every element in this row is greater than.
for the first two columns [0, 1], it can be calculated by by taking the min of d1
I use this for reference to compare the raw first two columns with the ranks
pd.concat([df.iloc[:, :2], d1.iloc[:, :2]], axis=1, keys=['raw', 'ranked'])
Take the min as stated above.
d1.iloc[:, :2].min(1)
1 1.0
2 0.0
3 1.0
4 2.0
dtype: float64
put the result next to raw data and ranks so we can see it
pd.concat([df.iloc[:, :2], d1.iloc[:, :2], d1.iloc[:, :2].min(1)],
axis=1, keys=['raw', 'ranked', 'results'])
sure enough, that ties out with your expected results.