Having the following Data Frame:
name value count total_count
0 A 0 1 20
1 A 1 2 20
2 A 2 2 20
3 A 3 2 20
4 A 4 3 20
5 A 5 3 20
6 A 6 2 20
7 A 7 2 20
8 A 8 2 20
9 A 9 1 20
----------------------------------
10 B 0 10 75
11 B 5 30 75
12 B 6 20 75
13 B 8 10 75
14 B 9 5 75
I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.
Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:
name 0-1 2-3 4-5 6-7 8-9
0 A 0.150000 0.2 0.3 0.2 0.150000
1 B 0.133333 0.0 0.4 0.4 0.066667
For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A
name 0-1
0 A (1+2)/20 = 0.15
I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.
Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.
import pandas as pd
import numpy as np
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5,
'number': np.arange(0,10),
'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)
The null values in the original data will not affect the result.
To get the exact result you could try this.
bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)
df1.columns = df1.columns.astype(str) # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i'] # rename the cols
cols = ['a',"b","d","f","h"]
df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)
You can manually rename the cols later.
# Output:
a b d f h
name
A 0.150000 0.2 0.3 0.200000 0.15
B 0.133333 NaN 0.4 0.266667 0.20
You can replace the NaN values using df1.fillna("0.0")
Related
I have missing values in one column that I would like to fill by random sampling from a source distribution:
import pandas as pd
import numpy as np
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
source
age location x
0 21 0 1
1 21 0 2
2 21 1 3
3 21 1 4
4 21 1 4
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
target
age location x
0 21 0 NaN
1 21 0 NaN
2 21 0 NaN
3 21 1 NaN
4 21 2 NaN
Now I need to fill in the missing values of x in the target dataframe by choosing a random value of x from the source dataframe that have the same values for age and location as the missing x with replacement. If there is no value of x in source that has the same values for age and location as the missing value it should be left as missing.
Expected output:
age location x
0 21 0 1 with probability 0.5 2 otherwise
1 21 0 1 with probability 0.5 2 otherwise
2 21 0 1 with probability 0.5 2 otherwise
3 21 1 3 with probability 0.33 4 otherwise
4 21 2 NaN
I can loop through all the missing combinations of age and location and slice the source dataframe and then take a random sample, but my dataset is large enough that it takes quite a while to do.
Is there a better way?
You can create MultiIndex in both DataFrames and then in custom function replace NaN by another DataFrame in GroupBy.transform with numpy.random.choice:
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
cols = ['age', 'location']
source1 = source.set_index(cols)['x']
target1 = target.set_index(cols)['x']
def f(x):
try:
a = source1.loc[x.name].to_numpy()
m = x.isna()
x[m] = np.random.choice(a, size=m.sum())
return x
except KeyError:
return np.nan
target1 = target1.groupby(level=[0,1]).transform(f).reset_index()
print (target1)
age location x
0 21 0 1.0
1 21 0 2.0
2 21 0 2.0
3 21 1 3.0
4 21 2 NaN
You can create a common grouper and perform a merge:
cols = ['age', 'location']
(target[cols]
.assign(group=target.groupby(cols).cumcount()) # compute subgroup for duplicates
.merge((# below: assigns a random row group
source.assign(group=source.sample(frac=1).groupby(cols, sort=False).cumcount())
.groupby(cols+['group'], as_index=False) # get one row per group
.first()
),
on=cols+['group'], how='left') # merge
#drop('group', axis=1) # column kept for clarity, uncomment to remove
)
output:
age location group x
0 20 0 0 0.339955
1 20 0 1 0.700506
2 21 0 0 0.777635
3 22 1 0 NaN
I'm trying to create 2-dimensional bins from a pandas DataFrame based on 3 columns. Here a snippet from my DataFrame:
Scatters N z Dist_first
---------------------------------------
0 0 0 0.096144 2.761508
1 1 0 -8.229910 17.403039
2 2 0 0.038125 21.466233
3 3 0 -2.050480 29.239867
4 4 0 -1.620470 NaN
5 5 0 -1.975930 NaN
6 6 0 -11.672200 NaN
7 7 0 -16.629000 26.554049
8 8 0 0.096002 NaN
9 9 0 0.176049 NaN
10 10 0 0.176005 NaN
11 11 0 0.215408 NaN
12 12 0 0.255889 NaN
13 13 0 0.301834 27.700308
14 14 0 -29.593600 9.155065
15 15 1 -2.582290 NaN
16 16 1 0.016441 2.220946
17 17 1 -17.329100 NaN
18 18 1 -5.442320 34.520919
19 19 1 0.001741 39.579189
For my result each Dist_first should be binned with all "z <= 0" of lower index within a group "N" than the Distance itself. "Scatters" is a copy of the index left from an operation in an earlier stage of my code which is not relevant here. Nonetheless I came to use it instead of the index in the example below. The bins for the distances and z's are in 10 m and 0.1 m steps, respectively and I can obtain a result from looping through groups of the dataFrame:
# create new column for maximal possible distances per group N
for j in range(N.groupby('N')['Dist_first'].count().max()):
N[j+1] = N.loc[N[N['Dist_first'].notna()].groupby('N')['Scatters'].nlargest(j+1).groupby('N').min()]['Dist_first']
# fill nans with zeros to allow
N[j+1] = N[j+1].fillna(0)
# make sure no value is repeated
if j+1 > 1:
N[j+1] = N[j+1]-N[list(np.arange(j)+1)].sum(axis=1)
# and set all values <= 0 to NaN
N[N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] <= 0] = np.nan
# backwards fill to make sure every distance gets all necessary depths
N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] = N.set_index('N').groupby('N').bfill().set_index('Scatters')[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)]
# bin the result(s)
for j in range(N.groupby('N')['Dist_first'].count().max()):
binned = N[N['z'] >= 0].groupby([pd.cut(N[N['z'] >= 0]['z'], bins_v, include_lowest=True), pd.cut(N[N['z'] >= 0][j+1], bins_h, include_lowest=True)])
binned = binned.size().unstack()
## rename
binned.index = N_v.index; binned.columns = N_h.index
## and sum up with earlier chunks
V = V+binned
This bit of code works just fine and the result for the small snippet of the data I've shared looks like this:
Distance [m] 0.0 10.0 20.0 30.0 40.0
Depth [m]
----------------------------------------------------
0.0 1 1 1 4 2
0.1 1 2 2 4 0
0.2 0 3 0 3 0
0.3 0 2 0 2 0
0.4 0 0 0 0 0
However, the whole dataset(s) are excesively large (> 300 mio rows each) and looping through all rows is not an option. Therefore I'm looking for some vectorized solution.
I suggest you to calculate creiteria in extra columns and then use Pandas standard binning function, like qcut. It can be applied separately along the 2 binning dimensions. Not most elegant, but definitely vectorized.
I am joining two tables left_table and right_table on non-unique keys that results in row explosion. I then want to aggregate rows to match the number of rows in left_table. To do this I aggregate over left_table columns.
Weirdly, when I save the table the columns in left_table double. It seems like columns of left_table become an index for resulting dataframe...
Left table
k1 k2 s v c target
0 1 3 20 40 2 2
1 1 2 10 20 1 1
2 1 2 10 80 2 1
Right table
k11 k22 s2 v2
0 1 2 0 100
1 2 3 30 200
2 1 2 10 300
Left join
k1 k2 s v c target s2 v2
0 1 3 20 40 2 2 NaN NaN
1 1 2 10 20 1 1 0.0 100.0
2 1 2 10 20 1 1 10.0 300.0
3 1 2 10 80 2 1 0.0 100.0
4 1 2 10 80 2 1 10.0 300.0
Aggregation code
dic = {}
keys_to_agg_over = left_table_col_names
for col in numeric_cols:
if col in all_cols:
dic[col] = 'median'
left_join = left_join.groupby(keys_to_agg_over).aggregate(dic)
After aggregation (doubled number of left table cols)
k1 k2 s v c target s2 v2
k1 k2 s v c target
1 2 10 20 1 1 1 2 10 20 1 1 5.0 200.0
80 2 1 1 2 10 80 2 1 5.0 200.0
3 20 40 2 2 1 3 20 40 2 2 NaN NaN
Saved to csv file
k1,k2,s,v,c,target,k1,k2,s,v,c,target,s2,v2
1,2,10,20,1,1,1,2,10,20,1,1,5.0,200.0
1,2,10,80,2,1,1,2,10,80,2,1,5.0,200.0
1,3,20,40,2,2,1,3,20,40,2,2,,
I tried resetting index, as left_join.reset_index() but I get
ValueError: cannot insert target, already exists
How to fix the issue of column-doubling?
You have a couple of options:
Store csv not including the index: I guess you are using the to_csv method to store the result in a csv. By default it includes you index columns in the generated csv. you can do to_csv(index=False) to avoid storing them.
reset_index dropping it: you can use left_join.reset_index(drop=True) in order to discard the index columns and not add them in the dataframe. By default reset_index adds the current index columns to the dataframe, generating the ValueError you obtain.
It seems like you are using:
left_join = left_table.merge(right_table, left_on = ["k1", "k2"], "right_on" = ["k11", "k22"] , how = "left")
This will result in a dataframe with repeated rows since indexes 1 and 2 from the left table both can be joined to indexes 0 and 2 of the right table. If that is the behavior you expected, and just want to get rid of duplicated rows you can try using:
left_join = left_join.drop_duplicates()
Before aggregating. This solution won't stop duplicating rows, it will rather eliminate them to not cause any trouble.
You can also pass the parameter as_index = False in the groupby function like this:
left_join = left_join.groupby(keys_to_agg_over, as_index = False).aggregate(dic)
To stop geting the "grouping columns" as indexes.
I have a dataframe with multiple NaN values. I want to fill each with a random number between 0,1. I tried fillna but that fills the code with just one value.
We can use itterows but it consumes a lot of resources. Is there any way else we can do it and if yes then how? The following is an example of my dataframe.
> df
a b c d
0 1 10 na na
1 2 20 40 30
2 24 na na na
expected output
> df
a b c d
0 1 10 0.7 0.9
1 2 20 40 30
2 24 0.9 0.34 0.532
basically replacing na anything between (0,1)
You can create your own formula along with random number:
In below solution, I am multiplying column a with random number and taking only fractions as you want number between 0 to 1.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'a':[1,2,24], 'b':[10,20, np.nan],'c':[np.nan,40,np.nan],'d':[np.nan,30,np.nan]})
for c in df.columns:
df[c] = np.where(df[c].isnull(),(df['a']*random.random())%1,df[c])
print(df)
Output:
a b c d
0 1.0 10.000000 0.526793 0.678061
1 2.0 20.000000 40.000000 30.000000
2 24.0 0.865441 0.643032 0.273461
I have a table in pandas df1
id value
1 1500
2 -1000
3 0
4 50000
5 50
also I have another table in dataframe df2, that contains upper boundaries of groups, so essentially every row represents an interval from the previous boundary to the current one (the first interval is "<0"):
group upper
0 0
1 1000
2 NaN
How should I get the relevant groups for value from df, using intervals from df2? I can't use join, merge etc., because the rules for this join should be like "if value is between previous upper and current upper" and not "if value equals something". The only way that I've found is using predefined function with df.apply() (also there is a case of categorical values in it with interval_flag==False):
def values_to_group(x, interval_flag, groups_def):
if interval_flag==True:
for ind, gr in groups_def.sort_values(by='group').iterrows():
if x<gr[1]:
return gr[0]
elif math.isnan(gr[1]) == True:
return gr[0]
else:
for ind, gr in groups_def.sort_values(by='group').iterrows():
if x in gr[1]:
return gr[0]
Is there an easier/more optimal way to do it?
The expected output should be this:
id value group
1 1500 2
2 -1000 0
3 0 1
4 50000 2
5 50 1
I suggest use cut with sorted DataFrame of df2 by sorted upper and repalce last NaN to np.inf:
df2 = pd.DataFrame({'group':[0,1,2], 'upper':[0,1000,np.nan]})
df2 = df2.sort_values('upper')
df2['upper'] = df2['upper'].replace(np.nan, np.inf)
print (df2)
group upper
0 0 0.000000
1 1 1000.000000
2 2 inf
#added first bin -np.inf
bins = np.insert(df2['upper'].values, 0, -np.inf)
df1['group'] = pd.cut(df1['value'], bins=bins, labels=df2['group'], right=False)
print (df1)
id value group
0 1 1500 2
1 2 -1000 0
2 3 0 1
3 4 50000 2
4 5 50 1
Here's a solution using numpy.digitize. Your only task is to construct bins and names input lists, which should be possible via an input dataframe.
import pandas as pd, numpy as np
df = pd.DataFrame({'val': [99, 53, 71, 84, 84]})
df['ratio'] = df['val']/ df['val'].shift() - 1
bins = [-np.inf, 0, 0.2, 0.4, 0.6, 0.8, 1.0, np.inf]
names = ['<0', '0.0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8', '0.8-1.0', '>1']
d = dict(enumerate(names, 1))
df['Bucket'] = list(map(d.get, np.digitize(df['ratio'], bins)))
print(df)
val ratio Bucket
0 99 NaN None
1 53 -0.464646 <0
2 71 0.339623 0.2-0.4
3 84 0.183099 0.0-0.2
4 84 0.000000 0.0-0.2