Unstack and convert dates of observations to sequence number? - python

I have a CSV with one row for every observation per individual:
USER DATE SCORE
1 7/9/2015 37.2
1 11/18/2015 68.9
2 7/7/2015 45.1
2 11/2/2015 42.9
3 6/4/2015 56
3 10/27/2015 39
3 5/11/2016 42.9
I'd like to produce a dataframe where the first observation is assigned to round one, second to round two, and so forth. So the result would look like:
USER R1 R2 R3
1 37.2 68.9 NaN
2 45.1 42.9 NaN
3 56 39 42.9
I've played around with pd.pivot and pd.unstack, but can't get what I need.
Suggestions?

You can use groupby with apply for creating new columns:
#if necessary sort values
df = df.sort_values(by=['USER','DATE'])
df = df.groupby('USER')['SCORE'].apply(lambda x: pd.Series(x.values))
.unstack()
.rename(columns = lambda x: 'R' + str(x+1))
.reset_index()
print (df)
USER R1 R2 R3
0 1 37.2 68.9 NaN
1 2 45.1 42.9 NaN
2 3 56.0 39.0 42.9
Another solution with pivot and unstack:
#if necessary sort values
df = df.sort_values(by=['USER','DATE'])
df = pd.pivot(index=df['USER'],columns=df.groupby('USER').cumcount() + 1,values=df['SCORE'])
.add_prefix('R')
.reset_index()
print (df)
USER R1 R2 R3
0 1 37.2 68.9 NaN
1 2 45.1 42.9 NaN
2 3 56.0 39.0 42.9

First sort values by USER and DATE (this seems to be done already in example data but just to be sure).
Then create a new column ROUND that will sequentially number entries for every user.
Set index to columns USER and ROUND.
Finally, unstack the SCORE column.
Here's some example code:
import pandas as pd
from io import StringIO
data = '''USER DATE SCORE
1 7/9/2015 37.2
1 11/18/2015 68.9
2 7/7/2015 45.1
2 11/2/2015 42.9
3 6/4/2015 56
3 10/27/2015 39
3 5/11/2016 42.9'''
df = (pd.read_csv(StringIO(data),sep='\s+',parse_dates=['DATE'])
.sort_values(by=['USER','DATE'])
.assign(ROUND = lambda x: x.groupby('USER').cumcount() + 1)
.set_index(['USER','ROUND'])['SCORE']
.unstack()
.add_prefix('R')
)

Related

Calculating of tolerance

I am working with one data set. Data contains values with different decimal places. Data and code you can see below :
data = {
'value':[9.1,10.5,11.8,
20.1,21.2,22.8,
9.5,10.3,11.9,
]
}
df = pd.DataFrame(data, columns = ['value'])
Which gives the following dataframe:
value
0 9.1
1 10.5
2 11.8
3 20.1
4 21.2
5 22.8
6 9.5
7 10.3
8 11.9
Now I want to add a new column with the title adjusted.This column I want to calculate with numpy.isclose function with a tolerance of 2 (plus or minus 1). At the end I expect to have results as result shown in the next table
value adjusted
0 9.1 10
1 10.5 10
2 11.8 10
3 20.1 21
4 21.2 21
5 22.8 21
6 9.5 10
7 10.3 10
8 11.9 10
I tried with this line but I get only results such true and false and also this is only for one value (10) not for all values.
np.isclose(df1['value'],10,atol=2)
So can anybody help me how to solve this problem and calculate tolerance for values 10 and 21 with one line ?
The exact logic and how this would generalize are not fully clear. Below are two options.
Assuming you want to test your values against a list of defined references, you can use the underlying numpy array and broadcasting:
vals = np.array([10, 21])
a = df['value'].to_numpy()
m = np.isclose(a[:, None], vals, atol=2)
df['adjusted'] = np.where(m.any(1), vals[m.argmax(1)], np.nan)
Assuming you want to group successive values, you can get the diff and start a new group when the difference is above threshold. Then round and get the median per group with groupby.transform:
group = df['value'].diff().abs().gt(2).cumsum()
df['adjusted'] = df['value'].round().groupby(group).transform('median')
Output:
value adjusted
0 9.1 10.0
1 10.5 10.0
2 11.8 10.0
3 20.1 21.0
4 21.2 21.0
5 22.8 21.0
6 9.5 10.0
7 10.3 10.0
8 11.9 10.0

Get weighted average summary data column in new pandas dataframe from existing dataframe based on other column-ID

Somewhat similar question to an earlier question I had here: Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID
However, instead of just taking the sum of datapoints, I wanted to have the weighted average in an extra column. I'll repeat and rephrase the question:
I want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surfaces and U-values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and surface-weighted average U-value per appartment. There are three conditions for the original dataframe:
Three conditions:
the dataframe can contain empty cells
when the values of surface or U-value are equal for all of the rows within that ID
(so all the same values for the same ID), then the data (surface, volumes) is not
summed but one value/row is passed to the new summary column (example: 'ID 4')(as
this could be a mistake in the original dataframe and the total surface/volume was
inserted for all the rooms by the government-employee)
the average U-value should be the Surface-weighted average U-value
Initial dataframe 'data':
print(data)
ID Surface U-value
0 2 10.0 1.0
1 2 12.0 1.0
2 2 24.0 0.5
3 2 8.0 1.0
4 4 84.0 0.8
5 4 84.0 0.8
6 4 84.0 0.8
7 52 NaN 0.2
8 52 96.0 1.0
9 95 8.0 2.0
10 95 6.0 2.0
11 95 12.0 2.0
12 95 30.0 1.0
13 95 12.0 1.5
Desired output from 'df':
print(df)
ID Surface U-value #-> U-value = surface weighted U-value!; Surface = sum of all surfaces except when all surfaces per ID are the same (example 'ID 4')
0 2 54.0 0.777
1 4 84.0 0.8 #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2 52 96.0 1.0 # -> as one of 2 surface is empty, the corresponding U-value for the empty cell is ignored, so the output here should be the weighted average of the values that have both 'Surface'&'U-value'-values (in this case 1,0)
3 95 68.0 1.47
The code of jezrael in the reference already works brilliant for the sum() but how
to add a weighted average 'U-value'-column to it? I really have no idea. An
average could be set with a mean()-function instead of the sum() but
the weighted-average..?
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,4,52,95]})
data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
"Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
"U-value":
[1.0,1.0,0.5,1.0,0.8,0.8,0.8,0.2,1.0,2.0,2.0,2.0,1.0,1.5]})
print(data)
cols = ['Surface']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
This should do the trick:
data.groupby('ID').apply(lambda g: (g['U-value']*g['Surface']).sum() / g['Surface'].sum())
To add to original dataframe, don't reset the index first:
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum()
df['U-value'] = data.groupby('ID').apply(
lambda g: (g['U-value'] * g['Surface']).sum() / g['Surface'].sum())
df.reset_index(inplace=True)
The result:
ID Surface U-value
0 2 54.0 0.777778
1 4 84.0 0.800000
2 52 96.0 1.000000
3 95 68.0 1.470588

Pandas: Extract rows of a DF when a column value matches with a column value of another DF

I have two DF1 and DF2 as mentioned below. The first column 'POS' of both dataframe might have matches but other columns will be different. I want to compare the 'POS' column of both dataframes, if a 'POS' value of DF1 is in DF2 'POS' column the I want to store that row in new DF1 dataframe and do the same for DF2. I could this easily with a dictionary by keeping POS as keys and compare them to get corresponding values. But the dictionary will not accept duplicate 'POS' values so I am wondering if there is a solution in Pandas DF.
df1 =
POS id freq
0 100 "idex" 3.0
1 102 "ter" 2.0
2 102 "pec" 4.0
3 103 "jek" 4.0
4 104 "jek" 4.0
df2 =
POS id freq
0 100 "treg" 3.0
1 102 "dfet" 2.2
2 102 "idet" 7.0
3 108 "jeik" 1.0
4 109 "jek" 4.0
Expected:
new_df1 =
POS id freq
0 100 "idex" 3.0
1 102 "ter" 2.0
2 102 "pec" 4.0
new_df2 =
POS id freq
0 100 "treg" 3.0
1 102 "dfet" 2.2
2 102 "idet" 7.0
You can use isin for both dataframes:
new_df1 = df1[df1.POS.isin(df2.POS)]
new_df2 = df2[df2.POS.isin(df1.POS)]
>>> new_df1
POS id freq
0 100 idex 3.0
1 102 ter 2.0
2 102 pec 4.0
>>> new_df2
POS id freq
0 100 treg 3.0
1 102 dfet 2.2
2 102 idet 7.0
I believe you are describing a classic join problem.
I would recommend the .merge() method:
df = pd.merge(df1, df2, how='left', on='POS')
this will return a new data frame with df1.POS as index. all columns from df2 will be in df1, but only for POS values that match. You can play around with the how= parameter in oder to get what you need. For more information, see types of sql joins

Converting from Deep to wide format in pandas without Memory errors

I have a pandas dataframe which looks more like below which contains person Id , characteristics and the count. This is currently in deep/long format.
Person Id Characteristics Count
123 Apple 2
123 Banana 4
124 Pineaple 1
125 Apple 2
I want to efficiently convert this into a wide format and create a matrix which needs to be fed into an algorithm for reducing components.
It should look something like below
Person Id Apple Banana Pineapple
123 2 4 0
124 0 0 1
125 2 0 0
I am looking for an efficient way of doing this . Currently there is about 2000+ Characteristics and so there will be about 2000 or more columns and about 300K person Ids.
As you can see if there is no characteristic present, we need to fill it with zeroes. My approach seems to be clogging up a lot of memory and i was getting memory errors.
I am confused as to how to implement this in a efficient way.
You can use pivot_table with reset_index and rename_axis (new in pandas 0.18.0), but pivoting need much memory:
print df.pivot_table(index='Person Id',
columns='Characteristics',
values='Count',
fill_value=0).reset_index().rename_axis(None, axis=1)
Person Id Apple Banana Pineaple
0 123 2 4 0
1 124 0 0 1
2 125 2 0 0
Maybe faster is:
print df.pivot(index='Person Id',
columns='Characteristics',
values='Count').fillna(0).reset_index().rename_axis(None, axis=1)
Person Id Apple Banana Pineaple
0 123 2.0 4.0 0.0
1 124 0.0 0.0 1.0
2 125 2.0 0.0 0.0
Timings:
In [69]: %timeit df.pivot_table(index='Person Id', columns='Characteristics', values='Count', fill_value=0).reset_index().rename_axis(None, axis=1)
100 loops, best of 3: 5.26 ms per loop
In [70]: %timeit df.pivot(index='Person Id', columns='Characteristics', values='Count').fillna(0).reset_index().rename_axis(None, axis=1)
1000 loops, best of 3: 1.87 ms per loop

Pandas fillna with a lookup table

Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column.
(You might recognize my data from the Titanic data set)...
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 Nan
I want to fill the NaN with a value from series 'pclass_lookup':
pclass_lookup
1 38.1
2 29.4
3 25.2
I have tried doing fillna with indexing like:
df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of
ValueError: cannot reindex from a duplicate axis
lambdas were a try too:
df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]
but, that seems not to fill it right, either. Am I totally missing the boat here? '
Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.
So you need to replace that duff value and then you can just call map to perform the lookup on the NaN values:
In [317]:
df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 29.4
4 1 38.1
Timings
For a df with 5000 rows:
In [26]:
%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:
%%timeit
def remove_na(x):
if pd.isnull(x['Age']):
return df1[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:
%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop
So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map is still the fastest.
Building on the response of #vrajs5:
# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values
>>> df
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1
Following should work for you:
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
df
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 NaN
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
pclass_lookup
1 38.1
2 29.4
3 25.2
dtype: float64
def remove_na(x):
if pd.isnull(x['Age']):
return pclass_lookup[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1

Categories

Resources