Pandas:Count accumulated unique values based on another column

Pandas:Count accumulated unique values based on another column - python

I have a simple data frame with IDs and date, like below:
'ID Date
a 2009/12/1
c 2009/12/1
d 2009/12/1
a 2010/4/1
c 2010/5/1
e 2010/5/1
b 2010/12/1
b 2012/3/1
e 2012/7/1
b 2013/1/1
...
...'
I need to count unique values by each month and accumulate them but not counting existing IDs. For instance
`2009/12/1 3
2010/4/1 3
2010/5/1 4
... ...`
I created a loop but not working
`for d in df['date'].drop_duplicates():
c=df[df['date']<=d].ID.nunique()
df2=DataFrame(data=c,index=d)`
Can anyone tell me where is the problem? thanks

You should be using groupby() rather than looping over your data frame. After grouping by the date column, you can count the unique instances of ID using:
df.groupby('Date')['ID'].nunique()
Quick example:
df = pd.DataFrame([['a' ,'2009/12/1'],
['c' ,'2009/12/1'],
['d' ,'2009/12/1'],
['c' ,'2009/12/1'],
['a' ,'2010/4/1'],
['c' ,'2010/5/1'],
['e' ,'2010/5/1']], columns = ['ID','Date'])
df.groupby('Date')['ID'].nunique()
# returns:
# Date
# 2009/12/1 3
# 2010/4/1 1
# 2010/5/1 2

One option is to write a for loop and use a set to hold the cumulative unique IDs:
cumcount = []
cumunique = set()
date = []
for k, g in df.groupby(pd.to_datetime(df.Date)):
cumunique |= set(g.ID) # hold cumulative unique IDs
date.append(g.Date.iat[0]) # get the date variable for each group
cumcount.append(len(cumunique)) # hold cumulative count of unique IDs
pd.DataFrame({"Date": date, "ID": cumcount})

Related

How to re-number strings after sorting a dataframe

Description:
I have a GUI that allows the user to add variables that are displayed in a dataframe. As the variables are added, they are automatically numbered, ex.'FIELD_0' and 'FIELD_1' etc and each variable has a value associated with it. The data is actually row-based instead of column based, in that the 'FIELD' ids are in column 0 and progress downwards and the corresponding value is in column 1, in the same corresponding row. As shown below:
0 1
0 FIELD_0 HH_5_MILES
1 FIELD_1 POP_5_MILES
The user is able to reorder these values and move them up/down a row. However, it's important that the number ordering remains sequential. So, if the user positions 'FIELD_1' above 'FIELD_0' then it gets re-numbered appropriately. Example:
0 1
0 FIELD_0 POP_5_MILES
1 FIELD_1 HH_5_MILES
Currently, I'm using the below code to perform this adjustment - this same re-numbering occurs with other variable names within the same dataframe.
df = pandas.DataFrame({0:['FIELD_1','FIELD_0']})
variable_list = ['FIELD', 'OPERATOR', 'RESULT']
for var in variable_list:
field_list = ['%s_%s' % (var, _) for _, field_name in enumerate(df[0].isin([var]))]
field_count = 0
for _, field_name in enumerate(df.loc[:, 0]):
if var in field_name:
df.loc[_, 0] = field_list[field_count]
field_count += 1
This gets me the result I want, but it seems a bit inelegant. If there is a better way, I'd love to know what it is.

It appears you're looking to overwrite the Field values so that they always appear in order starting with 0.
We can filter to only rows which str.contains the word FIELD. Then assign those to a list comprehension like field_list.
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
# Select Where Values are Field
m = df[0].str.contains('FIELD')
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'FIELD_{n}' for n in range(m.sum())]
print(df)
df:
0
0 FIELD_0
1 OTHER_1
2 FIELD_1
3 OTHER_0
For multiple variables:
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
variable_list = ['FIELD', 'OTHER']
for v in variable_list:
# Select Where Values are Field
m = df[0].str.contains(v)
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'{v}_{n}' for n in range(m.sum())]
df:
0
0 FIELD_0
1 OTHER_0
2 FIELD_1
3 OTHER_1

You can use sort values as below:
def f(x):
l=x.split('_')[1]
return int(l)
df.sort_values(0, key=lambda col: [f(k) for k in col]).reset_index(drop=True)
0
0 FIELD_0
1 FIELD_1

Calculate Product of length of lists in dataframe and store in a new column

I have a dataframe, whose values are lists. How can I calculate product of lengths of all lists in a row, and store in a separate column? Maybe the following example will make it clear:
test_1 = ['Protocol', 'SCADA', 'SHM System']
test_2 = ['CM', 'Finances']
test_3 = ['RBA', 'PBA']
df = pd.DataFrame({'a':[test_1,test_2,test_3],'b':[test_2]*3, 'c':[test_3]*3, 'product of len(lists)':[12,8,8]})
This is a sample code which shows that in first row, the product is 3 * 2 * 2 = 12 which are lengths of each list in first row...and simlarly for other rows.
How can I compute these products and store in a new column, for a dataframe whose all values are lists?
Thank you.

Try using DataFrame.applymap and DataFrame.product:
df['product of len(lists)'] = df[['a', 'b', 'c']].applymap(len).product(axis=1)
[out]
a b c product of len(lists)
0 [Protocol, SCADA, SHM System] [CM, Finances] [RBA, PBA] 12
1 [CM, Finances] [CM, Finances] [RBA, PBA] 8
2 [RBA, PBA] [CM, Finances] [RBA, PBA] 8

Create a dataframe from one dictionary and remove a specific character

I would like to know if it is possible to create a dataframe from two dictionaries.
I get two dictionaries like this:
dict= {'MO': ['N-2', 'N-8', 'N-7', 'N-6', 'N-9'], 'MO2': ['N0-6'], 'MO3': ['N-2']}
My result will be like this :
ID NUM
0 MO 'N-2', 'N-8', 'N-7', 'N-6', 'N-9'
1 MO2 'N0-6'
2 MO3 'N-2'
I try to obtain this result but in the column with the value I get [] and I can't remove it
liste_id=list(dict.keys())
liste_num=list(dict.values())
df = pandas.DataFrame({'ID':liste_id,'NUM':liste_num})

Merge the values in the dictionary into a string, before creating the dataframe; this ensures the arrays are of the same length
pd.DataFrame([(key, ", ".join(value))
for key, value in dicts.items()],
columns = ['ID', 'NUM'])
ID NUM
0 MO N-2, N-8, N-7, N-6, N-9
1 MO2 N0-6
2 MO3 N-2

How to get the highest values from many columns and show in what rows it happened using pandas?

I have a dataframe from which I want to know the highest value for each column. But I also want to know in what row it happened.
With my code I have to put the name of each column each time. Is there a better way to get all highest values from all columns?
df2.loc[df2['ALL'].idxmax()]
THE DATAFRAME
WHAT I GET WITH MY CODE
WHAT I WANT
THE DATAFRAME

You can stack your frame and then sort the values from largest to smallest and then take the first occurrence of your column names.
First I will create some fake data
df = pd.DataFrame(np.random.rand(10,5), columns=list('abcde'),
index=list('nopqrstuvw'))
df.columns.name = 'level_0'
df.index.name = 'level_1'
Output
level_0 a b c d e
level_1
n 0.417317 0.821350 0.443729 0.167315 0.281859
o 0.166944 0.223317 0.418765 0.226544 0.508055
p 0.881260 0.789210 0.289563 0.369656 0.610923
q 0.893197 0.494227 0.677377 0.065087 0.228854
r 0.394382 0.573298 0.875070 0.505148 0.334238
s 0.046179 0.039642 0.930811 0.326114 0.880804
t 0.143488 0.561449 0.832186 0.486752 0.323215
u 0.891823 0.616401 0.247078 0.497050 0.995108
v 0.888553 0.386260 0.816100 0.874761 0.769073
w 0.557239 0.601758 0.932839 0.274614 0.854063
Now stack, sort and drop all but the first column occurrence
df.stack()\
.sort_values(ascending=False)\
.reset_index()\
.drop_duplicates('level_0')\
.sort_values('level_0')[['level_0', 0, 'level_1']]
level_0 0 level_1
3 a 0.893197 q
12 b 0.821350 n
1 c 0.932839 w
9 d 0.874761 v
0 e 0.995108 u

Missing values replace by med/mean in conti var, by mode in categorical var in pandas dataframe -after grouping the data by a column)

I have a pandas dataframe , where all missing values are np.nan, now I am trying to replace these missing values. The last column of my data is " class" , I need to group the data based on the class, then get mean/median/mode (based on data whether data is categorical/ continuos, normal/ not) of that group of a column and replace missing values of the group of the coulmn by respective mean/median/mode.
This is the code I have come up with , which I know is an overkill..
if I could :
group the col of dataframe
get median/mode/mean of groups of the cols
replace the missing of those groups
recombine them back to original df
it would be great.
but currently I landed up , finding replacement values (mean/median/mode) group wise and storing in dict, then seperating the nan tuples and non-nan tuples.. replacing missing values in nan tuples.. and trying to join them back to dataframe (which i donno yet how to do)
def fillMissing(df, dataType):
'''
Args:
df ( 2d array/ Dict):
eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad'])
dataTypes (dict): Dictionary of attribute names of df as keys and values 0/1
indicating categorical/continuous variable eg: ('attribute1':1, 'attribute2': 0)
Returns:
dataframe wih missing values filled
writes a file with missing values replaces.
'''
dataLabels = list(df.columns.values)
# the dictionary to hold the values to put in place of nan
replaceValues = {}
for eachlabel in dataLabels:
thisSer = df[eachlabel]
if dataType[eachlabel] == 1: # if its a continuous variable
_,pval = stats.normaltest(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
if(pval < 0.5):
groupMiddle = group.median() # get the median of the group
else:
groupMiddle = group.mean() # get mean (if group is normal )
innerDict[name.strip()] = groupMiddle
replaceValues[eachlabel] = innerDict
else: # if the series is continuous
# freqCount = collections.Counter(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
freqC = collections.Counter(group)
mostFreq = freqC.most_common(1) # get the most frequent value of the attribute(grouped by class)
# newGroup = group.replace(np.nan , mostFreq)
innerDict[name.strip()] = mostFreq[0][0].strip()
replaceValues[eachlabel] = innerDict
print replaceValues
# replace the missing values =======================
newfile = open('missingReplaced.csv', 'w')
newdf = df
mask=False
for col in df.columns: mask = mask | df[col].isnull()
# get the dataframe of tuples that contains nulls
dfnulls = df[mask]
dfnotNulls = df[~mask]
for _, row in dfnulls.iterrows():
for colname in dataLabels:
if pd.isnull(row[colname]):
if row['class'].strip() == '>50K':
row[colname] = replaceValues[colname]['>50K']
else:
row[colname] = replaceValues[colname]['<=50K']
newfile.write(str(row[colname]) + ",")
newdf.append(row)
newfile.write("\n")
# here add newdf to dfnotNulls to get finaldf
return finaldf

If I understand correctly, this is mostly in the documentation, but probably not where you'd be looking if you're asking the question. See note regarding mode at the bottom as it is slightly trickier than mean and median.
df = pd.DataFrame({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],)
df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean()))
df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median()))
df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0]))
df
v v_mean v_med v_mode
1 1 1.000000 1 1
1 2 2.000000 2 2
1 2 2.000000 2 2
1 NaN 1.666667 2 2
2 3 3.000000 3 3
2 4 4.000000 4 4
2 4 4.000000 4 4
2 NaN 3.666667 4 4
Note that mode() may not be unique, unlike mean and median and pandas returns it as a Series for that reason. To deal with that, I just took the simplest route and added [0] in order to extract the first member of the series.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas:Count accumulated unique values based on another column - python

Related

How to re-number strings after sorting a dataframe

Calculate Product of length of lists in dataframe and store in a new column

Create a dataframe from one dictionary and remove a specific character

How to get the highest values from many columns and show in what rows it happened using pandas?

Missing values replace by med/mean in conti var, by mode in categorical var in pandas dataframe -after grouping the data by a column)

Categories

Resources