I have a dataframe that looks like this, but with 26 rows and 110 columns:
index/io 1 2 3 4
0 42 53 23 4
1 53 24 6 12
2 63 12 65 34
3 13 64 23 43
Desired output:
index io value
0 1 42
0 2 53
0 3 23
0 4 4
1 1 53
1 2 24
1 3 6
1 4 12
2 1 63
2 2 12
...
I have tried with dict and lists by transforming the dataframe to dict, and then create a new list with index values and update in new dict with io.
indx = []
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indx.append(key)
indxio = {}
for element in indx:
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indxio.update({element:k})
I know this is too far probably, but it's the only thing I could think of. The process was too long, so I stopped.
You can use set_index, stack, and reset_index().
df.set_index("index/io").stack().reset_index(name="value")\
.rename(columns={'index/io':'index','level_1':'io'})
Output:
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
You need set_index + stack + rename_axis + reset_index:
df = df.set_index('index/io').stack().rename_axis(('index','io')).reset_index(name='value')
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
Solution with melt, rename, but there is different order of values, so sort_values is necessary:
d = {'index/io':'index'}
df = df.melt('index/io', var_name='io', value_name='value') \
.rename(columns=d).sort_values(['index','io']).reset_index(drop=True)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
And alternative solution for numpy lovers:
df = df.set_index('index/io')
a = np.repeat(df.index, len(df.columns))
b = np.tile(df.columns, len(df.index))
c = df.values.ravel()
cols = ['index','io','value']
df = pd.DataFrame(np.column_stack([a,b,c]), columns = cols)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
Related
assume i have df:
pd.DataFrame({'data': [0,0,0,1,1,1,2,2,2,3,3,4,4,5,5,0,0,0,0,2,2,2,2,4,4,4,4]})
data
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
10 3
11 4
12 4
13 5
14 5
15 0
16 0
17 0
18 0
19 2
20 2
21 2
22 2
23 4
24 4
25 4
26 4
I'm looking for a way to create a new column in df that shows the number of data items repeated in new column For example:
data new
0 0 1
1 0 2
2 0 3
3 1 1
4 1 2
5 1 3
6 2 1
7 2 2
8 2 3
9 3 1
10 3 2
11 4 1
12 4 2
13 5 1
14 5 2
15 0 1
16 0 2
17 0 3
18 0 4
19 2 1
20 2 2
21 2 3
22 2 4
23 4 1
24 4 2
25 4 3
26 4 4
My logic was to get the rows to python list compare and create a new list.
Is there a simple way to do this?
Example
df = pd.DataFrame({'data': [0,0,0,1,1,1,2,2,2,3,3,4,4,5,5,0,0,0,0,2,2,2,2,4,4,4,4]})
Code
grouper = df['data'].ne(df['data'].shift(1)).cumsum()
df['new'] = df.groupby(grouper).cumcount().add(1)
df
data new
0 0 1
1 0 2
2 0 3
3 1 1
4 1 2
5 1 3
6 2 1
7 2 2
8 2 3
9 3 1
10 3 2
11 4 1
12 4 2
13 5 1
14 5 2
15 0 1
16 0 2
17 0 3
18 0 4
19 2 1
20 2 2
21 2 3
22 2 4
23 4 1
24 4 2
25 4 3
26 4 4
I am having hard time understanding what the code below does. I initially thought it was counting the unique appearances of the values in (weight age) and (weight height) however when I ran this example, I found out it was doing something else.
data = [[0,33,15,4],[1,44,12,3],[0,44,12,5],[1,33,15,4],[0,77,13,4],[1,33,15,4],[1,99,40,7],[0,58,45,4],[1,11,13,4]]
df = pd.DataFrame(data,columns=["Lbl","Weight","Age","Height"])
print (df)
def group_fea(df,key,target):
'''
Adds columns for feature combinations
'''
tmp = df.groupby(key, as_index=False)[target].agg({
key+target + '_nunique': 'nunique',
}).reset_index()
del tmp['index']
print("****{}****".format(target))
return tmp
#Add feature combinations
feature_key = ['Weight']
feature_target = ['Age','Height']
for key in feature_key:
for target in feature_target:
tmp = group_fea(df,key,target)
df = df.merge(tmp,on=key,how='left')
print (df)
Lbl Weight Age Height
0 0 33 15 4
1 1 44 12 3
2 0 44 12 5
3 1 33 15 4
4 0 77 13 4
5 1 33 15 4
6 1 99 40 7
7 0 58 45 4
8 1 11 13 4
****Age****
****Height****
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1
I want to understand what the values in WeightAge_nunique WeightHeight_nunique mean
The value of WeightAge_nunique on a given row is the number of unique Ages that have the same Weight. The corresponding thing is true of WeightHeight_nunique. E.g., for people of Weight=44, there is only 1 unique age (12), hence WeightAge_nunique=1 on those rows, but there are 2 unique Heights (3 and 5), hence WeightHeight_nunique=2 on those same rows.
You can see that this happens because the grouping function groups by the "key" column (Weight), then performs the "nunique" aggregation function on the "target" column (either Age or Height).
Let us try transform
g = df.groupby('Weight').transform('nunique')
df['WeightAge_nunique'] = g['Age']
df['WeightHeight_nunique'] = g['Height']
df
Out[196]:
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1
How do we filter the dataframe below to remove all duplicate ID rows after a certain number of ID occurrence. I.E. remove all rows of ID == 0 after the 3rd occurrence of ID == 0
Thanks
pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=['ID', 'Value']).sort_values('ID')
Output:
ID Value
0 7
0 8
0 5
0 5
... ... ...
9 7
9 7
9 1
9 3
Desired Output for filter_count = 3:
Output:
ID Value
0 7
0 8
0 5
1 7
1 7
1 1
2 3
If you want to do this for all IDs, use:
df.groupby("ID").head(3)
For single ID, you can assign a new column using cumcount and then filter by conditions:
df["count"] = df.groupby("ID")["Value"].cumcount()
print (df.loc[(df["ID"].ne(0))|((df["ID"].eq(0)&(df["count"]<3)))])
ID Value count
64 0 6 0
77 0 6 1
83 0 0 2
44 1 7 0
58 1 5 1
40 1 2 2
35 1 7 3
89 1 9 4
19 1 7 5
10 1 3 6
45 2 4 0
68 2 1 1
74 2 4 2
75 2 8 3
34 2 4 4
60 2 6 5
78 2 0 6
31 2 8 7
97 2 9 8
2 2 6 9
93 2 8 10
13 2 2 11
...
I will do without groupby
df = pd.concat([df.loc[df.ID==0].head(3),df.loc[df.ID!=0]])
Thanks Henry,
I modified your code and I think this should work as well.
Your df.groupby("ID").head(3) is great. Thanks.
df["count"] = df.groupby("ID")["Value"].cumcount()
df.loc[df["count"]<3].drop(['count'], axis=1)
I have a table which look like this.
msno date num_25 num_50 num_75 num_985 num_100 num_unq \
0 rxIP2f2aN0rYNp+toI0Obt/N/FYQX8hcO1fTmmy2h34= 20150513 0 0 0 0 1 1
1 rxIP2f2aN0rYNp+toI0Obt/N/FYQX8hcO1fTmmy2h34= 20150709 9 1 0 0 7 11
2 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20150105 3 3 0 0 68 36
3 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20150306 1 0 1 1 97 27
4 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20150501 3 0 0 0 38 38
5 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20150702 4 0 1 1 33 10
6 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20150830 3 1 0 0 4 7
7 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20151107 1 0 0 0 4 5
8 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20160110 2 0 1 0 11 6
9 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20160316 9 3 4 1 67 50
10 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20160510 5 3 2 1 67 66
11 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20160804 1 4 5 0 36 43
12 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20160926 7 1 0 1 38 20
13 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20161115 0 1 4 1 38 40
14 yxiEWwE9VR5utpUecLxVdQ5B7NysUPfrNtGINaM2zA8= 20170106 0 0 0 1 39 38
15 PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 20151201 3 3 2 0 8 11
16 PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 20160628 0 0 1 1 1 3
17 PNxIsSLWOJDCm7pNPFzRO/6Mmg2WeZA2nf6hw6t1x3g= 20170106 2 1 0 0 35 34
18 KXF9c/T66LZIzFq+xS64icWMhDQE6miCZAtdXRjZHX8= 20150803 0 0 0 0 16 11
19 KXF9c/T66LZIzFq+xS64icWMhDQE6miCZAtdXRjZHX8= 20160527 4 3 0 2 2 11
20 KXF9c/T66LZIzFq+xS64icWMhDQE6miCZAtdXRjZHX8= 20160808 14 3 4 1 15 31
How should I sum up the columns 'num_25', 'num_50', 'num_75', 'num_985', 'num_100', 'num_unq', 'total_secs' to get the total and left only one unique msno number?
For example, after group all same msno number rows, it will produce result below, discarding date column.
msno num_25 num_50 num_75 num_985 num_100 num_unq \
0 rxIP2f2aN0rYNp+toI0Obt/N/FYQX8hcO1fTmmy2h34= 9 1 0 0 8 12
I tried this but the msno still duplicated and date column is still there.
df_user_logs_v2.groupby(['msno', 'date'])['num_25', 'num_50', 'num_75', 'num_985', 'num_100', 'num_unq', 'total_secs'].sum()
Use drop + groupby + sum:
df = df_user_logs_v2.drop('date', axis=1).groupby('msno', as_index=False).sum()
Is there a function that can swap between the following dataframes(df1,df2):
import random
import pandas as pd
numbers = random.sample(range(1,50), 10)
d = {'num': list(range(1,6)) + list(range(1,6)),'values':numbers,'type':['a']*5 + ['b']*5}
df = pd.DataFrame(d)
e = {'num': list(range(1,6)) ,'a':numbers[:5],'b':numbers[5:]}
df2 = pd.DataFrame(e)
Dataframe df1:
#df1
num type values
0 1 a 18
1 2 a 26
2 3 a 34
3 4 a 21
4 5 a 48
5 1 b 1
6 2 b 19
7 3 b 36
8 4 b 42
9 5 b 30
Dataframe df2:
a b num
0 18 1 1
1 26 19 2
2 34 36 3
3 21 42 4
4 48 30 5
I take the first df and the type column becomes a type name with the variables.Is there a function that can do this(from df1 to df2) and the vice-verca action(from df2 to df1)
You can use stack and pivot:
print df
num type values
0 1 a 20
1 2 a 25
2 3 a 2
3 4 a 27
4 5 a 29
5 1 b 39
6 2 b 40
7 3 b 6
8 4 b 17
9 5 b 47
print df2
a b num
0 20 39 1
1 25 40 2
2 2 6 3
3 27 17 4
4 29 47 5
df1 = df2.set_index('num').stack().reset_index()
df1.columns = ['num','type','values']
df1 = df1.sort_values('type')
print df1
num type values
0 1 a 20
2 2 a 46
4 3 a 21
6 4 a 33
8 5 a 10
1 1 b 45
3 2 b 39
5 3 b 38
7 4 b 37
9 5 b 34
df3 = df.pivot(index='num', columns='type', values='values').reset_index()
df3.columns.name = None
df3 = df3[['a','b','num']]
print df3
a b num
0 46 23 1
1 38 6 2
2 36 47 3
3 33 34 4
4 15 1 5