Sum of dataframe columns to another dataframe column Python gives NaN - python

I want to sumarize rows and columns of dataframe (pdf and wdf) and save results in another dataframe columns (to_hex).
I tried it for one dataframe and it worked. It doesn't work for another (it gives NaN). I cannot understand what is the difference.
to_hex = pd.DataFrame(0, index=np.arange(len(sasiedztwo)), columns=['ID','podroze','p_rozmyte'])
to_hex.loc[:,'ID']= wdf.index+1
to_hex.index=pdf.index
to_hex.loc[:,'podroze']= pd.DataFrame(pdf.sum(axis=0))[:]
to_hex.index=wdf.index
to_hex.loc[:,'p_rozmyte']= pd.DataFrame(wdf.sum(axis=0))[:]
This is how pdf dataframe looks like:
0 1 2 3 4 5 6 7 8
0 0 0 10 0 0 0 0 0 100
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 1000
8 0 0 0 0 0 0 0 0 0
This is wdf:
0 1 2 3 4 5 6 7 8
0 2.5 5.0 35.0 0.0 27.5 55.0 25.0 50.0 102.5
1 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
2 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 25.0
3 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
4 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 250.0
6 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
7 0.0 0.0 250.0 0.0 250.0 500.0 250.0 500.0 1000.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 500.0
And this is the result in to_hex:
ID podroze p_rozmyte
0 1 0 NaN
1 2 0 NaN
2 3 10 NaN
3 4 0 NaN
4 5 0 NaN
5 6 0 NaN
6 7 0 NaN
7 8 0 NaN
8 9 1100 NaN

SOLUTION:
One option to solve it is to modify your code as follows:
to_hex.loc[:,'ID']= wdf.index+1
# to_hex.index=pdf.index # no need
to_hex.loc[:,'podroze']= pdf.sum(axis=0) # modified; directly use the series output from SUM()
# to_hex.index=wdf.index # no need
to_hex.loc[:,'p_rozmyte']= wdf.sum(axis=0) # modified
Then you get:
ID podroze p_rozmyte
0 1 0 2.5
1 2 0 5.0
2 3 10 302.5
3 4 0 0.0
4 5 0 277.5
5 6 0 555.0
6 7 0 275.0
7 8 0 550.0
8 9 1100 3527.5
I think the reason that you get NaN for one case and correct values for the other case lies in to_hex.dtypes:
ID int64
podroze int64
p_rozmyte int64
dtype: object
And as you see to_hex dataframe has column types as int64. This is fine when you add pdf dataframe (since it has the same dtype)
pd.DataFrame(pdf.sum(axis=0))[:].dtypes
0 int64
dtype: object
but does not work when you add wdf:
pd.DataFrame(wdf.sum(axis=0))[:].dtypes
0 float64
dtype: object

Related

Python pandas How to pick up certain values by internal numbering?

I have a dataframe that looks like this:
Answers all_answers Score
0 0.0 0 72
1 0.0 0 73
2 0.0 0 74
3 1.0 1 1
4 -1.0 1 2
5 1.0 1 3
6 -1.0 1 4
7 1.0 1 5
8 0.0 0 1
9 0.0 0 2
10 -1.0 1 1
11 0.0 0 1
12 0.0 0 2
13 1.0 1 1
14 0.0 0 1
15 0.0 0 2
16 1.0 1 1
The first column is a signal that the sign has changed in the calculation flow
The second one is I just removed the minus from the first one
The third is an internal account for the second column - how much was one and how much was zero
I want to add a fourth column to it that would show me only those units that went in a row for example 5 times while observing the sign of the first column.
To get something like this
Answers all_answers Score New
0 0.0 0 72 0
1 0.0 0 73 0
2 0.0 0 74 0
3 1.0 1 1 1
4 -1.0 1 2 -1
5 1.0 1 3 1
6 -1.0 1 4 -1
7 1.0 1 5 1
8 0.0 0 1 0
9 0.0 0 2 0
10 -1.0 1 1 0
11 0.0 0 1 0
12 0.0 0 2 0
13 1.0 1 1 0
14 0.0 0 1 0
15 0.0 0 2 0
16 1.0 1 1 0
17 0.0 0 1 0
Is it possible to do this by Pandas ?
You can use:
# group by consecutive 0/1
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()
# get size of each group and compare to threshold
m = df.groupby(g)['all_answers'].transform('size').ge(5)
# mask small groups
df['New'] = df['Answers'].where(m, 0)
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
A faster way (with regex):
import pandas as pd
import re
def repl5(m):
return '5' * len(m.group())
s = df['all_answers'].astype(str).str.cat()
d = re.sub('(?:1{5,})', repl5, s)
d = [x=='5' for x in list(d)]
df['New'] = df['Answers'].where(d, 0.0)
df
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0

Numpy read variable amount of columns from a text file into an array

My file is formatted like this:
2106 2002 27 26 1
1 0.000000 0.000000
2 0.389610 0.000000
3 0.779221 0.000000
4 1.168831 0.000000
5 1.558442 0.000000
6 1.948052 0.000000
7 2.337662 0.000000
8 2.727273 0.000000
9 3.116883 0.000000
10 3.506494 0.000000
I want to read in these. There are more rows than this and some only have two columns. In MATLAB I use readmatrix() and it works well, does Python have anything comparable? Because python genfromtxt() and python loadtxt do not work with a variable number of columns.
Should I just stick with MATLAB since Python seems to be missing key functionality like this?
Edit: Here is the output that I get in matlab that I would like in numpy:
2106 2002 27 26 1 0
1 0 0 0 0 0
2 0.389610000000000 0 0 0 0
3 0.779221000000000 0 0 0 0
4 1.16883100000000 0 0 0 0
5 1.55844200000000 0 0 0 0
6 1.94805200000000 0 0 0 0
7 2.33766200000000 0 0 0 0
8 2.72727300000000 0 0 0 0
9 3.11688300000000 0 0 0 0
10 3.50649400000000 0 0 0 0
import numpy as np
headers = []
rows = []
with open("test.txt", 'r') as file:
for i, v in enumerate(file.readlines()):
if i == 0:
headers.extend(v.split())
else:
rows.append(v.split())
for i, v in enumerate(rows):
while len(v) != len(headers):
v.append(0)
rows[i] = v
rows = np.array(rows)
let me know if any modifications are needed
You have missing values in your columns that matlab interprets them as 0. You can import similar structure to pandas and pandas will have right number of columns. It interprets missing values as nan which you can later replace with 0 if you prefer that way. The only catch is have the right column number in first row. If you have 0 at the end of it, put it 0 instead of space:
df = pd.read_csv('file.csv', sep='\s+').fillna(0)
output:
2106 2002 27 26 1 0
0 1 0.000000 0.0 0.0 0.0 0.0
1 2 0.389610 0.0 0.0 0.0 0.0
2 3 0.779221 0.0 0.0 0.0 0.0
3 4 1.168831 0.0 0.0 0.0 0.0
4 5 1.558442 0.0 0.0 0.0 0.0
5 6 1.948052 0.0 0.0 0.0 0.0
6 7 2.337662 0.0 0.0 0.0 0.0
7 8 2.727273 0.0 0.0 0.0 0.0
8 9 3.116883 0.0 0.0 0.0 0.0
9 10 3.506494 0.0 0.0 0.0 0.0

Forward fill missing values by group after condition is met in pandas

I'm having a bit of trouble with this. My dataframe looks like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 nan nan
1 nan nan
2 nan 0
2 50 0
2 20 1
2 nan nan
2 nan nan
So, what I need to do is, after the dummy gets value = 1, I need to fill the amount variable with zeroes for each id, like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 0 nan
1 0 nan
2 nan 0
2 50 0
2 20 1
2 0 nan
2 0 nan
I'm guessing I'll need some combination of groupby('id'), fillna(method='ffill'), maybe a .loc or a shift() , but everything I tried has had some problem or is very slow. Any suggestions?
The way I will use
s = df.groupby('id')['dummy'].ffill().eq(1)
df.loc[s&df.dummy.isna(),'amount']=0
You can do this much easier:
data[data['dummy'].isna()]['amount'] = 0
This will select all the rows where dummy is nan and fill the amount column with 0.
IIUC, ffill() and mask the still-nan:
s = df.groupby('id')['amount'].ffill().notnull()
df.loc[df['amount'].isna() & s, 'amount'] = 0
Output:
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN
Could you please try following.
df.loc[df['dummy'].isnull(),'amount']=0
df
Output will be as follows.
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN

better grouping of label frequency by month from dataframe

I have a dataframe with a date+time and a label, which I want to reshape into date (/month) columns with label frequencies for that month:
date_time label
1 2017-09-26 17:08:00 0
3 2017-10-03 13:27:00 2
4 2017-10-04 19:04:00 0
11 2017-10-11 18:28:00 1
27 2017-10-13 11:22:00 0
28 2017-10-13 21:43:00 0
39 2017-10-16 14:43:00 0
40 2017-10-16 21:39:00 0
65 2017-10-21 21:53:00 2
...
98 2017-11-01 20:08:00 3
99 2017-11-02 12:00:00 3
100 2017-11-02 12:01:00 2
109 2017-11-02 12:03:00 3
110 2017-11-03 22:24:00 0
111 2017-11-04 09:05:00 3
112 2017-11-06 12:36:00 3
113 2017-11-06 12:48:00 2
128 2017-11-07 15:20:00 2
143 2017-11-10 16:36:00 3
144 2017-11-10 20:00:00 0
145 2017-11-10 20:02:00 0
I group the label frequency by month with this line (thanks partially to this post):
df2 = df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count()
which outputs
date_time label
2017-09-30 0 1
2017-10-31 0 6
1 1
2 8
3 2
2017-11-30 0 25
4 2
5 1
2 4
3 11
2017-12-31 0 14
5 3
2 5
3 7
2018-01-31 0 8
4 1
5 1
2 2
3 3
but, as mentioned before, I would like to get the data by month/date columns:
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
currently I can do sort of divide the data with
pd.concat([df2[m] for m in df2.index.levels[0]], axis=1).fillna(0)
but I lose the column names:
label label label label label
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
So I have to do a longer version where I generate a series, rename it, concatenate and then fill in the blanks:
m_list = []
for m in df2.index.levels[0]:
m_labels = df2[m]
m_labels = m_labels.rename(m)
m_list.append(m_labels)
pd.concat(m_list, axis=1).fillna(0)
resulting in
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
Is there a shorter/more elegant way to get to this last datagrame from my original one?
You just need unstack here
df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count().unstack(0,fill_value=0)
Out[235]:
date_time 2017-09-30 2017-10-31 2017-11-30
label
0 1 5 3
1 0 1 0
2 0 2 3
3 0 0 6
Base on your groupby output
s.unstack(0,fill_value=0)
Out[240]:
date_time 2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
label
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1

Pandas multiIndex DataFrame sort

Just show my data
In [14]: new_df
Out[14]:
action_type 1 2 3
user_id
0000110e00f7c85f550b329dc3d76210 31.0 4.0 0.0
00004931fe12d6f678f67e375b3806e3 8.0 4.0 0.0
0000c2b8660766ed74bafd48599255f0 0.0 2.0 0.0
0000d8d4ea411b05e0392be855fe9756 19.0 0.0 3.0
ffff18540a9567b455bd5645873e56d5 1.0 0.0 0.0
ffff3c8cf716efa3ae6d3ecfedb2270b 58.0 2.0 0.0
ffffa5fe57d2ef322061513bf60362ff 0.0 2.0 0.0
ffffce218e2b4af7729a4737b8702950 1.0 0.0 0.0
ffffd17a96348904fe49216ba3c7006f 1.0 0.0 0.0
[9 rows x 3 columns]
In [15]: new_df.columns
Out[15]: Int64Index([1, 2, 3], dtype='int64', name=u'action_type')
In [16]: new_df.index
Out[16]:
Index([u'0000110e00f7c85f550b329dc3d76210',
u'00004931fe12d6f678f67e375b3806e3',
...
u'ffffa5fe57d2ef322061513bf60362ff',
u'ffffce218e2b4af7729a4737b8702950',
u'ffffd17a96348904fe49216ba3c7006f'],
dtype='object', name=u'user_id', length=9)
The output that I want is:
# sort by the action_type value 1
action_type 1 2 3
user_id
ffff3c8cf716efa3ae6d3ecfedb2270b 58.0 2.0 0.0
0000110e00f7c85f550b329dc3d76210 31.0 4.0 0.0
0000d8d4ea411b05e0392be855fe9756 19.0 0.0 3.0
00004931fe12d6f678f67e375b3806e3 8.0 4.0 0.0
ffff18540a9567b455bd5645873e56d5 1.0 0.0 0.0
ffffce218e2b4af7729a4737b8702950 1.0 0.0 0.0
ffffd17a96348904fe49216ba3c7006f 1.0 0.0 0.0
0000c2b8660766ed74bafd48599255f0 0.0 2.0 0.0
ffffa5fe57d2ef322061513bf60362ff 0.0 2.0 0.0
[9 rows x 3 columns]
# sort by the action_type value 2
action_type 1 2 3
user_id
00004931fe12d6f678f67e375b3806e3 8.0 4.0 0.0
0000110e00f7c85f550b329dc3d76210 31.0 4.0 0.0
ffff3c8cf716efa3ae6d3ecfedb2270b 58.0 2.0 0.0
0000c2b8660766ed74bafd48599255f0 0.0 2.0 0.0
ffffa5fe57d2ef322061513bf60362ff 0.0 2.0 0.0
0000d8d4ea411b05e0392be855fe9756 19.0 0.0 3.0
ffff18540a9567b455bd5645873e56d5 1.0 0.0 0.0
ffffce218e2b4af7729a4737b8702950 1.0 0.0 0.0
ffffd17a96348904fe49216ba3c7006f 1.0 0.0 0.0
[9 rows x 3 columns]
So, what I want to do is to sort the DataFrame by the action_type, that is 1, 2, 3 or the sum of any of them(action_type sum of 1+2, 1+3, 2+3, 1+2+3)
The output should sorted by the value of action_type(1, 2 or 3) of each user or the sum of action_type(for example the sum of action_type 1 and action_type 2, and any combinations, such as the sum of action_type 1 and action_type 3, the sum of action_type 2 and action_type 3, the sum of action_type 1 and action_type 2 and action_type 3) of each user.
For example:
for user id 0000110e00f7c85f550b329dc3d76210, the value of action_type 1 is 31.0, the value of action_type 2 is 4 and the value of action_type 3 is 3. The sum of action_type 1 and action_type 2 of this user is 31.0 + 4.0 = 35.0
I have tried new_df.sortlevel(), but it seems it has just sored the dataframe by the user_id, not by the action_type(1, 2, 3)
How can I do it, thank you!
UPDATE:
If you wanna sort it by columns, just try sort_values
df.sort_values(column_names)
Example:
In [173]: df
Out[173]:
1 2 3
0 6 3 8
1 0 8 0
2 3 8 0
3 5 2 7
4 1 2 1
sort descending by column 2
In [174]: df.sort_values(by=2, ascending=False)
Out[174]:
1 2 3
1 0 8 0
2 3 8 0
0 6 3 8
3 5 2 7
4 1 2 1
sort descending by sum of columns 2+3
In [177]: df.assign(sum=df.loc[:,[2,3]].sum(axis=1)).sort_values('sum', ascending=False)
Out[177]:
1 2 3 sum
0 6 3 8 11
3 5 2 7 9
1 0 8 0 8
2 3 8 0 8
4 1 2 1 3
OLD answer:
if i got you right, you can do it this way:
In [107]: df
Out[107]:
a b c
0 9 1 4
1 0 5 7
2 5 9 8
3 3 9 7
4 1 2 5
In [108]: df.assign(sum=df.sum(axis=1)).sort_values('sum', ascending=True)
Out[108]:
a b c sum
4 1 2 5 8
1 0 5 7 12
0 9 1 4 14
3 3 9 7 19
2 5 9 8 22

Categories

Resources