Is it possible to multiply all the columns in a Pandas.DataFrame together to get a single value for every row in the DataFrame?
As an example, using
df = pd.DataFrame(np.random.randn(5,3)*10)
I want a new DataFrame df2 where df2.ix[x,0] will have the value of df.ix[x,0] * df.ix[x,1] * df.ix[x,2].
However I do not want to hardcode this, how can I use a loop to achieve this?
I found a function df.mul(series, axis=1) but cant figure out a way to use this for my purpose.
You could use DataFrame.prod():
>>> df = pd.DataFrame(np.random.randint(1, 10, (5, 3)))
>>> df
0 1 2
0 7 7 5
1 1 8 6
2 4 8 4
3 2 9 5
4 3 8 7
>>> df.prod(axis=1)
0 245
1 48
2 128
3 90
4 168
dtype: int64
You could also apply np.prod, which is what I'd originally done, but usually when available the direct methods are faster.
>>> df = pd.DataFrame(np.random.randint(1, 10, (5, 3)))
>>> df
0 1 2
0 9 3 3
1 8 5 4
2 3 6 7
3 9 8 5
4 7 1 2
>>> df.apply(np.prod, axis=1)
0 81
1 160
2 126
3 360
4 14
dtype: int64
Related
given the following DF:
df = pd.DataFrame(data=np.random.randint(1,10,size=(10,4)),columns=list("abcd"),dtype=np.int64)
Lets say i want to update the first two columns with a list of two numpy arrays(having a specific dtype: e.g. np.int8 and np.float32) --> update_vals = [np.arange(1,11,dtype=np.int8),np.ones(10,dtype=np.float32)]
I can do the following that works: df[["a","b"]] = pd.DataFrame(dict(zip(list("ab"),update_vals)))
Expected outcome of Column Dtypes:
a: np.int8
b=np.float32
[c,d]=np.int64
Is there maybe a faster way to do this?
Update
Why don't simply:
df['a'] = update_vals[0]
df['b'] = update_vals[1]
print(df.dtypes)
# Output:
a int8
b float32
c int64
d int64
dtype: object
Or:
for col, arr in zip(df.columns, update_vals):
df[col] = arr
Use:
df[['a', 'b']] = np.array(update_vals).T
print(df)
# Output:
a b c d
0 1 1 1 2
1 2 1 5 1
2 3 1 4 8
3 4 1 6 3
4 5 1 3 4
5 6 1 8 2
6 7 1 3 1
7 8 1 8 7
8 9 1 4 1
9 10 1 3 6
I'm using Pandas to create a dataframe by extracting a file which is located in an SFTP location. Sample df looks like below:
So what I'm trying to achieve here is to only swap the values between columns PROCESS_FLAG and AD_WINDTOUCH_ID and leave the rest of the dataframe cols and rows as it is. Please note I don't want to even switch the column names, just the values. I tried finding an approach and only came across reindexing which doesn't work as expected. If any of you'll have gone through this scenario, any help would be appreciated.
You can use unpacking,
df.PROCESS_FLAG, df.AD_WINDTOUCH_ID = df.AD_WINDTOUCH_ID.copy(), df.PROCESS_FLAG.copy()
SAMPLE
Given a dataframe like this:
>>> df = pd.DataFrame({"A": np.random.randint(0,10,100), "B": np.random.randint(0,10,100)})
>>> df
Out[164]:
A B
0 1 9
1 4 7
2 3 8
3 6 8
4 5 0
.. .. ..
95 1 9
96 3 8
97 2 3
98 3 4
99 3 1
[100 rows x 2 columns]
>>> df.A, df.B = df.B.copy(), df.A.copy()
>>> df
Out[166]:
A B
0 9 1
1 7 4
2 8 3
3 8 6
4 0 5
.. .. ..
95 9 1
96 8 3
97 3 2
98 4 3
99 1 3
[100 rows x 2 columns]
To rename and reorder:
import pandas as pd
d = {'a': [1,2,3,4], 'b':[1,2,5,6]}
df = pd.DataFrame(d)
df.rename(columns={'a': 'b','b':'a'}, inplace=True)
df = df[['a','b']]
print(df)
a b
0 1 1
1 2 2
2 5 3
3 6 4
So this is more of a question than a problem i have.
I wanted to .append() some pandas series' together and without thinking i just did total=series1+series2+series3.
The length of each series is 2199902,171175, and 178989 respectively and sum(pd.isnull(i) for i in total) = 2214596
P.S all 3 series' had no null values to start with, is it something to do with merging 3 series' of different lengths which creates missing values? Even if that is the case why aer 2,214,596 null values created?
If you're trying to append series, you're doing it wrong. The + operator calls .add which ends up adding each corresponding elements in the series. If your series are not aligned, this results in a lot of NaNs being generated.
If you're looking to append these together into one long series, you can use pd.concat:
pd.concat([s1, s2, s3], ignore_index=True)
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
If you're going to use append, you can do this in a loop, or with reduce:
s = s1
for i in [s2, s3]:
s = s.append(i, ignore_index=True)
s
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
from functools import reduce
reduce(lambda x, y: x.append(y, ignore_index=True), [s1, s2, s3])
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
Both solutions generalise to multiple series quite nicely, but they are slow in comparison to pd.concat or np.concatenate.
If sum Series all index are align. So if some index exist in series1 and not in another Series, get NaNs.
So need add with fill_value=0:
s = s1.add(s2, fill_value=0).add(s3, fill_value=0)
Sample:
s1 = pd.Series([1,2,4,5])
s2 = pd.Series([4,7], index=[10,11])
s3 = pd.Series([40,70], index=[2,4])
s = s1.add(s2, fill_value=0).add(s3, fill_value=0)
print (s)
0 1.0
1 2.0
2 44.0
3 5.0
4 70.0
10 4.0
11 7.0
dtype: float64
But if need append them together (or use concat as mentioned cᴏʟᴅsᴘᴇᴇᴅ):
s = s1.append(s2, ignore_index=True).append(s3, ignore_index=True)
print (s)
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
And numpy alternative:
#alternative, thanks cᴏʟᴅsᴘᴇᴇᴅ - np.concatenate([s1, s2, s3])
s = pd.Series(np.concatenate([s1.values, s2.values, s3.values]))
print (s)
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
If want use + for append then need convert Series to lists:
s = pd.Series(s1.tolist() + s2.tolist() + s3.tolist())
print (s)
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({'A': ['286a2', '17', '286a1', '373', '200b', '150'], 'B': range(6)})
A B
0 286a2 0
1 17 1
2 286a1 2
3 373 3
4 200b 4
5 150 5
which I want to sort according to A. When I do this using
df.sort_values(by='A')
I obtain
A B
5 150 5
1 17 1
4 200b 4
2 286a1 2
0 286a2 0
3 373 3
which is almost correct: I would like to have 17 before 150 but don't know how to do this as those entries are not just values but actual strings consisting of numerical values and letters. Is there a way to do this?
EDIT
About the pattern of the entries:
It is always a numeric value first of arbitrary length, then it can be followed by characters, which can be followed by numerical values again.
You can use replace characters to . with cast to float with sort_index:
df.index = df['A'].str.replace('[a-zA-Z]+','.').astype(float)
df = df.sort_index().reset_index(drop=True)
print (df)
A B
0 17 1
1 150 5
2 200b 4
3 286a1 2
4 286a2 0
5 373 3
Another variant to jezrael's
In [1706]: df.assign(
A_=df.A.str.replace('[/\D]', '.').astype(float) # or '[a-zA-Z]+'
).sort_values(by='A_').drop('A_', 1)
Out[1706]:
A B
1 17 1
5 150 5
4 200b 4
2 286a1 2
0 286a2 0
3 373 3
Or you can try , natsort
from natsort import natsorted, ns
df.set_index('A').reindex(natsorted(df.A, key=lambda y: y.lower())).reset_index()
Out[395]:
A B
0 17 1
1 150 5
2 200b 4
3 286a1 2
4 286a2 0
5 373 3
so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)