How to remove ugly row in pandas.dataframe - python

so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?

The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14

Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)

Related

How can I only swap the values of two columns and leave the rest as it is in a dataframe?

I'm using Pandas to create a dataframe by extracting a file which is located in an SFTP location. Sample df looks like below:
So what I'm trying to achieve here is to only swap the values between columns PROCESS_FLAG and AD_WINDTOUCH_ID and leave the rest of the dataframe cols and rows as it is. Please note I don't want to even switch the column names, just the values. I tried finding an approach and only came across reindexing which doesn't work as expected. If any of you'll have gone through this scenario, any help would be appreciated.
You can use unpacking,
df.PROCESS_FLAG, df.AD_WINDTOUCH_ID = df.AD_WINDTOUCH_ID.copy(), df.PROCESS_FLAG.copy()
SAMPLE
Given a dataframe like this:
>>> df = pd.DataFrame({"A": np.random.randint(0,10,100), "B": np.random.randint(0,10,100)})
>>> df
Out[164]:
A B
0 1 9
1 4 7
2 3 8
3 6 8
4 5 0
.. .. ..
95 1 9
96 3 8
97 2 3
98 3 4
99 3 1
[100 rows x 2 columns]
>>> df.A, df.B = df.B.copy(), df.A.copy()
>>> df
Out[166]:
A B
0 9 1
1 7 4
2 8 3
3 8 6
4 0 5
.. .. ..
95 9 1
96 8 3
97 3 2
98 4 3
99 1 3
[100 rows x 2 columns]
To rename and reorder:
import pandas as pd
d = {'a': [1,2,3,4], 'b':[1,2,5,6]}
df = pd.DataFrame(d)
df.rename(columns={'a': 'b','b':'a'}, inplace=True)
df = df[['a','b']]
print(df)
a b
0 1 1
1 2 2
2 5 3
3 6 4

Sort part of DataFrame in Python Panda, return new column with order depending on row values

My first question here, I hope this is understandable.
I have a Panda DataFrame:
order_numbers
x_closest_autobahn
0
34
3
1
11
3
2
5
3
3
8
12
4
2
12
I would like to get a new column with the order_number per closest_autobahn in ascending order:
order_numbers
x_closest_autobahn
order_number_autobahn_x
2
5
3
1
1
11
3
2
0
34
3
3
4
2
12
1
3
8
12
2
I have tried:
df['order_number_autobahn_x'] = ([df.loc[(df['x_closest_autobahn'] == 3)]].sort_values(by=['order_numbers'], ascending=True, inplace=True))
I have looked at slicing, sort_values and reset_index
df.sort_values(by=['order_numbers'], ascending=True, inplace=True)
df = df.reset_index() # reset index to the order after sort
df['order_numbers_index'] = df.index
but I can't seem to get the DataFrame I am looking for.
Use DataFrame.sort_values by both columns and for counter use GroupBy.cumcount:
df = df.sort_values(['x_closest_autobahn','order_numbers'])
df['order_number_autobahn_x'] = df.groupby('x_closest_autobahn').cumcount().add(1)
print (df)
order_numbers x_closest_autobahn order_number_autobahn_x
2 5 3 1
1 11 3 2
0 34 3 3
4 2 12 1
3 8 12 2

Why does pd.rolling and .apply() return multiple outputs from a function returning a single value?

I'm trying to create a rolling function that:
Divides two DataFrames with 3 columns in each df.
Calculate the mean of each row from the output in step 1.
Sums the averages from step 2.
This could be done by using pd.iterrows() hence looping through each row. However, this would be inefficient when working with larger datasets. Therefore, my objective is to create a pd.rolling function that could do this much faster.
What I would need help with is to understand why my approach below returns multiple values while the function I'm using only returns a single value.
EDIT : I have updated the question with the code that produces my desired output.
This is the test dataset I'm working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
One method to achieve my desired output by looping through each row:
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[330.42328042328046,
212.0899470899471,
152.06349206349208,
205.55555555555554,
311.9047619047619,
209.1269841269841,
197.61904761904765,
116.94444444444444,
149.72222222222223,
430.0,
219.51058201058203,
215.34391534391537,
199.15343915343914,
159.6031746031746,
127.6984126984127,
326.85185185185185,
204.16666666666669]
However, this would be timely when working with large datasets. Therefore, I've tried to create a function which applies to a pd.rolling() object.
def SumOfAverageFunction(vals):
Div = df2 / vals.reset_index(drop="True")
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSum = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
The problem here is that my function returns multiple output. How can I solve this?
print(RunningSum)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 3.214286 4.533333 2.277778
3 4.777778 3.200000 2.111111
4 5.888889 4.416667 1.656085
5 5.111111 5.400000 2.915344
6 3.455556 3.933333 5.714286
7 2.866667 2.066667 5.500000
8 2.977778 3.977778 3.063492
9 3.555556 5.622222 1.907937
10 2.750000 4.200000 1.747619
11 1.638889 2.377778 3.616667
12 2.986111 2.005556 5.500000
13 5.333333 3.075000 4.750000
14 4.396825 5.000000 3.055556
15 2.174603 3.888889 2.148148
16 2.111111 2.527778 1.418519
17 2.507937 3.500000 3.311111
18 2.880952 3.000000 5.366667
19 2.722222 3.370370 5.750000
20 2.138889 5.129630 5.666667
After reordering of operations, your calculations can be simplified
BASE = df2.sum(axis=0) /3
BASE_series = pd.Series({k: v for k, v in zip(df1.columns, BASE)})
result = df1.rdiv(BASE_series, axis=1).sum(axis=1)
print(np.around(result[4:], 3))
Outputs:
4 5.508
5 4.200
6 2.400
7 3.000
...
if you dont want to calculate anything before index 4 then change:
df1.iloc[4:].rdiv(...

Sort strings of mixed types and different lengths

I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({'A': ['286a2', '17', '286a1', '373', '200b', '150'], 'B': range(6)})
A B
0 286a2 0
1 17 1
2 286a1 2
3 373 3
4 200b 4
5 150 5
which I want to sort according to A. When I do this using
df.sort_values(by='A')
I obtain
A B
5 150 5
1 17 1
4 200b 4
2 286a1 2
0 286a2 0
3 373 3
which is almost correct: I would like to have 17 before 150 but don't know how to do this as those entries are not just values but actual strings consisting of numerical values and letters. Is there a way to do this?
EDIT
About the pattern of the entries:
It is always a numeric value first of arbitrary length, then it can be followed by characters, which can be followed by numerical values again.
You can use replace characters to . with cast to float with sort_index:
df.index = df['A'].str.replace('[a-zA-Z]+','.').astype(float)
df = df.sort_index().reset_index(drop=True)
print (df)
A B
0 17 1
1 150 5
2 200b 4
3 286a1 2
4 286a2 0
5 373 3
Another variant to jezrael's
In [1706]: df.assign(
A_=df.A.str.replace('[/\D]', '.').astype(float) # or '[a-zA-Z]+'
).sort_values(by='A_').drop('A_', 1)
Out[1706]:
A B
1 17 1
5 150 5
4 200b 4
2 286a1 2
0 286a2 0
3 373 3
Or you can try , natsort
from natsort import natsorted, ns
df.set_index('A').reindex(natsorted(df.A, key=lambda y: y.lower())).reset_index()
Out[395]:
A B
0 17 1
1 150 5
2 200b 4
3 286a1 2
4 286a2 0
5 373 3

Multiply all columns in a Pandas dataframe together

Is it possible to multiply all the columns in a Pandas.DataFrame together to get a single value for every row in the DataFrame?
As an example, using
df = pd.DataFrame(np.random.randn(5,3)*10)
I want a new DataFrame df2 where df2.ix[x,0] will have the value of df.ix[x,0] * df.ix[x,1] * df.ix[x,2].
However I do not want to hardcode this, how can I use a loop to achieve this?
I found a function df.mul(series, axis=1) but cant figure out a way to use this for my purpose.
You could use DataFrame.prod():
>>> df = pd.DataFrame(np.random.randint(1, 10, (5, 3)))
>>> df
0 1 2
0 7 7 5
1 1 8 6
2 4 8 4
3 2 9 5
4 3 8 7
>>> df.prod(axis=1)
0 245
1 48
2 128
3 90
4 168
dtype: int64
You could also apply np.prod, which is what I'd originally done, but usually when available the direct methods are faster.
>>> df = pd.DataFrame(np.random.randint(1, 10, (5, 3)))
>>> df
0 1 2
0 9 3 3
1 8 5 4
2 3 6 7
3 9 8 5
4 7 1 2
>>> df.apply(np.prod, axis=1)
0 81
1 160
2 126
3 360
4 14
dtype: int64

Categories

Resources