I have a pandas dataframe such as follow:
import numpy as np
pd.DataFrame(np.random.rand(20,2))
I would like to remove from it the rows with index contained in the list:
list_ = [0,4,5,6,18]
How would I go about that?
Use drop:
df = df.drop(list_)
print (df)
0 1
1 0.311202 0.435528
2 0.225144 0.322580
3 0.165907 0.096357
7 0.925991 0.362254
8 0.777936 0.552340
9 0.445319 0.672854
10 0.919434 0.642704
11 0.751429 0.220296
12 0.425344 0.706789
13 0.708401 0.497214
14 0.110249 0.910790
15 0.972444 0.817364
16 0.108596 0.921172
17 0.299082 0.433137
19 0.234337 0.163187
This will do it:
remove = df.index.isin(list_)
df[~remove]
Or just:
df[~df.index.isin(list_)]
Related
In df A and B are label encoded categories all belonging to a certain subset (typ).
This categories should now be encoded/decoded again ... into metric data ... taken from template
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0,1,2,3,0,1,2,3,0,2,2,2,3,3,2,3,1,1],
'B': [2,3,1,1,1,3,2,2,0,2,2,2,3,3,3,3,2,1],
'typ': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]})
A and B should be decoded to metric(float) data from the templates pivot_A and pivot_B respectively. In the templates the headers are the values to replace, the indices are the conditions to match and the values are the new values:
pivot_A = pd.DataFrame(np.array([np.random.rand(9),np.random.rand(9),np.random.rand(9),np.random.rand(9)]).T,
columns=np.unique(df.A),
index = np.unique(df.typ))
pivot_B = pd.DataFrame(np.array([np.random.rand(9),np.random.rand(9),np.random.rand(9),np.random.rand(9)]).T,
columns=np.unique(df.B),
index = np.unique(df.typ))
pivot_B looks like:
In [5]: pivot_B
Out[5]:
0 1 2 3
type
1 0.326687 0.851405 0.830255 0.721817
2 0.496182 0.769574 0.083379 0.491332
3 0.442760 0.786503 0.593361 0.470658
4 0.100724 0.455841 0.485407 0.211383
5 0.989424 0.852057 0.530137 0.385900
6 0.413897 0.915375 0.708038 0.846020
7 0.548033 0.670561 0.900648 0.742418
8 0.077552 0.310529 0.156794 0.076186
9 0.463480 0.377749 0.876133 0.518022
pivot_A looks like:
In [6] pivot_A
Out[6]:
0 1 2 3
type
1 0.012808 0.128041 0.001279 0.320740
2 0.615976 0.736491 0.879216 0.842910
3 0.298637 0.828012 0.962703 0.736827
4 0.700053 0.115463 0.670091 0.638931
5 0.416262 0.633604 0.504292 0.983946
6 0.956872 0.129720 0.611625 0.682046
7 0.414579 0.062104 0.118168 0.265530
8 0.162742 0.952069 0.112400 0.837696
9 0.123151 0.061040 0.326437 0.380834
explained useage of pivots:
if df.typ == pivot.index and df.A == X:
df.A = pivot_A.loc[typ][X]
decoding could be done by:
for categorie in [i for i in df.columns if i != 'typ']:
for col in np.unique(df[categorie]):
for type_ in np.unique(df.typ):
df.loc[((df['typ']==type_) & (df[categorie]==col)), categorie] = locals()['pivot_{}'.format(categorie)].loc[type_,col]
and result in:
In[7] :df
Out[7]:
A B typ
0 0.012808 0.830255 1
1 0.736491 0.491332 2
2 0.962703 0.786503 3
3 0.638931 0.455841 4
4 0.416262 0.852057 5
5 0.129720 0.846020 6
6 0.118168 0.900648 7
7 0.837696 0.156794 8
8 0.123151 0.463480 9
9 0.001279 0.830255 1
10 0.879216 0.083379 2
11 0.962703 0.593361 3
12 0.638931 0.211383 4
13 0.983946 0.385900 5
14 0.611625 0.846020 6
15 0.265530 0.742418 7
16 0.952069 0.156794 8
17 0.061040 0.377749 9
BUT this looping seems NOT to be the best way doing it, right?!
How can I improve the code? pd.replace or dictionaries seem to be reasonable... but I can not figuere how to handle it with the extra typ conditions
melting down the 3xlooping process to one loop helps to reduce the duration time a lot:
old_values = list(pivot_A.columns) #from template
new_values_df = pd.DataFrame() #to save the decoded values without overwriting the oldvalues
for typ_ in pivot_A.index: #to match the condition (correct typ in every loop seperately)
new_values = list(pivot_A].loc[cl])
new_values_df = pd.concat([(df[df['typ']==typ]['A'].\
replace(old_values,new_values)).to_frame(A), new_values_df])
I have generated a dataframe containing all the possible two combinations of electrocardiogram (ECG) leads using itertools using the code below
source = [ 'I-s', 'II-s', 'III-s', 'aVR-s', 'aVL-s', 'aVF-s', 'V1-s', 'V2-s', 'V3-s', 'V4-s', 'V5-s', 'V6-s', 'V1Long-s', 'IILong-s', 'V5Long-s', 'Information-s' ]
target = [ 'I-t', 'II-t', 'III-t', 'aVR-t', 'aVL-t', 'aVF-t', 'V1-t', 'V2-t', 'V3-t', 'V4-t', 'V5-t', 'V6-t', 'V1Long-t', 'IILong-t', 'V5Long-t', 'Information-t' ]
from itertools import product
test = pd.DataFrame(list(product(source, target)), columns=['source', 'target'])
The test dataframe contains 256 rows/lines containing all the two possible combinations.
The value for each combination is zero as follows
test['value'] = 0
The test df looks like this:
I have another dataframe called diagramDF that contains the combinations where the value column is non-zero. The diagramDF is significanntly smaller than the test dataframe.
source target value
0 I-s II-t 137
1 II-s I-t 3
2 II-s III-t 81
3 II-s IILong-t 13
4 II-s V1-t 21
5 III-s II-t 3
6 III-s aVF-t 19
7 IILong-s II-t 13
8 IILong-s V1Long-t 353
9 V1-s aVL-t 11
10 V1Long-s IILong-t 175
11 V1Long-s V3-t 4
12 V1Long-s aVF-t 4
13 V2-s V3-t 8
14 V3-s V2-t 6
15 V3-s V6-t 2
16 V5-s aVR-t 5
17 V6-s III-t 4
18 aVF-s III-t 79
19 aVF-s V1Long-t 235
20 aVL-s I-t 1
21 aVL-s aVF-t 16
22 aVR-s aVL-t 1
Note that the first two columns source and target have the same notations
I have tried to replace the zero values of the test dataframe with the nonzero values of the diagramDF using merge like below:
df = pd.merge(test, diagramDF, how='left', on=['source', 'target'])
However, I get an error informing me that:
ValueError: The column label 'source' is not unique. For a
multi-index, the label must be a tuple with elements corresponding to
each level
Is there something that I am getting wrong? Is there a more efficient and fast way to do this?
Might help,
pd.merge(test, diagramDF, how='left', on=['source', 'target'],right_index=True,left_index=True)
Check this:
test = test.reset_index()
diagramDF = diagramDF.reset_index()
new = pd.merge(test, diagramDF, how='left', on=['source', 'target'])
Overview
How do you populate a pandas dataframe using math which uses column and row indices as variables.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(index = range(5), columns = ['Combo_Class0', 'Combo_Class1', 'Combo_Class2', 'Combo_Class3', 'Combo_Class4'])
Objective
Each cell in df = row index * (column index + 2)
Attempt 1
You can use this solution to produce the following code:
row = 0
for i in range(5):
row = row + 1
df.loc[i] = [(row)*(1+2), (row)*(2+2), (row)*(3+2), (row)*(4+2), (row)*(4+2), (row)*(5+2)]
Attempt 2
This solution seemed relevant as well, although I believe I've read you're not supposed to loop through dataframes. Besides, I'm not seeing how to loop through rows and columns:
for i, j in df.iterrows():
df.loc[i] = i
You can leverage broadcasting for a more efficient approach:
ix = (df.index+1).to_numpy() # .values for pandas 0.24<
df[:] = ix[:,None] * (ix+2)
print(df)
Combo_Class0 Combo_Class1 Combo_Class2 Combo_Class3 Combo_Class4
0 3 4 5 6 7
1 6 8 10 12 14
2 9 12 15 18 21
3 12 16 20 24 28
4 15 20 25 30 35
Using multiply outer
df[:]=np.multiply.outer((np.arange(5)+1),(np.arange(5)+3))
I know that there is a method .argmax() that returns the indexes of the maximum values across an axis.
But what if we want to get the indexes of the 10 highest values across an axis?
How could this be accomplished?
E.g.:
data = pd.DataFrame(np.random.random_sample((50, 40)))
You can use argsort:
s = pd.Series(np.random.permutation(30))
sorted_indices = s.argsort()
top_10 = sorted_indices[sorted_indices < 10]
print(top_10)
Output:
3 9
4 1
6 0
8 7
13 4
14 2
15 3
19 8
20 5
24 6
dtype: int64
IIUC, say, if you want to get the index of the top 10 largest numbers of column col:
data[col].nlargest(10).index
Give this a try. This will take the 10 largest values across a row and put them into a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_sample((50, 40)))
df2 = pd.DataFrame(np.sort(df.values)[:,-10:])
I am new to Pandas and I am trying to get the biggest string for every row in a DataFrame.
import pandas as pd
import sqlite3
authors = pd.read_sql('select * from authors')
authors['name']
...
12 KRISHNAN RAJALAKSHMI
13 J O
14 TSIPE
15 NURRIZA
16 HATICE OZEL
17 D ROMERO
18 LLIBERTAT
19 E F
20 JASMEET KAUR
...
What I expect is to get back the biggest string in each authors['name'] row:
...
12 RAJALAKSHMI
13 J
14 TSIPE
15 NURRIZA
16 HATICE
17 ROMERO
18 LLIBERTAT
19 E
20 JASMEET
...
I tried to split the string by spaces and apply(max) but it's not working. It seems that pandas is not applying max to each row.
authors['name'].str.split().apply(max)
# or
authors['name'].str.split().apply(lambda x: max(x))
# or
def get_max(x):
y = max(x)
print (y) # y is the biggest string in each row
return y
authors['name'].str.split().apply(get_max)
# Still results in:
...
12 KRISHNAN RAJALAKSHMI
13 J O
14 TSIPE
15 NURRIZA
16 HATICE OZEL
17 D ROMERO
18 LLIBERTAT
19 E F
20 JASMEET KAUR
...
When you tell pandas to apply max to the split series, it doesn't know what it should be maximizing. You might instead try something like
authors['name'].apply(lambda x: max(x.split(), key=len))
For each row, this will create an array of the substrings, and return the largest string, using the string length as the key.
Also note that while authors['name'].apply(lambda x: max(x.split())) works without having to specify the key=len for max, authors['name'].str.split().max() does not work, since max() is a pandas dataframe method that is specifically built to get the maximum value of a numeric column, not the maximum length string of each split row.
You are not replacing its values...
Try this function:
def getName(df):
df[0] = df[0].apply(lambda x: max(x.split(), key=len))
And then you just have to call:
getName(authors)
Note that I reassign each value of df[0] in this code.
Output:
names
0 RAJALAKSHMI
1 O
2 TSIPE
3 NURRIZA
4 HATICE
5 ROMERO
6 LLIBERTAT
7 F
8 JASMEET
The main problem in your code is that you weren't reassigning the values in each row.