I am new to Pandas and I am trying to get the biggest string for every row in a DataFrame.
import pandas as pd
import sqlite3
authors = pd.read_sql('select * from authors')
authors['name']
...
12 KRISHNAN RAJALAKSHMI
13 J O
14 TSIPE
15 NURRIZA
16 HATICE OZEL
17 D ROMERO
18 LLIBERTAT
19 E F
20 JASMEET KAUR
...
What I expect is to get back the biggest string in each authors['name'] row:
...
12 RAJALAKSHMI
13 J
14 TSIPE
15 NURRIZA
16 HATICE
17 ROMERO
18 LLIBERTAT
19 E
20 JASMEET
...
I tried to split the string by spaces and apply(max) but it's not working. It seems that pandas is not applying max to each row.
authors['name'].str.split().apply(max)
# or
authors['name'].str.split().apply(lambda x: max(x))
# or
def get_max(x):
y = max(x)
print (y) # y is the biggest string in each row
return y
authors['name'].str.split().apply(get_max)
# Still results in:
...
12 KRISHNAN RAJALAKSHMI
13 J O
14 TSIPE
15 NURRIZA
16 HATICE OZEL
17 D ROMERO
18 LLIBERTAT
19 E F
20 JASMEET KAUR
...
When you tell pandas to apply max to the split series, it doesn't know what it should be maximizing. You might instead try something like
authors['name'].apply(lambda x: max(x.split(), key=len))
For each row, this will create an array of the substrings, and return the largest string, using the string length as the key.
Also note that while authors['name'].apply(lambda x: max(x.split())) works without having to specify the key=len for max, authors['name'].str.split().max() does not work, since max() is a pandas dataframe method that is specifically built to get the maximum value of a numeric column, not the maximum length string of each split row.
You are not replacing its values...
Try this function:
def getName(df):
df[0] = df[0].apply(lambda x: max(x.split(), key=len))
And then you just have to call:
getName(authors)
Note that I reassign each value of df[0] in this code.
Output:
names
0 RAJALAKSHMI
1 O
2 TSIPE
3 NURRIZA
4 HATICE
5 ROMERO
6 LLIBERTAT
7 F
8 JASMEET
The main problem in your code is that you weren't reassigning the values in each row.
Related
I have generated a dataframe containing all the possible two combinations of electrocardiogram (ECG) leads using itertools using the code below
source = [ 'I-s', 'II-s', 'III-s', 'aVR-s', 'aVL-s', 'aVF-s', 'V1-s', 'V2-s', 'V3-s', 'V4-s', 'V5-s', 'V6-s', 'V1Long-s', 'IILong-s', 'V5Long-s', 'Information-s' ]
target = [ 'I-t', 'II-t', 'III-t', 'aVR-t', 'aVL-t', 'aVF-t', 'V1-t', 'V2-t', 'V3-t', 'V4-t', 'V5-t', 'V6-t', 'V1Long-t', 'IILong-t', 'V5Long-t', 'Information-t' ]
from itertools import product
test = pd.DataFrame(list(product(source, target)), columns=['source', 'target'])
The test dataframe contains 256 rows/lines containing all the two possible combinations.
The value for each combination is zero as follows
test['value'] = 0
The test df looks like this:
I have another dataframe called diagramDF that contains the combinations where the value column is non-zero. The diagramDF is significanntly smaller than the test dataframe.
source target value
0 I-s II-t 137
1 II-s I-t 3
2 II-s III-t 81
3 II-s IILong-t 13
4 II-s V1-t 21
5 III-s II-t 3
6 III-s aVF-t 19
7 IILong-s II-t 13
8 IILong-s V1Long-t 353
9 V1-s aVL-t 11
10 V1Long-s IILong-t 175
11 V1Long-s V3-t 4
12 V1Long-s aVF-t 4
13 V2-s V3-t 8
14 V3-s V2-t 6
15 V3-s V6-t 2
16 V5-s aVR-t 5
17 V6-s III-t 4
18 aVF-s III-t 79
19 aVF-s V1Long-t 235
20 aVL-s I-t 1
21 aVL-s aVF-t 16
22 aVR-s aVL-t 1
Note that the first two columns source and target have the same notations
I have tried to replace the zero values of the test dataframe with the nonzero values of the diagramDF using merge like below:
df = pd.merge(test, diagramDF, how='left', on=['source', 'target'])
However, I get an error informing me that:
ValueError: The column label 'source' is not unique. For a
multi-index, the label must be a tuple with elements corresponding to
each level
Is there something that I am getting wrong? Is there a more efficient and fast way to do this?
Might help,
pd.merge(test, diagramDF, how='left', on=['source', 'target'],right_index=True,left_index=True)
Check this:
test = test.reset_index()
diagramDF = diagramDF.reset_index()
new = pd.merge(test, diagramDF, how='left', on=['source', 'target'])
I know that there is a method .argmax() that returns the indexes of the maximum values across an axis.
But what if we want to get the indexes of the 10 highest values across an axis?
How could this be accomplished?
E.g.:
data = pd.DataFrame(np.random.random_sample((50, 40)))
You can use argsort:
s = pd.Series(np.random.permutation(30))
sorted_indices = s.argsort()
top_10 = sorted_indices[sorted_indices < 10]
print(top_10)
Output:
3 9
4 1
6 0
8 7
13 4
14 2
15 3
19 8
20 5
24 6
dtype: int64
IIUC, say, if you want to get the index of the top 10 largest numbers of column col:
data[col].nlargest(10).index
Give this a try. This will take the 10 largest values across a row and put them into a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_sample((50, 40)))
df2 = pd.DataFrame(np.sort(df.values)[:,-10:])
I have a pandas dataframe such as follow:
import numpy as np
pd.DataFrame(np.random.rand(20,2))
I would like to remove from it the rows with index contained in the list:
list_ = [0,4,5,6,18]
How would I go about that?
Use drop:
df = df.drop(list_)
print (df)
0 1
1 0.311202 0.435528
2 0.225144 0.322580
3 0.165907 0.096357
7 0.925991 0.362254
8 0.777936 0.552340
9 0.445319 0.672854
10 0.919434 0.642704
11 0.751429 0.220296
12 0.425344 0.706789
13 0.708401 0.497214
14 0.110249 0.910790
15 0.972444 0.817364
16 0.108596 0.921172
17 0.299082 0.433137
19 0.234337 0.163187
This will do it:
remove = df.index.isin(list_)
df[~remove]
Or just:
df[~df.index.isin(list_)]
I am working with a pandas dataframe. I trying to create a new column, data['Labels'], which contains labels determined by the change in value between row i and row i+n in column data['diff'], for the entire length of the dataframe.
I imagine something of the following, however this is returning me errors:
for i in range(len(data['diff'])-1):
data.loc[data['diff'][i] >= data['diff'][i+n], 'Labels'] = 'A'
data.loc[data['diff'][i] < data['diff'][i+n], 'Labels'] = 'B'
example output:
index diff label
9 117.32 B
10 108.32 A
11 125.36 A
12 127.36 A
13 139.28 A
14 141.22 A
15 147.89 A
16 153.89 B
17 153.89 B
18 156.87 B
19 168.84 B
20 161.04 B
21 172.04 B
24 175.16 B
22 164.04 B
23 164.16 B
27 175.16 B
25 149.16 A
If I understand correctly, you want to set the label to A if there is a higher value in any of the following rows.
You can get the maximum value of the remaining rows with cummax. However, you need to revert the index first, otherwise cummax will return the maximum of the preceding rows. You you can do that with .iloc[::-1]:
df['following_max'] = df['diff'].iloc[::-1].cummax().iloc[::-1]
Then apply the label A whenever the following max is greater than the current value:
df['Labels'] = np.where(df['diff'] < df['following_max'], 'A', 'B')
df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
Thanks!
Edit:
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
http://pastebin.com/qXiaWcJq
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>>
>>> grouped = df.groupby("A")
>>>
>>> result = grouped.agg('first')
>>> result
mask values
A
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
...
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>>
>>> result
mask values Z
A
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.