I have a column in a pandas dataframe with street names such as
88 SØNDRE VEI 54
89 UTSIKTVEIEN 20B
92 KAARE MOURSUNDS VEG 14 A
94 OKSVALVEIEN 19
96 SLEMDALSVINGEN 33A
97 GAMLESTRØMSVEIEN 59
100 JONAS LIES VEI 68 A
what i want is to get separate columns for the street name, street number and street letter. Is there a way using pd.apply and using join to split the street names into three columns?
Thanks!
Edit: The 20B should be splittet to a value of 20 and B separately.
IIUC, you can use this regex:
df[1].str.extract('(\D+)\s+(\d+)\s?(.*)')
Output:
0 1 2
0 SØNDRE VEI 54
1 UTSIKTVEIEN 20 B
2 KAARE MOURSUNDS VEG 14 A
3 OKSVALVEIEN 19
4 SLEMDALSVINGEN 33 A
5 GAMLESTRØMSVEIEN 59
6 JONAS LIES VEI 68 A
Related
I have a DataFrame like this:
student marks term
steve 55 1
jordan 66 2
steve 53 1
alan 74 2
jordan 99 1
steve 81 2
alan 78 1
alan 76 2
jordan 48 1
I would like to return highest two scores for each student
student marks term
steve 81 2
steve 55 1
jordan 99 1
jordan 66 2
alan 78 1
alan 76 2
I have tried
df = df.groupby('student')['marks'].max()
but it returns 1 row, I would like each student in the order they are mentioned with top two scores.
You could use groupby + nlargest to find the 2 largest values; then use loc to sort in the order they appear in df:
out = (df.groupby('student')['marks'].nlargest(2)
.droplevel(1)
.loc[df['student'].drop_duplicates()]
.reset_index())
Output:
student marks
0 steve 81
1 steve 55
2 jordan 99
3 jordan 66
4 alan 78
5 alan 76
If you want to keep "terms" as well, you could use the index:
idx = df.groupby('student')['marks'].nlargest(2).index.get_level_values(1)
out = df.loc[idx].set_index('student').loc[df['student'].drop_duplicates()].reset_index()
Output:
student marks term
0 steve 81 2
1 steve 55 1
2 jordan 99 1
3 jordan 66 2
4 alan 78 1
5 alan 76 2
#sammywemmy suggested a better way to derive the second result:
out = (df.loc[df.groupby('student', sort=False)['marks'].nlargest(2)
.index.get_level_values(1)]
.reset_index(drop=True))
You should use:
df = df.groupby(['student', 'term'])['marks'].max()
(with an optional .reset_index() )
Sorting before grouping should suffice, since you need to keep the term column:
df.sort_values('marks').groupby('student', sort = False).tail(2)
student marks term
0 steve 55 1
1 jordan 66 2
7 alan 76 2
6 alan 78 1
5 steve 81 2
4 jordan 99 1
I have the following dataframe:
df=pd.DataFrame({'Name':['JOHN','ALLEN','BOB','NIKI','CHARLIE','CHANG'],
'Age':[35,42,63,29,47,51],
'Salary_in_1000':[100,93,78,120,64,115],
'FT_Team':['STEELERS','SEAHAWKS','FALCONS','FALCONS','PATRIOTS','STEELERS']})
df output:
- Name Age Salary_in_1000 FT_Team
0 JOHN 35 100 STEELERS
1 ALLEN 42 93 SEAHAWKS
2 BOB 63 78 FALCONS
3 NIKI 29 120 FALCONS
4 CHARLIE 47 64 PATRIOTS
5 CHANG 51 115 STEELERS
And my dataframe that I am trying to complete:
df1=pd.DataFrame({'Name':['JOHN','ALLEN','BOB','NIKI','CHARLIE','CHANG'],
'Age':[35,42,63,29,47,51],})
df1 output:
- Name Age
0 JOHN 35
1 ALLEN 42
2 BOB 63 78
3 NIKI 29
4 CHARLIE 47
5 CHANG 51
I would like to add a new column to df1 that references ['FT_Team'] from df based upon 'Name' and 'Age' in df1.
I believe that the new code should look something like a .map; however, I am completely stumped as to what the arguments would be for multiple arguments.
df1['FT_Team] =
final output:
- Name Age FT_Team
0 JOHN 35 STEELERS
1 ALLEN 42 SEAHAWKS
2 BOB 63 FALCONS
3 NIKI 29 FALCONS
4 CHARLIE 47 PATRIOTS
5 CHANG 51 STEELERS
Ultimately, I would like to match the football team from df based upon Name AND Age in df1
Per Quang Hoang:
df1=df1.merge(df[['Name','Age','FT_Team']], on=['Name','Age'], how='left')
I wonder how to extract only rows whose column is notnull and put them in a variable.
My code is
data_result = df[df['english'].isnull().sum()==0]
But an Error accured. How do I fix it?
Dataframe:
name math english
0 John 90 nan
1 Ann 85 84
2 Brown 77 nan
3 Eva 92 93
4 Anita 91 90
5 Jimmy 75 69
Result
name math english
1 Ann 85 84
3 Eva 92 93
4 Anita 91 90
5 Jimmy 75 69
Try this:
data_result = df[df['english'].notnull()]
I'm working on text mining problem and using Pandas for text processing. From the following example I need to pick only those row which have the max span (start - end) within the same category (cat)
Given this dataframe:
name start end cat
0 coumadin 0 8 DRUG
1 albuterol 18 27 DRUG
2 albuterol sulfate 18 35 DRUG
3 sulfate 28 35 DRUG
4 2.5 36 39 STRENGTH
5 2.5 mg 36 42 STRENGTH
6 2.5 mg /3 ml 36 48 STRENGTH
7 0.083 50 55 STRENGTH
8 0.083 % 50 57 STRENGTH
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
10 solution 59 67 FORM
11 solution for nebulization 59 84 FORM
12 nebulization 72 84 ROUTE
13 one (1) 90 97 FREQUENCY
14 neb 98 101 ROUTE
15 neb inhalation 98 112 ROUTE
16 inhalation 102 112 ROUTE
17 q4h 113 116 FREQUENCY
18 every 118 123 FREQUENCY
19 every 4 hours 118 131 FREQUENCY
20 q4h (every 4 hours) 113 132 FREQUENCY
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
22 as needed 133 142 FREQUENCY
23 dyspnea 147 154 REASON
I need to get the following:
name start end cat
0 coumadin 0 8 DRUG
2 albuterol sulfate 18 35 DRUG
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
11 solution for nebulization 59 84 FORM
13 one (1) 90 97 FREQUENCY
15 neb inhalation 98 112 ROUTE
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
23 dyspnea 147 154 REASON
What I tried is to groupby by the category and then compute the max difference (end-start). However I got stuck how to find the max span between for the same entity within the category. I guess it should not be very tricky
COMMENT
Thank you all for suggestions, but I need ALL possible entities within each category. For example, in DRUG, there are two relevant drugs: coumadin and albuterol sulfate, and some fractions of them (albuterol and sulfate). I need to remove only (albuterol and sulfate) while keeping coumadin and albuterol sulfate. The same logic for other categories.
For example, rows 4-8 are all bits of a complete row 9, thus I need to keep only row 9. Rows 1 and 3 are parts of the row 2, thus I need to keep row 2 (in addition to row 0). Etc.
Obviously, all constituents ('bits') are within the max range, but the problem is to find the max (or unifying range) of the same entity and its constituents)
COMMENT 2
A possible solution could be: to find all overlapping intervals within the same category cat and pick the largest. I'm trying to implement, but not luck so far.
Possible Solution
I sorted columns by ascending and descending order:
df.sort_values(by=[1,2], ascending=[True, False])
0 1 2 3
0 coumadin 0 8 DRUG
2 albuterol sulfate 18 35 DRUG
1 albuterol 18 27 DRUG
3 sulfate 28 35 DRUG
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
6 2.5 mg /3 ml 36 48 STRENGTH
5 2.5 mg 36 42 STRENGTH
4 2.5 36 39 STRENGTH
8 0.083 % 50 57 STRENGTH
7 0.083 50 55 STRENGTH
11 solution for nebulization 59 84 FORM
10 solution 59 67 FORM
12 nebulization 72 84 ROUTE
13 one (1) 90 97 FREQUENCY
15 neb inhalation 98 112 ROUTE
14 neb 98 101 ROUTE
16 inhalation 102 112 ROUTE
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
20 q4h (every 4 hours) 113 132 FREQUENCY
17 q4h 113 116 FREQUENCY
19 every 4 hours 118 131 FREQUENCY
18 every 118 123 FREQUENCY
22 as needed 133 142 FREQUENCY
23 dyspnea 147 154 REASON
Which puts the relevant row the first, however, I still need to filter out irrelevant rows....
I have tried this on sample of your df:
Create a sample df:
import pandas as pd
Name = ['coumadin','albuterol','albuterol sulfate','sulfate']
Cat = ['D', 'D', 'D', 'D']
Start = [0, 18, 18, 28]
End = [8, 27, 33,35]
ID = [1,2,3,4]
df = pd.DataFrame(data = list(zip(ID,Name,Start,End,Cat)), \
columns=['ID','Name','Start','End','Cat'])
Make a function which will help in identifying the names which are similar
def matcher(x):
res = df.loc[df['Name'].str.contains(x, regex=False, case=False), 'ID']
return ','.join(res.astype(str))
Applying this function to value of the column
df['Matches'] = df['Name'].apply(matcher) ##Matches will contain the ID of rows which are similar and have only 1 value which are absolute.
ID Name Start End Cat Matches
0 1 coumadin 0 8 D 1
1 2 albuterol 18 27 D 2,3
2 3 albuterol sulfate 18 33 D 3
3 4 sulfate 28 35 D 3,4
Count the number of rows getting in matches
df['Count'] = df.Matches.apply(lambda x: len(x.split(',')))
Keep the df which has "Count" as 1 as these are the rows which contains the other rows:
df = df[df.Count == 1]
ID Name Start End Cat Matches Count
0 1 coumadin 0 8 D 1 1
2 3 albuterol sulfate 18 33 D 3 1
You can then remove unnecessary columns :)
I have below data in a DataFrame.
city age
mumbai 12 33 5 55
delhi 24 56 78 23 43 55 67
kal 12 43 55 78 34
mumbai 14 56 78 99 # Have a leading space
MUMbai 34 59 # Have Capitol letters
kal 11
I want to convert it into below format :
city age
mumbai 12 33 5 55 14 56 78 99 34 59
delhi 24 56 78 23 43 55 67
kal 12 43 55 78 34 11
How can I achieve this?
Note:
I have edited the data, now some city name are in Capital letter and some has leading spaces. How can we apply strip() and lower() functions to it?
We use groupby with sort=False to ensure we present cities in the same order they first appear.
We use ' '.join to concatenate the strings together.
Lastly, we reset_index to get the city values that have been placed in the index into the dataframe proper.
df.groupby('city', sort=False).age.apply(' '.join).reset_index()
city age
0 mumbai 12 33 5 55 14 56 78 99 34 59
1 delhi 24 56 78 23 43 55 67
2 kal 12 43 55 78 34 11
Response to Edit
df.age.str.strip().groupby(
df.city.str.strip().str.lower(),
sort=False
).apply(' '.join).reset_index()
city age
0 mumbai 12 33 5 55 14 56 78 99 34 59
1 delhi 24 56 78 23 43 55 67
2 kal 12 43 55 78 34 11