Check if column contains (/,-,_, *or~) and split in another column - Pandas - python

I have a column with numbers and one of these characters between them -,/,*,~,_. I need to check if values contain any of the characters, then split the value in another column. Is there a different solution than shown below? In the end, columns subnumber1, subnumber2 ...subnumber5 will be merged in one column and column "number5" will be without characters. Those two columns I need to use in further process. I'm a newbie in Python so any advice is welcome.
if gdf['column_name'].str.contains('~').any():
gdf[['number1', 'subnumber1']] = gdf['column_name'].str.split('~', expand=True)
gdf
if gdf['column_name'].str.contains('^').any():
gdf[['number2', 'subnumber2']] = gdf['column_name'].str.split('^', expand=True)
gdf
Input column:
column_name
152/6*3
163/1-6
145/1
163/6^3
output:
number5 |subnumber1 |subnumber2
152 | 6 | 3
163 | 1 | 6
145 | 1 |
163 | 6 | 3

Use Series.str.split with list of possible separators and create new DataFrame:
import re
L = ['-','/','*','~','_','^', '.']
#some values like `^.` are escape
pat = '|'.join(re.escape(x) for x in L)
df = df['column_name'].str.split(pat, expand=True).add_prefix('num')
print (df)
num0 num1 num2
0 152 6 3
1 163 1 6
2 145 1 None
3 163 6 3
EDIT: If need match values before value use:
L = ["\-_",'\^|\*','~','/']
for val in L:
df[f'before {val}'] = df['column_name'].str.extract(rf'(\d+){[val]}')
#for last value not exist separator, so match $ for end of string
df['last'] = df['column_name'].str.extract(rf'(\d+)$')
print (df)
column_name before \-_ before \^|\* before ~ before / last
0 152/2~3_4*5 3 4 2 152 5
1 152/2~3-4^5 4 4 2 152 5
2 152/6*3 NaN 6 NaN 152 3
3 163/1-6 NaN NaN NaN 163 6
4 145/1 NaN NaN NaN 145 1
5 163/6^3 6 6 NaN 163 3

Use str.split:
df['column_name'].str.split(r'[*,-/^_]', expand=True)
output:
0 1 2
0 152 6 3
1 163 1 6
2 145 1 None
3 163 6 3
Or, if you know in advance that you have 3 numbers, use str.extract and named capturing groups:
regex = '(?P<number5>\d+)\D*(?P<subnumber1>\d*)\D*(?P<subnumber2>\d*)'
df['column_name'].str.extract(regex)
output:
number5 subnumber1 subnumber2
0 152 6 3
1 163 1 6
2 145 1
3 163 6 3

Related

pandas: how to check that a certain value in a column repeats maximum once in each group (after groupby)

I have a pandas DataFrame which I want to group by column A, and check that a certain value ('test') in group B does not repeat more than once in each group.
Is there a pandas native way to do the following:
1 - find the groups where 'test' appears in column B more than once ?
2 - delete the additional occurrences (keep the one with the min value in column C).
example:
A B C
0 1 test 342
1 1 t 4556
2 1 te 222
3 1 test 56456
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
if I groupby 'A', I get that 'test' appears twice in A==1, which is the case I would like to deal with.
Solution for remove duplicated test values by columns A,B - keep first value per group:
df = df[df.B.ne('test') | ~df.duplicated(['A','B'])]
print (df)
A B C
0 1 test 342
1 1 t 4556
2 1 te 222
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
EDIT: If need minimal C matched test in B and need all possible duplicated minimal C values compare by GroupBy.transform with replace C to NaN in Series.mask:
m = df.B.ne('test')
df = df[m | ~df.C.mask(m).groupby(df['A']).transform('min').ne(df['C'])]
But if need only first duplicated test value use DataFrameGroupBy.idxmin with filtered DataFrame:
m = df.B.ne('test')
m1 = df.index.isin(df[~m].groupby('A')['C'].idxmin())
df = df[m | m1]
Difference of solutions:
print (df)
A B C
-2 1 test 342
-1 1 test 342
0 1 test 342
1 1 t 4556
2 1 te 222
3 1 test 56456
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
m = df.B.ne('test')
df1 = df[m | ~df.C.mask(m).groupby(df['A']).transform('min').ne(df['C'])]
print (df1)
A B C
-2 1 test 342
-1 1 test 342
0 1 test 342
1 1 t 4556
2 1 te 222
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
m = df.B.ne('test')
m1 = df.index.isin(df[~m].groupby('A')['C'].idxmin())
df2 = df[m | m1]
print (df2)
A B C
-2 1 test 342
1 1 t 4556
2 1 te 222
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665

how to split a column based on a character and append the rest of columns with each split

Consider I have a dataframe:
a = [['A','def',2,3],['B|C','xyz|abc',56,3],['X|Y|Z','uiu|oi|kji',65,34],['K','rsq',98,12]]
df1 = pd.DataFrame(a, columns=['1', '2','3','4'])
df1
1 2 3 4
0 A def 2 3
1 B|C xyz|abc 56 3
2 X|Y|Z uiu|oi|kji 65 34
3 K rsq 98 12
First, how do I print all the rows that has "|" in column 1? I am trying the following but it prints all rows of the frame:
df1[df1[1].str.contains("|")]
Second, how do I split the column 1 and column 2 on "|", so that each split in column 1 gets its corresponding split from column 2 and the rest of the data is appended to each split.
For example, I want something like this from df1:
1 2 3 4
0 A def 2 3
1 B xyz 56 3
2 C abc 56 3
3 X uiu 65 34
4 Y oi 65 34
5 Z kji 65 34
6 K rsq 98 12
You can use custom lambda function with Series.str.split and Series.explode for columns specified in list and then add all another columns in DataFrame.join:
splitter = ['1','2']
cols = df1.columns.difference(splitter)
f = lambda x: x.str.split('|').explode()
df1 = df1[splitter].apply(f).join(df1[cols]).reset_index(drop=True)
print (df1)
1 2 3 4
0 A def 2 3
1 B xyz 56 3
2 C abc 56 3
3 X uiu 65 34
4 Y oi 65 34
5 Z kji 65 34
6 K rsq 98 12
For filter by | what is special regex character or add regex=False to Series.str.contains:
print(df1[df1[1].str.contains("|" regex=False)])
Or escape it by \|:
print(df1[df1[1].str.contains("\|")])

Fill the numbers between two columns in Pandas data frame

I have a Pandas dataframe with below columns:
id start end
1 101 101
2 102 104
3 108 109
I want to fill the gaps between start and end with additional rows, so the output may look like this:
id number
1 101
2 102
2 103
2 104
3 108
3 109
Is there anyway to do it in Pandas? Thanks.
Use nested list comprehension with range and flattening for list of tuples, last use DataFrame constructor:
zipped = zip(df['id'], df['start'], df['end'])
df = pd.DataFrame([(i, y) for i, s, e in zipped for y in range(s, e+1)],
columns=['id','number'])
print (df)
id number
0 1 101
1 2 102
2 2 103
3 2 104
4 3 108
5 3 109
Here is a pure pandas solution but performance-wise, #jaezrael's solution would be better,
df.set_index('id').apply(lambda x: pd.Series(np.arange(x.start, x.end + 1)), axis = 1)\
.stack().astype(int).reset_index()\
.drop('level_1', 1)\
.rename(columns = {0:'Number'})
id Number
0 1 101
1 2 102
2 2 103
3 2 104
4 3 108
5 3 109

Sort strings of mixed types and different lengths

I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({'A': ['286a2', '17', '286a1', '373', '200b', '150'], 'B': range(6)})
A B
0 286a2 0
1 17 1
2 286a1 2
3 373 3
4 200b 4
5 150 5
which I want to sort according to A. When I do this using
df.sort_values(by='A')
I obtain
A B
5 150 5
1 17 1
4 200b 4
2 286a1 2
0 286a2 0
3 373 3
which is almost correct: I would like to have 17 before 150 but don't know how to do this as those entries are not just values but actual strings consisting of numerical values and letters. Is there a way to do this?
EDIT
About the pattern of the entries:
It is always a numeric value first of arbitrary length, then it can be followed by characters, which can be followed by numerical values again.
You can use replace characters to . with cast to float with sort_index:
df.index = df['A'].str.replace('[a-zA-Z]+','.').astype(float)
df = df.sort_index().reset_index(drop=True)
print (df)
A B
0 17 1
1 150 5
2 200b 4
3 286a1 2
4 286a2 0
5 373 3
Another variant to jezrael's
In [1706]: df.assign(
A_=df.A.str.replace('[/\D]', '.').astype(float) # or '[a-zA-Z]+'
).sort_values(by='A_').drop('A_', 1)
Out[1706]:
A B
1 17 1
5 150 5
4 200b 4
2 286a1 2
0 286a2 0
3 373 3
Or you can try , natsort
from natsort import natsorted, ns
df.set_index('A').reindex(natsorted(df.A, key=lambda y: y.lower())).reset_index()
Out[395]:
A B
0 17 1
1 150 5
2 200b 4
3 286a1 2
4 286a2 0
5 373 3

Pandas: Add new column based on comparison of two DFs

I have 2 dataframes that I am wanting to compare one to the other and add a 'True/False' to a new column in the first based on the comparison.
My data resembles:
DF1:
cat sub-cat low high
3 3 1 208 223
4 3 1 224 350
8 4 1 223 244
9 4 1 245 350
13 5 1 232 252
14 5 1 253 350
DF2:
Cat Sub-Cat Rating
0 5 1 246
1 5 2 239
2 8 1 203
3 8 2 218
4 K 1 149
5 K 2 165
6 K 1 171
7 K 2 185
8 K 1 157
9 K 2 171
Desired result would be for DF2 to have an additional column with a True or False depending on if, based on the cat and sub-cat, that the rating is between the low.min() and high.max() or Null if no matches found to compare.
Have been running rounds with this for far too long with no results to speak of.
Thank you in advance for any assistance.
Update:
First row would look something like:
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
As it falls within the min low and the max high.
Example: There are two rows in DF1 for cat = 5 and sub-cat = 2. I need to get the minimum low and the maximum high from those 2 rows and then check if the rating from row 0 in DF2 falls within the minimum low and maximum high from the two matching rows in DF1
join post groupby.agg
d2 = DF2.join(
DF1.groupby(
['cat', 'sub-cat']
).agg(dict(low='min', high='max')),
on=['Cat', 'Sub-Cat']
)
d2
Cat Sub-Cat Rating high low
0 5 1 246 350.0 232.0
1 5 2 239 NaN NaN
2 8 1 203 NaN NaN
3 8 2 218 NaN NaN
4 K 1 149 NaN NaN
5 K 2 165 NaN NaN
6 K 1 171 NaN NaN
7 K 2 185 NaN NaN
8 K 1 157 NaN NaN
9 K 2 171 NaN NaN
assign with .loc
DF2.loc[d2.eval('low <= Rating <= high'), 'In-Spec'] = True
DF2
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
1 5 2 239 NaN
2 8 1 203 NaN
3 8 2 218 NaN
4 K 1 149 NaN
5 K 2 165 NaN
6 K 1 171 NaN
7 K 2 185 NaN
8 K 1 157 NaN
9 K 2 171 NaN
To add a new column based on a boolean expression would involve something along the lines of:
temp = boolean code involving inequality
df2['new column name'] = temp
However I'm not sure I understand, the first row in your DF2 table for instance, has a rating of 246, which means it's true for row 13 of DF1, but false for row 14. What would you like it to return?
You can do it like this
df2['In-Spec'] = 'False'
df2['In-Spec'][(df2['Rating'] > df1['low']) & (df2['Rating'] < df1['high'])] = 'True'
But which rows should be compared with each others? Do you want them to compare by their index or by their cat & subcat names?

Categories

Resources