This is my data which contains number and string.
df2 = pd.DataFrame({'A': ['1,008$','4,000$','6,000$','10,00$','8,00$','45€','45€']})
df2 = pd.DataFrame(df2, columns = ['A'])
vv=df2[df2['A'].str.match('$')]
I want an output like this.
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
but I am getting this output:
Out[144]:
Empty DataFrame
Columns: [A]
Index: []
can anyone please help me?
A somewhat verbose way using Numpy's defchararray module.
I always want to give this some attention.
# Using #cᴏʟᴅsᴘᴇᴇᴅ's suggestion
# Same function as below but shorter namespace path
df2[np.char.find(df2.A.values.astype(str), '$') >= 0]
Old Answer
from numpy.core.defchararray import find
df2[find(df2.A.values.astype(str), '$') >= 0]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
str.match starts matching from the beginning. however, your $ pattern will be found only at the end.
The fix requires either, a modification to your pattern, or changing the function.
Option 1
str.match with a modified pattern (so \$ is matched at the end) -
df2[df2.A.str.match('.*\$$')]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
If you want to be specific about what is matched, you can match only on digits and commas -
df2[df2.A.str.match('[\d,]+\$$')]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
Note that this does not account for invalid entries in your column (they're matched as long as they have those characters somewhere in the string, and are terminated by $).
Option 2
str.contains
df2[df2.A.str.contains('\$$')]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
Related
I have dataframe like this.
print(df)
[ ID ... Control
0 PDF-1 ... NaN
1 PDF-3 ... NaN
2 PDF-4 ... NaN
I want to get only number of ID column. So the result will be.
1
3
4
How to get one of the strings of the dataframe column ?
How about just replace a common PDF- prefix?
df['ID'].str.replace('PDF-', '')
Could you please try following.
df['ID'].replace(regex=True,to_replace=r'([^\d])',value=r'')
One could refer documentation for df.replace
Basically using regex to remove everything apart from digits in column named ID where \d denotes digits and when we use [^\d] means apart form digits match everything.
Another possibility using Regex is:
df.ID.str.extract('(\d+)')
This avoids changing the original data just to extract the integers.
So for the following simple example:
import pandas as pd
df = pd.DataFrame({'ID':['PDF-1','PDF-2','PDF-3','PDF-4','PDF-5']})
print(df.ID.str.extract('(\d+)'))
print(df)
we get the following:
0
0 1
1 2
2 3
3 4
4 5
ID
0 PDF-1
1 PDF-2
2 PDF-3
3 PDF-4
4 PDF-5
Find "PDF-" ,and replace it with nothing
df['ID'] = df['ID'].str.replace('PDF-', '')
Then to print how you asked I'd convert the data frame to a string with no index.
print df['cleanID'].to_string(index=False)
df =
A B
1 5
2 6)
(3 7
4 8
To remove parentheses I did:
df.A = df.A.str.replace(r"\(.*\)","")
But no result. I have checked a lot of replies here, but still same result.
Would appreciate to remove parentheses from the whole data set or at least in coulmn
to remove parentheses from the whole data set
With regex character class [...] :
In [15]: df.apply(lambda s: s.str.replace(r'[()]', ''))
Out[15]:
A B
0 1 5
1 2 6
2 3 7
3 4 8
Or the same with df.replace(r'[()]', '', regex=True) which is a more concise way.
If you want regex, you can use r"[()]" instead of alteration groups, as long as you need to replace only one character at a time.
df.A = df.A.str.replace(r"[()]", "")
I find it easier to read and alter if needed.
I have a dataframe test with a column category containing a complex pattern of words, characters and digits. I need to extract words separated by hyphen before another followed by digits into a new column sub_category.
I'm not a regex expert and spent too much time fighting it. So will appreciate your help!
test = pd.DataFrame({
'id': ['1','2','3','4'],
'category': ['worda-wordb-1234.ds.er89.',
'worda-4567.we.77-ty','wordc-wordd-5698/de/','wordc-2356/rt/']
})
Desired output:
id category sub_category
0 1 worda-wordb-1234.ds.er worda-wordb
1 2 worda-4567.we.ty worda
2 3 wordc-wordd-5698/de/ wordc-wordd
3 4 wordc-2356/rt/ wordc
Use str.extract,
test['sub-category'] = test.category.str.extract('(.*)-\d+')
id category sub-category
0 1 worda-wordb-1234.ds.er89. worda-wordb
1 2 worda-4567.we.77-ty worda
2 3 wordc-wordd-5698/de/ wordc-wordd
3 4 wordc-2356/rt/ wordc
What you want is simply the start of the string and as many non-digits as possible, except for the final hyphen. This should do the trick:
^\D+?(?=-\d)
Demo
Explanation:
^ matches the start of the string
\D+? matches non-digits, but in a non-greedy manner
(?=-\d) matches a hyphen followed by a digit; this forces the previous match to stop.
You can do this with split() also:
>>> df
id category
0 1 worda-wordb-1234.ds.er89.
1 2 worda-4567.we.77-ty
2 3 wordc-wordd-5698/de/
3 4 wordc-2356/rt/
Resulted output:
>>> df['sub_category'] = df.category.str.split('-\d+',expand=True)[0]
>>> df
id category sub_category
0 1 worda-wordb-1234.ds.er89. worda-wordb
1 2 worda-4567.we.77-ty worda
2 3 wordc-wordd-5698/de/ wordc-wordd
3 4 wordc-2356/rt/ wordc
OR , as #jezrael suggested with the split() method with little change specifying the number of split required for the dataset, here its One only ...
df['sub_category'] = df.category.str.split('-\d+',n=1).str[0]
I have the following pandas DataFrame in Python3.x:
import pandas as pd
dict1 = {
'ID':['first', 'second', 'third', 'fourth', 'fifth'],
'pattern':['AAABCDEE', 'ABBBBD', 'CCCDE', 'AA', 'ABCDE']
}
df = pd.DataFrame(dict1)
>>> df
ID pattern
0 first AAABCDEE
1 second ABBBBD
2 third CCCDE
3 fourth AA
4 fifth ABCDE
There are two columns, ID and pattern. The string in pattern with the longest length is in the first row, len('AAABCDEE'), which is length 8.
My goal is to standardize the strings such that these are the same length, with the trailing spaces as ?.
Here is what the output should look like:
>>> df
ID pattern
0 first AAABCDEE
1 second ABBBBD??
2 third CCCDE???
3 fourth AA??????
4 fifth ABCDE???
If I was able to make the trailing spaces NaN, then I could try something like:
df = df.applymap(lambda x: int(x) if pd.notnull(x) else str("?"))
But I'm not sure how one would efficiently (1) find the longest string in pattern and (2) then add NaN add the end of the strings up to this length? This may be a convoluted approach...
You can use Series.str.ljust for this, after acquiring the max string length in the column.
df.pattern.str.ljust(df.pattern.str.len().max(), '?')
# 0 AAABCDEE
# 1 ABBBBD??
# 2 CCCDE???
# 3 AA??????
# 4 ABCDE???
# Name: pattern, dtype: object
In the source for Pandas 0.22.0 here it can be seen that ljust is entirely equivalent to pad with side='right', so pick whichever you find more clear.
You can using str.pad
df.pattern.str.pad(width=df.pattern.str.len().max(),side='right',fillchar='?')
Out[1154]:
0 AAABCDEE
1 ABBBBD??
2 CCCDE???
3 AA??????
4 ABCDE???
Name: pattern, dtype: object
Python 3.6 f-string
n = df.pattern.str.len().max()
df.assign(pattern=[f'{i:?<{n}s}' for i in df.pattern])
ID pattern
0 first AAABCDEE
1 second ABBBBD??
2 third CCCDE???
3 fourth AA??????
4 fifth ABCDE???
I have a column of a dataframe df like the following:
df.a=
0 2
1 2;4
2 4;2
3 2;4
4 4;2
5 1
I want to find all the rows that contain the value 4.
I am looking for a command like
df[df.a==4]
I would use str.contains (as long as the column is a series of strings):
a = df.loc[df['a'].str.contains('4')]
this returns:
a
0 2;4
1 4;2
2 2;4
3 4;2
EDIT: in the general case, you should use a regex expression to match single '4' values:
a = df.loc[df['a'].str.contains(r'\b4\b')]