Delete the rows that contain the string - Pandas dataframe - python

I want to convert columns in DataFrame from OBJECT to INT. I need to completely delete the lines that contain the string.
The following expression "saves" the data I care about and converts the column from the OBJECT to INT type:
df["column name"] = df["column name"].astype(str).str.replace(r'/\d+$', '').astype(int)
However,before this, rows that contain letters (A-Z) I want to delete completely.
I tried:
df[~df["column name"].str.lower().str.startswith('A-Z')]
Also I tried a few other expressions, however, no data cleans.
DataFrame looks something like this:
A B C
0 8161 0454 9600
1 - 3780 1773 1450
2 2564 0548 5060
3 1332 9179 2040
4 6010 3263 1050
5 I Forgot 7849 1400/10000
Col C - 1400/10000 - The first expression I wrote simply removes "/ 10000" and remains "1400"
Now I need to remove the word expressions as in the "A5"

Using regular expression you can create a mask for all rows that contains a character between [a-z]. Then you can drop this rows. Like this:
mask = df['a'].str.lower().str.contains("[a-z]")
idx = df.index[mask]
df = df.drop(idx, axis=0)

Related

Filter on a pandas string column as numeric without creating a new column

This is a quite easy task, however, I am stuck here. I have a dataframe and there is a column with type string, so characters in it:
Category
AB00
CD01
EF02
GH03
RF04
Now I want to treat these values as numeric and filter on and create a subset dataframe. However, I do not want to change the dataframe in any way. I tried:
df_subset=df[df['Category'].str[2:4]<=3]
of course this does not work, as the first part is a string and cannot be evaluated as numeric and compared to 69.
I tried
df_subset=df[int(df['Category'].str[2:4])<=3]
but I am not sure about this, I think it is wrong or not the way it should be done.
Add type conversion to your expression:
df[df['Category'].str[2:].astype(int) <= 3]
Category
0 AB00
1 CD01
2 EF02
3 GH03
As you have leading zeros, you can directly use string comparison:
df_subset = df.loc[df['Category'].str[2:4] <= '03']
Output:
Category
0 AB00
1 CD01
2 EF02
3 GH03

How to deal with Pandas dataframe column with list containing string values, get unique words

I am trying to do some basic operations on a dataframe column (called dimensions) that contains a list. Do basic operations like df['dimensions'].str.replace() work when the dataframe column contains a list? It did not work for me. I also tried to replace the text in the column using re.sub() method and it did not work either.
This is the last column in my dataframe:
**dimensions**
[50' long]
None
[70ft long, 19ft wide, 8ft thick]
[5' high, 30' long, 18' wide]
This is what I have tried, but it did not work:
def dimension_unique_words(dimensions):
if dimensions != 'None':
for value in dimensions:
new_value = re.sub(r'[^\w\s]|ft|feet', ' ', value)
new_value = ''.join([i for i in new_value if not i.isdigit()])
return new_value
df['new_col'] = df['dimensions'].apply(dimension_unique_words)
this is the output I got from my code:
**new_col**
NaN
None
NaN
None
NaN
None
What I want to do is to replace the numbers and the units [ft, feet, '] in the column called dimensions with a space and then apply the df.unique() on that column to get the unique values which are [long, wide, thick, high].
The expected output would be:
**new_col**
[long]
None
[long, wide, thick]
[high, long, wide]
...then I want to apply the df.unique() on the new_col to get [long, wide, thick, high]
How to do that?
First we deal with the annoyance that your 'dimensions' column is sometimes None, sometimes a list of one string element. So extract that element when it's non-null:
df['dimensions2'] = df['dimensions'].apply(lambda col: col[0] if col else None)
Next, get all alphabetic strings in each row, excluding measurements:
>>> df['dimensions2'].str.findall(r'\b([a-zA-Z]+)')
0 [long]
1 None
2 [long, wide, thick]
3 [high, long, wide]
Note we use \b word-boundary (to exclude the 'ft' from '30ft'), and to avoid misinterpreting \b as backslash we have to use r'' rawstring on the regex.
This gives you a list. You wanted a set, to prevent duplicates occurring, so:
df['dimensions2'].str.findall(r'\b([a-zA-Z]+)').apply(lambda l: set(l) if l else None)
0 {long}
1 None
2 {thick, long, wide}
3 {high, long, wide}
First we deal with the annoyance that your 'dimensions' column is sometimes None, sometimes a list of one string element. So extract that element when it's non-null:
df['dimensions2'] = df['dimensions'].apply(lambda col: col[0] if col else None)
Next, get all alphabetic strings in each row, excluding measurements:
>>> df['dimensions2'].str.findall(r'\b([a-z]+)')
0 [long]
1 None
2 [long, wide, thick]
3 [high, long, wide]
Note we use \b word-boundary (to exclude the 'ft' from '30ft'), and to avoid misinterpreting \b as backslash we have to use r'' rawstring on the regex.
use str.findall to find all dimensions values to a list.
use explode to explode the list to elements with the same index.
then use groupby(level=0).unique() to drop duplicates by index to a list.
df['new_col'] = (
df['dimensions'].fillna('').astype(str)
.str.findall(r'\b[a-zA-Z]+\b')
.explode().dropna()
.groupby(level=0).unique()
)
use df['new_col'].explode().dropna().unique() to get the unique dimensions values.
array(['long', 'wide', 'thick', 'high'], dtype=object)

Pandas dataframe to check an empty string

I would like to differentiate an empty string with certain lengths and a regular string such as G1234567. The length of the empty string right now in my dataset is 8 but I would not guarantee all future empty string will still have length of 8.
This is what the column looks like when I print it out:
0
1
2
3
4
9461 G6000000
9462 G6000001
9463 G6000002
9464 G6000003
9465 G6000004
Name: Sub_ID, Length: 9466, dtype: object
If I apply pd.isnull() on the entire column, I will have a mask populated with all False. I would like to ask if there is anyway for me to differentiate between an empty string with certain lengths and a string that is actually populated with something.
Thank you so much for your help!
The following creates a mask for all the cells in your DataFrame (df) that are just empty strings (strings that only contain whitespaces):
df.applymap(lambda column: column.isspace())

Numpy select by string from pandas DataFrame

I would like to create a new column in my pandas DataFrame based on matching strings. I have pathnames of images that contain either the string 'distorted' or 'original'. I would like to assign the string values 'd' and 'o' in the new column respectively. I have been using np.select but I got a shape-mismatch error.
This is my code:
type_cond = [(df[df['img_name'].str.contains(r'\bdistorted\b')]), (df[df['img_name'].str.contains(r'\boriginal\b')])]
type_values = ['d', 'o']
df['image_type'] = np.select(type_cond, type_values)
When I run the conditions separately, I get the expected output:
distorted = df[df['img_name'].str.contains(r'\bdistorted\b')]
output:
id
n
r
img_name
rid
...
2995
I
2
images/distorted/png/3MRNMEIQW56USS7S1XTZ20C8J...
E
2996
I
3
images/distorted/png/30MVJZJNHMDCUC6BMWCK0PGQO...
E
2997
I
2
images/distorted/png/3MYYFCXHJ37164AYXVVQM4DUA...
E
2998
I
3
images/distorted/png/39RP059MEHTLJDRTND387N3XG...
E
2999
I
1
images/distorted/png/3EKVH9QMEY4OR6LKRRBUN4DZD...
E
[2003 rows x 4 columns]
When filtering the strings that contain 'original' it selects: [997 rows x 4 columns]
The entire data frame is of size: [3000 rows x 4 columns]
I don't see why there is a shape mismatch because all the rows are covered by either condition.
There is problem in conditions list are filtered DataFrames.
So need remove boolean indexing - (df[]):
type_cond = [df['img_name'].str.contains(r'\bdistorted\b'),
df['img_name'].str.contains(r'\boriginal\b')]

Drop rows from dataframe that contain characters outside of a list of characters

I'm trying to remove all rows from a panda dataframe that are not in a list
allowed_chars = list(ascii_lowercase)
data = df[df['Value'].apply(lambda x : x in allowed_chars)]
print(data.Value.tolist())
The print just prints a list of 'False' values.
What you're doing is comparing the entire string stored in the Value column to the list of character you allow, this won't work as your list of characters only consist of one character strings that don't match any of your words in the Value column. Here's what you could do instead.
allowed_chars = set('abcde...')
data = df[df['Value'].apply(lambda x: set(x).issubset(allowed_chars))]
print(data.Value.tolist())
Your format seems fine, and the only thing i can think of is that the maybe the value isn't in the right format for the test in. you may need to do str(x) or something like that in you data = line. If you can give a snippet of ascii_lowercase and data i can look further.
df2
# a b c
#0 x 2 3
#1 y 2 4
df2[df2.a.apply(lambda x: x in 'x')]
# a b c
#0 x 2 3

Categories

Resources