Pandas dataframe to check an empty string - python

I would like to differentiate an empty string with certain lengths and a regular string such as G1234567. The length of the empty string right now in my dataset is 8 but I would not guarantee all future empty string will still have length of 8.
This is what the column looks like when I print it out:
0
1
2
3
4
9461 G6000000
9462 G6000001
9463 G6000002
9464 G6000003
9465 G6000004
Name: Sub_ID, Length: 9466, dtype: object
If I apply pd.isnull() on the entire column, I will have a mask populated with all False. I would like to ask if there is anyway for me to differentiate between an empty string with certain lengths and a string that is actually populated with something.
Thank you so much for your help!

The following creates a mask for all the cells in your DataFrame (df) that are just empty strings (strings that only contain whitespaces):
df.applymap(lambda column: column.isspace())

Related

Filter on a pandas string column as numeric without creating a new column

This is a quite easy task, however, I am stuck here. I have a dataframe and there is a column with type string, so characters in it:
Category
AB00
CD01
EF02
GH03
RF04
Now I want to treat these values as numeric and filter on and create a subset dataframe. However, I do not want to change the dataframe in any way. I tried:
df_subset=df[df['Category'].str[2:4]<=3]
of course this does not work, as the first part is a string and cannot be evaluated as numeric and compared to 69.
I tried
df_subset=df[int(df['Category'].str[2:4])<=3]
but I am not sure about this, I think it is wrong or not the way it should be done.
Add type conversion to your expression:
df[df['Category'].str[2:].astype(int) <= 3]
Category
0 AB00
1 CD01
2 EF02
3 GH03
As you have leading zeros, you can directly use string comparison:
df_subset = df.loc[df['Category'].str[2:4] <= '03']
Output:
Category
0 AB00
1 CD01
2 EF02
3 GH03

Remove "?" from pandas column

I've a pandas dataset which has columns and it's Dtype is object. The columns however has numerical float values inside it along with '?' and I'm trying to convert it to float. I want to remove these '?' from the entire column and making those values Nan but not 0 and then convert the column to float64.
The output of value_count() of Voltage column look like this :
? 3771
240.67 363
240.48 356
240.74 356
240.62 356
...
227.61 1
227.01 1
226.36 1
227.28 1
227.02 1
Name: Voltage, Length: 2276, dtype: int64
What is the best way to do that in case I've entire dataset which has "?" inside them along with numbers and i want to convert them all at once.
I tried something like this but it's not working. I want to do this operation for all the columns. Thanks
df['Voltage'] = df['Voltage'].apply(lambda x: float(x.split()[0].replace('?', '')))
1 More question. How can I get "?" from all the columns. I tried something like. Thanks
list = []
for i in df.columns:
if '?' in df[i]
continue
series = df[i].value_counts()['?']
list.append(series)
So, from your value_count, it is clear, that you just have some values that are floats, in a string, and some values that contain ? (apparently that ARE ?).
So, the one thing NOT to do, is use apply or applymap.
Those are just one step below for loops and iterrows in the hierarchy of what not to do.
The only cases where you should use apply is when, otherwise, you would have to iterate rows with for. And those cases almost never happen (in my real life, I've used apply only once. And that was when I was a beginner, and I am pretty sure that if I were to review that code now, I would find another way).
In your case
df.Voltage = df.Voltage.where(~df.Voltage.str.contains('\?')).astype(float)
should do what you want
df.Voltage.str.contains('\?') is a True/False series saying if a row contains a '?'. So ~df.Voltage.str.contains('\?') is the opposite (True if the row does not contain a '\?'. So df.Voltage.where(~df.Voltage.str.contains('\?')) is a serie where values that match ~df.Voltage.str.contains('\?') are left as is, and the other are replaced by the 2nd argument, or, if there is no 2nd argument (which is our case) by NaN. So exactly what you want. Adding .astype(float) convert everyhting to float, since it should now be possible (all rows contains either strings representing a float such as 230.18, or a NaN. So, all convertible to float).
An alternative, closer to what you where trying, that is replacing first, in place, the ?, would be
df.loc[df.Voltage=='?', 'Voltage']=None
# And then, df.Voltage.astype(float) converts to float, with NaN where you put None

Python replace NaN values of a field holding numeric values with empty and not single quotes which will be treated later as strings

I am uploading some data frames into snowflake cloud. I had to use the following to transform all field values into string:
data = data.applymap(str)
The only reason I am doing that is that without it, I will get the following error:
TypeError: not all arguments converted during string formatting
The reason for that is there is fields containing numeric values, but not all rows have it, some of them have 'NA' instead. And for data integrity, we cannot replace them with 0s as in our domain, 0 might seems something, and in our work blank is different to the value of 0.
At the beginning, I tried to relace NAa with single quotes '', but then, all fields having numbers were transformed into float. So if a value is 123, it will be 123.0.
How can I replace NA values in a numeric field into completly blank and not '' so the field can still be considered as type INT.
In the image below, I don't want the empty cell to be treated as string, as the other fields will be transformed with the applymap() into floats if they are int:
Detect nan's using np.isnan() and put only non-nan numbers into str().
If you don't want float-typed intgers, just change the mapping fromstr() to str(int()).
Data
Note that column B contains nan which is actually a float number, so its dtype is automatically float.
df = pd.DataFrame({"A": [1 ,2], "B":[3, np.nan]})
print(df)
A B
0 1 3.0
1 2 NaN
Code
import numpy as np
df.applymap(lambda el: "" if np.isnan(el) else str(el))
Out[12]:
A B
0 1 3.0
1 2

Delete the rows that contain the string - Pandas dataframe

I want to convert columns in DataFrame from OBJECT to INT. I need to completely delete the lines that contain the string.
The following expression "saves" the data I care about and converts the column from the OBJECT to INT type:
df["column name"] = df["column name"].astype(str).str.replace(r'/\d+$', '').astype(int)
However,before this, rows that contain letters (A-Z) I want to delete completely.
I tried:
df[~df["column name"].str.lower().str.startswith('A-Z')]
Also I tried a few other expressions, however, no data cleans.
DataFrame looks something like this:
A B C
0 8161 0454 9600
1 - 3780 1773 1450
2 2564 0548 5060
3 1332 9179 2040
4 6010 3263 1050
5 I Forgot 7849 1400/10000
Col C - 1400/10000 - The first expression I wrote simply removes "/ 10000" and remains "1400"
Now I need to remove the word expressions as in the "A5"
Using regular expression you can create a mask for all rows that contains a character between [a-z]. Then you can drop this rows. Like this:
mask = df['a'].str.lower().str.contains("[a-z]")
idx = df.index[mask]
df = df.drop(idx, axis=0)

Pandas: return feature names if variable is true

I have a list of ~2M strings and a list of ~800 words. I have created a dataframe with strings as rows and words as columns. With the exception of the string variable, all of the other variables are true or false values corresponding to whether or not the word is in the string. There are no missing values.
i.e.
import pandas as pd
df = pd.DataFrame({'strings':['a string with california',
'a string with lobster',
'a str with california and lobster'],
'california':[True,False,True],
'lobster':[False,True,True],
'string':[True,True,False],})
Because the dataframe is too long and wide to view at once, I would like to have a variable that lists the column names that have a true value for that particular row. For example,
df_filtered = pd.DataFrame({'strings':['a string with california',
'a string with lobster',
'a str with california and lobster'],
'matches':[['string','california'],
['string', 'lobster'],
['california', 'lobster']],
'california':[True,False,True],
'lobster':[False,True,True],
'string':[True,True,False],})
I am new to pandas and have figured out that I can create a list of column names with missing values with the following command
columns_w_na = df.columns[df.isnull().any()].tolist()
Is there a way that I can, for each row, similarly capture the names of columns with a particular value and represent it at as a list?
You may want to check
df.eq(True).dot(df.columns+',').str[:-1].str.split()
0 [california,string]
1 [lobster,string]
2 [california,lobster]
dtype: object
use apply with a lambda expression:
# setting axis=1 in apply means you are looking across rows
df['new'] = df.apply(lambda x: df.columns[x == True].values, axis=1)
strings california lobster string \
0 a string with california True False True
1 a string with lobster False True True
2 a str with california and lobster True True False
new
0 [california, string]
1 [lobster, string]
2 [california, lobster]
One of the responses above does a good job of creating a bracketed string of the matches separated by commas which is really helpful. I had a subsequent issue where I needed to count the number of matched phrases which made it more helpful to have the column in a list type as opposed to a string.
df['matches'] = df.eq(True).dot(df.columns+',').str[:-1].str.split(',')
df['num_matches'] = df['matches'].str.len()

Categories

Resources