How to find error value on pandas dataframe? - python

While using csv files from excel and read it with pandas data frame, got 1 value that's has symbol such as 2$3.74836730957 while it has to be 243.74836730957 (it seems mistook 4 with $). is there anyways that I could find such as values that I mention before and change it into NaN values on Data Frame?
CSV file:

You can use pd.to_numeric in order to report boolean values that denote whether a particular column has only numerical values. In order to check all columns you can do
df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())
And the output would look like:
A True
B False
C True
D False
...
dtype: bool
Now if you want to know which specific row(s) are not numerical you can use np.isreal:
df.applymap(np.isreal)
A B C D
item
r1 True True True True
r2 True True True True
r3 True True True False
...

Related

How to compress a pandas dataframe, based on their boolean column value?

Given a pandas frame like this:
In:
pokemon yes no ignore
Vulpix True False False
Nidorino False True False
Growlithe False False True
Krokorok False True False
Darumaka False False True
Klefki False True False
Croagunk True False False
What is the correct way of getting as a row value the column associated to the pokemon column?:
Out:
pokemon Val
Vulpix yes
Nidorino no
Growlithe ignore
Krokorok no
Darumaka ignore
Klefki no
Croagunk yes
So far I tried with cross tab:
pd.crosstab(df, columns=['yes', 'no', 'ignore'])
However, I am getting a value error:
`ValueError: Shape of passed values`
What is the correct way of getting the previous output?
If those are one-hot encoded such that there is only ever a single 1/True on each row, set_index and dot with the columns.
df = df.set_index('pokemon')
df = df.dot(df.columns)
pokemon
Vulpix yes
Nidorino no
Growlithe ignore
Krokorok no
Darumaka ignore
Klefki no
Croagunk yes
dtype: object
The above is a Series, to get the DataFrame to match your output:
df = df.dot(df.columns).to_frame('Val').reset_index()

Split strings in DataFrame and keep only certain parts

I have a DataFrame like this:
x = ['3.13.1.7-2.1', '3.21.1.8-2.2', '4.20.1.6-2.1', '4.8.1.2-2.0', '5.23.1.10-2.2']
df = pd.DataFrame(data = x, columns = ['id'])
id
0 3.13.1.7-2.1
1 3.21.1.8-2.2
2 4.20.1.6-2.1
3 4.8.1.2-2.0
4 5.23.1.10-2.2
I need to split each id string on the periods, and then I need to know when the second part is 13 and the third part is 1. Ideally, I would have one additional column that is a boolean value (in the above example, index 0 would be TRUE and all others would be FALSE). But I could live with multiple additional columns, where one or more contain individual string parts, and one is for said boolean value.
I first tried to just split the string into parts:
df['id_split'] = df['id'].apply(lambda x: str(x).split('.'))
This worked, however if I try to isolate only the second part of the string like this...
df['id_split'] = df['id'].apply(lambda x: str(x).split('.')[1])
...I get an error that the list index is out of range.
Yet, if I check any individual index in the DataFrame like this...
df['id_split'][0][1]
...this works, producing only the second item in the list of strings.
I guess I'm not familiar enough with what the .apply() method is doing to know why it won't accept list indices. But anyway, I'd like to know how I can isolate just the second and third parts of each string, check their values, and output a boolean based on those values, in a scalable manner (the actual dataset is millions of rows). Thanks!
Let's use str.split to get the parts, then you can compare:
parts = df['id'].str.split('\.', expand=True)
(parts[[1,2]] == ['13','1']).all(1)
Output:
0 True
1 False
2 False
3 False
4 False
dtype: bool
You can do something like this
df['flag'] = df['id'].apply(lambda x: True if x.split('.')[1] == '13' and x.split('.')[2]=='1' else False)
Output
id flag
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False
You can do it directly, like below:
df['new'] = df['id'].apply(lambda x: str(x).split('.')[1]=='13' and str(x).split('.')[2]=='1')
>>> print(df)
id new
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False

Find rows in a DataFrame that match

Suppose make a data frame with pandas
df = pandas.DataFrame({"a":[1,2,3],"b":[3,4,5]})
Now I have a series
s = pandas.Series([1,3], index=['a','b'])
OK, finally, my question is how can I know s is a item in df especially I need to consider performance?
ps: It's better to return True or False when test s in or not in df.
You want
df.eq(s).all(axis=1).any()
# True
We start with a row-wise comparison,
df.eq(s)
a b
0 True True
1 False False
2 False False
Find all rows which match fully,
df.eq(s).all(axis=1)
0 True
1 False
2 False
dtype: bool
Return True if any matches were found,
df.eq(s).all(axis=1).any()
# True

Using str.contains to look for two substrings with pandas in python

I'm afraid the solution is obvious or the question a duplicate, but I couldn't find an answer yet: I have a pandas data frame that contains long strings and I need two strings to be matched at the same time. I found the "or" version multiple time but I didn't find the "and" solution yet.
Please assume the following data frame where the interesting information "element type" and subpart type" are separated by a random in between element:
import pandas as pd
data = pd.DataFrame({"col1":["element1_random_string_subpartA"
, "element2_ran_str_subpartA"
, "element1_some_text_subpartB"
, "element2_some_other_text_subpartB"]})
I'd now like to filter for all lines that contain element1 and subpartA.
data.col1.str.contains("element1|subpartA")
return a data frame
True
True
True
False
which is the expected result. But I need an "And" combination and
data.col1.str.contains("element1&subpartA")
returns
False
False
False
False
although I'd expect
True
False
False
False
Regex and is not easy:
m = data.col1.str.contains(r'(?=.*subpartA)(?=.*element1)')
Simplier is chain both conditions with & for bitwise AND:
m = data.col1.str.contains("subpartA") & data.col1.str.contains("element1")
print (m)
0 True
1 False
2 False
3 False
Name: col1, dtype: bool

Find values of data frame in another dataframe in python

I have two data frames df_1 contains:
["TP","MP"]
and df_2 contains:
["This is case 12389TP12098","12378MP899" is now resolved","12356DCT is pending"]
I want to use values in df_1 search it in each entry of df_2
and return those which matches. In this case,those two entries which have TP,MP.
I tried something like this.
df_2.str.contains(df_1)
You need to do it separately for each element of df_1. Pandas will help you:
df_1.apply(df_2.str.contains)
Out:
0 1 2
0 True False False
1 False True False
That's a matrix of all combinations. You can pretty it up:
matches = df_1.apply(df_2.str.contains)
matches.index = df_1
matches.columns = df_2
matches
Out:
This is case 12389TP12098 12378MP899 is now resolved 12356DCT is pending
TP True False False
MP False True False

Categories

Resources