Check which value in Pandas Dataframe Column is String - python

I have a Dataframe that consists of around 0.2 Million Records. When I'm inputting this Dataframe as an input to a model, it's throwing this error:
Cast string to float is not supported.
Is there any way I can check which particular value in the data frame is causing this error?
I've tried running this command and checking if any value is a string in the column.
False in map((lambda x: type(x) == str), trainDF['Embeddings'])
Output:
True

In panda when we convert those type mix column we do
df['col'] = pd.to_numeric(df['col'],errors = 'coerce')
Which will return NaN for those item can not be convert to float, you can drop then with dropna or fill some default value with fillna

You should loop over trainDF's indices and find the rows that have errors using try except.
>>> import pandas as pd
>>> trainDF = pd.DataFrame({'Embeddings':['100', '23.2', '44a', '453.2']})
>>> trainDF
Embeddings
0 100
1 23.2
2 44a
3 453.2
>>> error_indices = []
>>> for idx, row in trainDF.iterrows():
... try:
... trainDF.loc[idx, 'Embeddings'] = float(row['Embeddings'])
... except:
... error_indices.append(idx)
...
>>> trainDF
Embeddings
0 100.0
1 23.2
2 44a
3 453.2
>>> trainDF.loc[error_indices]
Embeddings
2 44a

Related

How to add multiple array outputs to a dataframe?

I am working with probabilities, when I print the output,
it looks as follows:
[[4.88915104e-308 1.43405787e-307 2.20709896e-308 ... 3.08740254e-307
1.68481486e-307 1.72126050e-307]
[1.64744295e-004 8.66082462e-004 7.66062761e-005 ... 1.85613403e-003
9.68750380e-004 8.22260750e-004]
[6.18964539e-004 1.85605606e-003 2.71300593e-004 ... 3.86232296e-003
2.01921300e-003 2.18304351e-003]],
Is there a way in pandas to store it as a DataFrame?
Desired output:
Index Value
0 [4.88915104e-308 1.43405787e-307 2.20709896e-308 ... 3.08740254e-307
1.68481486e-307 1.72126050e-307]
1 [1.64744295e-004 8.66082462e-004 7.66062761e-005 ... 1.85613403e-003
9.68750380e-004 8.22260750e-004]
2 [6.18964539e-004 1.85605606e-003 2.71300593e-004 ... 3.86232296e-003
2.01921300e-003 2.18304351e-003]
I tried a lot of ways but I get :
ValueError: Wrong number of items passed 6, placement implies 1
Yes, it is possible if convert 2d array to list:
df = pd.DataFrame({'col':arr.tolist()})
Or:
s = pd.Series(arr.tolist())

Check if a pandas Dataframe string column contains all the elements given in an array

I have a dataframe as shown below:
>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
a b c d
0 app; 1 2 3
1 app; web; 4 5 6
2 web; 7 8 9
3 1 4 5
I have an input array that looks like this: ["app","web"]
For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:
>>> df.a.str.contains("app")
0 True
1 True
2 False
3 False
Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:
df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'
My end goal is not to do an absolute match (df.a.isin(["app", "web"]) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.
Note: I can of course use apply method to create my own function for the same logic such as:
elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))
But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.
This should work too:
l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))
also this should work as well:
pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)
so many solutions, which one is the most efficient
The str.contains-based answers are generally fastest, though str.findall is also very fast on smaller dfs:
values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)
def replace_dummies_all(df):
return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)
def findall_map(df):
return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))
def lower_contains(df):
return df.a.astype(str).str.lower().str.contains(pattern)
def contains_concat_all(df):
return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)
def contains(df):
return df.a.str.contains(pattern)
Try with str.get_dummies
df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0 False
1 True
2 False
3 False
dtype: bool
Update
df['a'].str.contains(r'^(?=.*web)(?=.*app)')
Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):
elementList = ['app','web']
for eachValue in elementList:
valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)

Convert string (comma separated) to int list in pandas Dataframe

I have a Dataframe with a column which contains integers and sometimes a string which contains multiple numbers which are comma separated (like "1234567, 89012345, 65425774").
I want to convert that string to an integer list so it's easier to search for specific numbers.
In [1]: import pandas as pd
In [2]: raw_input = "1111111111 666 10069759 9695011 9536391,2261003 9312405 15542804 15956127 8409044 9663061 7104622 3273441 3336156 15542815 15434808 3486259 8469323 7124395 15956159 3319393 15956184
: 15956217 13035908 3299927"
In [3]: df = pd.DataFrame({'x':raw_input.split()})
In [4]: df.head()
Out[4]:
x
0 1111111111
1 666
2 10069759
3 9695011
4 9536391,2261003
Since your column contains strings and integers, you want probably something like this:
def to_integers(column_value):
if not isinstance(column_value, int):
return [int(v) for v in column_value.split(',')]
else:
return column_value
df.loc[:, 'column_name'] = df.loc[:, 'column_name'].apply(to_integers)
Your best solution to cases like this, where a column has 1 or more values, is splitting the data into multiple columns.
Try something like
ids = df.ID.str.split(',', n=1, expand=True)
for i in range(3):
df['ID' + str(i + 1)] = ids.iloc[, i]

Compare columns in Dataframe with inconsistent type values

import pandas as pd
df = pd.DataFrame({'RMDS': ['10.686000','NYSE_XNAS','0.472590','qrtr'], 'Mstar': ['10.690000', 'NYSE_XNAS', '0.473590','mnthly']})
Dataframe df will look like this:
Mstar RMDS
0 10.690000 10.686000
1 NYSE_XNAS NYSE_XNAS
2 0.473590 0.472590
3 mnthly qrtr
I want to compare value of 'RMDS' with 'Mstar' and type of dataframe is 'object',this is huge dataframe and I need to compare rounded values
mask = np.around(pd.to_numeric(df.Mstar), 2) != np.around(pd.to_numeric(df.RMDS), 2)
df_Difference = df[mask]
since values in columns are not consistent so whenever string values are coming like 'qrtr' , above logic is failing as I am using pd.to_numeric but still i wanted to compare 'qrtr' from 'RMDS' to 'mnthly' in 'Mstar'
Is there any way I could handle this type of situation.
Use pd.to_numeric to convert what you can, then .fillna to get everything back that wasn't converted.
import pandas as pd
import numpy as np
df = np.round(df.apply(pd.to_numeric, errors='coerce'),2).fillna(df)
# RMDS Mstar
#0 10.69 10.69
#1 NYSE_XNAS NYSE_XNAS
#2 0.47 0.47
#3 qrtr mnthly
df.RMDS == df.Mstar
#0 True
#1 True
#2 True
#3 False
#dtype: bool
Alternatively, define your own function and use .applymap
def my_round(x):
try:
return np.round(float(x),2)
except ValueError:
return x
df = df.applymap(my_round)

Replacing masked values (--) with a Null or None value using fiil_value from ma numpy in Python

Is there a way to replace a masked value in a numpy masked array as a null or None value? This is what I have tried but does not work.
for stars in range(length_masterlist_final):
....
star = customSimbad.query_object(star_names[stars])
#obtain stellar info.
photometry_dataframe.iloc[stars,0] = star_IDs[stars]
photometry_dataframe.iloc[stars,1] = star_names[stars]
photometry_dataframe.iloc[stars,2] = star['FLUX_U'][0]
#Replace "--" masked values with a Null (i.e., '') value.
photometry_dataframe.iloc[stars,2] = ma.filled(photometry_dataframe.iloc[stars,2], fill_value=None)
.....
photometry_dataframe.to_csv(output_dir + "simbad_photometry.csv", index=False, header=True, na_rep='NaN')
specifically
(photometry_dataframe.iloc[stars,2] = ma.filled(photometry_dataframe.iloc[stars,2], fill_value=None))
produces
'MaskedConstant' object has no attribute '_fill_value'
I want to replace masked values '--' with '' when I output the dataframe as a csv file. One work around is to read the outputted csv file back into python and replace '--' with '', but this is a horrible solution. There must be a better solution. I don't want masked values printed as '--' in the csv file.
Use Astropy:
>>> from pandas import DataFrame
>>> from astropy.table import Table
>>> import numpy as np
>>>
>>> df = DataFrame()
>>> df['a'] = [1, np.nan, 2]
>>> df['b'] = [3, 4, np.nan]
>>> df
a b
0 1 3
1 NaN 4
2 2 NaN
>>> t = Table.from_pandas(df)
>>> t
<Table masked=True length=3>
a b
float64 float64
------- -------
1.0 3.0
-- 4.0
2.0 --
>>> t.write('photometry.csv', format='ascii.csv')
>>>
(astropy)neptune$ cat photometry.csv
a,b
1.0,3.0
,4.0
2.0,
You can specify arbitrary transformations from table values to output values using the fill_values parameter (http://docs.astropy.org/en/stable/io/ascii/write.html#parameters-for-write).

Categories

Resources