Is there a more efficient way to do the following? Ideally, by using only one if statement?
Suppose there is a dataframe with an "author" series, a "comedy" series (default = True), and a "horror" series (default = False). I want to search the author series for "stephen king" and "lovecraft" and in those cases change the value of "comedy" from True to False and change the value of "horror" from False to True.
for count,text in enumerate(df.loc[0:, "author"]):
if "stephen king" in str(text):
df.loc[count, "comedy"] = False
df.loc[count, "horror"] = True
continue
elif "lovecraft" in str(text):
df.loc[count, "comedy"] = False
df.loc[count, "horror"] = True
continue
When I try using str.contains(), I get the error "str' object has no attribute 'str'".
Don't enumerate a data frame, index and slice it.
ix = df.author.str.contains('stephen king')
df.loc[ix, 'comedy'] = False
df.loc[ix, 'horror'] = True
ix = df.author.str.contains('lovecraft')
df.loc[ix, 'comedy'] = False
df.loc[ix, 'horror'] = True
You can assign these values with df.loc. If string contains 'stephen king' or 'lovecraft', put False in column 'comedy' and True in column 'horror':
df.loc[df['author'].str.contains('stephen king|lovecraft'),
['comedy', 'horror']] = [False, True]
You can check that the column contains of several values by using
df['Author'].str.contains('|'.join(list_of_authors)
then assign the values using loc.
Ex;
>>> df
Author Comedy Horror
0 stephen king True False
1 lovecraft True False
2 jonathan ames True True
3 stephen king False True
4 oprah True False
>>> df.loc[df['Author'].str.contains('|'.join(['stephen king','lovecraft']),case=False,na=False),('Comedy','Horror')]=False,True
>>> df
Author Comedy Horror
0 stephen king False True
1 lovecraft False True
2 jonathan ames True True
3 stephen king False True
4 oprah True False
Related
In a pandas dataframe, I have a column of mixed data types, such as text, integers and datetimes. I need to find columns where datetimes match: (1) exact values in some cases, (2) only the date (ignoring time), or (3) only the date and time, but ignoring seconds.
In the following code example with a mixed data type dataframe column, there are three dates of varying imprecision. Mapping the conditions into a separate dataframe works for a precise value.
import pandas as pd
import numpy as np
# example data frame
inp = [{'Id': 0, 'mixCol': np.nan},
{'Id': 1, 'mixCol': "text"},
{'Id': 2, 'mixCol': 43831},
{'Id': 3, 'mixCol': pd.to_datetime("2020-01-01 00:00:00")},
{'Id': 4, 'mixCol': pd.to_datetime("2020-01-01 01:01:00")},
{'Id': 5, 'mixCol': pd.to_datetime("2020-01-01 01:01:01")}
]
df = pd.DataFrame(inp)
print(df.dtypes)
myMap = pd.DataFrame()
myMap["Exact"] = df["mixCol"] == pd.to_datetime("2020-01-01 01:01:01")
0 False
1 False
2 False
3 False
4 False
5 True
6 False
The output I need should be:
Id Exact DateOnly NoSeconds
0 False False False
1 False False False
2 False False False
3 False True False
0 False True True
5 True True True
6 False False False
BUT, mapping just the date, without time, maps as if the date had a time of 00:00:00.
myMap["DateOnly"] = df["mixCol"] == pd.to_datetime("2020-01-01")
Id Exact DateOnly
0 False False
1 False False
2 False False
3 False True
0 False False
5 True False
6 False False
Trying to convert values in the mixed column throws an AttributeError: 'Series' object has not attribute 'date'; and trying to use ">" and "<" to define the relevant range throws a TypeError: '>=' not supported between instances of 'str' and 'Timestamp'
myMap["DateOnly"] = df["mixCol"].date == pd.to_datetime("2020-01-01")
myMap["NoSeconds"] = (df["mixCol"] >= pd.to_datetime("2020-01-01 01:01:00")) & (df["mixCol"] < pd.to_datetime("2020-01-01 01:02:00"))
If I try to follow the solution for mix columns in pandas proposed here, both the np.nan and text value map true as dates.
df["IsDate"] = df.apply(pd.to_datetime, errors='coerce',axis=1).nunique(1).eq(1).map({True:True ,False:False})
I'm not sure how to proceed in this situation?
Use Series.dt.normalize for compare datetimes with remove times (set them to 00:00:00) or with Series.dt.floor by days or minutes for remove seconds:
#convert column to all datetimes with NaT
d = pd.to_datetime(df["mixCol"], errors='coerce')
myMap["DateOnly"] = d.dt.normalize() == pd.to_datetime("2020-01-01")
myMap["DateOnly"] = d.dt.floor('D') == pd.to_datetime("2020-01-01")
#alternative with dates
myMap["DateOnly"] = d.dt.date == pd.to_datetime("2020-01-01").date()
myMap['NoSeconds'] = d.dt.floor('Min') == pd.to_datetime("2020-01-01 01:01:00")
print (myMap)
Exact DateOnly NoSeconds
0 False False False
1 False False False
2 False False False
3 False True False
4 False True True
5 True True True
So I'm currently finishing a tutorial with the titanic dataset (https://www.kaggle.com/c/titanic/data).
Now I'm trying a couple of new things that might be related.
The info for it is :
There are 891 entries(red asterisk), and columns with NaN values (blue dashes).
When I went to find a little summary of the missing values I got confused by .sum() & .count():
In the above code, .sum() is incrementing by one for each instance of a null value. So it seems that the output is the value of how many missing entries there are for each column in the data frame. (which is what I want)
However if we do .count() we get 891 for each column no matter if we use .isnull().count() or .notnull().count().
So my question(s) is :
What does .count() mean in this context?
I thought that it would count every instance of the wanted method (in this case every instance of a null or not null entry; basically what .sum() did).
Also; my "definition" of how .sum() is being used, is that correct?
Just print out the data of train_df.isnull(), you will see it.
# data analysis and wrangling
import pandas as pd
import numpy as np
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
train_df = pd.read_csv('train.csv')
print(train_df.isnull())
result:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
.. ... ... ... ... ... ... ... ... ...
886 False False False False False False False False False
887 False False False False False False False False False
888 False False False False False True False False False
889 False False False False False False False False False
890 False False False False False False False False False
It got 891 rows, full of Trues and False.
When you use sum(), it will return the sum of every column, which adds trues(=1) and false(= 0) together.Like this
print(False+False+True+True)
2
When you use count(), it just returns the number of rows.
Of course, you will get 891 for each column no matter you use .isnull().count() or .notnull().count().
data.isnull().count() returns the total number of rows irrespective of missing values. You need to use data.isnull().sum().
Columns L,M,N of my dataframe are populated with 'true' and 'false' statements(1000 rows). I would like to create a new column 'count_false' that will return the number of times 'false' statement occurred in columns L,M and N.
Any tips appreciated!
Thank you.
You can negate your dataframe and sum over axis=1:
df = pd.DataFrame(np.random.randint(0, 2, (5, 3)), columns=list('LMN')).astype(bool)
df['Falses'] = (~df).sum(1)
print(df)
L M N Falses
0 True False True 1
1 True False False 2
2 True True True 0
3 False True False 2
4 False False True 2
If you have additional columns, you can filter accordingly:
df['Falses'] = (~df[list('LMN')]).sum(1)
Try this : df[df==false].count()
As explained in this Stack question True and False equal to 1 and 0 in Python, therefore something like the line three of the following example should solve your problem:
import pandas as pd
df = pd.DataFrame([[True, False, True],[False, False, False],[True, False, True],[False, False, True]], columns=['L','M','N'])
df['count_false'] = 3 - (df['L']*1 + df['M']*1 + df['N']*1)
I am trying to get a list of columns and values from a data frame column which is a nested dictionary:
Data Frame column looks like this:
{"id":"0","request":{"plantSearch":"true","maxResults":"51","caller":"WMS","companyCode":"GB54","purchOrg":"UPSO","Code":"5852","confidential":"false","flag":"true","service":"false","Item":"false","mastered":"true","copas":"false","pscmBlock":"false","descOperator":"CO","assocManuf":"PETK"},"response":{"hasMoreResults":"false","resultsCount":"0","execTime":"878 ms"}}
I am writing a code:
s1.columns = ['data']
l2 = []
for idx, row in s1['data'].iteritems():
tempdf = pd.DataFrame(row['request']['plantSearch'])
tempdf['maxResults'] = row['maxResults']
l2.append(tempdf)
pd.concat(l2,axis = 0)
The issue is python is referring the 'row' as string even if it is a dictionary.
I think you can use json.loads for converting to dict with DataFrame constructor for parse all data from request keys:
df = pd.DataFrame({'data':['{"id":"0","request":{"plantSearch":"true","maxResults":"51","caller":"WMS","companyCode":"GB54","purchOrg":"UPSO","Code":"5852","confidential":"false","flag":"true","service":"false","Item":"false","mastered":"true","copas":"false","pscmBlock":"false","descOperator":"CO","assocManuf":"PETK"},"response":{"hasMoreResults":"false","resultsCount":"0","execTime":"878 ms"}}','{"id":"0","request":{"plantSearch":"true","maxResults":"51","caller":"WMS","companyCode":"GB54","purchOrg":"UPSO","Code":"5852","confidential":"false","flag":"true","service":"false","Item":"false","mastered":"true","copas":"false","pscmBlock":"false","descOperator":"CO","assocManuf":"PETK"},"response":{"hasMoreResults":"false","resultsCount":"0","execTime":"878 ms"}}']})
print (df)
data
0 {"id":"0","request":{"plantSearch":"true","max...
1 {"id":"0","request":{"plantSearch":"true","max...
df1 =pd.DataFrame(df['data'].apply(lambda x: pd.io.json.loads(x)['request']).values.tolist())
print (df1)
Code Item assocManuf caller companyCode confidential copas descOperator \
0 5852 false PETK WMS GB54 false false CO
1 5852 false PETK WMS GB54 false false CO
flag mastered maxResults plantSearch pscmBlock purchOrg service
0 true true 51 true false UPSO false
1 true true 51 true false UPSO false
Similar solution:
df = pd.DataFrame([pd.io.json.loads(x)['request'] for x in df['data']])
print (df)
Code Item assocManuf caller companyCode confidential copas descOperator \
0 5852 false PETK WMS GB54 false false CO
1 5852 false PETK WMS GB54 false false CO
flag mastered maxResults plantSearch pscmBlock purchOrg service
0 true true 51 true false UPSO false
1 true true 51 true false UPSO false
Last is possible select columns by subset:
cols = ['plantSearch','maxResults']
df2 = df[cols]
print (df2)
plantSearch maxResults
0 true 51
1 true 51
I would like to make a boolean vector that is created by the comparison of two input boolean vectors. I can use a for loop, but is there a better way to do this?
My ideal solution would look like this:
df['A'] = [True, False, False, True]
df['B'] = [True, False, False, False]
C = ((df['A']==True) or (df['B']==True)).as_matrix()
print C
>>> True, False, False, True
I think this is what you are looking for:
C = (df['A']) | (df['B'])
C
0 True
1 False
2 False
3 True
dtype: bool
You could then leave this as a series or convert it to a list or array
Alternatively you could use any method with axis=1 to search in index. It also will work for any number of columns where you have True values:
In [1105]: df
Out[1105]:
B A
0 True True
1 False False
2 False False
3 False True
In [1106]: df.any(axis=1)
Out[1106]:
0 True
1 False
2 False
3 True
dtype: bool