create new dataframe field using lambda function - python

I am trying to create a new column based on conditions on other column.
(the data frame is already aggragated by user)
this is a sample of the data frame:
event_names country
["deleteobject", "getobject"] ["us"]
["getobject"] ["ca"]
["deleteobject", "putobject"] ["ch"]
I want to create 3 new columns:
was data deleted?
was data downloaded?
did the events come from my whitelisted countries?
WHITELISTED_COUNTRIES = ["us", "sg"]
like this:
event_names country was_data_deleted? was_data_downloaded? whitelisted_country?
["deleteobject","getobject"] ["us"] True True True
["getobject"] ["ca"] False True False
["deleteobject","putobject"] ["ch"] True False False
This is what I tried so far:
result_df['was_data_deleted'] = result_df['event_name'].apply(lambda x:True if any("delete" in x for i in x) else False)
result_df['was_data_downloaded'] = result_df['event_name'].apply(lambda x:True if "getObject" in i for i in x else False)
result_df['strange_countries'] = result_df['country'].apply(lambda x:False if any(x in WHITELISTED_COUNTRIES for x in result_df['country']) else False)
I get an Error "SyntaxError: invalid syntax"
any ideas? thanks!

df['was_data_deleted'] = df['event_names'].apply(lambda x: 'deleteobject' in x)
df['was_data_downloaded'] = df['event_names'].apply(lambda x: 'getobject' in x)
df['whitelisted_country'] = df['country'].apply(lambda x: x[0] in WHITELISTED_COUNTRIES)
print(df)
Prints:
event_names country was_data_deleted was_data_downloaded whitelisted_country
0 [deleteobject, getobject] [us] True True True
1 [getobject] [ca] False True False
2 [deleteobject, putobject] [ch] True False False

You can simplify your lambda function with remove if-else and True, False, because compared values already return it:
WHITELISTED_COUNTRIES = ["us", "sg"]
#checked substring delete
f1 = lambda x: any("delete" in i for i in x)
result_df['was_data_deleted'] = result_df['event_names'].apply(f1)
#checked string "getobject"
f2 = lambda x:"getobject" in x
result_df['was_data_downloaded'] = result_df['event_names'].apply(f2)
#checked list
f3 = lambda x:any(y in WHITELISTED_COUNTRIES for y in x)
result_df['strange_countries'] = result_df['country'].apply(f3)
print (result_df)
event_names country was_data_deleted was_data_downloaded \
0 [deleteobject, getobject] [us] True True
1 [getobject] [ca] False True
2 [deleteobject, putobject] [ch] True False
strange_countries
0 True
1 False
2 False

Related

How to take a Python pandas dataframe column and create a custom list with delimiter

So, I have a dataframe in python named s3_df:
Unnamed: 0 UniprotID Name kinhub klifs pkinfam reviewed_uniprot dunbrack_msa
0 0 A0A0B4J2F2 SIK1B False False True True True
1 1 A4QPH2 PI4KAP2 False True False False False
2 2 B5MCJ9 TRIM66 True False False False False
3 3 O00141 SGK1 True True True True True
4 4 O00238 BMPR1B|BMR1B True True True True True
.. ... ... ... ... ... ... ... ...
547 547 Q9Y616 IRAK3 True True True True True
548 548 Q9Y6E0 STK24 True True True True True
549 549 Q9Y6M4 CSNK1G3|KC1G3 True True True True True
550 550 Q9Y6R4 M3K4|MAP3K4 True True True True True
551 551 Q9Y6S9 RPS6KL1|RPKL1 True True True True True
[552 rows x 8 columns]
Essentially, what I want to do is take only that UniprotID column and separate each entry by " OR " and store that as another variable (kinase_list).
I want that column to look like this:
A0A0B4J2F2 OR A4QPH2 OR B5MCJ9 OR O00141 OR O00238 OR O00311 OR O00329 OR O00418 OR O00443 OR O00444 OR O00506 OR O00750 OR O14578 OR O14730 OR O14733 OR O14757 OR O14874 OR O14920 OR O14936 OR O14965 OR O14976 OR O14986 OR O15021 OR O15075 OR O15111 OR O15146 OR O15164 OR O15197 OR O15264 OR O15530 OR O43187 OR O43283 OR O43293 OR O43318 OR O43353 OR O43683 OR O43781 OR O43921 OR O43930 OR O60229 OR O60285 OR O60307 OR O60331 OR O60566 OR O60674 OR O60885 OR O75116 OR O75385 OR O75460 OR O75582 OR O75676 OR O75716 OR O75747 OR O75914 OR O75962 OR O76039 OR O94768 OR O94804 OR O94806 OR O94921 OR O95382 OR O95747 OR O95819 OR O95835 OR O96013 OR O96017 OR P00519 OR P00533 OR P00540 OR P04049 OR P04626 OR P04629 OR P05129 OR P05771 OR P06213 OR P06239 OR P06241 OR P06493 OR P07332 OR P07333 OR P07947 OR P07948 OR P07949 OR P08069 OR P08581 OR P08631 OR P08922 OR P09619 OR P09769 OR P0C1S8 OR P0C263 OR P0C264 OR P10398 OR P10721 OR P11274 OR P11309 OR P11362 OR P11801 OR P11802 OR P12931 OR P14616 OR P15056 OR P15735 OR P16066 OR P16234 OR P16591 OR P17252 OR P17612 OR P17948 OR P19525 OR P19784 OR P20594 OR P20794 OR P21127 OR P21675 OR P21709 OR P21802 OR P21860 OR P22455 OR P22607 OR P22612 OR P22694 OR P23443 OR P23458 OR P24723 OR P24941 OR P25092 OR P25098 OR P25440 OR P27037 OR P27361 OR P27448 OR P28482 OR P29317 OR P29320 OR P29322 OR P29323 OR P29376 OR P29597 OR P30291 OR P30530 OR P31152 OR P31749 OR P31751 OR P32298 OR P33981 OR P34925 OR P34947 OR P35557 OR P35590 OR P35626 OR P35916 OR P35968 OR P36507 OR P36888 OR P36894 OR P36896 OR P36897 OR P37023 OR P37173 OR P41240 OR P41279 OR P41743 OR P42336 OR P42338 OR P42345 OR P42356 OR P42679 OR P42680 OR P42681 OR P42684 OR P42685 OR P43250 OR P43403 OR P43405 OR P45983 OR P45984 OR P45985 OR P46734 OR P48426 OR P48729 OR P48730 OR P48736 OR P49137 OR P49336 OR P49674 OR P49759 OR P49760 OR P49761 OR P49840 OR P49841 OR P49842 OR P50613 OR P50750 OR P51451 OR P51617 OR P51812 OR P51813 OR P51817 OR P51841 OR P51955 OR P51956 OR P51957 OR P52333 OR P52564 OR P53004 OR P53350 OR P53355 OR P53667 OR P53671 OR P53778 OR P53779 OR P54646 OR P54753 OR P54756 OR P54760 OR P54762 OR P54764 OR P57058 OR P57059 OR P57078 OR P68400 OR P78356 OR P78362 OR P78368 OR P78527 OR P80192 OR Q00526 OR Q00532 OR Q00534 OR Q00535 OR Q00536 OR Q00537 OR Q01973 OR Q01974 OR Q02156 OR Q02750 OR Q02763 OR Q02779 OR Q02846 OR Q04759 OR Q04771 OR Q04912 OR Q05397 OR Q05513 OR Q05655 OR Q05823 OR Q06187 OR Q06418 OR Q07002 OR Q07912 OR Q08345 OR Q08881 OR Q09013 OR Q12851 OR Q12852 OR Q12866 OR Q12979 OR Q13043 OR Q13131 OR Q13153 OR Q13163 OR Q13164 OR Q13177 OR Q13188 OR Q13233 OR Q13237 OR Q13263 OR Q13308 OR Q13315 OR Q13418 OR Q13464 OR Q13470 OR Q13523 OR Q13535 OR Q13546 OR Q13554 OR Q13555 OR Q13557 OR Q13627 OR Q13705 OR Q13873 OR Q13882 OR Q13976 OR Q14004 OR Q14012 OR Q14164 OR Q14289 OR Q14296 OR Q14680 OR Q15059 OR Q15118 OR Q15119 OR Q15120 OR Q15131 OR Q15139 OR Q15208 OR Q15303 OR Q15349 OR Q15375 OR Q15418 OR Q15569 OR Q15746 OR Q15759 OR Q15772 OR Q15831 OR Q15835 OR Q16288 OR Q16512 OR Q16513 OR Q16539 OR Q16566 OR Q16584 OR Q16620 OR Q16644 OR Q16654 OR Q16659 OR Q16671 OR Q16816 OR Q16832 OR Q2M2I8 OR Q32MK0 OR Q38SD2 OR Q3MIX3 OR Q496M5 OR Q504Y2 OR Q52WX2 OR Q56UN5 OR Q58A45 OR Q58F21 OR Q59H18 OR Q5JZY3 OR Q5MAI5 OR Q5S007 OR Q5TCX8 OR Q5TCY1 OR Q5VST9 OR Q5VT25 OR Q6A1A2 OR Q6DT37 OR Q6IBK5 OR Q6IQ55 OR Q6J9G0 OR Q6P0Q8 OR Q6P2M8 OR Q6P3R8 OR Q6P3W7 OR Q6P5Z2 OR Q6PHR2 OR Q6SA08 OR Q6VAB6 OR Q6XUX3 OR Q6ZMQ8 OR Q6ZN16 OR Q6ZS72 OR Q6ZWH5 OR Q76MJ5 OR Q7KZI7 OR Q7L7X3 OR Q7RTN6 OR Q7Z2Y5 OR Q7Z695 OR Q7Z7A4 OR Q86SG6 OR Q86TB3 OR Q86TW2 OR Q86UE8 OR Q86UX6 OR Q86V86 OR Q86Y07 OR Q86YV5 OR Q86YV6 OR Q86Z02 OR Q8IU85 OR Q8IV63 OR Q8IVH8 OR Q8IVT5 OR Q8IVW4 OR Q8IW41 OR Q8IWB6 OR Q8IWQ3 OR Q8IWU2 OR Q8IY84 OR Q8IYT8 OR Q8IZE3 OR Q8IZL9 OR Q8IZX4 OR Q8N165 OR Q8N2I9 OR Q8N4C8 OR Q8N568 OR Q8N5S9 OR Q8N752 OR Q8N8J0 OR Q8NB16 OR Q8NCB2 OR Q8NE28 OR Q8NE63 OR Q8NEB9 OR Q8NER5 OR Q8NEV1 OR Q8NEV4 OR Q8NFD2 OR Q8NG66 OR Q8NI60 OR Q8TAS1 OR Q8TBX8 OR Q8TCG2 OR Q8TD08 OR Q8TD19 OR Q8TDC3 OR Q8TDR2 OR Q8TDX7 OR Q8TEA7 OR Q8TF76 OR Q8WTQ7 OR Q8WU08 OR Q8WXR4 OR Q8WZ42 OR Q92519 OR Q92630 OR Q92772 OR Q92918 OR Q96BR1 OR Q96C45 OR Q96D53 OR Q96GD4 OR Q96GX5 OR Q96J92 OR Q96KB5 OR Q96KG9 OR Q96L34 OR Q96L96 OR Q96LW2 OR Q96NX5 OR Q96PF2 OR Q96PN8 OR Q96PY6 OR Q96Q04 OR Q96Q15 OR Q96Q40 OR Q96QP1 OR Q96QS6 OR Q96QT4 OR Q96RG2 OR Q96RR4 OR Q96RU7 OR Q96RU8 OR Q96S38 OR Q96S44 OR Q96S53 OR Q96SB4 OR Q99558 OR Q99570 OR Q99640 OR Q99683 OR Q99755 OR Q99759 OR Q99986 OR Q9BQI3 OR Q9BRS2 OR Q9BTU6 OR Q9BUB5 OR Q9BVS4 OR Q9BWU1 OR Q9BX84 OR Q9BXA6 OR Q9BXA7 OR Q9BXM7 OR Q9BXU1 OR Q9BYP7 OR Q9BYT3 OR Q9BZL6 OR Q9C098 OR Q9C0K7 OR Q9H093 OR Q9H0K1 OR Q9H1R3 OR Q9H2G2 OR Q9H2K8 OR Q9H2X6 OR Q9H3Y6 OR Q9H422 OR Q9H4A3 OR Q9H4B4 OR Q9H5K3 OR Q9H6X2 OR Q9H792 OR Q9HAZ1 OR Q9HBH9 OR Q9HBY8 OR Q9HC98 OR Q9HCP0 OR Q9NQU5 OR Q9NR20 OR Q9NRH2 OR Q9NRL2 OR Q9NRM7 OR Q9NRP7 OR Q9NSY0 OR Q9NSY1 OR Q9NWZ3 OR Q9NY57 OR Q9NYL2 OR Q9NYV4 OR Q9NYY3 OR Q9NZJ5 OR Q9P0L2 OR Q9P1W9 OR Q9P286 OR Q9P289 OR Q9P2K8 OR Q9UBE8 OR Q9UBF8 OR Q9UBS0 OR Q9UEE5 OR Q9UEW8 OR Q9UF33 OR Q9UHD2 OR Q9UHY1 OR Q9UIG0 OR Q9UIK4 OR Q9UJY1 OR Q9UK32 OR Q9UKE5 OR Q9UKI8 OR Q9UL54 OR Q9UM73 OR Q9UPE1 OR Q9UPN9 OR Q9UPZ9 OR Q9UQ07 OR Q9UQ88 OR Q9UQB9 OR Q9UQM7 OR Q9Y243 OR Q9Y2H1 OR Q9Y2H9 OR Q9Y2K2 OR Q9Y2U5 OR Q9Y3S1 OR Q9Y463 OR Q9Y4A5 OR Q9Y4K4 OR Q9Y572 OR Q9Y5P4 OR Q9Y5S2 OR Q9Y616 OR Q9Y6E0 OR Q9Y6M4 OR Q9Y6R4 OR Q9Y6S9
You can use join:
kinase_list = ' OR '.join(s3_df['UniprotID'])
Output:
'A0A0B4J2F2 OR A4QPH2 OR B5MCJ9 OR O00141 OR O00238 OR ... OR Q9Y616 OR Q9Y6E0 OR Q9Y6M4 OR Q9Y6R4 OR Q9Y6S9'

How to make this code not to consume so much RAM memory?

I have these two function and when I run them my kernel dies so freaking quickly. What can I do to prevent it? It happens after appending about 10 files to the dataframe. Unfortunately json files are such big (approx. 150 MB per one, having dozens of them) and I have no idea how to join it together.
import os
import pandas as pd
from pandas.io.json import json_normalize
import json
def filtering_nodes(df):
id_list = df.index.tolist()
print("Dropping rows without 4 nodes and 3 members...")
for x in id_list:
if len(df['Nodes'][x]) != 4 and len(df['Members'][x]) != 3:
df = df.drop(x)
print("Converting to csv...")
df.to_csv("whole_df.csv", sep='\t')
return df
def merge_JsonFiles(filename):
result = list()
cnt = 0
df_all = None
data_all = None
for f1 in filename:
print("Appending file: ", f1)
with open('../../data' + f1, 'r') as infile:
data_all = json.loads(infile.read())
if cnt == 0:
df_all = pd.json_normalize(data_all, record_path =['List2D'], max_level =2 ,sep = "-")
else:
df_all = df_all.append(pd.json_normalize(data_all, record_path =['List2D'], max_level =2 ,sep = "-"), ignore_index = True)
cnt += 1
return df_all
files = os.listdir('../../data')
df_all_test = merge_JsonFiles(files)
df_all_test_drop = filtering_nodes(df_all_test)
EDIT:
Due to #jlandercy answer, I've made this:
def merging_to_csv():
for path in pathlib.Path("../../data/loads_data/Dane/hilti/").glob("*.json"):
# Open source file one by one:
with path.open() as handler:
df = pd.json_normalize(json.load(handler), record_path =['List2D'])
# Identify rows to drop (boolean indexing):
q = (df["Nodes"] != 4) & (df["Members"] != 3)
# Inplace drop (no extra copy in RAM):
df.drop(q, inplace=True)
# Append data to disk instead of RAM:
df.to_csv("output.csv", mode="a", header=False)
merging_to_csv()
and I have this type of error:
KeyError Traceback (most recent call last)
<ipython-input-55-cf18265ca50e> in <module>
----> 1 merging_to_csv()
<ipython-input-54-698c67461b34> in merging_to_csv()
51 q = (df["Nodes"] != 4) & (df["Members"] != 3)
52 # Inplace drop (no extra copy in RAM):
---> 53 df.drop(q, inplace=True)
54 # Append data to disk instead of RAM:
55 df.to_csv("output.csv", mode="a", header=False)
/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
4906 level=level,
4907 inplace=inplace,
-> 4908 errors=errors,
4909 )
4910
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
4148 for axis, labels in axes.items():
4149 if labels is not None:
-> 4150 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
4151
4152 if inplace:
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in _drop_axis(self, labels, axis, level, errors)
4183 new_axis = axis.drop(labels, level=level, errors=errors)
4184 else:
-> 4185 new_axis = axis.drop(labels, errors=errors)
4186 result = self.reindex(**{axis_name: new_axis})
4187
/opt/conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
6016 if mask.any():
6017 if errors != "ignore":
-> 6018 raise KeyError(f"{labels[mask]} not found in axis")
6019 indexer = indexer[~mask]
6020 return self.delete(indexer)
KeyError: '[ True True True True True True True True True True True True\n True True True True True True True True True True True True\n True True True True True True True True True True True True\n True True True True True True True True True True True True\n True True True True True True True True True True True True\n True True True True True True True True True True True True\n True True True True True True True True True True True True\n True True True True True True True True True True True True\n True] not found in axis'
What's wrong? I'll upload two smallest json files here:
https://drive.google.com/drive/folders/1xlC-kK6NLGr0isdy1Ln2tzGmel45GtPC?usp=sharing
You are facing multiple issue in your original approach:
Multiple copy of dataframe: df = df.drop(...);
Whole information stored in RAM because of append;
Unnecessary for loop to filter rows, use boolean indexing instead.
Here is baseline snippet to solve your problem based on data sample you provided:
import json
import pathlib
import pandas as pd
# Iterate source files:
for path in pathlib.Path(".").glob("result*.json"):
# Open source file one by one:
with path.open() as handler:
# Normalize JSON model:
df = pd.json_normalize(json.load(handler), record_path =['List2D'], max_level=2, sep="-")
# Apply len to list fields to identify rows to drop (boolean indexing):
q = (df["Nodes"].apply(len) != 4) & (df["Members"].apply(len) != 3)
# Filter and append data to disk instead of RAM:
df.loc[~q,:].to_csv("output.csv", mode="a", header=False)
It loads file one by one in RAM then append filtered rows to disk not to RAM. Those fixes will drastically reduce RAM usage and should be kept as high as twice the biggest JSON file.

Better way to replace my function?

I have attached a json data link for download-
json data
Currently I have written following function for getting each level of children data into a combined dataframe-
def get_children(catMapping):
level4 = json_normalize(catMapping['SuccessResponse']['Body'],
['children', 'children', 'children', 'children', ['children']])
level3 = json_normalize(catMapping['SuccessResponse']['Body'],
['children', 'children', 'children', ['children']])
['children', 'children', ['children']])
level1 = json_normalize(catMapping['SuccessResponse']['Body'],
['children', ['children']])
level0 = json_normalize(catMapping['SuccessResponse']['Body'],
['children'])
combined = pd.concat([level0, level1, level2, level3,level4])
combined = combined.reset_index(drop=True)
return combined
And it looks like this is not the recommended way but I am unable to write a function which can traverse each level.
Can you please help me with any better function?
Here is a function that recursively iterate all items:
import pandas as pd
import ast
with open(r"data.json", "r") as f:
data = ast.literal_eval(f.read())
def nest_iter(items):
for item in items:
children_ids = [o["categoryId"] for o in item["children"]]
ret_item = item.copy()
ret_item["children"] = children_ids
yield ret_item
yield from nest_iter(item["children"])
df = pd.DataFrame(nest_iter(data['SuccessResponse']['Body']))
the result:
categoryId children leaf name var
....
4970 10001244 [] True Business False
4971 10001245 [] True Casual False
4972 10001246 [] True Fashion False
4973 10001247 [] True Sports False
4974 7756 [7761, 7758, 7757, 7759, 7760] False Women False
4975 7761 [] True Accessories False
4976 7758 [] True Business False
4977 7757 [] True Casual False
4978 7759 [] True Fashion False
4979 7760 [] True Sports False

pandas custom file format parsing

I have data in the following format:
1_engineer_grade1 |Boolean IsMale IsNorthAmerican IsFromUSA |Name blah
2_lawyer_grade7 |Boolean IsFemale IsAlive |Children 2
I need to convert this into a dataframe with the following columns:
id job grade Bool.IsMale Bool.IsFemale Bool.IsAlive Bool.IsNorthAmerican Bool.IsFromUSA Name Children
1 engineer 1 True False False True True blah NaN
2 lawyer 7 False True True True False NaN 2
I could preprocess this data in python and then call pd.DataFrame on this, but I was wondering if there was a better way of doing this?
UPDATE: I ended up doing the following: If there are obvious optimizations, please let me know
with open(vwfile, encoding='latin-1') as f:
data = []
for line in f:
line = [x.strip() for x in line.strip().split('|')]
# line == [
# "1_engineer_grade1",
# "|Boolean IsMale IsNorthAmerican IsFromUSA",
# "|Name blah"
# ]
ident, job, grade = line[0].split("_")
features = line[1:]
bools = {
"IsMale": False,
"IsFemale": False,
"IsNorthAmerican": False,
"IsFromUSA": False,
"IsAlive": False,
}
others = {}
for category in features:
if category.startswith("Bools "):
for feature in category.split(' ')[1:]:
bools[feature] = True
else:
feature = category.split(" ")
# feature == ["Name", "blah"]
others[feature[0]] = feature[1]
featuredict = {
'ident': ident,
'job': job,
'grade': grade,
}
featuredict.update(bools)
featuredict.update(others)
data.append(featuredict)
df = pd.DataFrame(data)
UPDATE-2 A million line file took about 55 seconds to process this.

NLTK classifier precision and recall are always none (0)

I have used Python NLTK library and the Naive Bayes classifier to detect if a string should be tagged "php" or not, based on training data (Stackoverflow questions in fact).
The classifier seem to find interesting features:
Most Informative Features
contains-word-isset = True True : False = 125.6 : 1.0
contains-word-echo = True True : False = 28.1 : 1.0
contains-word-php = True True : False = 17.1 : 1.0
contains-word-this- = True True : False = 16.0 : 1.0
contains-word-mysql = True True : False = 14.3 : 1.0
contains-word-_get = True True : False = 11.7 : 1.0
contains-word-foreach = True True : False = 7.6 : 1.0
Features are defined as follows:
def features(question):
features = {}
for token in detectorTokens:
featureName = "contains-word-"+token
features[featureName] = (token in question)
return features
but it seems the classifier decided to never tag a string as being a "php" question.
Even a simple string like: "is this a php question?" is being classified as False.
Can anyone help me understand this phenomenon?
Here is some partial code (I have 3 or 4 pages of code, so this is just a small part):
classifier = nltk.NaiveBayesClassifier.train(train_set)
cross_valid_accuracy = nltk.classify.accuracy(classifier, cross_valid_set)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(cross_valid_set):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'Precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'Recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])

Categories

Resources