Tokenize text in Pandas dataframe - python

I have a Pandas DataFrame with scripts collected from an external source. The column text_content contains the script contents. The longest script consists of 85.617 characters.
A sample to give you an idea:
The scripts contain table names and other useful information. Currently, the dataframe is written to a SQLite database table, which can then be searched using ad-hoc SQL statements (and distributed to a larger crowd).
A common use case is that we'll have a list of table names, and would like to know the scripts in which they appear. If we need to do this in SQL, it would require us to execute wildcard searches using the LIKE operator, which kinda sucks performance-wise.
Thus, I wanted to extract the words from the script while it's still in a DataFrame, resulting in a two columns table, with each row consisting of:
a link to the original script row
a word that was found in the script
Each script would result in a number of rows (depending on the amount of matches).
So far, I wrote this to extract the words from the script:
DataFrame(df[df.text_type == 'DISCRIPT']
.dropna(subset=['text_content'])
.apply(lambda x: re.findall('([a-zA-Z]\w+)', x['text_content']), axis=1)
.tolist())
The result:
So far, so good (?).
There are two more steps I need to go through, but I'm a little stuck here.
Remove a list of common words (e.g. SQL reserved words).
Reshape the DataFrame so each row is a match, but with a link to the script in the original DataFrame.
I can use T to transpose the DataFrame, use replace() in combination with a predefined list of keywords (replacing them with an NA value) and finally use dropna() to shorten the list to just the keywords. However, I'm not sure if this is the best approach.
I'd very much appreciate your comments and suggestions!

IIUC you can try add index=df.index to df2 constructor, then reshape by stack and filter by isin:
print df
text_content text_name text_type
1614 CHECK FOR LOCK STATUS CACHETABLEDB TEXT DISCRIPT
1615 CHECK FOR LOCK STATUS CACHETABLEDB TEXT DISCRIPT
df2 = pd.DataFrame(df[df.text_type == 'DISCRIPT']
.dropna(subset=['text_content'])
.apply(lambda x: re.findall('([a-zA-Z]\w+)', x['text_content']), axis=1)
.tolist(), index=df.index)
print df2
0 1 2 3 4
1614 CHECK FOR LOCK STATUS CACHETABLEDB
1615 CHECK FOR LOCK STATUS CACHETABLEDB
#reshape all rows to column
df2 = df2.stack().reset_index(level=0)
df2.columns = ['id', 'words']
L = ['CACHETABLEDB','STATUS']
#remove reserved words
df2 = df2.loc[~df2.words.isin(L)].reset_index(drop=True)
print df2
id words
0 1614 CHECK
1 1614 FOR
2 1614 LOCK
3 1615 CHECK
4 1615 FOR
5 1615 LOCK

Related

Find which columns contain a certain value for each row in a dataframe

I have a dataframe, df, shown below. Each row is a story and each column is a word that appears in the corpus of stories. A 0 means the word is absent in the story while a 1 means the word is present.
I want to find which words are present in each story (i.e. col val == 1). How can I go about finding this (preferably without for-loops)?
Thanks!
Assuming you are just trying to look at one story, you can filter for the story (let's say story 34972) and transpose the dataframe with:
df_34972 = df[df.index=34972].T
and then you can send the values equal to 1 to a list:
[*df_34972[df_34972['df_34972'] == 1]]
If you are trying to do this for all stories, then you can do this, but it will be a slightly different technique. From the link that SammyWemmy provided, you can melt() the dataframe and filter for 1 values for each story. From there you could .groupby('story_column') which is 'index' (after using reset_index()) in the example below:
df = df.reset_index().melt(id_vars='index')
df = df[df['values'] == 1]
df.groupby('index')['variable'].apply(list)

Merge on multiple or conditions in python

I have two dataframes. one has product ID information and the other is a master data frame with a zone mapping along with a mapping ID.
# this is dummy dataframe for example
product_df= pd.DataFrame([['abc','+1235']['cshdgs','+1352648'],['gdsfsn','+1232455']],columns='['roll','prod_id'])
master_df=pd.DataFrame([['AZ','32'],['WW','123'],['RT','12'],['PO','13'],['SZ','1352']], columns=['Zone','match_id']
I want to get Zone information alongside the product_df records. and the logic for that is (inSQL):
select product_df.* left join master_df.Zone where '+'||match_id=substr(prod_id,1,2) or '+'||match_id=substr(prod_id,1,3) or '+'||match_id=substr(prod_id,1,4) or '+'||match_id=substr(prod_id,1,5)
Basically this will be a merge on condition situation.
I know, that due to this joining logic, multiple zones may be mapped to the same roll. but that is the ask.
I am trying to implement the same in Python. I am using the following code:
master_df_dict=master_df.sert_index('match_id')['Zone'].to_dict
keys_list=['+' + key for key in master_df_dict.keys()]
def zone(pr_id):
if pr_id[0:2] in keys_list:
C=keys_list[pr_id[1:2]] # basically getting the zone information using the matched key, and as
#first character is always plus, starting the index at 1
elif pr_id[0:3] in keys_list:
C=keys_list[pr_id[1:3]]
elif pr_id[0:4] in keys_list:
C=keys_list[pr_id[1:4]]
elif pr_id[0:5] in keys_list:
C=keys_list[pr_id[1:5]]
else:
C=''
return C
product_df['Zone_info']=product_df['prod_id].apply(zone)
There are two problems with this approach:
It will only give me the first matched code, even if later conditions will also match, it will come out of the loop as soon as it matches with a condition.
Using this approach, it is taking me around 45 mins to parse 1700 records.
I need help in the following areas:
why is it taking so long for the above code to work? How can i make it run faster
Can I get exact python execution for the sql logic mentioned above that is joining based on multiple conditions? As far as i have searched, python does not have the fucntionality to merge on conditions as is in sql. Can we have a workaround for this?
If there is no way to merge on all the or conditions, is there a way to merge atleast the first matching condition and make it run faster?
Please help!!
Your matching criteria is not really a key. In these cases I form a Cartesian product, then establish a way to find unique rows required. This is pretty equivalent to a way that a relational database join works and especially in join is an inefficient expression.
Cartesian product idiom
derive 3 additional columns a) expr - regular expression to match match_id to prod_id b) just length to be used on sort_values() c) join true if match_id lines up with prod_id on the criteria you specified
sort and take first interesting record
reset values where no join was found
drop the columns used to get this working
Better solution would to have a reliable join key...
import re
product_df= pd.DataFrame(
[['abc','+1235'],['cshdgs','+1352648'],['gdsfsn','+1232455']],columns=['roll','prod_id'])
master_df=pd.DataFrame([['AZ','32'],['WW','123'],['RT','12'],['PO','13'],['SZ','1352']], columns=['Zone','match_id'])
cp = product_df.assign(foo=1).merge(master_df.assign(foo=1)).drop("foo",1)
cp["len"] = cp.match_id.str.len()
cp["expr"] = cp.apply(lambda r: "^[+]" + "".join([f"[{c}]{'' if i<2 else '?'}" for i, c in enumerate(r.match_id)]), axis=1)
cp["join"] = cp.apply(lambda r: re.search(r.expr, r.prod_id) is not None, axis=1)
cp = cp.sort_values(["Zone", "join", "len"], ascending=[True, False, False]).reset_index()\
.groupby(["Zone"]).first().reset_index()
cp.loc[~cp["join"],("roll","prod_id")] = ""
cp.drop(["len","expr","join","index"], axis=1)
output
Zone roll prod_id match_id
0 AZ 32
1 PO cshdgs +1352648 13
2 RT abc +1235 12
3 SZ cshdgs +1352648 1352
4 WW abc +1235 123

Ignoring multiple commas while reading csv in pandas

I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a.
Each file has data like :
Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php
I need two columns in my pandas df: Login_id and Web.
I am facing error when I try to read records like 2.
df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)
I am facing the following error :
ValueError: Columns must be same length as key.
Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks
Solution 1: use split with argument n=1 and expand=True.
result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']
That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).
EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:
result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)
This splits the field and uses the names of the matching groups to create columns with their content. The output is:
Login_id URL
0 1 http://www.x1.com
1 2 http://www.x1.com,as.php
Solution 3: convetional version with regex:
You could do something customized, e.g with a regex:
import re
sp_re= re.compile('([^,]*),(.*)')
aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]
The result on your example data is:
Login_id, Web Login_id URL
0 1,http://www.x1.com 1 http://www.x1.com
1 2,http://www.x1.com,as.php 2 http://www.x1.com,as.php
Now you could drop the column 'Login_id, Web'.

most efficient method to extract key: value pairs from a pandas column and use the keys as columns

I have a large data set, 50000 or so csv's containing about 40000 lines, that I need to read into dataframes, extract the key: value pairs and use them as columns/values in the same dataframe. The excerpt below is a single column of my pandas dataframe:
column
'this is my string of data., you can: parse me now, but: you will never find me'
'this is some crazy data., where are: you at today?, you can: never find me, but: this is fun.'
'this is more crazy than ever, where are:, you can: not parse me, strange: stuff'
How can I extract just the key: value pairs that match the following criteria? I am trying to do this in the most efficient method due to the iteration over multiple files.
between two commas
must contain a colon
two spaces after the colon
any character to include spaces
With an expected output of expanding the keys to columns and values in the columns:
you can but where are strange <==columns
parse me now you will never find me NONE NONE
never find me this is fun you at today? NONE
not parse me NONE NONE stuff
Updated visual of data.
1
0
0 subject NaN
strange sub AcDe1
strange name i001$
stuff and things 86753
newby id 09
You can use extractall to get all the key value pairs in long format, and transform it so the keys goes as the column headers, assuming the original column name is col here:
(df.col.str.extractall("([^,]+?):(?:\s{2}([^,]+))?")
.reset_index(level=1, drop=True)
.set_index(0, append=True)[1]
.unstack(level=1))

Pandas: query string where column name contains special characters

I am working with a data frame that has a structure something like the following:
In[75]: df.head(2)
Out[75]:
statusdata participant_id association latency response \
0 complete CLIENT-TEST-1476362617727 seeya 715 dislike
1 complete CLIENT-TEST-1476362617727 welome 800 like
stimuli elementdata statusmetadata demo$gender demo$question2 \
0 Sample B semi_imp complete male 23
1 Sample C semi_imp complete female 23
I want to be able to run a query string against the column demo$gender.
I.e,
df.query("demo$gender=='male'")
But this has a problem with the $ sign. If I replace the $ sign with another delimited (like -) then the problem persists. Can I fix up my query string to avoid this problem. I would prefer not to rename the columns as these correspond tightly with other parts of my application.
I really want to stick with a query string as it is supplied by another component of our tech stack and creating a parser would be a heavy lift for what seems like a simple problem.
Thanks in advance.
With the most recent version of pandas, you can esscape a column's name that contains special characters with a backtick (`)
df.query("`demo$gender` == 'male'")
Another possibility is clean the columns names as a previous step in your process, replacing special characters by some other more appropriate.
For instance:
(df
.rename(columns = lambda value: value.replace('$', '_'))
.query("demo_gender == 'male'")
)
For the interested here is a simple proceedure I used to accomplish the task:
# Identify invalid column names
invalid_column_names = [x for x in list(df.columns.values) if not x.isidentifier() ]
# Make replacements in the query and keep track
# NOTE: This method fails if the frame has columns called REPL_0 etc.
replacements = dict()
for cn in invalid_column_names:
r = 'REPL_'+ str(invalid_column_names.index(cn))
query = query.replace(cn, r)
replacements[cn] = r
inv_replacements = {replacements[k] : k for k in replacements.keys()}
df = df.rename(columns=replacements) # Rename the columns
df = df.query(query) # Carry out query
df = df.rename(columns=inv_replacements)
Which amounts to identifying the invalid column names, transforming the query and renaming the columns. Finally we perform the query and then translate the column names back.
Credit to #chrisb for their answer that pointed me in the right direction
The current implementation of query requires the string to be a valid python expression, so column names must be valid python identifiers. Your two options are renaming the column, or using a plain boolean filter, like this:
df[df['demo$gender'] =='male']

Categories

Resources