I'm having trouble on how to match regex in two different dataframe that is linked with its type and unique country. Here is the sample for the data df and the regex df. Note that the shape for these two dataframe is different because the regex df containing just unique value.
**Data df** **Regex df**
**Country Type Data** **Country Type Regex**
MY ABC MY1234567890 MY ABC ^MY[0-9]{10}
IT ABC IT1234567890 IT ABC ^IT[0-9]{10}
PL PQR PL123456 PL PQR ^PL
MY XYZ 456792abc MY XYZ ^\w{6,10}$
IT ABC MY45889976
IT ABC IT567888976
I have tried to merge them together and just use lambda to do the matching. Below is my code,
df.merge(df_regex,left_on='Country',right_on="Country")
df['Data Quality'] = df.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
But, it will add another row for each of the different type and country. So there will be a lot of duplication which is not efficient and time consuming.
Is there any pythonic way to match the data to its country and type but the reference is in another dataframe. without merging those 2 df. Then if its match to its own regex, it will return 1, else 0.
To avoid repetition based on Type we should include Type also in the joining conditions, Now apply the lambda
df2 = df.merge(df_regex, left_on=['Country', 'Type'],right_on=['Country', 'Type'])
df2['Data Quality'] = df2.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
df2
It will give you the following output.
Country Type Data Regex Data Quality
0 MY ABC MY1234567890 ^MY[0-9]{10} 1
1 IT ABC IT1234567890 ^IT[0-9]{10} 1
2 IT ABC MY45889976 ^IT[0-9]{10} 0
3 IT ABC IT567888976 ^IT[0-9]{10} 0
4 PL PQR PL123456 ^PL 1
5 MY XYZ 456792abc ^\w{6,10}$ 1
Related
I am having a dataset which look like follows(in dataframe):
**_id** **paper_title** **references** **full_text**
1 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
2 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
3 XYZ [{'abc':'something'},{'def':'something'},...many others] something
Expected:
**_id** **paper_title** **abc** **def** **full_text**
1 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
2 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
I have tried df['column_name'].apply(pd.Series).apply(pd.Series) to split the list and dictionaries into columns of dataframe but doesn't help as it didn't split dictionaries.
First row of my dataframe:
df.head(1)
Assuming your original DataFrame is a list of dictionaries with one key:value pair and a key named 'reference':
print(df)
id paper_title references full_text
0 1 xyz [{'reference': 'description1'}, {'reference': ... some text
1 2 xyz [{'reference': 'descriptiona'}, {'reference': ... more text
2 3 xyz [{'reference': 'descriptioni'}, {'reference': ... even more text
Then you can use concat to separate out your references with their index:
df1 = pd.concat([pd.DataFrame(i) for i in df['references']], keys = df.index).reset_index(level=1,drop=True)
print(df1)
reference
0 description1
0 description2
0 description3
1 descriptiona
1 descriptionb
1 descriptionc
2 descriptioni
2 descriptionii
2 descriptioniii
Then use DataFrame.join to join the columns back together on their index:
df = df.drop('references', axis=1).join(df1).reset_index(drop=True)
print(df)
id paper_title full_text reference
0 1 xyz some text description1
1 1 xyz some text description2
2 1 xyz some text description3
3 2 xyz more text descriptiona
4 2 xyz more text descriptionb
5 2 xyz more text descriptionc
6 3 xyz even more text descriptioni
7 3 xyz even more text descriptionii
8 3 xyz even more text descriptioniii
After a lot of Documentation reading of pandas, I found the explode method applying with apply(pd.Series) is the easiest of what I was looking for in the question.
Here is the Code:
df = df.explode('reference')
# It explodes the lists to rows of the subset columns
df = df['reference'].apply(pd.Series).merge(df, left_index=True, right_index=True, how ='outer')
# split a list inside a Dataframe cell into rows and merge with original dataframe like (AUB) in set theory
Sidenote: while merging look for unique values in column as there will many columns with duplicated values
I hope this helps someone with dataframe/Series with columns having list containing multiple dictionaries and want to split list of multiple dictionaries key to new column with values as their rows.
I want to match the strings from 2 dataframes and if match found the return the corresponding results. So my first dataframe contains:
Name
abc
pqr
xyz
And second dataframe contains
Id Name
1 abc
2 lmn
3 pqr
4 qwe
I want to return ID by comparing (string) Name columns. And additionally, how to achieve the same, if Name from abc will get compared with entire names from dataframe 2.
The code as follows which I was trying after combing 2 dataframes:
This is a function which will compare the strings and return difference.
def bit_func(x):
dmp = diff_match_patch()
patches = dmp.patch_make(x.Name1, x.Name2)
diff = dmp.patch_toText(patches)
return diff
And I have tried to get difference but the code is not working. And I also want corresponding ID for the name how to return the same?
df['diff'] = df.apply(bit_func, axis=1)
You can just use pandas merge functionality to show the matches between the DataFrames and the Ids associated with them:
import pandas as pd
df1 = pd.DataFrame({'Name': ['abc', 'pqr', 'xyz']})
df2 = pd.DataFrame({'Name': ['abc', 'lmn', 'pqr', 'qwe'], 'Id': [1, 2, 3, 4]})
print(df1.merge(df2))
Output is:
Name Id
0 abc 1
1 pqr 3
To get the difference between the two use the following:
df1.merge(df2, how='outer', indicator=True).query('_merge != "both"').drop('_merge', 1)
Which outputs:
Name Id
2 xyz NaN
3 lmn 2.0
4 qwe 4.0
Reference to this post for all merge queries
How can I slice column values based on first & last character location indicators from two other columns?
Here is the code for a sample df:
import pandas as pd
d = {'W': ['abcde','abcde','abcde','abcde']}
df = pd.DataFrame(data=d)
df['First']=[0,0,0,0]
df['Last']=[1,2,3,5]
df['Slice']=['a','ab','abc','abcde']
print(df.head())
Code output:
Desired Output:
Just do it with for loop , you may worry about the speed , please check For loops with pandas - When should I care?
df['Slice']=[x[y:z]for x,y,z in zip(df.W,df.First,df.Last)]
df
Out[918]:
W First Last Slice
0 abcde 0 1 a
1 abcde 0 2 ab
2 abcde 0 3 abc
3 abcde 0 5 abcde
I am not sure if this will be faster, but a similar approach would be:
df['Slice'] = df.apply(lambda x: x[0][x[1]:x[2]],axis=1)
Briefly, you go through each row (axis=1) and apply a custom function. The function takes the row (stored as x), and slices the first element using the second and third elements as indices for the slicing (that's the lambda part). I will be happy to elaborate more if this isn't clear.
I have a pandas dataframe with one column and 100 rows. I would like to merge all the values of a column into one single value.
Ex:
S.No Text
0 abc
1 def
2 ghi
3 jkl
4 mno
I want the result to be "abcdefghijklmno" as a one single value.
Any ideas?
Two pandas way
df.Text.sum()
Out[72]: 'abcdefghijklmno'
''.join(df.Text)
Out[77]: 'abcdefghijklmno'
Using for loop .
s=''
for x,y in df.iterrows():
s+=y['Text']
I have a pandas dataframe with two columns of strings. I want to identify all row where the string in the first column (s1) appears within the string in the second column (s2).
So if my columns were:
abc abcd*ef_gh
z1y xxyyzz
I want to keep the first row, but not the second.
The only approach I can think of is to:
iterate through dataframe rows
apply df.str.contains() to s2 using the contents of s1 as the matching pattern
Is there a way to accomplish this that doesn't require iterating over the rows?
It is probably doable (for simple matching only), in a vectorised way, with numpy chararray methods:
In [326]:
print df
s1 s2
0 abc abcd*ef_gh
1 z1y xxyyzz
2 aaa aaabbbsss
In [327]:
print df.ix[np.char.find(df.s2.values.astype(str),
df.s1.values.astype(str))>=0,
's1']
0 abc
2 aaa
Name: s1, dtype: object
The best I could come up with is to use apply instead of manual iterations:
>> df = pd.DataFrame({'x': ['abc', 'xyz'], 'y': ['1234', '12xyz34']})
>> df
x y
0 abc 1234
1 xyz 12xyz34
>> df.x[df.apply(lambda row: row.y.find(row.x) != -1, axis=1)]
1 xyz
Name: x, dtype: object