Check orthography in pandas data frame column using Enchant library - python

I'm new to pandas and enchant. I want to check orthography in short sentences using python.
I have a pandas data frame:
id_num word
1 live haapy
2 know more
3 ssweam good
4 eeat little
5 dream alot
And I want to achieve the next table with column “check”
id_num word check
1 live haapy True, False
2 know more True, True
3 ssweam good False, True
4 eeat little False, True
5 dream alot True, False
What is the best way to do this?
I tried this code:
import enchant
dic = enchant.Dict("ru_Eng")
df['list_word'] = df['word'].str.split() #Get list of all words in each sentence using split()
row = list()
for row in df[['id_num', 'list_word']].iterrows():
r = row[1]
for word in r.list_word:
rows.append((r.id_num, word))
df2 = pd.DataFrame(rows, columns=['id_num', 'word']) #Make the table with id_num column and a column of separate words
Then I got new data frame (df2):
id_num word
1 live
1 haapy
2 know
2 more
3 ssweam
3 good
4 eeat
4 little
5 dream
5 alot
After that I check words using:
column = df2['word']
for i in column:
n = dic.check(i)
print(n)
The result is:
True
False
True
True
False
True
False
True
True
False
Check is carried out correctly but when I tried to put this result to a new pandas data frame column I got all False values for all words.
for i in column:
df2['res'] = dic.check(i)
Resulted data frame:
id_num word res
1 live False
1 haapy False
2 know False
2 more False
3 ssweam False
3 good False
4 eeat False
4 little False
5 dream False
5 alot False
I will be grateful for any help!

Related

Improve the performance of my python code (I'm using Pandas)

I'm doing a python code for data analysis. I would like to mark the lines, in a new column, that have the same value in EMP, RAZAO and ATRIB columns and add the values in MONTCALC is zero. For exemple:
Example of datas
In this image the lines marked with color are subgroup and if you add the values of MONTCALC column the result is 0.
My code:
conciliation_df_temp = conciliation_df.copy()
doc_clear = 1
for i in conciliation_df_temp.index:
if conciliation_df_temp.loc[i,'DOC_COMP'] == "":
company = conciliation_df_temp.loc[i,'EMP']
gl_account = conciliation_df_temp.loc[i,'RAZAO']
assignment = conciliation_df_temp.loc[i,'ATRIB']
df_temp = conciliation_df_temp.loc[(cconciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment)]
if round(df_temp['MONTCALC'].sum(),2) == 0:
conciliation_df_temp.loc[(conciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment),'DOC_COMP'] = doc_clear
doc_clear += 1
The performance with few lines (10,000) is good execute less than 1 minute. In the 1 minute also has read a text file, file handling and convertion to dataframe. But if I put a text file with more than 1 million lines the script does't execute, I wait 5 hours with out return.
What do I do to improve performance this code?
Regards!!
Sorry my English
I tried delete lines in dataFrame to decrease size dataFrame to search be faster, but the execution was slower.
It seems like you could just check if the sum of the group is zero:
import pandas as pd
df = pd.DataFrame([
[3000,1131500040,8701731701,-156002.08],
[3000,1131500040,8701731701, 156002.08],
[3000,1131500040,"EA-17012.2.22", -3990],
[3000,1131500040,"EA-17012.2.22", 400],
[3000,1131500040,"000100000103", -35822.86],
[3000,1131500040,"000100000103", 35822.86],
[3000,1131500040,"000100000103", -35822.86],
[3000,1131500040,"000100000103", 35822.86]
], columns=['EMP','RAZAO','ATRIB','MONTCALC']
)
df['zero'] = df.groupby(['EMP','RAZAO','ATRIB'])['MONTCALC'].transform(lambda x: sum(x)==0)
print(df)
Output
EMP RAZAO ATRIB MONTCALC zero
0 3000 1131500040 8701731701 -156002.08 True
1 3000 1131500040 8701731701 156002.08 True
2 3000 1131500040 EA-17012.2.22 -3990.00 False
3 3000 1131500040 EA-17012.2.22 400.00 False
4 3000 1131500040 000100000103 -35822.86 True
5 3000 1131500040 000100000103 35822.86 True
6 3000 1131500040 000100000103 -35822.86 True
7 3000 1131500040 000100000103 35822.86 True

Sanitizing input to be used in pandas query with the str.contains method

How can I sanitize the input in the code below so that it can be used to query a dataset df?
query = f"`{field_name}`.str.contains('''{input()}''', case=False)"
df.query(query)
The main problem with the code above is that when the input contains triple quotes or backslashes it throws an error. Also keep in mind that the dataframe also contains backslashes in some cells and thus I would like the query to be able to perform that search as well (eg if the input is a\s I would like the query to return rows that contain a\s like for example aaaa\saaaaa would be a match).
Assume that field_name is given and not going to cause trouble.
If I understand you correctly, you want this:
import pandas as pd
import numpy as np
s1 = pd.Series(['Mouse', 'dog a\s', 'house and parrot', '23', np.NaN, 'aaaa\saaaaa', ' \ """ '])
s2 = s1.str.contains(input('input: '), regex=False)
print(s2)
input: a\s
0 False
1 True
2 False
3 False
4 NaN
5 True
6 False
dtype: object
Process finished with exit code 0
input: """
0 False
1 False
2 False
3 False
4 NaN
5 False
6 True
dtype: object
Process finished with exit code 0

Filtering inside the dataframe pandas

I want to extract two first symbols in case three first symbols match a certain pattern (first two symbols should be any of those inside the brackets [ptkbdgG_fvsSxzZhmnNJlrwj], the third symbol should be any of those inside the brackets[IEAOYye|aouKLM#)3*<!(#0~q^LMOEK].
The first two lines work correctly.
The last lines do not work and I do not understand why. The code doesn`t give any errors, it just does nothing for those
# extract tree first symbols and save them in the new column
df['first_three_symbols'] = df['ITEM'].str[0:3]
#create a boolean column on condition whether first three symbols contain symbols
df["ccv"] = df["first_three_symbols"].str.contains('[ptkbdgG_fvsSxzZhmnNJlrwj][ptkbdgG_fvsSxzZhmnNJlrwj][IEAOYye|aouKLM#)3*<!(#0~q^LMOEK]')
#create another column for True values in the previous column
if df["ccv"].item == True:
df['first_two_symbols'] = df["ITEM"].str[0:2]
Here is my output:
ID ITEM FREQ first_three_symbols ccv
0 0 a 563 a False
1 1 OlrMndmEn 1 Olr False
2 2 OlrMndSpOrtl#r 0 Olr False
3 3 AG#l 74 AG# False
4 4 AG#lbMm 24 AG# False
... ... ... ... ... ...
51723 51723 zytzWt# 8 zyt False
51724 51724 zytzytOst 0 zyt False
51725 51725 zYxtIx 5 zYx False
51726 51726 zYxtIxkWt 0 zYx False
51727 51727 zyZe 4 zyZ False
[51728 rows x 5 columns]
you can either create a function, use apply method :
def f(row):
if row["ccv"] == True:
return row["ITEM"].str[0:2]
else:
return None
df['first_two_symbols'] = df.apply(f,axis=1)
or you can use np.wherefunction from numpy package.

creating new column based on multiple values of another column

I have a data frame that has these values in one of its columns:
In:
df.line.unique()
Out:
array(['Line71A', 'Line71B', 'Line75B', 'Line79A', 'Line79B', 'Line75A', 'Line74A', 'Line74B',
'Line70A', 'Line70B', 'Line58B', 'Line70', 'Line71', 'Line74', 'Line75', 'Line79', 'Line58'],
dtype=object)
And I would like to create a new column with 2 values based on if the value string contains LineXX, like so:
if (df.line.str.contains("Line70") or (df.line.str.contains("Line71") or (df.line.str.contains("Line79")):
return 1
else:
return 0
So the value should be 1 in the new column, box_type, if the values in df.line contains "Line70", "Line71", "Line79" and the rest should be 0
I tried doing this with this code:
df['box_type'] = df.line.apply(lambda x: 1 if x.contains('Line70') or x.contains('Line71') or x.contains('Line79') else 0)
But I get this error:
AttributeError: 'str' object has no attribute 'contains'
And I tried adding .str in between x and contains, like x.str.contains(), but that also gave an error.
How can I do this?
Thanks!
How about:
df['box_type'] = df.line.str.contains('70|71|79')
Sample data:
np.random.seed(1)
df = pd.DataFrame({'line':np.random.choice(a, 10)})
Output:
line box_type
0 Line75A False
1 Line70 True
2 Line71 True
3 Line70A True
4 Line70B True
5 Line70 True
6 Line75A False
7 Line79 True
8 Line71A True
9 Line58 False

Pandas Improving Efficiency

I have a pandas dataframe with approximately 3 million rows.
I want to partially aggregate the last column in seperate spots based on another variable.
My solution was to separate the dataframe rows into a list of new dataframes based on that variable, aggregate the dataframes, and then join them again into a single dataframe. The problem is that after a few 10s of thousands of rows, I get a memory error. What methods can I use to improve the efficiency of my function to prevent these memory errors?
An example of my code is below
test = pd.DataFrame({"unneeded_var": [6,6,6,4,2,6,9,2,3,3,1,4,1,5,9],
"year": [0,0,0,0,1,1,1,2,2,2,2,3,3,3,3],
"month" : [0,0,0,0,1,1,1,2,2,2,3,3,3,4,4],
"day" : [0,0,0,1,1,1,2,2,2,2,3,3,4,4,5],
"day_count" : [7,4,3,2,1,5,4,2,3,2,5,3,2,1,3]})
test = test[["year", "month", "day", "day_count"]]
def agg_multiple(df, labels, aggvar, repl=None):
if(repl is None): repl = aggvar
conds = df.duplicated(labels).tolist() #returns boolean list of false for a unique (year,month) then true until next unique pair
groups = []
start = 0
for i in range(len(conds)): #When false, split previous to new df, aggregate count
bul = conds[i]
if(i == len(conds) - 1): i +=1 #no false marking end of last group, special case
if not bul and i > 0 or bul and i == len(conds):
sample = df.iloc[start:i , :]
start = i
sample = sample.groupby(labels, as_index=False).agg({aggvar:sum}).rename(columns={aggvar : repl})
groups.append(sample)
df = pd.concat(groups).reset_index(drop=True) #combine aggregated dfs into new df
return df
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
I suppose that I could potentially apply the function to small samples of the dataframe, to prevent a memory error and then combine those, but I'd rather improve the computation time of my function.
This function does the same, and is 10 times faster.
test.groupby(["year", "month"], as_index=False).agg({"day_count":sum}).rename(columns={"day_count":"month_count"})
There are almost always pandas methods that are pretty optimized for tasks that will vastly outperform iteration through the dataframe. If I understand correctly, in your case, the following will return the same exact output as your function:
test2 = (test.groupby(['year', 'month'])
.day_count.sum()
.to_frame('month_count')
.reset_index())
>>> test2
year month month_count
0 0 0 16
1 1 1 10
2 2 2 7
3 2 3 5
4 3 3 5
5 3 4 4
To check that it's the same:
# Your original function:
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
>>> test == test2
year month month_count
0 True True True
1 True True True
2 True True True
3 True True True
4 True True True
5 True True True

Categories

Resources