Number formatting after mapping? - python

I have a data frame with a number column, such as:
CompteNum
100
200
300
400
500
and a file with the mapping of all these numbers to other numbers, that I import to python and convert into a dictionary:
{100: 1; 200:2; 300:3; 400:4; 500:5}
And I am creating a second column in the data frame that combine both numbers in the format df number + dict number: From 100 to 1001 and so on...
## dictionary
accounts = pd.read_excel("mapping-accounts.xlsx")
accounts = accounts[['G/L Account #','FrMap']]
accounts = accounts.set_index('G/L Account #').to_dict()['FrMap']
## data frame --> CompteNum is the Number Column
df['CompteNum'] = df['CompteNum'].map(accounts1).astype(str) + df['CompteNum'].astype(str)
The problem is that my output then is 100.01.0 instead of 1001 and that creates additional manual work in the output excel file. I have tried:
df['CompteNum'] = df['CompteNum'].str.replace('.0', '')
but it doesn't deletes ALL the zero's, and I would want the additional ones deleted. Any suggestions?

There is problem missing values for non matched values after map, possible solution is:
print (df)
CompteNum
0 100
1 200
2 300
3 400
4 500
5 40
accounts1 = {100: 1, 200:2, 300:3, 400:4, 500:5}
s = df['CompteNum'].astype(str)
s1 = df['CompteNum'].map(accounts1).dropna().astype(int).astype(str)
df['CompteNum'] = (s + s1).fillna(s)
print (df)
CompteNum
0 1001
1 2002
2 3003
3 4004
4 5005
5 40
Your solution should be changed for replace by regex - $ for end of string with escape ., because special regex character (regex any char):
df['CompteNum'] = df['CompteNum'].str.replace('\.0$', '')

Related

retrieve cell string values in a column between two unknown indexes based on substrings location

I need to locate the first location where the word 'then' appears on Words table. I'm trying to get a code to consolidate all strings on 'text' column from this location till the first text with a substring '666' or '999' in it (in this case a combination of their, stoma22, fe156, sligh334, pain666 (the desired subtrings_output = 'theirfe156sligh334pain666').
I've tried:
their_loc = np.where(words['text'].str.contains(r'their', na =True))[0][0]
666_999_loc = np.where(words['text'].str.contains(r'666', na =True))[0][0]
subtrings_output = Words['text'].loc[Words.index[their_loc:666_999_loc]]
as you can see I'm not sure how to extend the conditioning of 666_999_loc to include substring 666 or 999, also slicing the indexing between two variables renders an error. Many thanks
Words table:
page no
text
font
1
they
0
1
ate
0
1
apples
0
2
and
0
2
then
1
2
their
0
2
stoma22
0
2
fe156
1
2
sligh334
0
2
pain666
1
2
given
0
2
the
1
3
fruit
0
You just need to add one for the end of the slice, and add an or condition to the np.where of the 666_or_999_loc using the | operator.
text_col = words['text']
their_loc = np.where(text_col.str.contains(r'their', na=True))[0][0]
contains_666_or_999_loc = np.where(text_col.str.contains('666', na=True) |
text_col.str.contains('999', na=True))[0][0]
subtrings_output = ''.join(text_col.loc[words.index[their_loc:contains_666_or_999_loc + 1]])
print(subtrings_output)
Output:
theirstoma22fe156sligh334pain666
IIUC, use pandas.Series.idxmax with "".join().
Series.idxmax(axis=0, skipna=True, *args, **kwargs)
Return the row label of the maximum value.
If multiple values equal the maximum, the first row label with that
value is returned.
So, assuming (Words) is your dataframe, try this :
their_loc = Words["text"].str.contains("their").idxmax()
_666_999_loc = Words["text"].str.contains("666").idxmax()
subtrings_output = "".join(Words["text"].loc[Words.index[their_loc:_666_999_loc+1]])
Output :
print(subtrings_output)
#theirstoma22fe156sligh334pain666
#their stoma22 fe156 sligh334 pain666 # <- with " ".join()

How to rename values in column having a specific separation symbols?

Values in my DataFrame look like this:
id val
big_val_167 80
renv_100 100
color_100 200
color_60/write_10 200
I want to remove everything in values of id column after _numeric. So desired result must look like:
id val
big_val 80
renv 100
color 200
color 200
How to do that? I know that str.replace() can be used, but I don't understand how to write regular expression part in it.
You can use regex(re.search) to find the first occurence of _ + digit and then you can solve the problem.
Code:
import re
import pandas as pd
def fix_id(id):
# Find the first occurence of: _ + digits in the id:
digit_search = re.search(r"_\d", id)
return id[:digit_search.start()]
# Your df
df = pd.DataFrame({"id": ["big_val_167", "renv_100", "color_100", "color_60/write_10"],
"val": [80, 100, 200, 200]})
df["id"] = df["id"].apply(fix_id)
print(df)
Output:
id val
0 big_val 80
1 renv 100
2 color 200
3 color 200

Check presence of multiple keywords and create another column using python

I have a data frame as shown below
df = pd.DataFrame({'meds': ['Calcium Acetate','insulin GLARGINE -- LANTUS - inJECTable','amoxicillin 1 g + clavulanic acid 200 mg ','digoxin - TABLET'],
'details':['DOSE: 667 mg - TDS with food - Inject','DOSE: 12 unit(s) - ON - SC (SubCutaneous)','-- AUGMENTIN - inJECTable','DOSE: 62.5 mcg - Every other morning - PO'],
'extracted':['Calcium Acetate 667 mg Inject','insulin GLARGINE -- LANTUS 12 unit(s) - SC (SubCutaneous)','amoxicillin 1 g + clavulanic acid 200 mg -- AUGMENTIN','digoxin - TABLET 62.5 mcg PO/Tube']})
df['concatenated'] = df['meds'] + " "+ df['details']
What I would like to do is
a) Check whether all of the individual keywords from extracted column is present in the concatenated column.
b) If present, assign 1 to the output column else 0
c) Assign the not found keyword in issue column as shown below
So, I was trying something like below
df['clean_extract'] = df.extracted.str.extract(r'([a-zA-Z0-9\s]+)')
#the above regex is incorrect. I would like to clean the text (remove all symbols except spaces and retain a clean text)
df['keywords'] = df.clean_extract.str.split(' ') #split them into keywords
def value_present(row): #check whether each of the keyword is present in `concatenated` column
if isinstance(row['keywords'], list):
for keyword in row['keywords']:
return 1
else:
return 0
df['output'] = df[df.apply(value_present, axis=1)][['concatenated', 'keywords']].head()
If you think its useful to clean concatenated column as well, its fine. Am only interested in finding the presence of all keywords.
Is there any efficient and elegant approach to do this on 7-8 million records?
I expect my output to be like as shown below. Red color indicates missing term between extracted and concatenated column. So, its assigned 0 and keyword is stored in issue column.
Let us zip the columns extracted and concatenated and for each pair map it to a function f which computes the set difference and returns the result accordingly:
def f(x, y):
s = set(x.split()) - set(y.split())
return [0, ', '.join(s)] if s else [1, np.nan]
df[['output', 'issue']] = [f(*s) for s in zip(df['extracted'], df['concatenated'])]
output issue
0 1 NaN
1 1 NaN
2 1 NaN
3 0 PO/Tube

Faster way of comparing 2 similar Data Frames for differences

This in continuation to my previous question:
How to fetch the modified rows after comparing 2 versions of same data frame
I am now done with the MODIFICATIONS, however, I am using below method for finding the INSERTS and DELETES.
It work fine, however, it takes a lot of time. Typically for a CSV file which has 10 columns and 10M rows.
For my problem,
INSERT are the records which are not in old file, but in new file.
DELETE are the records which are in old file, but not in new file.
Below is the code:
def getInsDel(df_old,df_new,key):
#concatinating old and new data to generate comparisons
df = pd.concat([df_new,df_old])
df= df.reset_index(drop = True)
#doing a group by for getting the frequency of each key
print('Grouping data for frequency of key...')
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
df_delta = df.reindex(idx)
df_delta_freq = df_delta.groupby(key).size().reset_index(name='Freq')
#Filtering data for frequency = 1, since these will be the target records for DELETE and INSERT
print('Creating data frame to get records with Frequency = 1 ...')
filter = df_delta_freq['Freq']==1
df_delta_freq_ins_del = df_delta_freq.where(filter)
#Dropping row with NULL
df_delta_freq_ins_del = df_delta_freq_ins_del.dropna()
print('Creating data frames of Insert and Deletes ...')
#Creating INSERT dataFrame
df_ins = pd.merge(df_new,
df_delta_freq_ins_del[key],
on = key,
how = 'inner'
)
#Creating DELETE dataFrame
df_del = pd.merge(df_old,
df_delta_freq_ins_del[key],
on = key,
how = 'inner'
)
print('size of INSERT file: ' + str(df_ins.shape))
print('size of DELETE file: ' + str(df_del.shape))
return df_ins,df_del
The section where I am doing a group by for the frequency of each key, it takes around 80% of the total time, so for my CSV it takes around 12-15 mins.
There must be a faster approach for doing this?
For your reference, below is my result expectation:
For example, Old data is:
ID Name X Y
1 ABC 1 2
2 DEF 2 3
3 HIJ 3 4
and new data set is:
ID Name X Y
2 DEF 2 3
3 HIJ 55 42
4 KLM 4 5
Where ID is the Key.
Insert_DataFrame should be:
ID Name X Y
4 KLM 4 5
Deleted_DataFrame should be:
ID Name X Y
1 ABC 1 2
to be deleted
delete=pd.merge(old,new,how='left',on='ID',indicator=True)
delete=delete.loc[delete['_merge']=='left_only']
delete.dropna(1,inplace=True)
to be inserted
insert=pd.merge(new,old,how='left',on='ID',indicator=True)
insert=insert.loc[insert['_merge']=='left_only']
insert.dropna(1,inplace=True)

Python text processing: NLTK and pandas

I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.
I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.
My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:
import pandas as pd
import nltk
pd.options.display.max_colwidth = 10000
txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45
txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
txt_lines.append(line)
txt = str(txt_lines)
len(txt)
Out[14]: 1668813
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086
Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).
As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.
Is there any way to improve the above code?
Also, str(x) creates 1 big string out of the comments while [str(x) for x in txt_data.comment] creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix() from tm package.
Many thanks.
The benefit of using a pandas DataFrame would be to apply the nltk functionality to each row like so:
word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]
df = pd.DataFrame(random_word_list, columns=['text'])
df.head()
text
0 Aaru Aaronic abandonable abandonedly abaction ...
1 abampere abampere abacus aback abalone abactor...
2 abaisance abalienate abandonedly abaff abacina...
3 Ababdeh abalone abac abaiser abandonable abact...
4 abandonable abandon aba abaiser abaft Abama ab...
len(df)
50
txt = df.text.apply(word_tokenize)
txt.head()
0 [Aaru, Aaronic, abandonable, abandonedly, abac...
1 [abampere, abampere, abacus, aback, abalone, a...
2 [abaisance, abalienate, abandonedly, abaff, ab...
3 [Ababdeh, abalone, abac, abaiser, abandonable,...
4 [abandonable, abandon, aba, abaiser, abaft, Ab...
txt.apply(len)
0 1000
1 1000
2 1000
3 1000
4 1000
....
44 1000
45 1000
46 1000
47 1000
48 1000
49 1000
Name: text, dtype: int64
As a result, you get the .count() for each row entry:
txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()
0 27
1 24
2 17
3 25
4 32
You can then sum the result using:
txt.sum()
1239

Categories

Resources