I have texts in one column and respective dictionary in another column. I have tokenized the text and want to replace those tokens which found a match for the key in respective dictionary. the text and and the dictionary are specific to each record of a pandas dataframe.
import pandas as pd
data =[['1','i love mangoes',{'love':'hate'}],['2', 'its been a long time we have not met',{'met':'meet'}],['3','i got a call from one of our friends',{'call':'phone call','one':'couple of'}]]
df = pd.DataFrame(data, columns = ['id', 'text','dictionary'])
The final dataframe the output should be
data =[['1','i hate mangoes'],['2', 'its been a long time we have not meet'],['3','i got a phone call from couple of of our friends']
df = pd.DataFrame(data, columns =['id, 'modified_text'])
I am using Python 3 in a windows machine
You can use dict.get method after zipping the 2 cols and splitting the sentence:
df['modified_text']=([' '.join([b.get(i,i) for i in a.split()])
for a,b in zip(df['text'],df['dictionary'])])
print(df)
Output:
id text \
0 1 i love mangoes
1 2 its been a long time we have not met
2 3 i got a call from one of our friends
dictionary \
0 {'love': 'hate'}
1 {'met': 'meet'}
2 {'call': 'phone call', 'one': 'couple of'}
modified_text
0 i hate mangoes
1 its been a long time we have not meet
2 i got a phone call from couple of of our friends
I added spaces to the key and values to distinguish a whole word from part of it:
def replace(text, mapping):
new_s = text
for key in mapping:
k = ' '+key+' '
val = ' '+mapping[key]+' '
new_s = new_s.replace(k, val)
return new_s
df_out = (df.assign(modified_text=lambda f:
f.apply(lambda row: replace(row.text, row.dictionary), axis=1))
[['id', 'modified_text']])
print(df_out)
id modified_text
0 1 i hate mangoes
1 2 its been a long time we have not met
2 3 i got a phone call from couple of of our friends
Related
I am trying to remove emojis from column in pandas dataframe. Using this code:
def remove_emoji(string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
def decontracted(phrase):
# specific
phrase = phrase.rstrip()
phrase = ' '.join(phrase.split())
phrase = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', phrase)
phrase = re.sub('#[\w]+','',phrase)
phrase = re.sub(r'[^\x00-\x7f]',r'', phrase)
# general
phrase = re.sub('#[^\s]+','',phrase)
phrase = remove_accented_chars(phrase)
phrase = remove_special_characters(phrase)
phrase = remove_emoji(phrase)
return phrase
def remove_accented_chars(text):
new_text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
return new_text
def remove_special_characters(text):
# define the pattern to keep
pat = r'[^a-zA-z0-9.,!?/:;\"\'\s]'
return re.sub(pat, '', text)
Applying it to the dataframe column like so:
AAVE["sentence"] = AAVE["sentence"].apply(decontracted)
['He better hurry amp; come back from playing cards', 'I ordered a new phone', 'lol okay baby \ud83d\ude18\u2764\ud83d\ude0d', 'imma cry']
Above is an example of the text I'm testing on. \ud83d\ude18\u2764\ud83d\ude0d is not removed.
-------------edit------------
Here is the code I am using to load the data that is in a TSV file:
AAVE = pd.read_csv('twitteraae_all_aa', sep='\t', on_bad_lines='skip')
columns = ['ID', 'Date', 'Num', 'Location','Num2', 'AA', 'Hispanic', 'Other', 'White']
AAVE.drop(columns, inplace=True, axis=1)
AAVE = AAVE.rename(columns={'Sentence': 'sentence'})
AAVE['label'] = 1
AAVE['sentence'] = AAVE['sentence'][0:391165].astype('string')
AAVE = AAVE.dropna()
AAVE['sentence1'] = AAVE['sentence'].astype('string').apply(decontracted).astype('string')
The code will work if I create an array of strings and apply the decontract function, but if I apply it to the dataframe, everything else that I want removed works, but not the emojis.
you have to apply row by row using
AAVE["sentence"] = AAVE.apply(lambda row: remove_emoji(row["sentence"]), axis=1)
This line of code is functional for removing emojis operating column by column
df.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
Note that it will also removes all non-English letters and special characters, but if you need to keep them we can edit the code
Your functions work for me:
arr = ['He better hurry amp; come back from playing cards', 'I ordered a new phone',
'lol okay baby \ud83d\ude18\u2764\ud83d\ude0d', 'imma cry']
df = pd.DataFrame({"column1": [0, 1, 2, 3], "column2": arr})
df
column1 column2
0 0 He better hurry amp; come back from playing cards
1 1 I ordered a new phone
2 2 lol okay baby \ud83d\ude18❤\ud83d\ude0d
3 3 imma cry
df["column2"] = df["column2"].apply(decontracted)
df
column1 column2
0 0 He better hurry amp; come back from playing cards
1 1 I ordered a new phone
2 2 lol okay baby
3 3 imma cry
Could it be an issue with how the text is stored in your dataframe?
I have a data frame with two columns Stg and Txt. The task is to check for all of the words in Stg Column with each Txt row and output the matched words into a new column while keeping the word case as in the Txt.
Example Code:
from pandas import DataFrame
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = DataFrame(new,columns= ['Stg','Txt'])
my_list = df["Stg"].tolist()
import re
def words_in_string(word_list, a_string):
word_set = set(word_list)
pattern = r'\b({0})\b'.format('|'.join(word_list))
for found_word in re.finditer(pattern, a_string):
word = found_word.group(0)
if word in word_set:
word_set.discard(word)
yield word
if not word_set:
raise StopIteration
df['new'] = ''
for i,values in enumerate(df['Txt']):
a=[]
b = []
for word in words_in_string(my_list, values):
a=word
b.append(a)
df['new'][i] = b
exit
The above code returns the case from the Stg column. Is there a way to get the case from Txt. Also I want to check for the entire string and not the substring like in the case of the text 'two-way', the current code returns the word way.
Current Output:
Stg Txt new
0 way An early term []
1 Early two-way allowed [way, allowed]
2 phone New Phone feature that allowed [allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
Expected Output:
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
You should use Series.str.findall with negative lookbehind:
import pandas as pd
import re
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = pd.DataFrame(new,columns= ['Stg','Txt'])
pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\\b" for i in new["Stg"])
df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)
print (df)
#
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
I have a dataframe which has some duplicate tags separated by commas in the "Tags" column, is there a way to remove the duplicate strings from the series. I want the output in 400 to have just Museum, Drinking, Shopping.
I can't split on a comma & remove them because there are some tags in the series that have similar words like for example: [Museum, Art Museum, Shopping] so splitting and dropping multiple museum strings would affect the unique 'Art Museum' string.
Desired Output
You can split by comma and convert to a set(),which removes duplicates, after removing leading/trailing white space with str.strip(). Then, you can df.apply() this to your column.
df['Tags']=df['Tags'].apply(lambda x: ', '.join(set([y.strip() for y in x.split(',')])))
You can create a function that removes duplicates from a given string. Then apply this function to your column Tags.
def remove_dup(strng):
'''
Input a string and split them
'''
return ', '.join(list(dict.fromkeys(strng.split(', '))))
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
DEMO:
import pandas as pd
my_dict = {'Tags':["Museum, Art Museum, Shopping, Museum",'Drink, Drink','Shop','Visit'],'Country':['USA','USA','USA', 'USA']}
df = pd.DataFrame(my_dict)
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
df
Output:
Tags Country
0 Museum, Art Museum, Shopping USA
1 Drink USA
2 Shop USA
3 Visit USA
Without some code example, I've thrown together something that would work.
import pandas as pd
test = [['Museum', 'Art Museum', 'Shopping', "Museum"]]
df = pd.DataFrame()
df[0] = test
df[0]= df.applymap(set)
Out[35]:
0
0 {Museum, Shopping, Art Museum}
One approach that avoids apply
# in your code just s = df['Tags']
s = pd.Series(['','', 'Tour',
'Outdoors, Beach, Sports',
'Museum, Drinking, Drinking, Shopping'])
(s.str.split(',\s+', expand=True)
.stack()
.reset_index()
.drop_duplicates(['level_0',0])
.groupby('level_0')[0]
.agg(','.join)
)
Output:
level_0
0
1
2 Tour
3 Outdoors,Beach,Sports
4 Museum,Drinking,Shopping
Name: 0, dtype: object
there maybe mach fancier way doing these kind of stuffs.
but will do the job.
make it lower-case
data['tags'] = data['tags'].str.lower()
split every row in tags col by comma it will return a list of string
data['tags'] = data['tags'].str.split(',')
map function str.strip to every element of list (remove trailing spaces).
apply set function return set of current words and remove duplicates
data['tags'] = data['tags'].apply(lambda x: set(map(str.strip , x)))
How do I remove multiple spaces between two strings in python.
e.g:-
"Bertug 'here multiple blanks' Mete" => "Bertug Mete"
to
"Bertug Mete"
Input is read from an .xls file. I have tried using split() but it doesn't seem to work as expected.
import pandas as pd , string , re
dataFrame = pd.read_excel("C:\\Users\\Bertug\\Desktop\\example.xlsx")
#names1 = ''.join(dataFrame.Name.to_string().split())
print(type(dataFrame.Name))
#print(dataFrame.Name.str.split())
Let me know where I'm doing wrong.
I think use replace:
df.Name = df.Name.replace(r'\s+', ' ', regex=True)
Sample:
df = pd.DataFrame({'Name':['Bertug Mete','a','Joe Black']})
print (df)
Name
0 Bertug Mete
1 a
2 Joe Black
df.Name = df.Name.replace(r'\s+', ' ', regex=True)
#similar solution
#df.Name = df.Name.str.replace(r'\s+', ' ')
print (df)
Name
0 Bertug Mete
1 a
2 Joe Black
I have a text file:
it can change each time and the number of lines can be changed, and contains the following for each line:
string (can contain one word, two or even more) ^ string of one word
EX:
level country ^ layla
hello sandra ^ organization
hello people ^ layla
hello samar ^ organization
I want to create dataframe using pandas such that:
item0 ( country, people)
item1 (sandra , samar)
Because for example each time there layla, we are returning the most right name that belongs to it and added it as the second column just shown above which is in this case ( country, people), and we called layla as item0 and as the index of the dataframe. I can't seem to arrange this and I don't know how to do the logic for returning the duplicated of whatever after the "^" and returning the list of its belonged most right name. My trial so far which doesn't really do it is:
def text_file(file):
list=[]
file_of_text = "text.txt"
with open(file_of_context) as f:
for l in f:
l_dict = l.split(" ")
list.append(l_dict)
return(list)
def items(file_of_text):
list_of_items= text_file(file_of_text)
for a in list_of_items:
for b in a:
if a[-1]==
def main():
file_of_text = "text.txt"
if __name__ == "__main__":
main()
Starting with pandas read_csv() Specifying '^' as your delimiter and using arbitrary column names
df = pd.read_csv('data.csv', delimiter='\^', names=['A', 'B'])
print (df)
A B
0 level country layla
1 hello sandra organization
2 hello people layla
3 hello samar organization
then we split to get the values we want. That expand arg is new in pandas 16 I believe
df['A'] = df['A'].str.split(' ', expand=True)[1]
print(df)
A B
0 country layla
1 sandra organization
2 people layla
3 samar organization
then we group column B and apply the tuple function. Note: We're reseting the index so we can use it later
g = df.groupby('B')['A'].apply(tuple).reset_index()
print(g)
B A
0 layla (country, people)
1 organization (sandra, samar)
Creating a new column with the string 'item' and the index
g['item'] = 'item' + g.index.astype(str)
print (g[['item','A']])
item A
0 item0 (country, people)
1 item1 (sandra, samar)
Let's assume that your file is called file_of_text.txt and contains the following:
level country ^ layla
hello sandra ^ organization
hello people ^ layla
hello samar ^ organization
You can get your data from a file to a dataframe similar to your desired output with the following lines of code:
import re
import pandas as pd
def main(myfile):
# Open the file and read the lines
text = open(myfile,'r').readlines()
# Split the lines into lists
text = list(map(lambda x: re.split(r"\s[\^\s]*",x.strip()), text))
# Put it in a DataFrame
data = pd.DataFrame(text, columns = ['A','B','C'])
# Create an output DataFrame with rows "item0" and "item1"
final_data = pd.DataFrame(['item0','item1'],columns=['D'])
# Create your desired column
final_data['E'] = data.groupby('C')['B'].apply(lambda x: tuple(x.values)).values
print(final_data)
if __name__ == "__main__":
myfile = "file_of_text.txt"
main(myfile)
The idea is to read the lines from the text file and then split each line using the split method from the re module. The result is then passed to the DataFrame method to generate a dataframe called data, which is used to create the desired dataframe final_data. The result should look like the following:
# data
A B C
0 level country layla
1 hello sandra organization
2 hello people layla
3 hello samar organization
# final_data
D E
0 item0 (country, people)
1 item1 (sandra, samar)
Please take a look at the script and ask further questions, if you have any.
I hope this helps.