i hope someone can point me in the right direction.
what would be an efficient way to translate data within a row[x]?
for example i want to convert the following: street,avenue,road,court to st,ave,rd,ct.
i was thinking of using a dictionary, the reason being is that sometimes the first letter will be capitalized and sometimes it wont ie: {'ave':['Avenue','avenue','AVENUE','av','AV']}
having said that, could i also do something (prior to translating) like convert all the data to lower case (in the original csv file) to prevent working with data that contains mixed caps?
this is for csv files with anywhere between 500-1000 lines..
thank you
edit: i should add that the row[x] string would be something like: '123 main street' and that is what im looking to translate to '123 main st'
edit#2:
mydict = {'avenue':'ave', 'street':'st', 'road':'rd', 'court':'ct'}
add1 = '123 MAIN ROAD'
newadd1 = []
for i in add1.lower().split(' '):
newtext = mydict.get(i.lower(),i)
newadd1.append(newtext)
print ' '.join(newadd1)
thank you everyone
The way I would tackle it would be, as you suggested, constructing a dictionary . For example, say that any form of Avenue - I would like to display as "Ave":
mapper = {'ave': 'Ave', 'avenue': 'Ave', 'av': 'Ave', 'st': 'Street', 'street': 'Street, ...}
and then use it with every word in the address field as follows:
word = mapper.get(word.lower(), word)
Related
Example dataframe:
data = pd.DataFrame({'Name': ['Nick', 'Matthew', 'Paul'],
'Text': ["Lived in Norway, England, Spain and Germany with his car",
"Used his bikes in England. Loved his bike",
"Lived in Alaska"]})
Example list:
example_list = ["England", "Bike"]
What I need
I want to create a new column, called x, where if a term from example_list is found as a string/substring in data.Text (case insensitive), it adds the word it was found from to the new column.
Output
So in row 1, the word England was found and returned, and bike was found and returned, as well as bikes (which bike was a substring of).
Progress so far:
I have managed - with the following code - to return terms that match the terms regardless of case, however it wont find substrings... e.g. if search for "bike", and it finds "bikes", I want it to return "bikes".
pattern = fr'({"|".join(example_list)})'
data['Text'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(", ")
I think I might have found a solution for your pattern there:
pattern = fr'({"|".join("[a-zA-Z]*" + ex + "[a-zA-Z]*" for ex in example_list)})'
data['x'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(",")
Basically what I do is, I extend the pattern by optionally allowing letters before the (I think you don't explicitly mention this, maybe this has to be omitted) and after the word.
As an output I get the following:
I'm just not so sure, in which format you want this x-column. In your code you join it via commas (which I followed here) but in the picture you only have a list of the values. If you specify this, I could update my solution.
I have an array of keywords:
keyword_list = ['word1', 'anotherWord', 'wordup', 'word to your papa']
I have a string of text:
string_of_text = 'So this is a string of text. I want to talk about anotherWord...and then I'm going to say something I've been meaning to say "wordup". But I also wanted to say the following: word to your papa. And lastly I wanted to talk about word1...'
I want to return the following:
{'list_word': 'word1', 'string_of_text_after': '...'}, {'list_word': 'anotherWord', 'string_of_text_after': '...and then I'm going to say something I've been meaning to say "'}, {'list_word': 'wordup', 'string_of_text_after': '". But I also wanted to say the following: '}, {list_word: 'word to your papa', 'string_of_text_after':'. And lastly I wanted to talk about '}
As you can see it is a list of dictionaries with the list word and then the text that comes after the list word item but only until the next list word is detected is when it would discontinue.
What would be the most efficient way to do this in python (python 3 or later, 2 is also ok if there are any issues with deprecated methods).
you could try something like this:
keyword_list = ['word1', 'anotherWord', 'wordup', 'word to your papa']
string_of_text = """So this is a string of text. I want to talk about anotherWord...\
and then I'm going to say something I've been meaning to say "wordup".\
But I also wanted to say the following: word to your papa.\
And lastly I wanted to talk about word1..."""
def t(k, t):
ls = len(t)
tmp = {i:len(i) for i in k}
return [{"list_word":i,"string_of_text_after":t[t.find(i)+tmp[i]:]} for i in tmp if t.find(i)>0]
from pprint import pprint
pprint(t(keyword_list,string_of_text))
Result:
[{'list_word': 'wordup',
'string_of_text_after': '". But I also wanted to say the following: word to your papa. And lastly I wanted to talk about word1...'},
{'list_word': 'word1', 'string_of_text_after': '...'},
{'list_word': 'anotherWord',
'string_of_text_after': '... and then I\'m going to say something I\'ve been meaning to say "wordup". But I also wanted to say the following: word to your papa. And lastly I wanted to talk about word1...'},
{'list_word': 'word to your papa',
'string_of_text_after': '. And lastly I wanted to talk about word1...'}]
ATTENTION
This code has several implications :
the keyword_list has to be of unique elements ...
the call t.find(i) is doubled
the function returns a list, which must be saved in your memory, this could be fixed if you chose to return a generator like this :
return ({"list_word":i,"string_of_text_after":t[t.find(i)+tmp[i]:]} for i in tmp if t.find(i)>0) and to call it where und when needed.
Good luck ! :)
I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question
I have a large dataset all_transcripts with conversations and I have a small list gemeentes containing names of different cities. In all_transcripts, I want to replace each instance in which the name of a city is given, by 'woonplaats' (Dutch for city).
To do so, I have the following code:
all_transcripts['filtered'] = all_transcripts['no_punc'].str.replace('|'.join(gemeentes),' woonplaats ')
However, this replaces each instance in which the word combination appears and not just whole words.
What I'm looking for is something like:
all_transcripts['filtered'] = all_transcripts['no_punc'].re.sub('|'r"\b{}\b".format(join(gemeentes)),' woonplaats ')
But this doesn't work.
As an example, I have:
all_transcripts['no_punc'] = ['i live in amsterdam', 'i come from haarlem', 'groningen is her favourite city']
gemeentes = ['amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen']
The output that I want, after I run the code is as follows:
>>> ['i live in woonplaats', 'i come from woonplaats', 'woonplaats is her favourite city']
Before, I've worked with the '\b' option of regex. However, I don't know how to apply it here. I could run a for loop for each word in gemeentes and apply it to the whole dataset. But given its size (gemeentes has over 300 variables and all_transcripts over 2.5 million rows), this would be very computationally expensive and thus, I would like a similar approach as above in which I replace a string, using the OR operator.
It looks like you're close, but you'll want to change your re.sub call a little. Something like this should work:
gemeentes = ['amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen']
all_transcripts['filtered'] = [re.sub(r"\b({})\b".format("|".join(gemeentes)), "woonplaats", s) for s in all_transcripts['no_punc']]
Output
all_transcripts['filtered'] = ['i live in woonplaats', 'i come from woonplaats', 'woonplaats is her favourite city']
As for performance, I'm not sure that you're going to get better speeds out of this over a traditional for-loop as you're still having to loop over the 25 million entries and apply the regex.
If you are using pandas dataframe then you can use the following :
import pandas as pd
all_transcripts['filtered']= all_transcripts.replace([amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen'], "woonplaats", regex=True)
Given a list of actors, with their their character name in brackets, separated by either a semi-colon (;) or comm (,):
Shelley Winters [Ruby]; Millicent Martin [Siddie]; Julia Foster [Gilda];
Jane Asher [Annie]; Shirley Ann Field [Carla]; Vivien Merchant [Lily];
Eleanor Bron [Woman Doctor], Denholm Elliott [Mr. Smith; abortionist];
Alfie Bass [Harry]
How would I parse this into a list of two-typles in the form of [(actor, character),...]
--> [('Shelley Winters', 'Ruby'), ('Millicent Martin', 'Siddie'),
('Denholm Elliott', 'Mr. Smith; abortionist')]
I originally had:
actors = [item.strip().rstrip(']') for item in re.split('\[|,|;',data['actors'])]
data['actors'] = [(actors[i], actors[i + 1]) for i in range(0, len(actors), 2)]
But this doesn't quite work, as it also splits up items within brackets.
You can go with something like:
>>> re.findall(r'(\w[\w\s\.]+?)\s*\[([\w\s;\.,]+)\][,;\s$]*', s)
[('Shelley Winters', 'Ruby'),
('Millicent Martin', 'Siddie'),
('Julia Foster', 'Gilda'),
('Jane Asher', 'Annie'),
('Shirley Ann Field', 'Carla'),
('Vivien Merchant', 'Lily'),
('Eleanor Bron', 'Woman Doctor'),
('Denholm Elliott', 'Mr. Smith; abortionist'),
('Alfie Bass', 'Harry')]
One can also simplify some things with .*?:
re.findall(r'(\w.*?)\s*\[(.*?)\][,;\s$]*', s)
inputData = inputData.replace("];", "\n")
inputData = inputData.replace("],", "\n")
inputData = inputData[:-1]
for line in inputData.split("\n"):
actorList.append(line.partition("[")[0])
dataList.append(line.partition("[")[2])
togetherList = zip(actorList, dataList)
This is a bit of a hack, and I'm sure you can clean it up from here. I'll walk through this approach just to make sure you understand what I'm doing.
I am replacing both the ; and the , with a newline, which I will later use to split up every pair into its own line. Assuming your content isn't filled with erroneous ]; or ], 's this should work. However, you'll notice the last line will have a ] at the end because it didn't have a need a comma or semi-colon. Thus, I splice it off with the third line.
Then, just using the partition function on each line that we created within your input string, we assign the left part to the actor list, the right part to the data list and ignore the bracket (which is at position 1).
After that, Python's very useful zip funciton should finish the job for us by associating the ith element of each list together into a list of matched tuples.