Lowercase sentences in lists in pandas dataframe - python

I have a pandas data frame like below. I want to convert all the text into lowercase. How can I do this in python?
Sample of data frame
[Nah I don't think he goes to usf, he lives around here though]
[Even my brother is not like to speak with me., They treat me like aids patent.]
[I HAVE A DATE ON SUNDAY WITH WILL!, !]
[As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers., Press *9 to copy your friends Callertune]
[WINNER!!, As a valued network customer you have been selected to receivea £900 prize reward!, To claim call 09061701461., Claim code KL341., Valid 12 hours only.]
What I tried
def toLowercase(fullCorpus):
lowerCased = [sentences.lower()for sentences in fullCorpus['sentTokenized']]
return lowerCased
I get this error
lowerCased = [sentences.lower()for sentences in fullCorpus['sentTokenized']]
AttributeError: 'list' object has no attribute 'lower'

It is easy:
df.applymap(str.lower)
or
df['col'].apply(str.lower)
df['col'].map(str.lower)
Okay, you have lists in rows. Then:
df['col'].map(lambda x: list(map(str.lower, x)))

Can also make it a string, use str.lower and get back to lists.
import ast
df.sentTokenized.astype(str).str.lower().transform(ast.literal_eval)

You can try using apply and map:
def toLowercase(fullCorpus):
lowerCased = fullCorpus['sentTokenized'].apply(lambda row:list(map(str.lower, row)))
return lowerCased

There is also a nice way to do it with numpy:
fullCorpus['sentTokenized'] = [np.char.lower(x) for x in fullCorpus['sentTokenized']]

Related

How can I sort the list in this order by popularity (most number of times)?

So I have a csv file which contains several data like this
1,8dac2b,ewmzr,jewelry,phone0,9759243157894736,us,69.166.231.58,vasstdc27m7nks3
2,668d39,aeqok,furniture,phone1,9759243157894736,jp,50.201.125.84,jmqlhflrzwuay9c
3,622r49,arqek,vehicle,phone2,9759544365415694736,az,53.001.135.54,weqlhrerreuert6f
4,6444t43,rrdwk,vehicle,phone9,9759543263245434353,au,54.241.234.64,weqqyqtqwrtert6f
and I'm tryna use this this function def popvote(list) to return the most popular thing in auction which in the example is vehicle
so I want my function to return what's the most popular thing in the 4th row.. which is vehicle in this case
This is what I have so far
def popvote(list):
for x in list:
g = list(x)
return list.sort(g.sort)
However, this doesn't really work.. what should I change to make sure this works??
Note: The answer should be returned as a set
Edit: so I'm trying to return the value that is repeated most in the list based on what's indicated in (** xxxx **) below
1,8dac2b,ewmzr,**jewelry**,phone0,9759243157894736,us,69.166.231.58,vasstdc27m7nks3
2,668d39,aeqok,**furniture**,phone1,9759243157894736,jp,50.201.125.84,jmqlhflrzwuay9c
3,622r49,arqek,**vehicle**,phone2,9759544365415694736,az,53.001.135.54,weqlhrerreuert6f
4,6444t43,rrdwk,**vehicle**,phone9,9759543263245434353,au,54.241.234.64,weqqyqtqwrtert6f
So in this case, vehicle should be the output.
import pandas as pd
df = pd.read_csv("filename.csv")
most_common = df[df.columns[3]].value_counts().idxmax()
Any questions? Down in the comments.
An alternative solution could be (assuming you have your records as list of lists):
from statistics import mode
mode(list(zip(*your_csv))[3]) # item type is listed as 4th argument

Using split function in Pandas bracket indexer

I'm trying to keep the text rows in a data frame that contains a specific word. I have tried the following:
df['hello' in df['text_column'].split()]
and received the following error:
'Series' object has no attribute 'split'
Please pay attention that I'm trying to check if they contain a word, not a char series, so df[df['text_column'].str.contains('hello')] is not a solution because in that case, 'helloss' or 'sshello' would also be returned as True.
Another answer in addition to the regex answer mentioned above would be to use split combined with the map function like below
df['keep_row'] = df['text_column'].map(lambda x: 'hello' in x.split())
df = df[df['keep_row']]
OR
df = df[df['text_column'].map(lambda x: 'hello' in x.split())]

I have a list of lists how do i classify based on language?

I have three lists:
id = [1,3,4]
text = ["hello","hola","salut"]
date = ["20-12-2020","21-04-2018","15-04-2016"]
#I then combined it all in one list:
new_list = zip(id, text, date)
#which looks like [(1,"hello","20-12-2020"),(3,"hola","21-04-2018"),(4,"salut","15-04-2016")
I want to delete the whole list if it is not in english, do to this i installed lang id and am using lang id.classify
I ran a loop on only the text and its working but am unsure how to delete the whole value such as: (3,"hola","21-04-2018") as hola is not in english.
I am trying to achieve a new list which only has those lists in it that is only english. I want to further write the output list in a xml file.
To do that I have made a sample xml file and am using the date as a parent key as the date can be same for multiple texts.
Try this simple for loop
new_list = [(1,"hello","20-12-2020"),(3,"hola","21-04-2018"),(4,"salut","15-04-2016")]
for x in new_list:
# condition to check if word or sentence is english
if not isEnglishWord(x[1]):
new_list.pop(x)
Not sure how lang id.classify works or what parameters it takes in but something like this should work:
for i in range(len(new_list)):
if id.classify(new_list[i][1]) != 'english':
new_list.pop[i]
In this case, I'm assuming id.classify takes in a str and outputs which language the word belongs (as a str).
I'm also using the range list method to iterate so we don't end up changing the list as we are iterating over it.

How to fix TypeError: can only concatenate str (not "list") to str

I am trying to learn python from the python crash course but this one task has stumped me and I can’t find an answer to it anywhere
The task is
Think of your favourite mode of transportation and make a list that stores several examples
Use your list to print a series of statements about these items
cars = ['rav4'], ['td5'], ['yaris'], ['land rover tdi']
print("I like the "+cars[0]+" ...")
I’m assuming that this is because I have letters and numbers together, but I don’t know how to produce a result without an error and help would be gratefully received
The error I get is
TypeError: can only concatenate str (not "list") to str**
new_dinner = ['ali','zeshan','raza']
print ('this is old friend', new_dinner)
use comma , instead of plus +
If you use plus sign + in print ('this is old friend' + new_dinner) statement you will get error.
Your first line actually produces a tuple of lists, hence cars[0] is a list.
If you print cars you'll see that it looks like this:
(['rav4'], ['td5'], ['yaris'], ['land rover tdi'])
Get rid of all the square brackets in between and you'll have a single list that you can index into.
This is one of the possibilities you can use to get the result needed.
It learns you to import, use the format method and store datatypes in variables and also how to convert different datatypes into the string datatype!
But the main thing you have to do is convert the list or your wished index into a string. By using str(----) function. But the problem is that you've created 4 lists, you should only have one!
from pprint import pprint
cars = ['rav4'], ['td5'], ['yaris'], ['land rover tdi']
Word = str(cars[0])
pprint("I like the {0} ...".format(Word))
new_dinner = ['ali','zeshan','raza']
print ('this is old friend', str(new_dinner))
#Try turning the list into a strang only
First, create a list (not tuple of lists) of strings and then you can access first element (string) of list.
cars = ['rav4', 'td5', 'yaris', 'land rover tdi']
print("I like the "+cars[0]+" ...")
The above code outputs: I like the rav4 ...
You can do like
new_dinner = ['ali','zeshan','raza']
print ('this is old friend', *new_dinner)
Here you are the answer :
cars = (['rav4'], ['td5'], ['yaris'], ['land rover tdi'])
print("I like the "+cars[0][0]+" ...")
What we have done here is calling the list first and then calling the first item in the list.
since you are storing your data in a tuple, this is your solution.

Map on a Pandas DataFrame Column cointaining lists

I am doing some Textmining and therfor I need to lemmatize my documents after tokenization. So I have written a function that uses the python nlp libary spacy to convert my tokenized text column into a lemmatized text column. Actually I supposed that it woulb be easy and straight forward but for some reason it does not work. My DataFrame looks like:
1
As mentioned before I have written a function for lemmatizing lists of strings using spacy:
de = spacy.load('de')
def lemmatizer(x):
return [de(unicode(y))[0].lemma_ for y in x]
When I use it on a simple list of strings, it works fine:
2
Problems are occuring when I try to use it on my filtered column using map.
removed_pd['test'] = removed_pd['filtered'].map(lambda x : lemmatizer(x))
3
I don't know why because my lemmatizer function operates on lists and the column 'filtered' contains lists.
]4
And using other list functions like len works fine:
removed_pd['test'] = removed_pd['filtered'].map(lambda x : len(x))
5
I used textblob-de python package.
pypi link
from textblob_de.lemmatizers import PatternParserLemmatizer
def lemmatize_text(text):
_lemmatizer = PatternParserLemmatizer()
return [_lemmatizer.lemmatize(w)[0][0] for w in text]
removed_pd['test'] = removed_pd['filtered'].apply(lemmatize_text)

Categories

Resources