I am doing some Textmining and therfor I need to lemmatize my documents after tokenization. So I have written a function that uses the python nlp libary spacy to convert my tokenized text column into a lemmatized text column. Actually I supposed that it woulb be easy and straight forward but for some reason it does not work. My DataFrame looks like:
1
As mentioned before I have written a function for lemmatizing lists of strings using spacy:
de = spacy.load('de')
def lemmatizer(x):
return [de(unicode(y))[0].lemma_ for y in x]
When I use it on a simple list of strings, it works fine:
2
Problems are occuring when I try to use it on my filtered column using map.
removed_pd['test'] = removed_pd['filtered'].map(lambda x : lemmatizer(x))
3
I don't know why because my lemmatizer function operates on lists and the column 'filtered' contains lists.
]4
And using other list functions like len works fine:
removed_pd['test'] = removed_pd['filtered'].map(lambda x : len(x))
5
I used textblob-de python package.
pypi link
from textblob_de.lemmatizers import PatternParserLemmatizer
def lemmatize_text(text):
_lemmatizer = PatternParserLemmatizer()
return [_lemmatizer.lemmatize(w)[0][0] for w in text]
removed_pd['test'] = removed_pd['filtered'].apply(lemmatize_text)
Related
I have three lists:
id = [1,3,4]
text = ["hello","hola","salut"]
date = ["20-12-2020","21-04-2018","15-04-2016"]
#I then combined it all in one list:
new_list = zip(id, text, date)
#which looks like [(1,"hello","20-12-2020"),(3,"hola","21-04-2018"),(4,"salut","15-04-2016")
I want to delete the whole list if it is not in english, do to this i installed lang id and am using lang id.classify
I ran a loop on only the text and its working but am unsure how to delete the whole value such as: (3,"hola","21-04-2018") as hola is not in english.
I am trying to achieve a new list which only has those lists in it that is only english. I want to further write the output list in a xml file.
To do that I have made a sample xml file and am using the date as a parent key as the date can be same for multiple texts.
Try this simple for loop
new_list = [(1,"hello","20-12-2020"),(3,"hola","21-04-2018"),(4,"salut","15-04-2016")]
for x in new_list:
# condition to check if word or sentence is english
if not isEnglishWord(x[1]):
new_list.pop(x)
Not sure how lang id.classify works or what parameters it takes in but something like this should work:
for i in range(len(new_list)):
if id.classify(new_list[i][1]) != 'english':
new_list.pop[i]
In this case, I'm assuming id.classify takes in a str and outputs which language the word belongs (as a str).
I'm also using the range list method to iterate so we don't end up changing the list as we are iterating over it.
CHECK_OUTPUT_HERE
Currently, the output I am getting is in the string format. I am not sure how to convert that string to a pandas dataframe.
I am getting 3 different tables in my output. It is in a string format.
One of the following 2 solutions will work for me:
Convert that string output to 3 different dataframes. OR
Change something in the function so that I get the output as 3 different data frames.
I have tried using RegEx to convert the string output to a dataframe but it won't work in my case since I want my output to be dynamic. It should work if I give another input.
def column_ch(self, sample_count=10):
report = render("header.txt")
match_stats = []
match_sample = []
any_mismatch = False
for column in self.column_stats:
if not column["all_match"]:
any_mismatch = True
match_stats.append(
{
"Column": column["column"],
"{} dtype".format(self.df1_name): column["dtype1"],
"{} dtype".format(self.df2_name): column["dtype2"],
"# Unequal": column["unequal_cnt"],
"Max Diff": column["max_diff"],
"# Null Diff": column["null_diff"],
}
)
if column["unequal_cnt"] > 0:
match_sample.append(
self.sample_mismatch(column["column"], sample_count, for_display=True)
)
if any_mismatch:
for sample in match_sample:
report += sample.to_string()
report += "\n\n"
print("type is", type(report))
return report
Since you have a string, you can pass your string into a file-like buffer and then read it with pandas read_csv into a dataframe.
Assuming that your string with the dataframe is called dfstring, the code would look like this:
import io
bufdf = io.StringIO(dfstring)
df = pd.read_csv(bufdf, sep=???)
If your string contains multiple dataframes, split it with split and use a loop.
import io
dflist = []
for sdf in dfstring.split('\n\n'): ##this seems the separator between two dataframes
bufdf = io.StringIO(sdf)
dflist.append(pd.read_csv(bufdf, sep=???))
Be careful to pass an appropriate sep parameter, my ??? means that I am not able to understand what could be a proper parameter. Your field are separated by spaces, so you could use sep='\s+') but I see that you have also spaces which are not meant to be a separator, so this may cause a parsing error.
sep accept regex, so to have 2 consecutive spaces as a separator, you could do: sep='\s\s+' (this will require an additional parameter engine='python'). But again, be sure that you have at least 2 spaces between two consecutive fields.
See here for reference about the io module and StringIO.
Note that the io module exists in python3 but not in python2 (it has another name) but since the latest pandas versions require python3, I guess you are using python3.
I have a pandas data frame like below. I want to convert all the text into lowercase. How can I do this in python?
Sample of data frame
[Nah I don't think he goes to usf, he lives around here though]
[Even my brother is not like to speak with me., They treat me like aids patent.]
[I HAVE A DATE ON SUNDAY WITH WILL!, !]
[As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers., Press *9 to copy your friends Callertune]
[WINNER!!, As a valued network customer you have been selected to receivea £900 prize reward!, To claim call 09061701461., Claim code KL341., Valid 12 hours only.]
What I tried
def toLowercase(fullCorpus):
lowerCased = [sentences.lower()for sentences in fullCorpus['sentTokenized']]
return lowerCased
I get this error
lowerCased = [sentences.lower()for sentences in fullCorpus['sentTokenized']]
AttributeError: 'list' object has no attribute 'lower'
It is easy:
df.applymap(str.lower)
or
df['col'].apply(str.lower)
df['col'].map(str.lower)
Okay, you have lists in rows. Then:
df['col'].map(lambda x: list(map(str.lower, x)))
Can also make it a string, use str.lower and get back to lists.
import ast
df.sentTokenized.astype(str).str.lower().transform(ast.literal_eval)
You can try using apply and map:
def toLowercase(fullCorpus):
lowerCased = fullCorpus['sentTokenized'].apply(lambda row:list(map(str.lower, row)))
return lowerCased
There is also a nice way to do it with numpy:
fullCorpus['sentTokenized'] = [np.char.lower(x) for x in fullCorpus['sentTokenized']]
The API here: https://api.bitfinex.com/v2/tickers?symbols=ALL
does not have any labels and I want to extract all of the tBTCUSD, tLTCUSD etc.. Basically everything without numbers. Normally, i would extract this information if they are labeled so i can do something like:
data['name']
or something like that however this API does not have labels.. how can i get this info with python?
You can do it like this:
import requests
j = requests.get('https://api.bitfinex.com/v2/tickers?symbols=ALL').json()
mydict = {}
for i in j:
mydict[i[0]] = i[1:]
Or using dictionary comprehension:
mydict = {i[0]: i[1:] for i in j}
Then access it as:
mydict['tZRXETH']
I don't have access to Python right now, but it looks like they're organized in a superarray of several subarrays.
You should be able to extract everything (the superarray) as data, and then do a:
for array in data:
print array[0]
Not sure if this answers your question. Let me know!
Even if it doesn't have labels (or, more specifically, if it's not a JSON object) it's still a perfectly legal piece of JSON, since it's just some arrays contained within a parent array.
Assuming you can already get the text from the api, you can load it as a Python object using json.loads:
import json
data = json.loads(your_data_as_string)
Then, since the labels you want to extract are always in the first position of the arrays, you can store them in a list using a list comprehension:
labels = [x[0] for x in data]
labels will be:
['tBTCUSD', 'tLTCUSD', 'tLTCBTC', 'tETHUSD', 'tETHBTC', 'tETCBTC', ...]
I'm trying to extract a delimitedString as a list using PyParsing as follows:
from pyparsing import *
string = "arm + mips + x86"
pattern = delimitedList(Word(printables), delim="+")("architectures")
result = pattern.parseString(string)
print(dict(result))
The problem is that this prints
{'architectures': (['arm', 'mips', 'x86'], {})}
which is the string representation of a ParseResult. However, I would like the result to be a Python list:
{'architectures': ['arm', 'mips', 'x86']}
I've looked into doing this with setParseAction, but I wasn't able to figure out how to achieve this using the API of that method. I would actually like to apply list() to the entire ParseResult, but the setParseAction functions have to have the original string, locations, and tokens as input (cf. http://pyparsing.wikispaces.com/HowToUsePyparsing).
How can I post-process the result to make it a list?
Converting to an answer PaulMcG's comment, result.asDict() yields the desired result.