Faster way to flatten list in Pandas Dataframe - python

I have a dataframe below:
import pandas
df = pandas.DataFrame({"terms" : [[['the', 'boy', 'and', 'the goat'],['a', 'girl', 'and', 'the cat']], [['fish', 'boy', 'with', 'the dog'],['when', 'girl', 'find', 'the mouse'], ['if', 'dog', 'see', 'the cat']]]})
My desired outcome is as follows:
df2 = pandas.DataFrame({"terms" : ['the boy and the goat','a girl and the cat', 'fish boy with the dog','when girl find the mouse', 'if dog see the cat']})
Is there a simple way to accomplish this without having to use a for loop to iterate through each row for each element and substring:
result = pandas.DataFrame()
for i in range(len(df.terms.tolist())):
x = df.terms.tolist()[i]
for y in x:
z = str(y).replace(",",'').replace("'",'').replace('[','').replace(']','')
flattened = pandas.DataFrame({'flattened_term':[z]})
result = result.append(flattened)
print(result)
Thank you.

This is certainly no way to avoid loops here, at least not implicitely. Pandas is not created to handle list objects as elements, it deals magnificently with numeric data, and pretty well with strings. In any case, your fundamental problem is that you are using pd.Dataframe.append in a loop, which is a quadratic time algorithm (the entire data-frame is re-created on each iteration). But you can probably just get away with the following, and it should be significantly faster:
>>> df
terms
0 [[the, boy, and, the goat], [a, girl, and, the...
1 [[fish, boy, with, the dog], [when, girl, find...
>>> pandas.DataFrame([' '.join(term) for row in df.itertuples() for term in row.terms])
0
0 the boy and the goat
1 a girl and the cat
2 fish boy with the dog
3 when girl find the mouse
4 if dog see the cat
>>>

Related

Get an item value from a nested dictionary inside the rows of a pandas df and get rid off the rest

I implemented allennlp's OIE, which extracts subject, predicate, object information (in the form of ARG0, V, ARG1 etc) embedded in nested strings. However, I need to make sure that each output is linked to the given ID of the original sentence.
I produced the following pandas dataframe, where OIE output contains the raw output of the allennlp algorithm.
Current output:
sentence
ID
OIE output
'The girl went to the cinema'
'abcd'
{'verbs':[{'verb': 'went', 'description':'[ARG0: The girl] [V: went] [ARG1:to the cinema]'}]}
'He is right and he is an engineer'
'efgh'
{'verbs':[{'verb': 'is', 'description':'[ARG0: He] [V: is] [ARG1:right]'}, {'verb': 'is', 'description':'[ARG0: He] [V: is] [ARG1:an engineer]'}]}
My code to get the above table:
oie_l = []
for sent in sentences:
oie_pred = predictor_oie.predict(sentence=sent) #allennlp oie predictor
for d in oie_pred['verbs']: #get to the nested info
d.pop('tags') #remove unnecessary info
oie_l.append(oie_pred)
df['OIE out'] = oie_l #add new column to df
Desired output:
sentence
ID
OIE Triples
'The girl went to the cinema'
'abcd'
'[ARG0: The girl] [V: went] [ARG1:to the cinema]'
'He is right and he is an engineer'
'efgh'
'[ARG0: He] [V: is] [ARG1:right]'
'He is right and he is an engineer'
'efgh'
'[ARG0: He] [V: is] [ARG1:an engineer]'
Approach idea:
To get to the desired output of 'OIE Triples' , I was considering transforming the initial 'OIE output' into a string and then using regular expression to extract the ARGs. However, I am not sure if this is the best solution, as the 'ARGs' can vary. Another approach, would be to iterate to the nested values of description: , replace what is currently in the OIE output in the form of a list and then implement df.explode() method to expand it, so that the right sentence and id columns are linked to the triple after 'exploding'.
Any advice is appreciated.
Your second idea should do the trick:
import ast
df["OIE Triples"] = df["OIE output"].apply(ast.literal_eval)
df["OIE Triples"] = df["OIE Triples"].apply(lambda val: [a_dict["description"]
for a_dict in val["verbs"]])
df = df.explode("OIE Triples").drop(columns="OIE output")
In case "OIE output" values are not truly dicts but strings, we convert them to dicts via ast.literal_eval. (so if they are dicts, you can skip the first 2 lines).
Then we get a list for each value of the series that is composed of "description"s of the outermost dict key'ed by "verbs".
Finally explode this description lists and drop the "OIE output" column as it is no longer needed.
to get
sentence ID OIE Triples
0 'The girl went to the cinema' 'abcd' [ARG0: The girl] [V: went] [ARG1:to the cinema]
1 'He is right and he is an engineer' 'efgh' [ARG0: He] [V: is] [ARG1:right]
1 'He is right and he is an engineer' 'efgh' [ARG0: He] [V: is] [ARG1:an engineer]

Most efficient way to remove list elements from column

I have a large DataFrame df with 40k+ rows:
words
0 coffee in the morning
1 good morning
2 hello my name is
3 hello world
...
I have a list of English stopwords that I want to remove from the column, which I do as follows:
df["noStopwords"] = df["words"].apply(lambda x: [i for i in x if i not in stopwords])
noStopwords
0 coffee morning
1 good morning
2 hello name
3 hello world
...
This works but takes too long. Is there an efficient way of doing this?
EDIT:
print(stopwords)
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', ...]
Not sure if it's any faster, but you could try str.replace. Since words is actually a list, we also have to join and split as well.
import re
pattern = rf"\b(?:{'|'.join(map(re.escape, stopwords))})\b"
df['noStopwords'] = (df.words
.str.join(' ')
.str.replace(pattern, '', regex=True)
.str.split('\s+')
)

Pretty print table from 2d list with tabs

I've tried to use pandas and PrettyTable but neither of them helped me in my case.
Here is my case:
left_headers = ['Numbers', 'Animals', 'Names', 'Flowers']
data = [
[1, 2, 3, 4, 5, 6],
['dog', 'cat', 'rabbit', 'elephant', 'hyena', 'kangaroo'],
['short name', 'a very long name', '123', 'some text', 'different name', 'another name'],
['tulip', 'cactus', 'daffodil', 'hydrangea', 'geranium', 'rose']
]
Now I want it in this form:
Numbers 1 2 3 4 5 6
Animals dog cat rabbit elephant hyena kangaroo
Names short name a very long name 123 some text different name another name
Flowers tulip cactus daffodil hydrangea geranium rose
Data is separated by tabs not spaces. All beginning characters should be adjusted.
The main idea: headers are on left side. All data (and headers) are separated by some number of tabs. My problem is that I don't know how to predict how many tabs do I need to fit the data. I want to use as less tabs as possible to fit all data with minimal space but It should be at least one 'space' (like between "Numbers" and "1").
Edit: I did it with very ugly code. I added my answer.
You can use pandas to achieve this:
import pandas as pd
left_headers = ['Numbers', 'animals', 'name', 'flowers']
data = [
[1, 2, 3, 4, 5, 6],
['dog', 'cat', 'rabbit', 'elephant', 'hyena', 'kangaroo'],
['short name', 'a very long name', '123', 'some text', 'different name', 'another name'],
['tulip', 'cactus', 'daffodil', 'hydrangea', 'geranium', 'rose']
]
df = pd.DataFrame(data, index=left_headers)
print(df.to_string(header=False))
The output is:
Numbers 1 2 3 4 5 6
animals dog cat rabbit elephant hyena kangaroo
name short name a very long name 123 some text different name another name
flowers tulip cactus daffodil hydrangea geranium rose
The answer depends on the required output format
1. With one tab (\t) separation
With tab (\t) separation it is very easy to print it:
for header, items in zip(left_headers, data):
print(header, '\t', '\t'.join(map(str, items)))
Output:
Numbers 1 2 3 4 5 6
animals dog cat rabbit elephant hyena kangaroo
name short name a very long name 123 some text different name another name
flowers tulip cactus daffodil hydrangea geranium rose
Short explanation
map(str, items) turns a list of items into list of strings (one list was integers, so this is needed)
'\t'.join(lst) creates a new string from items in a list lst, and joins them with \t.
zip(lst1, lst2) is used to iterate two lists taking one element at time from each one.
2. With space separation (equal width columns)
This is one-liner with tabulate
from tabulate import tabulate
print(tabulate(data, showindex=left_headers, tablefmt='plain'))
Output
Numbers 1 2 3 4 5 6
animals dog cat rabbit elephant hyena kangaroo
name short name a very long name 123 some text different name another name
flowers tulip cactus daffodil hydrangea geranium rose
3. With variable tab separation
This is the toughest one. One thing you have to do is to assume that how the tabulator is handled by the program that uses the output. Here it is assumed that "tab = 4 spaces".
import math
import os
SPACES_PER_TAB = 4
table = [[str(item) for item in items] for items in data]
for header, items in zip(left_headers, table):
items.insert(0, header)
offset_table = [] # in tabs
for col in zip(*table):
lengths = [len(x) for x in col]
cell_length = math.ceil(max(lengths)/SPACES_PER_TAB)*SPACES_PER_TAB
offsets_s = [cell_length - length for length in lengths] # in spaces
additional_tabs = 1 if min(offsets_s) == 0 else 0
offsets = [math.ceil(o/SPACES_PER_TAB) + additional_tabs for o in offsets_s]
offset_table.append(offsets)
with open('table_out.txt', 'w') as f:
for row, row_offsets in zip(table, zip(*offset_table)):
for item, offset in zip(row, row_offsets):
f.write(item)
f.write('\t'*offset)
f.write('\n')
The output looks like this (tabs copied here won't look good, so here's a printscreen from Notepad++)
Short explanation
First, we just create one table called table that contains the headers and the data as strings.
Then, we calculate the lengths of the cells (in spaces), assuming there is one additional space between cells. Then, one additional space is added if some cell would end up having no space before next cell.
Here the builtin zip() is really put to work, and it is used for example to transpose lists of lists by zip(*lst).
Finally, the results are written into an output file.
I did it!
My code is not simple but does what I want:
left_headers = ['Numbers', 'Animals', 'Names', 'Flowers']
data = [
[1, 2, 3, 4, 5, 6],
['dog', 'cat', 'rabbit', 'elephant', 'hyena', 'kangaroo'],
['short name', 'a very long name', '123', 'some text', 'different name', 'another name'],
['tulip', 'cactus', 'daffodil', 'hydrangea', 'geranium', 'rose']
]
for i in range(len(left_headers)):
print(left_headers[i], end="\t")
how_many_tabs_do_i_need = max([len(h) for h in left_headers]) // 4
how_many_tabs_actual_word_has = len(left_headers[i]) // 4
print("\t"*(how_many_tabs_do_i_need-how_many_tabs_actual_word_has), end="")
for j in range(len(data[0])):
how_many_tabs_do_i_need = max([len(str(data[k][j])) for k in range(len(left_headers))]) // 4
how_many_tabs_actual_word_has = len(str(data[i][j])) // 4
print(str(data[i][j]) +"\t"*(how_many_tabs_do_i_need - how_many_tabs_actual_word_has + 1), end="")
print()
The output:
Numbers 1 2 3 4 5 6
Animals dog cat rabbit elephant hyena kangaroo
Names short name a very long name 123 some text different name another name
Flowers tulip cactus daffodil hydrangea geranium rose
If one's can simplify the code - the problem is open.

Replacing text in tags

I have been having problems trying to find a way to replace tags in my strings in Python.
What I have at the moment is the text:
you should buy a {{cat_breed + dog_breed}} or a {{cat_breed + dog_breed}}
Where cat_breed and dog_breed are lists of cat and dog breeds.
What I want to end up with is:
you should buy a Scottish short hair or a golden retriever
I want the tag to be replaced by a random entry in one of the two lists.
I have been looking at re.sub() but I do not know how to fix the problem and not just end up with the same result in both tags.
Use random.sample to get two unique elements from the population.
import random
cats = 'lazy cat', 'cuddly cat', 'angry cat'
dogs = 'dirty dog', 'happy dog', 'shaggy dog'
print("you should buy a {} or a {}".format(*random.sample(dogs + cats, 2)))
There's no reason to use regular expressions here. Just use string.format instead.
I hope the idea below gives you some idea on how to complete your task:
list1 = ['cat_breed1', 'cat_breed2']
list2 = ['dog_breed1', 'dog_breed2']
a = random.choice(list1)
b = random.choice(list2)
sentence = "you should buy a %s or a %s" %(a, b)

Sorting a list of strings in Python [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do I sort a list of strings in Python?
How do I sort unicode strings alphabetically in Python?
I have a list of strings list and want to sort it alphabetically. When I call list.sort() the first part of the list contains the entries starting with upper case letters sorted alphabetically, the second part contains the sorted entries starting with a lower case letter. Like so:
Airplane
Boat
Car
Dog
apple
bicycle
cow
doctor
I googled for an answer but didn't came to a working algorithm. I read about the locale module and the sort parameters cmp and key. Often there was this lambda in a row with sort, which made things not better understandable for me.
How can I get from:
list = ['Dog', 'bicycle', 'cow', 'doctor', 'Car', 'Boat', 'apple', 'Airplane']
to:
Airplane
apple
bicycle
Boat
Car
cow
doctor
Dog
Characters of foreign languages should be taken into account (like ä, é, î).
Use case-insensitive comparison:
>>> sorted(['Dog', 'bicycle', 'cow', 'doctor', 'Car', 'Boat',
'apple', 'Airplane'], key=str.lower)
['Airplane', 'apple', 'bicycle', 'Boat', 'Car', 'cow', 'doctor', 'Dog']
This is actually the way suggested on the python wiki about sorting:
Starting with Python 2.4, both list.sort() and sorted() added a key
parameter to specify a function to be called on each list element
prior to making comparisons.
For example, here's a case-insensitive string comparison:
>>> sorted("This is a test string from Andrew".split(), key=str.lower)
['a', 'Andrew', 'from', 'is', 'string', 'test', 'This']
There seems to be a good overview of that topic here:
http://wiki.python.org/moin/HowTo/Sorting/
Scroll down about a page to here;
For example, here's a case-insensitive string comparison:
>>> sorted("This is a test string from Andrew".split(), key=str.lower)

Categories

Resources