I've tried to use pandas and PrettyTable but neither of them helped me in my case.
Here is my case:
left_headers = ['Numbers', 'Animals', 'Names', 'Flowers']
data = [
[1, 2, 3, 4, 5, 6],
['dog', 'cat', 'rabbit', 'elephant', 'hyena', 'kangaroo'],
['short name', 'a very long name', '123', 'some text', 'different name', 'another name'],
['tulip', 'cactus', 'daffodil', 'hydrangea', 'geranium', 'rose']
]
Now I want it in this form:
Numbers 1 2 3 4 5 6
Animals dog cat rabbit elephant hyena kangaroo
Names short name a very long name 123 some text different name another name
Flowers tulip cactus daffodil hydrangea geranium rose
Data is separated by tabs not spaces. All beginning characters should be adjusted.
The main idea: headers are on left side. All data (and headers) are separated by some number of tabs. My problem is that I don't know how to predict how many tabs do I need to fit the data. I want to use as less tabs as possible to fit all data with minimal space but It should be at least one 'space' (like between "Numbers" and "1").
Edit: I did it with very ugly code. I added my answer.
You can use pandas to achieve this:
import pandas as pd
left_headers = ['Numbers', 'animals', 'name', 'flowers']
data = [
[1, 2, 3, 4, 5, 6],
['dog', 'cat', 'rabbit', 'elephant', 'hyena', 'kangaroo'],
['short name', 'a very long name', '123', 'some text', 'different name', 'another name'],
['tulip', 'cactus', 'daffodil', 'hydrangea', 'geranium', 'rose']
]
df = pd.DataFrame(data, index=left_headers)
print(df.to_string(header=False))
The output is:
Numbers 1 2 3 4 5 6
animals dog cat rabbit elephant hyena kangaroo
name short name a very long name 123 some text different name another name
flowers tulip cactus daffodil hydrangea geranium rose
The answer depends on the required output format
1. With one tab (\t) separation
With tab (\t) separation it is very easy to print it:
for header, items in zip(left_headers, data):
print(header, '\t', '\t'.join(map(str, items)))
Output:
Numbers 1 2 3 4 5 6
animals dog cat rabbit elephant hyena kangaroo
name short name a very long name 123 some text different name another name
flowers tulip cactus daffodil hydrangea geranium rose
Short explanation
map(str, items) turns a list of items into list of strings (one list was integers, so this is needed)
'\t'.join(lst) creates a new string from items in a list lst, and joins them with \t.
zip(lst1, lst2) is used to iterate two lists taking one element at time from each one.
2. With space separation (equal width columns)
This is one-liner with tabulate
from tabulate import tabulate
print(tabulate(data, showindex=left_headers, tablefmt='plain'))
Output
Numbers 1 2 3 4 5 6
animals dog cat rabbit elephant hyena kangaroo
name short name a very long name 123 some text different name another name
flowers tulip cactus daffodil hydrangea geranium rose
3. With variable tab separation
This is the toughest one. One thing you have to do is to assume that how the tabulator is handled by the program that uses the output. Here it is assumed that "tab = 4 spaces".
import math
import os
SPACES_PER_TAB = 4
table = [[str(item) for item in items] for items in data]
for header, items in zip(left_headers, table):
items.insert(0, header)
offset_table = [] # in tabs
for col in zip(*table):
lengths = [len(x) for x in col]
cell_length = math.ceil(max(lengths)/SPACES_PER_TAB)*SPACES_PER_TAB
offsets_s = [cell_length - length for length in lengths] # in spaces
additional_tabs = 1 if min(offsets_s) == 0 else 0
offsets = [math.ceil(o/SPACES_PER_TAB) + additional_tabs for o in offsets_s]
offset_table.append(offsets)
with open('table_out.txt', 'w') as f:
for row, row_offsets in zip(table, zip(*offset_table)):
for item, offset in zip(row, row_offsets):
f.write(item)
f.write('\t'*offset)
f.write('\n')
The output looks like this (tabs copied here won't look good, so here's a printscreen from Notepad++)
Short explanation
First, we just create one table called table that contains the headers and the data as strings.
Then, we calculate the lengths of the cells (in spaces), assuming there is one additional space between cells. Then, one additional space is added if some cell would end up having no space before next cell.
Here the builtin zip() is really put to work, and it is used for example to transpose lists of lists by zip(*lst).
Finally, the results are written into an output file.
I did it!
My code is not simple but does what I want:
left_headers = ['Numbers', 'Animals', 'Names', 'Flowers']
data = [
[1, 2, 3, 4, 5, 6],
['dog', 'cat', 'rabbit', 'elephant', 'hyena', 'kangaroo'],
['short name', 'a very long name', '123', 'some text', 'different name', 'another name'],
['tulip', 'cactus', 'daffodil', 'hydrangea', 'geranium', 'rose']
]
for i in range(len(left_headers)):
print(left_headers[i], end="\t")
how_many_tabs_do_i_need = max([len(h) for h in left_headers]) // 4
how_many_tabs_actual_word_has = len(left_headers[i]) // 4
print("\t"*(how_many_tabs_do_i_need-how_many_tabs_actual_word_has), end="")
for j in range(len(data[0])):
how_many_tabs_do_i_need = max([len(str(data[k][j])) for k in range(len(left_headers))]) // 4
how_many_tabs_actual_word_has = len(str(data[i][j])) // 4
print(str(data[i][j]) +"\t"*(how_many_tabs_do_i_need - how_many_tabs_actual_word_has + 1), end="")
print()
The output:
Numbers 1 2 3 4 5 6
Animals dog cat rabbit elephant hyena kangaroo
Names short name a very long name 123 some text different name another name
Flowers tulip cactus daffodil hydrangea geranium rose
If one's can simplify the code - the problem is open.
Related
I have two dataframe, I need to check contain substring from first df in each string in second df and get a list of words that are included in the second df
First df(word):
word
apples
dog
cat
cheese
Second df(sentence):
sentence
apples grow on a tree
...
I love cheese
I tried this one:
tru=[]
for i in word['word']:
if i in sentence['sentence'].values:
tru.append(i)
And this one:
tru=[]
for i in word['word']:
if sentence['sentence'].str.contains(i):
tru.append(i)
I expect to get a list like ['apples',..., 'cheese']
One possible way is to use Series.str.extractall:
import re
import pandas as pd
df_word = pd.Series(["apples", "dog", "cat", "cheese"])
df_sentence = pd.Series(["apples grow on a tree", "i love cheese"])
pattern = f"({'|'.join(df_word.apply(re.escape))})"
matches = df_sentence.str.extractall(pattern)
matches
Output:
0
match
0 0 apples
1 0 cheese
You can then convert the results to a list:
matches[0].unique().tolist()
Output:
['apples', 'cheese']
I have a dataframe and a list as follows:
data = {{"text":["I have one apple and two bananas", "this is my apple", "she has three apples","My friend has five apples but she only has one banana"]}
df= pd.DataFrame(data=data, columns=['text'])
my_list = ['one','two','three','four','five']
what I would like to have as an output is an extra column 'new_text' where the sentences containing the words from the list are replaced with each and every word from my_list, so the output would look like this:
output:
text new_text
0 I have one apple and two bananas I have two apple and three bananas, I have three apple and four bananas,I have four apple and five bananas,I have five apple and one bananas,...
1 this is my apple this is my apple
2 she has three apples she has two apples,she has four apples,she has five apples,...
and so on...
the repetition of the same sentence and plural cases does not matter, the only important thing is that all the words from the list appear in the sentences in the 'new_text' column
I have tried the code here: Python: Replace one word in a sentence with a list of words and put thenew sentences in another column in pandas
with an exception in step 1, but it only finds the first word :
data1 = data['text'].str.extract(
r"(?i)(?P<before>.*)\s(?P<clock>\(?=\bone\b | \btwo\b | \bthree\b | \bfour\b | \bfive\b))\s(?P<after>.*)")
Thank you in advance
You can use this :
a = ["I have one apple and two bananas", "this is my apple",
"she has three apples", "My friend has five apples but she only has one banana"]
b = ['one','two','three','four','five', 'six']
new = []
for sentence in a:
newSen = []
for word in sentence.split():
if word in b:
newSen.append(b[b.index(word) + 1])
else:
newSen.append(word)
new.append(' '.join(newSen))
print(new) #output : ['I have two apple and three bananas', 'this is my apple', 'she has four apples', 'My friend has six apples but she only has two banana']
I have bag of words as elements in a list format. I am trying to search if each and every single of these words appear in the pandas data frame ONLY if it 'startswith' the element in the list. I have tried 'startswith' and 'contains' to compare.
Code:
import pandas as pd
# list of words to search for
searchwords = ['harry','harry potter','secret garden']
# Data
l1 = [1, 2, 3,4,5]
l2 = ['Harry Potter is a great book',
'Harry Potter is very famous',
'I enjoyed reading Harry Potter series',
'LOTR is also a great book along',
'Have you read Secret Garden as well?'
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
# Preview df:
id text
0 1 harry potter is a great book
1 2 harry potter is very famous
2 3 i enjoyed reading harry potter series
3 4 lotr is also a great book along
4 5 have you read secret garden as well?
Try #1:
When I run this command it picks it up and gives me the results through out the text column. Not what I am looking for. I just used to check if I am doing things right for an example reasons for my understanding.
df[df['text'].str.contains('|'.join(searchwords))]
Try #2:
When I run this command it returns nothing. Why is that? I am doing something wrong? When I search 'harry' as single it works, but not when I pass in the list of elements.
df[df['text'].str.startswith('harry')] # works with single string.
df[df['text'].str.startswith('|'.join(searchwords))] # returns nothing!
Use startswith with a tuple
Ex:
searchwords = ['harry','harry potter','secret garden']
# Data
l1 = [1, 2, 3,4,5]
l2 = ['Harry Potter is a great book',
'Harry Potter is very famous',
'I enjoyed reading Harry Potter series',
'LOTR is also a great book along',
'Have you read Secret Garden as well?'
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
print(df[df['text'].str.startswith(tuple(searchwords))] )
Output:
id text
0 1 harry potter is a great book
1 2 harry potter is very famous
since startswith accepts str and no regex, use str.findall
df[df['text'].str.findall('^(?:'+'|'.join(searchwords) + ')').apply(len) > 0]
Output
id text
0 1 harry potter is a great book
1 2 harry potter is very famous
You could pass a tuple in startswith function to check for multiple words
See this str.startswith with a list of strings to test for
In your case, you can do
df['text'].str.startswith(tuple(searchwords))
Out:
0 True
1 True
2 False
3 False
4 False
Name: text, dtype: bool
I have a dataframe below:
import pandas
df = pandas.DataFrame({"terms" : [[['the', 'boy', 'and', 'the goat'],['a', 'girl', 'and', 'the cat']], [['fish', 'boy', 'with', 'the dog'],['when', 'girl', 'find', 'the mouse'], ['if', 'dog', 'see', 'the cat']]]})
My desired outcome is as follows:
df2 = pandas.DataFrame({"terms" : ['the boy and the goat','a girl and the cat', 'fish boy with the dog','when girl find the mouse', 'if dog see the cat']})
Is there a simple way to accomplish this without having to use a for loop to iterate through each row for each element and substring:
result = pandas.DataFrame()
for i in range(len(df.terms.tolist())):
x = df.terms.tolist()[i]
for y in x:
z = str(y).replace(",",'').replace("'",'').replace('[','').replace(']','')
flattened = pandas.DataFrame({'flattened_term':[z]})
result = result.append(flattened)
print(result)
Thank you.
This is certainly no way to avoid loops here, at least not implicitely. Pandas is not created to handle list objects as elements, it deals magnificently with numeric data, and pretty well with strings. In any case, your fundamental problem is that you are using pd.Dataframe.append in a loop, which is a quadratic time algorithm (the entire data-frame is re-created on each iteration). But you can probably just get away with the following, and it should be significantly faster:
>>> df
terms
0 [[the, boy, and, the goat], [a, girl, and, the...
1 [[fish, boy, with, the dog], [when, girl, find...
>>> pandas.DataFrame([' '.join(term) for row in df.itertuples() for term in row.terms])
0
0 the boy and the goat
1 a girl and the cat
2 fish boy with the dog
3 when girl find the mouse
4 if dog see the cat
>>>
If I have a frame like this
frame = pd.DataFrame({
"a": ["the cat is blue", "the sky is green", "the dog is black"]
})
and I want to check if any of those rows contain a certain word I just have to do this.
frame["b"] = (
frame.a.str.contains("dog") |
frame.a.str.contains("cat") |
frame.a.str.contains("fish")
)
frame["b"] outputs:
0 True
1 False
2 True
Name: b, dtype: bool
If I decide to make a list:
mylist = ["dog", "cat", "fish"]
How would I check that the rows contain a certain word in the list?
frame = pd.DataFrame({'a' : ['the cat is blue', 'the sky is green', 'the dog is black']})
frame
a
0 the cat is blue
1 the sky is green
2 the dog is black
The str.contains method accepts a regular expression pattern:
mylist = ['dog', 'cat', 'fish']
pattern = '|'.join(mylist)
pattern
'dog|cat|fish'
frame.a.str.contains(pattern)
0 True
1 False
2 True
Name: a, dtype: bool
Because regex patterns are supported, you can also embed flags:
frame = pd.DataFrame({'a' : ['Cat Mr. Nibbles is blue', 'the sky is green', 'the dog is black']})
frame
a
0 Cat Mr. Nibbles is blue
1 the sky is green
2 the dog is black
pattern = '|'.join([f'(?i){animal}' for animal in mylist]) # python 3.6+
pattern
'(?i)dog|(?i)cat|(?i)fish'
frame.a.str.contains(pattern)
0 True # Because of the (?i) flag, 'Cat' is also matched to 'cat'
1 False
2 True
For list should work
print(frame[frame["a"].isin(mylist)])
See pandas.DataFrame.isin().
After going through the comments of the accepted answer of extracting the string, this approach can also be tried.
frame = pd.DataFrame({'a' : ['the cat is blue', 'the sky is green', 'the dog is black']})
frame
a
0 the cat is blue
1 the sky is green
2 the dog is black
Let us create our list which will have strings that needs to be matched and extracted.
mylist = ['dog', 'cat', 'fish']
pattern = '|'.join(mylist)
Now let create a function which will be responsible to find and extract the substring.
import re
def pattern_searcher(search_str:str, search_list:str):
search_obj = re.search(search_list, search_str)
if search_obj :
return_str = search_str[search_obj.start(): search_obj.end()]
else:
return_str = 'NA'
return return_str
We will use this function with pandas.DataFrame.apply
frame['matched_str'] = frame['a'].apply(lambda x: pattern_searcher(search_str=x, search_list=pattern))
Result :
a matched_str
0 the cat is blue cat
1 the sky is green NA
2 the dog is black dog
We can check for three patterns simultaneously using pipe, for example
for i in range(len(df)):
if re.findall(r'car|oxide|gen', df.iat[i,1]):
df.iat[i,2]='Yes'
else:
df.iat[i,2]='No'