Related
I have a dataframe like this:
test = pd.DataFrame({'id':['a','C','D','b','b','D','c','c','c'], 'text':['a','x','a','b','b','b','c','c','c']})
Using the following for-loop I can add x to a new_col. This for-loop works fine for the small dataframe. However, for dataframes that have thousands of rows, it will take many hours to process. Any suggestions to speed it up?
for index, row in test.iterrows():
if row['id'] == 'C':
if test['id'][index+1] =='D':
test['new_col'][index+1] = test['text'][index]
Try using shift() and conditions.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['a', 'C', 'D', 'b', 'b', 'D', 'c', 'c', 'c'],
'text': ['a', 'x', 'a', 'b', 'b', 'b', 'c', 'c', 'c']})
df['temp_col'] = df['id'].shift()
df['new_col'] = np.where((df['id'] == 'D') & (df['temp_col'] == 'C'), df['text'].shift(), "")
del df['temp_col']
print(df)
We can also do it without a temporary column. (Thanks& credits to Prayson 🙂)
df['new_col'] = np.where((df['id'].eq('D')) & (df['id'].shift().eq('C')), df['text'].shift(), "")
I have a dataset along the lines of:
data.append(['a', 'b', 'c'], ['a', 'x', 'y', z'], ['a', 'x', 'e', 'f'], ['a'])
I've searched SO and found ways to return duplicates across all lists using intersection_update() (so, in this example, 'a'), but I actually want to return duplicates from any lists, i.e.,:
retVal = ['a', 'x']
Since 'a' and 'x' are duplicated at least once among all lists. Is there a built-in for Python 2.7 that can do this?
Use a Counter to determine the number of each item and chain.from_iterable to pass the items from the sublists to the Counter.
from itertools import chain
from collections import Counter
data=[['a', 'b', 'c'], ['a', 'x', 'y', 'z'], ['a', 'x', 'e', 'f'], ['a']]
c = Counter(chain.from_iterable(data))
retVal = [k for k, count in c.items() if count >= 2]
print(retVal)
#['x', 'a']
I'm new to Python, so I might be doing basic errors, so apologies first.
Here is the kind of result I'm trying to obtain :
foo = [
["B","C","E","A","D"],
["E","B","A","C","D"],
["D","B","A","E","C"],
["C","D","E","B","A"]
]
So basically, a list of lists of randomly permutated letters without repeat.
Here is the look of what I can get so far :
foo = ['BDCEA', 'BDCEA', 'BDCEA', 'BDCEA']
The main problem being that everytime is the same permutation. This is my code so far :
import random
import numpy as np
letters = ["A", "B", "C", "D", "E"]
nblines = 4
foo = np.repeat(''.join(random.sample(letters, len(letters))), nblines)
Help appreciated. Thanks
The problem with your code is that the line
foo = np.repeat(''.join(random.sample(letters, len(letters))), nblines)
will first create a random permutation, and then repeat that same permutation nblines times. Numpy.repeat does not repeatedly invoke a function, it repeats elements of an already existing array, which you created with random.sample.
Another thing is that numpy is designed to work with numbers, not strings. Here is a short code snippet (without using numpy) to obtain your desired result:
[random.sample(letters,len(letters)) for i in range(nblines)]
Result: similar to this:
foo = [
["B","C","E","A","D"],
["E","B","A","C","D"],
["D","B","A","E","C"],
["C","D","E","B","A"]
]
I hope this helped ;)
PS: I see that others gave similar answers to this while I was writing it.
np.repeat repeats the same array. Your approach would work if you changed it to:
[''.join(random.sample(letters, len(letters))) for _ in range(nblines)]
Out: ['EBCAD', 'BCEAD', 'EBDCA', 'DBACE']
This is a short way of writing this:
foo = []
for _ in range(nblines):
foo.append(''.join(random.sample(letters, len(letters))))
foo
Out: ['DBACE', 'CBAED', 'ACDEB', 'ADBCE']
Here's a plain Python solution using a "traditional" style for loop.
from random import shuffle
nblines = 4
letters = list("ABCDE")
foo = []
for _ in range(nblines):
shuffle(letters)
foo.append(letters[:])
print(foo)
typical output
[['E', 'C', 'D', 'A', 'B'], ['A', 'B', 'D', 'C', 'E'], ['A', 'C', 'B', 'E', 'D'], ['C', 'A', 'E', 'B', 'D']]
The random.shuffle function shuffles the list in-place. We append a copy of the list to foo using letters[:], otherwise foo would just end up containing 4 references to the one list object.
Here's a slightly more advanced version, using a generator function to handle the shuffling. Each time we call next(sh) it shuffles the lst list stored in the generator and returns a copy of it. So we can call next(sh) in a list comprehension to build the list, which is a little neater than using a traditional for loop. Also, list comprehesions can be slightly faster than using .append in a traditional for loop.
from random import shuffle
def shuffler(seq):
lst = list(seq)
while True:
shuffle(lst)
yield lst[:]
sh = shuffler('ABCDE')
foo = [next(sh) for _ in range(10)]
for row in foo:
print(row)
typical output
['C', 'B', 'A', 'E', 'D']
['C', 'A', 'E', 'B', 'D']
['D', 'B', 'C', 'A', 'E']
['E', 'D', 'A', 'B', 'C']
['B', 'A', 'E', 'C', 'D']
['B', 'D', 'C', 'E', 'A']
['A', 'B', 'C', 'E', 'D']
['D', 'C', 'A', 'B', 'E']
['D', 'C', 'B', 'E', 'A']
['E', 'D', 'A', 'C', 'B']
This is the sequence:
l = [['A', 'G'], 'A', ['A', 'C']]
I need the three element sequence back for each permutation
all = ['AAA','GAA','AAC','GAC']
I can't figure this one out! I'm having trouble retaining the permutation order!
You want the product:
from itertools import product
l = [['A', 'G'], 'A', ['A', 'C']]
print(["".join(p) for p in product(*l)])
I am a begginer of python. I am trying to make a horizontal barchart with differently ordered colors.
I have a data set like the one in the below:
dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16}]
data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D']]
The first list contains numerical data, and the second one contains the order of each data item. I need the second list here, because the order of A, B, C, and D is crucial for the dataset when presenting them in my case.
Using data like the above, I want to make a stacked bar chart like the picture in the below. It was made with MS Excel by me manually. What I hope to do now is to make this type of bar chart using Matplotlib with the dataset like the above one in a more automatic way. I also want to add a legend to the chart if possible.
Actually, I have totally got lost in trying this by myself. Any help will be very, very helpful.
Thank you very much for your attention!
It's a long program, but it works, I added one dummy data to distinguish rows count and columns count:
import numpy as np
from matplotlib import pyplot as plt
dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16},
{'A':35, 'B':45, 'C':66, 'D':50}]
data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D'],
['A', 'B', 'C', 'D']]
colors = ["r","g","b","y"]
names = sorted(dataset[0].keys())
values = np.array([[data[name] for name in order] for data,order in zip(dataset, data_orders)])
lefts = np.insert(np.cumsum(values, axis=1),0,0, axis=1)[:, :-1]
orders = np.array(data_orders)
bottoms = np.arange(len(data_orders))
for name, color in zip(names, colors):
idx = np.where(orders == name)
value = values[idx]
left = lefts[idx]
plt.bar(left=left, height=0.8, width=value, bottom=bottoms,
color=color, orientation="horizontal", label=name)
plt.yticks(bottoms+0.4, ["data %d" % (t+1) for t in bottoms])
plt.legend(loc="best", bbox_to_anchor=(1.0, 1.00))
plt.subplots_adjust(right=0.85)
plt.show()
the result figure is:
>>> dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16}]
>>> data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D']]
>>> for i,x in enumerate(data_orders):
for y in x:
#do something here with dataset[i][y] in matplotlib