How to speed up nested for loop with dataframe?

How to speed up nested for loop with dataframe? - python

I have a dataframe like this:
test = pd.DataFrame({'id':['a','C','D','b','b','D','c','c','c'], 'text':['a','x','a','b','b','b','c','c','c']})
Using the following for-loop I can add x to a new_col. This for-loop works fine for the small dataframe. However, for dataframes that have thousands of rows, it will take many hours to process. Any suggestions to speed it up?
for index, row in test.iterrows():
if row['id'] == 'C':
if test['id'][index+1] =='D':
test['new_col'][index+1] = test['text'][index]

Try using shift() and conditions.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['a', 'C', 'D', 'b', 'b', 'D', 'c', 'c', 'c'],
'text': ['a', 'x', 'a', 'b', 'b', 'b', 'c', 'c', 'c']})
df['temp_col'] = df['id'].shift()
df['new_col'] = np.where((df['id'] == 'D') & (df['temp_col'] == 'C'), df['text'].shift(), "")
del df['temp_col']
print(df)
We can also do it without a temporary column. (Thanks& credits to Prayson 🙂)
df['new_col'] = np.where((df['id'].eq('D')) & (df['id'].shift().eq('C')), df['text'].shift(), "")

Related

Reorder the contents of my string in Python

This is the sample dataframe which I managed to make for better understanding.
import pandas as pd
import numpy as np
df = pd.DataFrame([["{postal2:express}", "{postal1:regular}", "{postal4:slow}", "{postal3:slow}","{address1:ABC}","{address4:VPN}","{address3:XYZ}","{address2:POL}"],["{postal1:regular}","{postal4:slow}","{address4:VPN}","{address3:XYZ}","{postal3:slow}","{address1:ABC}","{postal2:express}","{address2:POL}"]], columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])
However the output which I need should e grouped like in the image:
targetdf

Moving a bunch of distinct items to the end of a python list

I have this python list:
['Intercept', 'a', 'country[T.BE]', 'country[T.CY]', 'country[T.DE]', 'b', 'c', 'd', 'e']
I want the country items at the end:
['Intercept', 'a', 'b', 'c', 'd', 'e', 'country[T.BE]', 'country[T.CY]', 'country[T.DE]']
How to accomplish this?
(Note, the items are column headers of a dataframe that I will use for regression analysis. The column names and the weird ordering are generated by patsy.dmatrices.)
I tried sorting, pop, del, and list comprehension, but to no avail. In this case I decided not to explain what I did to solve this problem and did not work. It is a simple problem, and unlike one commentators, I do not have decades of programming experience.

If your logic is to put any item that contains country to the back, use sorted with key:
l = ['Intercept', 'a', 'country[T.BE]', 'country[T.CY]', 'country[T.DE]', 'b', 'c', 'd', 'e']
sorted(l, key=lambda x: 'country' in x)
Output:
['Intercept',
'a',
'b',
'c',
'd',
'e',
'country[T.BE]',
'country[T.CY]',
'country[T.DE]']

Here I assume having or not the country[ text is what you want to split... then you can use:
li = ['Intercept', 'a', 'country[T.BE]', 'country[T.CY]', 'country[T.DE]', 'b', 'c', 'd', 'e']
[x for x in li if not 'country[' in x] + [x for x in li if 'country[' in x]

pandas map function to data frame across all columns and one fixed column

I have a pandas dataframe df with 5 columns, 'A', 'B', 'C', 'D', 'E'
I would like to apply a function to the first 4 columns ('A', 'B', 'C', 'D') that takes two inputs X[i] and E[i] for row i where X is one of the first four columns.
Ignoring E[i], this is fairly straightforward:
def do_something(value):
return some_transformation(value)
df[['A', 'B', 'C', 'D']].applymap(do_something)
Similarly, if I have a constant value I can do it with map:
def do_something(value, i):
return some_transformation(value, i)
df[['A', 'B', 'C', 'D']].map(lambda f: do_something(f, 6))
But how do I do this if instead of 6 I want to pass in the value of E in the same row?

Using np.vectorize, you can pass columns to the function while actual computation happens over each set of elements.
def do_something(x, y):
return some_transformation(x, y)
v = np.vectorize(do_something)
df[['A', 'B', 'C', 'D']].apply(v, args=(df['E'], ))

Start loop after certain element in list is reached

How do I start executing code in a for loop after a certain element in the list has been reached. I've got something that works, but is there a more pythonic or faster way of doing this?
list = ['a', 'b', 'c', 'd', 'e', 'f']
condition = 0
for i in list:
if i == 'c' or condition == 1:
condition = 1
print i

One way would to be to iterate over a generator combining dropwhile and islice:
from itertools import dropwhile, islice
data = ['a', 'b', 'c', 'd', 'e', 'f']
for after in islice(dropwhile(lambda L: L != 'c', data), 1, None):
print after
If you want including then drop the islice.

A little simplified code:
lst = ['a', 'b', 'c', 'd', 'e', 'f']
start_index = lst.index('c')
for i in lst[start_index:]:
print i

Stacked bar chart with differently ordered colors using matplotlib

I am a begginer of python. I am trying to make a horizontal barchart with differently ordered colors.
I have a data set like the one in the below:
dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16}]
data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D']]
The first list contains numerical data, and the second one contains the order of each data item. I need the second list here, because the order of A, B, C, and D is crucial for the dataset when presenting them in my case.
Using data like the above, I want to make a stacked bar chart like the picture in the below. It was made with MS Excel by me manually. What I hope to do now is to make this type of bar chart using Matplotlib with the dataset like the above one in a more automatic way. I also want to add a legend to the chart if possible.
Actually, I have totally got lost in trying this by myself. Any help will be very, very helpful.
Thank you very much for your attention!

It's a long program, but it works, I added one dummy data to distinguish rows count and columns count:
import numpy as np
from matplotlib import pyplot as plt
dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16},
{'A':35, 'B':45, 'C':66, 'D':50}]
data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D'],
['A', 'B', 'C', 'D']]
colors = ["r","g","b","y"]
names = sorted(dataset[0].keys())
values = np.array([[data[name] for name in order] for data,order in zip(dataset, data_orders)])
lefts = np.insert(np.cumsum(values, axis=1),0,0, axis=1)[:, :-1]
orders = np.array(data_orders)
bottoms = np.arange(len(data_orders))
for name, color in zip(names, colors):
idx = np.where(orders == name)
value = values[idx]
left = lefts[idx]
plt.bar(left=left, height=0.8, width=value, bottom=bottoms,
color=color, orientation="horizontal", label=name)
plt.yticks(bottoms+0.4, ["data %d" % (t+1) for t in bottoms])
plt.legend(loc="best", bbox_to_anchor=(1.0, 1.00))
plt.subplots_adjust(right=0.85)
plt.show()
the result figure is:

>>> dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16}]
>>> data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D']]
>>> for i,x in enumerate(data_orders):
for y in x:
#do something here with dataset[i][y] in matplotlib

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up nested for loop with dataframe? - python

Related

Reorder the contents of my string in Python

Moving a bunch of distinct items to the end of a python list

pandas map function to data frame across all columns and one fixed column

Start loop after certain element in list is reached

Stacked bar chart with differently ordered colors using matplotlib

Categories

Resources