I want to know if the Pandas applymap function always go through from top to bottom and left to right (iterating through each row on a per column basis).
Mainly, I'm using applymap to pass a dictionary to count the number of items as a list in each cell, BUT I have to account for it differently once the value is seen for the first time. So if applymap always goes works consistently, I can use it, but if there are some weird potential for race conditions, then I can't.
import numpy as np
import pandas as pd
vals = np.arange(25).reshape([5,5])
df = pd.DataFrame(vals)
print(df)
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
l = []
_ = df.applymap(lambda x: l.append(x))
print(l)
[ 0, 5, 10, 15, 20,
1, 6, 11, 16, 21,
2, 7, 12, 17, 22,
3, 8, 13, 18, 23,
4, 9, 14, 19, 24]
I believe this always will be consistent, as apply by default also works column-by-columns.
I found a comment here on Stack Overflow to that effect (emphasis mine):
strictly speaking, applymap internally is implemented via apply with a little wrap-up over passed function parameter (rougly speaking replacing func to lambda x: [func(y) for y in x], and applying column-wise)
In the source code, applymap uses apply, which work by default by column.
The order seems consistent, even on a shuffled array:
import numpy as np
import pandas as pd
from itertools import count
df = pd.DataFrame(np.zeros((5,5)))
c = count()
df.sample(frac=1).sample(frac=1, axis=1).applymap(lambda x: next(c))
output:
1 3 2 0 4
0 0 5 10 15 20
4 1 6 11 16 21
3 2 7 12 17 22
1 3 8 13 18 23
2 4 9 14 19 24
Now, I think the real question is, "is this behavior stable or is it just an implementation detail that could change in the future?"
Related
I've read the docs,
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html)
When I use the below, all is fine and the code works perfectly:
df['c06new']='not_women_gathered'
df[c06new].where(f1 & f2,"women_gathered",inplace=True)
(df[c06new]=='women_gathered').sum()
However, if I use,
df.where(f1 & f2,"women_gathered", other='not',inplace=True)
I get: TypeError: where() got multiple values for argument 'other'
Why is this?
Your code is producing an error because you're supplying other as a positional argument, but also as a keyword. For example, this gives us your error, TypeError: where() got multiple values for argument 'other.'
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':range(10)})
df.where(df > 5, 10, other=12)
However, these two examples give the expected result.
df.where(df > 5, other=12)
df.where(df > 5, 12)
Output:
A
0 12
1 12
2 12
3 12
4 12
5 12
6 6
7 7
8 8
9 9
As suggested by Emma, you might be looking for np.where().
df['A'] = np.where(df > 5, 99, 12)
Output on original df:
A
0 12
1 12
2 12
3 12
4 12
5 12
6 99
7 99
8 99
9 99
I need to build a dataframe from 10 list of list. I did it manually, but it's need a time. What is a better way to do it?
I have tried to do it manually. It works fine (#1)
I tried code (#2) for better perfomance, but it returns only last column.
1
import pandas as pd
import numpy as np
a1T=[([7,8,9]),([10,11,12]),([13,14,15])]
a2T=[([1,2,3]),([5,0,2]),([3,4,5])]
print (a1T)
#Output[[7, 8, 9], [10, 11, 12], [13, 14, 15]]
vis1=np.array (a1T)
vis_1_1=vis1.T
tmp2=np.array (a2T)
tmp_2_1=tmp2.T
X=np.column_stack([vis_1_1, tmp_2_1])
dataset_all = pd.DataFrame({"Visab1":X[:,0], "Visab2":X[:,1], "Visab3":X[:,2], "Temp1":X[:,3], "Temp2":X[:,4], "Temp3":X[:,5]})
print (dataset_all)
Output: Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
> Actually I have varying number of columns in dataframe (500-1500), thats why I need auto generated column names. Extra index (1, 2, 3) after name Visab_, Temp_ and so on - constant for every case. See code below.
For better perfomance I tried
code<br>
#2
n=3 # This is varying parameter. The parameter affects the number of columns in the table.
m=2 # This is constant for every case. here is 2, because we have "Visab", "Temp"
mlist=('Visab', 'Temp')
nlist=[range(1, n)]
for j in range (1,n):
for i in range (1,m):
col=i+(j-1)*n
dataset_all=pd.DataFrame({mlist[j]+str(i):X[:, col]})
I expect output like
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
but there is not any result (only error expected an indented block)
Ok, so the number of columns n is the number of sublists in each list, right? You can measure that with len:
len(a1T)
#Output
3
I'll simplify the answer above so you don't need X and add automatic column-names creation:
my_lists = [a1T,a2T]
my_names = ["Visab","Temp"]
dfs=[]
for one_list,name in zip(my_lists,my_names):
n_columns = len(one_list)
col_names=[name+"_"+str(n) for n in range(n_columns)]
df = pd.DataFrame(one_list).T
df.columns = col_names
dfs.append(df)
dataset_all = pd.concat(dfs,axis=1)
#Output
Visab_0 Visab_1 Visab_2 Temp_0 Temp_1 Temp_2
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
Now is much clearer. So you have:
X=np.column_stack([vis_1_1, tmp_2_1])
Let's create a list with the names of the columns:
columns_names = ["Visab1","Visab2","Visab3","Temp1","Temp2","Temp3"]
Now you can directly make a dataframe like this:
dataset_all = pd.DataFrame(X,columns=columns_names)
#Output
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
Let's say I have some arrays/lists that contains a lot of values, which means that loading several of these into memory would ultimately result in a memory error due to lack of memory. One way to circumvent this is to load these arrays/lists into a generator, and then use them when needed. However, with generators you don't have so much control as with arrays/lists - and that is my problem.
Let me explain.
As an example I have the following code, which produces a generator with some small lists. So yeah, this is not memory intensive at all, just an example:
import numpy as np
np.random.seed(10)
number_of_lists = range(0, 5)
generator_list = (np.random.randint(0, 10, 10) for i in number_of_lists)
If I iterate over this list I get the following:
for i in generator_list:
print(i)
>> [9 4 0 1 9 0 1 8 9 0]
>> [8 6 4 3 0 4 6 8 1 8]
>> [4 1 3 6 5 3 9 6 9 1]
>> [9 4 2 6 7 8 8 9 2 0]
>> [6 7 8 1 7 1 4 0 8 5]
What I would like to do is sum element wise for all the lists (axis = 0). So the above should in turn result in:
[36, 22, 17, 17, 28, 16, 28, 31, 29, 14]
To do this I could use the following:
sum = [0]*10
for i in generator_list:
sum += i
where 10 is the length of one of the lists.
So far so good. I am not sure if there is a better/more optimized way of doing it, but it works.
My problem is that I would like to determine which lists in the generator_list I want to use. For example, what if I wanted to sum two of the first [0] list, one of the third, and 2 of the last, i.e.:
[9 4 0 1 9 0 1 8 9 0]
[9 4 0 1 9 0 1 8 9 0]
[4 1 3 6 5 3 9 6 9 1]
[6 7 8 1 7 1 4 0 8 5]
[6 7 8 1 7 1 4 0 8 5]
>> [34, 23, 19, 10, 35, 5, 19, 22, 43, 11]
How would I go about doing that ?
And before any questions arise why I want to do it this way, the reason is that in my real case, getting the arrays into the generator takes some time. I could then in principle just generate a new generator where I put in the order of lists as seen in the new list, but again, that would mean I would have to wait to get them in a new generator. And if this is to happen thousands of times (as seen with bootstrapping), well, it would take some time. With the first generator I have ALL lists that are available. Now I just wish to use them selectively so I don't have to create a new generator every time I want to mix it up, and sum a new set of arrays/lists.
import numpy as np
np.random.seed(10)
number_of_lists = range(5)
generator_list = (np.random.randint(0, 10, 10) for i in number_of_lists)
indices = [0, 0, 2, 4, 4]
assert sorted(indices) == indices, "only works for sorted list"
# sum_ = [0] * 10
# I prefer this:
sum_ = np.zeros((10,), dtype=int)
generator_index = -1
for index in indices:
while generator_index < index:
vector = next(generator_list)
generator_index += 1
sum_ += vector
print(sum_)
outputs
[34 23 19 10 37 5 19 22 43 11]
I have a very large dataframe
in>> all_data.shape
out>> (228714, 436)
What I would like to do effciently is multiply many of the columns together. I started with a for loop and list of columns--the most effcient way I have found is
from itertools import combinations
newcolnames=list(all_data.columns.values)
newcolnames=newcolnames[0:87]
#make cross products (the columns I want to operate on are the first 87)
for c1, c2 in combinations(newcolnames, 2):
all_data['{0}*{1}'.format(c1,c2)] = all_data[c1] * all_data[c2]
The problem as one may guess is I have 87 columns which would give on the order of 3800 new columns (yes this is what I intended). Both my jupyter notebook and ipython shell choke on this calculation. I need to figure a better way to undertake this multiplication.
Is there a more efficient way to vectorize and/or process? Perhaps using a numpy array (my dataframe has been processed and now contains only numbers and NANs, it started with categorical variables).
As you have mentioned NumPy in the question, that might be a viable option here, specially because you might want to work in 2D space of NumPy instead of 1D columnar processing with pandas. To start off, you can convert the dataframe to a NumPy array with a call to np.array, like so -
arr = np.array(df) # df is the input dataframe
Now, you can get the pairwise combinations of the column IDs and then index into the columns and perform column-wise multiplications and all of this would be done in a vectorized manner, like so -
idx = np.array(list(combinations(newcolnames, 2)))
out = arr[:,idx[:,0]]*arr[:,idx[:,1]]
Sample run -
In [117]: arr = np.random.randint(0,9,(4,8))
...: newcolnames = [1,4,5,7]
...: for c1, c2 in combinations(newcolnames, 2):
...: print arr[:,c1] * arr[:,c2]
...:
[16 2 4 56]
[64 2 6 16]
[56 3 0 24]
[16 4 24 14]
[14 6 0 21]
[56 6 0 6]
In [118]: idx = np.array(list(combinations(newcolnames, 2)))
...: out = arr[:,idx[:,0]]*arr[:,idx[:,1]]
...:
In [119]: out.T
Out[119]:
array([[16, 2, 4, 56],
[64, 2, 6, 16],
[56, 3, 0, 24],
[16, 4, 24, 14],
[14, 6, 0, 21],
[56, 6, 0, 6]])
Finally, you can create the output dataframe with propers column headers (if needed), like so -
>>> headers = ['{0}*{1}'.format(idx[i,0],idx[i,1]) for i in range(len(idx))]
>>> out_df = pd.DataFrame(out,columns = headers)
>>> df
0 1 2 3 4 5 6 7
0 6 1 1 6 1 5 6 3
1 6 1 2 6 4 3 8 8
2 5 1 4 1 0 6 5 3
3 7 2 0 3 7 0 5 7
>>> out_df
1*4 1*5 1*7 4*5 4*7 5*7
0 1 5 3 5 3 15
1 4 3 8 12 32 24
2 0 6 3 0 0 18
3 14 0 14 0 49 0
you can try the df.eval() method:
for c1, c2 in combinations(newcolnames, 2):
all_data['{0}*{1}'.format(c1,c2)] = all_data.eval('{} * {}'.format(c1, c2))
I have a pandas.DataFrame() object like below
start, end
5, 9
6, 11
13, 11
14, 11
15, 17
16, 17
18, 17
19, 17
20, 24
22, 26
"end" has to always be > "start"
So, I need to filter it from when the "end" values becomes < "start" till the next row where they are again are back to normal.
In above example, I need:
1.
13,11
15,17
2.
18,17
20,24
Edit: (updated)
Think of these as timestamps in seconds. So I can find that it took 2 seconds in both scenario to recover back.
I can do this in iterating the data, but does Pandas have a better way ?
You could use panda's boolean indexing to find the rows where start < end. Then if you reset the index you can calculate the difference between the original indices that act as the upper and lower bounds delta between rows where start > end.
For example you could do something like the following:
# A = starts, B = ends
df = pd.DataFrame({'B' : [9, 11, 11, 11, 17, 17, 17, 17, 24, 26],
'A': [5, 6, 13, 14, 15, 16, 18, 19, 20, 22]})
# use boolean indexing
df = df[df['A'] < df['B']].reset_index()
# calculate the difference of each row's "old" index to determine delta
diffs = df['index'].diff()
# create a column to show deltas
df['delta'] = diffs
print(diffs)
print(df)
The diffs data frame looks like:
0 NaN
1 1
2 3
3 1
4 3
5 1
Name: index, dtype: float64
Notice the NaN value since the diff() method subtracts the previous row from the current row, but since the first row has no previous row it marks a NaN. One must only look at the first value of the index column to calculate the delta in the case that the first arbitrary number of n starts were > ends.
The fully augmented data frame would then look like:
index A B delta
0 0 5 9 NaN
1 1 6 11 1
2 4 15 17 3
3 5 16 17 1
4 8 20 24 3
5 9 22 26 1
If you wish to delete any of the extraneous columns you can use the del method like so:
del col1, col2, col3, etc..