I used regex to extract patterns from a csv document, mainly the pattern is (qty x volume in L), eg: 2x2L or 3x4L. (Note that 1 cell can have more than 1 pattern, eg: I want 2x4L and 3x1L)
0 []
1 [(2, x1L), (2, x4L)]
2 [(1, x1L), (1, x4L)]
3 [(2, x4L)]
4 [(1, x4L), (1, x1L)]
...
95 [(1, x2L)]
96 [(1, x1L), (1, x4L)]
97 [(2, x1L)]
98 [(6, x1L)]
99 [(6, x1L), (4, x2L), (4, x4L)]
Name: cards__name, Length: 100, dtype: object
I want to create 3 columns called "1L" "2L" and "4L" and then for every item, take the quantity and add it to the relevant row under the relevant column.
As such
1L 2L 4L
2 0 2
1 0 1
0 0 4
1 0 1
However I am not able to index to index the tuple in order to extract the quantity and the volume size for every item.
Any ideas?
Before you will be able to use pivot you have to normalize your columns, e.g. this way:
df['multiplier_1'] = df['order_1'].apply(lambda r: r[0])
df['base_volume_1'] = df['order_1'].apply(lambda r: r[1])
In such a way you will be able to ungroup the orders and eventually split into multiple base volumes.
I have a dataframe like the following
df
entry
0 (5, 4)
1 (4, 2, 1)
2 (0, 1)
3 (2, 7)
4 (9, 4, 3)
I would like to keep only the entry that contains two values
df
entry
0 (5, 4)
1 (0, 1)
2 (1, 7)
If there are tuples use Series.str.len for lengths and compare by Series.le for <= and filter in boolean indexing:
df1 = df[df['entry'].str.len().le(2)]
print (df1)
entry
0 (5, 4)
2 (0, 1)
3 (2, 7)
If there are strings compare number of , and compare by Series.lt for <:
df2 = df[df['entry'].str.count(',').lt(2)]
print (df2)
entry
0 (5,4)
2 (0,1)
3 (2,7)
Im struck over rolling a window over multiple columns in Pandas, what I have is:
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8]})
def test(ts):
print(ts.shape)
df.rolling(2).apply(test)
However the problem is that ts.shape prints (2,) and I wanted it to print (2,2), that is include the whole window of both rows and columns.
What is wrong about my intuition about how rolling works and how can I get the results im after using Pandas?
You can use a little hack - get numeric columns length by select_dtypes and use this scalar value:
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8], 'C':list('abcd')})
print (df)
A B C
0 1 5 a
1 2 6 b
2 3 7 c
3 4 8 d
cols = len(df.select_dtypes(include=[np.number]).columns)
print (cols)
2
def test(ts):
print(tuple((ts.shape[0], cols)))
return ts.sum()
(2, 2)
(2, 2)
(2, 2)
(2, 2)
(2, 2)
(2, 2)
df = df.rolling(2).apply(test)
I am currently working with a panda that uses tuples for column names. When attempting to use .loc as I would for normal columns the tuple names cause it to error out.
Test code is below:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6,4),
columns=[('a','1'), ('b','2'), ('c','3'), 'nontuple'])
df1.loc[:3, 'nontuple']
df1.loc[:3, ('c','3')]
The second line works as expected and displays the column 'non tuple' from 0:3. The third line does not work and instead gives the error:
KeyError: "None of [('c', '3')] are in the [columns]
Any idea how to resolve this issue short of not using tuples as column names?
Also, I have found that the code below works even though the .loc doesn't:
df1.ix[:3][('c','3')]
Documenation
access by tuple, returns DF:
In [508]: df1.loc[:3, [('c', '3')]]
Out[508]:
(c, 3)
0 1.433004
1 -0.731705
2 -1.633657
3 0.565320
access by non-tuple column, returns series:
In [514]: df1.loc[:3, 'nontuple']
Out[514]:
0 0.783621
1 1.984459
2 -2.211271
3 -0.532457
Name: nontuple, dtype: float64
access by non-tuple column, returns DF:
In [517]: df1.loc[:3, ['nontuple']]
Out[517]:
nontuple
0 0.783621
1 1.984459
2 -2.211271
3 -0.532457
access any column by it's number, returns series:
In [515]: df1.iloc[:3, 2]
Out[515]:
0 1.433004
1 -0.731705
2 -1.633657
Name: (c, 3), dtype: float64
access any column(s) by it's number, returns DF:
In [516]: df1.iloc[:3, [2]]
Out[516]:
(c, 3)
0 1.433004
1 -0.731705
2 -1.633657
NOTE: pay attention at the differences between .loc[] and .iloc[] - they are filtering rows differently!
this works like Python's slicing:
In [531]: df1.iloc[0:2]
Out[531]:
(a, 1) (b, 2) (c, 3) nontuple
0 0.650961 -1.130000 1.433004 0.783621
1 0.073805 1.907998 -0.731705 1.984459
this includes right index boundary:
In [532]: df1.loc[0:2]
Out[532]:
(a, 1) (b, 2) (c, 3) nontuple
0 0.650961 -1.130000 1.433004 0.783621
1 0.073805 1.907998 -0.731705 1.984459
2 -1.511939 0.167122 -1.633657 -2.211271
I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)