R sequence function in Python - python

pandas version: 1.2
I am trying to take a python pandas dataframe column pandas and create the same type of logic as in R that would be
ss=sequence(df$los)
Which produces for the first two records
[1] 1 2 3 4 5 1 2 3 4 5
Example dataframe:
df = pd.DataFrame([('test', 5), ('t2', 5), ('t3', 2), ('t4', 6)],
columns=['first', 'los'])
df
first los
0 test 5
1 t2 5
2 t3 2
3 t4 6
So the first row is sequenced 1-5 and second row is sequenced 1-5 and third row is sequenced 1-2 etc. In R this becomes one sequenced list. I would like that is python.
What I have been able to do is.
ss = df['los']
ss.apply(lambda x: np.array(range(1, x)))
18 [1, 2, 3, 4, 5]
90 [1, 2, 3, 4, 5]
105 [1,2]
106 [1, 2, 3, 4, 5, 6]
Which is close but then I need to combine it into a single pd.Series so that it should be:
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 2, 3, 4, 5, 6]

Use explode():
df.los.apply(lambda x: np.arange(1, x+1)).explode().tolist()
Output:
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 2, 3, 4, 5, 6]
Note - you can skip the ss assignment step, and use np.arange to streamline a bit.

You can just use concatenate:
np.concatenate([np.arange(x)+1 for x in df['los']])
Output:
array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 1, 2, 3, 4, 5, 6])

Related

Getting the sum of rows until a certain point

I would like to have some code that would add one from the row above until a new 'SCU_KEY' comes up. For example here is code and what I would like:
df = pd.DataFrame({'SCU_KEY' : [3, 3, 3, 5, 5, 5, 7, 8, 8, 8, 8], 'count':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})
Expected output:
df = pd.DataFrame({'SCU_KEY' : [3, 3, 3, 5, 5, 5, 7, 8, 8, 8, 8], 'count':[1, 2, 3, 1, 2, 3, 1, 1, 2, 3, 4]})
You can try this:
import pandas as pd
df = pd.DataFrame({
'SCU_KEY': [3, 3, 3, 5, 5, 5, 7, 8, 8, 8, 8],
'count': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
})
s = df['SCU_KEY']
df['count'] = s.groupby(s).cumcount() + 1
print(df)
It gives:
SCU_KEY count
0 3 1
1 3 2
2 3 3
3 5 1
4 5 2
5 5 3
6 7 1
7 8 1
8 8 2
9 8 3
10 8 4
This assumes that values of the SCU_KEY column cannot reappear once they change, or that they can reappear but then you want to continue counting them where you left off.
If, instead, each contiguous sequence of repeating values should be counted starting from 1, then you can use this instead:
s = df['SCU_KEY']
df['count'] = s.groupby((s.shift() != s).cumsum()).cumcount() + 1
For the above dataframe the result will be the same as before, but you can add, say, 3 at the end of the SCU_KEY column to see the difference.
This will do the job-
import pandas as pd
df = pd.DataFrame({'SCU_KEY' : [3, 3, 3, 5, 5, 5, 7, 8, 8, 8, 8], 'count':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})
for item in set(df['SCU_KEY']):
inc = 0
for i in range(len(df.index)):
if df['SCU_KEY'][i] == item:
df['count'][i] += inc
inc += 1
P.S.- As others have mentioned, it's a good practice to show your work before asking others for solution. It shows your effort which everyone appreciates and encourages to help you.

How to extract values from a column of list string pandas dataframe

I want to extract the values from the list so that I can perform some pandas operation on the string.
Distance TGR Grade TGR1
[342m, 342m, 530m, 342m] [M, M, RW, RW] [1, 1, 7, 1]
[390m, 390m, 390m, 390m,450] [M, 7, 6G, X45, X67] [1, 2, 4, 5, 5]
[] [] []
I need a clean df of this form.
Distance TGRGrade TGR1
342m,342m,530m,342m M,M,RW,RW 1,1,7,1
390m,390m,390m,390m,450 M,7,6G,X45,X67 1,2,4,5,5
I have tried the below functions:
df.columns = [''.join(i.split()) for i in df.columns]
df = df.applymap(lambda x: ''.join(x.strip('\[').strip('\]').split()))
and
df = df.replace('[', '')
df = df.replace(']', '')
My first attempt lead to this error.
AttributeError: 'list' object has no attribute 'strip'
checking the values in the individual column resulted in this.
df['TGR1']
0 [1, 1, 7, 1, 1, 8, 8, 1, 1, 8]
1 [1, 2, 4, 5, 5, 1, 2, 7, 6, 8]
2 [6, 1, 4, 4, 7, 1, 7, 1, 8, 3, 4, 5]
3 [1, 7, 4, 4, 3, 2, 1, 1, 2, 2, 2, 1]
4 [3, 4, 5, 2, 1, 8, 5, 2, 3, 6, 5, 3]
Try:
df = df.apply(lambda x: x.apply(lambda x: ','.join(map(str, x))))
OUTPUT:
Distance TGR Grade TGR1
0 342m,342m,530m,342m M,M,RW,RW 1,1,7,1
1 390m,390m,390m,390m,450 M,7,6G,X45,X67 1,2,4,5,5
2
You should check out pandas.explode:
df.explode(['Distance', 'TGR Grade', 'TGR1'])

How to repeat a numpy array on both axis? [duplicate]

This question already has answers here:
Quick way to upsample numpy array by nearest neighbor tiling [duplicate]
(3 answers)
Closed 3 years ago.
I have a 2d array, lets say the array is:
1 2 3
4 5 6
I want it to repeat 3 times on both axis, so it will look like:
1 1 1 2 2 2 3 3 3
1 1 1 2 2 2 3 3 3
1 1 1 2 2 2 3 3 3
4 4 4 5 5 5 6 6 6
4 4 4 5 5 5 6 6 6
4 4 4 5 5 5 6 6 6
Ive tried using numpy.repeat but unsuccessful.
any suggestions? thx
You can do it with the kronicker product, np.kron and a ones array of the size of the block.
a = np.arange(6).reshape(2,3) + 1
np.kron(a, np.ones((3,3), dtype = a.dtype))
Out[]:
array([[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3],
[4, 4, 4, 5, 5, 5, 6, 6, 6],
[4, 4, 4, 5, 5, 5, 6, 6, 6],
[4, 4, 4, 5, 5, 5, 6, 6, 6]])
You can do it with numpy repeat
>>> data = np.array([[1,2,3],[4,5,6]])
>>> data
array([[1, 2, 3],
[4, 5, 6]])
>>> np.repeat(data,[3,3,3],axis=1).repeat([3,3],axis=0)
array([[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3],
[1, 1, 1, 2, 2, 2, 3, 3, 3],
[4, 4, 4, 5, 5, 5, 6, 6, 6],
[4, 4, 4, 5, 5, 5, 6, 6, 6],
[4, 4, 4, 5, 5, 5, 6, 6, 6]])

Combining data contained in several lists

I am working on a personal project in python 3.6. I used pandas to import the data from an excel file in a dataframe and then I extracted data into several lists.
Now, I will give an example to illustrate exactly what I am trying to achieve.
So I have let's say 3 input lists a,b and c(I did insert the index and some additional white spaces for in lists so it is easier to follow):
0 1 2 3 4 5 6
a=[1, 5, 6, [10,12,13], 1, [5,3] ,7]
b=[3, [1,2], 3, [5,6], [1,3], [5,6], 9]
c=[1, 0 , 4, [1,2], 2 , 8 , 9]
I am trying to combine the data in order to get all the combinations when in one of the lists there is a list containing multiple elements. So the output needs to be like this:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
a=[1, 5, 5, 6, 10,10,10, 10, 12, 12, 12, 12, 13, 13, 13, 13, 1, 1, 5, 5, 3, 3, 7]
b=[3, 1, 2, 3, 5, 5, 6, 6, 5, 5, 6, 6, 5, 5, 6, 6, 1, 3, 5, 6, 5, 6, 9]
c=[1, 0, 0, 4, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 8, 8, 8, 8, 9]
To make this more clear:
From the original lists if we look at index 1 elements:
a[1]=5, b[1]=[1,2], c[1]=0. These got transformed to the following values on the 1 and 2 index positions: a[1:3]=[ 5, 5 ]; b[1:3]=[1, 2]; c[1:3]=[ 0, 0]
This needs to be applied also to index 3, 4, and 5 in the original input lists in order to obtain something similar to the example output above.
I want to be able to generalize this to more lists (a,b,c.....n). I have been able to do this for two lists, but in a totally not elegant, definitely not pythonic way. Also I think the code I wrote can't be generalized to more lists.
I am looking for some help, at least some pointers to some reading material that can help me achieve what I presented above.
Thank you!
You could do something like this.
Looks at each column, works out the combinations, then output the list:
import pandas as pd
import numpy
a=[1, 5, 6, [10,12,13], 1, [5,3] ,7]
b=[3, [1,2], 3, [5,6], [1,3], [5,6], 9]
c=[1, 0 , 4, [1,2], 2 , 8 , 9]
df = pd.DataFrame([a,b,c])
final_df = pd.DataFrame()
i=0
for col in df.columns:
temp_df = pd.DataFrame(df[col])
get_combo = []
for idx, row in temp_df.iterrows():
get_combo.append([row[i]])
combo_list = [list(x) for x in numpy.array(numpy.meshgrid(*get_combo)).T.reshape(-1,len(get_combo))]
temp_df_alpha = pd.DataFrame(combo_list).T
i+=1
if len(final_df) == 0:
final_df = temp_df_alpha
else:
final_df = pd.concat([final_df, temp_df_alpha], axis=1, sort=False)
for idx, row in final_df.iterrows():
print (row.tolist())
Output:
[1, 5, 5, 6, 10, 10, 12, 12, 13, 13, 10, 10, 12, 12, 13, 13, 1, 1, 5, 5, 3, 3, 7]
[3, 1, 2, 3, 5, 6, 5, 6, 5, 6, 5, 6, 5, 6, 5, 6, 1, 3, 5, 6, 5, 6, 9]
[1, 0, 0, 4, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 8, 8, 8, 8, 9]

Groupby and reduce pandas dataframes with numpy arrays as entries

I have a pandas.DataFrame with the following structure:
>>> data
a b values
1 0 [1, 2, 3, 4]
2 0 [3, 4, 5, 6]
1 1 [1, 3, 7, 9]
2 1 [2, 4, 6, 8]
('values' has the type of numpy.array). What I want to do is to group the data by column 'a' and then combine the list of values.
My goal is to end up with the following:
>>> data
a values
1 [1, 2, 3, 4, 1, 3, 7, 9]
2 [3, 4, 5, 6, 2, 4, 6, 8]
Note, that the order of the values does not matter. How do I achieve this? I though about something like
>>> grps = data.groupby(['a'])
>>> grps['values'].agg(np.concatenate)
but this fails with a KeyError. I'm sure there is a pandaic way to achieve this - but how?
Thanks.
Similar to the John Galt's answer, you can group and then apply np.hstack:
In [278]: df.groupby('a')['values'].apply(np.hstack)
Out[278]:
a
1 [1, 2, 3, 4, 1, 3, 7, 9]
2 [3, 4, 5, 6, 2, 4, 6, 8]
Name: values, dtype: object
To get back your frame, you'll need pd.Series.to_frame and pd.reset_index:
In [311]: df.groupby('a')['values'].apply(np.hstack).to_frame().reset_index()
Out[311]:
a values
0 1 [1, 2, 3, 4, 1, 3, 7, 9]
1 2 [3, 4, 5, 6, 2, 4, 6, 8]
Performance
df_test = pd.concat([df] * 10000) # setup
%timeit df_test.groupby('a')['values'].apply(np.hstack) # mine
1 loop, best of 3: 219 ms per loop
%timeit df_test.groupby('a')['values'].sum() # John's
1 loop, best of 3: 4.44 s per loop
sum is very inefficient for list, and does not work when Values is a np.array.
You can use sum to join lists.
In [640]: data.groupby('a')['values'].sum()
Out[640]:
a
1 [1, 2, 3, 4, 1, 3, 7, 9]
2 [3, 4, 5, 6, 2, 4, 6, 8]
Name: values, dtype: object
Or,
In [653]: data.groupby('a', as_index=False).agg({'values': 'sum'})
Out[653]:
a values
0 1 [1, 2, 3, 4, 1, 3, 7, 9]
1 2 [3, 4, 5, 6, 2, 4, 6, 8]

Categories

Resources