Keep column order at DataFrame creation

Keep column order at DataFrame creation - python

I'd like to keep the columns in the order they were defined with pd.DataFrame. In the example below, df.info shows that GroupId is the first column and print also prints GroupId.
I'm using Python version 3.6.3
import numpy as np
import pandas as pd
df = pd.DataFrame({'Id' : np.random.randint(1,100,10),
'GroupId' : np.random.randint(1,5,10) })
df.info()
print(df.iloc[:,0])

One way is to use collections.OrderedDict, as below. Note that the OrderedDict object takes a list of tuples as an input.
from collections import OrderedDict
df = pd.DataFrame(OrderedDict([('Id', np.random.randint(1,100,10)),
('GroupId', np.random.randint(1,5,10))]))
# Id GroupId
# 0 37 4
# 1 10 2
# 2 42 1
# 3 97 2
# 4 6 4
# 5 59 2
# 6 12 2
# 7 69 1
# 8 79 1
# 9 17 1

Unless you're using python-3.6+ where dictionaries are ordered, this just isn't possible with a (standard) dictionary. You will need to zip your items together and pass a list of tuples:
np.random.seed(0)
a = np.random.randint(1, 100, 10)
b = np.random.randint(1, 5, 10)
df = pd.DataFrame(list(zip(a, b)), columns=['Id', 'GroupId'])
Or,
data = [a, b]
df = pd.DataFrame(list(zip(*data)), columns=['Id', 'GroupId']))
df
Id GroupId
0 45 3
1 48 1
2 65 1
3 68 1
4 68 3
5 10 2
6 84 3
7 22 4
8 37 4
9 88 3

Related

How to perform a multiple groupby and transform count with a condition in pandas

This is an extension of the question here: here
I am trying add an extra column to the grouby:
# Import pandas library
import pandas as pd
import numpy as np
# data
data = [['tom', 10,2,'c',100,'x'], ['tom',16 ,3,'a',100,'x'], ['tom', 22,2,'a',100,'x'],
['matt', 10,1,'c',100,'x'], ['matt', 15,5,'b',100,'x'], ['matt', 14,1,'b',100,'x']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category','Rating','Other'])
df['AttemptsbyRating'] = df.groupby(by=['Rating','Other'])['Attempts'].transform('count')
df
Then i try to add another column for the sum of rows that have a Score greater than 1 (which should equal 4):
df['scoregreaterthan1'] = df['Score'].gt(1).groupby(by=df[['Rating','Other']]).transform('sum')
But i am getting a
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
Any ideas? thanks very much!

df['Score'].gt(1) is returning a boolean series rather than a dataframe. You need to return a dataframe first before you can groupby the relevant columns.
use:
df = df[df['Score'].gt(1)]
df['scoregreaterthan1'] = df.groupby(['Rating','Other'])['Score'].transform('count')
df
output:
Name Attempts Score Category Rating Other AttemptsbyRating scoregreaterthan1
0 tom 10 2 c 100 x 6 4
1 tom 16 3 a 100 x 6 4
2 tom 22 2 a 100 x 6 4
4 matt 15 5 b 100 x 6 4
If you want to keep the people who have a score that is not greater than one, then instead of this:
df = df[df['Score'].gt(1)]
df['scoregreaterthan1'] = df.groupby(['Rating','Other'])['Score'].transform('count')
do this:
df['scoregreaterthan1'] = df[df['Score'].gt(1)].groupby(['Rating','Other'])['Score'].transform('count')
df['scoregreaterthan1'] = df['scoregreaterthan1'].ffill().astype(int)
output 2:
Name Attempts Score Category Rating Other AttemptsbyRating scoregreaterthan1
0 tom 10 2 c 100 x 6 4
1 tom 16 3 a 100 x 6 4
2 tom 22 2 a 100 x 6 4
3 matt 10 1 c 100 x 6 4
4 matt 15 5 b 100 x 6 4
5 matt 14 1 b 100 x 6 4

pandas - iterate over rows and calculate - faster

I already have a solution -but it is very slow (13 minutes for 800 rows). here is an example of the dataframe:
import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
df
In a new column, I want to calculate how many of the previous values (for example three)of col2 are greater or equal than row-value of col1. i also continue the first rows.
this is my slow code:
start_at_nr = 3 #variable in which row start to calculate
df["overlap_count"] = "" #create new column
for row in range(len(df)):
if row <= start_at_nr - 1:
df["overlap_count"].loc[row] = "x"
else:
df["overlap_count"].loc[row] = (
df["col2"].loc[row - start_at_nr:row - 1] >=
(df["col1"].loc[row])).sum()
df
i obtain a faster solution - thank you for your time!
this is the result i obtain:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3

IIUC, you can do:
df['overlap_count'] = 0
for i in range(1,start_at_nr+1):
df['overlap_count'] += df['col1'].le(df['col2'].shift(i))
# mask the first few rows
df.iloc[:start_at_nr, -1] = np.nan
Output:
col1 col2 overlap_count
0 20 39 NaN
1 23 32 NaN
2 40 42 NaN
3 41 50 1.0
4 48 63 1.0
5 49 68 2.0
6 50 68 3.0
7 50 69 3.0
Takes about 11ms on for 800 rows and start_at_nr=3.

You basically compare the current value of col1 to previous 3 rows of col2 and starting the compare from row 3. You may use shift as follow
n = 3
s = ((pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1) >= df.col1.values[:,None])
.sum(1)[3:])
or
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
Out[65]:
3 1
4 1
5 2
6 3
7 3
dtype: int64
To get your desired output, assign it back to df and fillna
n = 3
s = (pd.concat([df.col2.shift(x) for x in range(1,n+1)], axis=1).ge(df.col1,axis=0)
.sum(1)[3:])
df_final = df.assign(overlap_count=s).fillna('x')
Out[68]:
col1 col2 overlap_count
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3

You could do it with .apply() in a single statement as follows. I have used a convenience function process_row(), which is also included below.
df.assign(OVERLAP_COUNT = (df.reset_index(drop=False).rename(
columns={'index': 'ID'})).apply(
lambda x: process_row(x, df, offset=3), axis=1))
For More Speed:
In case you need more speed and are processing a lot of rows, you may consider using swifter library. All you have to do is:
install swifter: pip install swifter.
import the library as import swifter.
replace any .apply() with .swifter.apply() in the code-block above.
Solution in Detail
#!pip install -U swifter
#import swifter
import numpy as np
import pandas as pd
d = {'col1': [20,23,40,41,48,49,50,50], 'col2': [39,32,42,50,63,68,68,69]}
df = pd.DataFrame(data=d)
def process_row(x, df, offset=3):
value = (df.loc[x.ID - offset:x.ID - 1, 'col2'] >= df.loc[x.ID, 'col1']).sum() if (x.ID >= offset) else 'x'
return value
# Use df.swifter.apply() for faster processing, instead of df.apply()
df.assign(OVERLAP_COUNT = (df.reset_index(drop=False, inplace=False).rename(
columns={'index': 'ID'}, inplace=False)).apply(
lambda x: process_row(x, df, offset=3), axis=1))
Output:
col1 col2 OVERLAP_COUNT
0 20 39 x
1 23 32 x
2 40 42 x
3 41 50 1
4 48 63 1
5 49 68 2
6 50 68 3
7 50 69 3

re-arranging rows into columns in a pandas dataframe

I am trying to find some elegant ways of rearranging a pandas dataframe.
My initial dataframe looks like this:
PS PSS 10PS 10PSS 5PS 5PSS
1 6 263 5 23 2 101
2 5 49 2 30 1 30
desired arrangement would be:
1-PS 1-PSS 1-10PS 1-10PSS 1-5PS 1-5PSS 2-PS 2-PSS 2-10PS 2-10PSS 2-5PS 2-5PSS
A 6 263 5 23 2 101 5 49 2 30 1 30
Where A is a new index and I would like the rows to be merged with the columns.

You need stack here , with column join
s=df.stack().to_frame('A')
s.index=s.index.map('{0[0]}-{0[1]}'.format)
s.T
Out[42]:
1-PS 1-PSS 1-10PS 1-10PSS 1-5PS 1-5PSS 2-PS 2-PSS 2-10PS 2-10PSS \
A 6 263 5 23 2 101 5 49 2 30
2-5PS 2-5PSS
A 1 30

Hopefully these lines can help you out:
# Put a pandas Series from each line in a generator
series = (pd.Series(i, index=['{}-{}'.format(ind,x) for x in df.columns])
for ind, i in zip(df.index,df.values))
# Concatenate and convert to frame + transpose
df = pd.concat(series).to_frame('A').T
Full example:
import pandas as pd
data = '''\
index PS PSS 10PS 10PSS 5PS 5PSS
1 6 263 5 23 2 101
2 5 49 2 30 1 30'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+').set_index('index')
# Put a pandas Series from each line in a generator
series = (pd.Series(i, index=['{}-{}'.format(ind,x) for x in df.columns])
for ind, i in zip(df.index,df.values))
# Concatenate and convert to frame + transpose
df = pd.concat(series).to_frame('A').T

pandas split timeseries in groups

I have a pandas dataframe
>>> df = pd.DataFrame()
>>> df['a'] = np.random.choice(range(0,100), 200)
>>> df['b'] = np.random.choice([0,1], 200)
>>> df.head()
a b
0 69 1
1 49 1
2 79 1
3 88 0
4 57 0
>>>
Some of the variables (in this example 'a') have a lot of unique values.
I would like to replace 'a' with a2 where a2 has 5 unique values. In other words I want to define 5 groups and assign to each value of a one of the group.
For example a2=1 if 0<=df['a']<20 and a2=2 if 20<=df['a']<40 and so on.
Note:
I used group of size 20 because 100/5 = 20
How can I do that using numpy or pandas or something else?
EDIT:
Possible solution
def group_array(a):
a = a - a.min()
a = 100 * a/a.max()
a = (a.apply(int)//20)+1
return a

You could use pd.cut to categorize the values in df['a']:
import pandas as pd
df = pd.DataFrame({'a':[69,49,79,88,57], 'b':[1,1,1,0,0]})
df['a2'] = pd.cut(df['a'], bins=range(0,101,20), labels=range(1,6), )
print(df)
yields
a b a2
0 69 1 4
1 49 1 3
2 79 1 4
3 88 0 5
4 57 0 3

Append a list of arrays as column to pandas Data Frame with same column indices

I have a list of arrays (one-dimensional numpy array) (a_) and a list (l_) and want to have a DataFrame with them as its columns. They look like this:
a_: [array([381]), array([376]), array([402]), array([400])...]
l_: [1.5,2.34,4.22,...]
I can do it by:
df_l = pd.DataFrame(l_)
df_a = pd.DataFrame(a_)
df = pd.concat([df_l, df_a], axis=1)
Is there a shorter way of doing it? I tried to use pd.append:
df_l = pd.DataFrame(l_)
df_l = df_l.append(a_)
However, because columns indices are both 0, it adds a_ to the end of the dataframe column, resulting in a single column. Is there something like this:
l_ = l_.append(a_).reset(columns)
that set a new column index for the appended array? well, obviously this does not work!
the desired output is like:
0 0
0 1.50 381
1 2.34 376
2 4.22 402
...
Thanks.

Suggestion:
df_l = pd.DataFrame(l_)
df_1['a_'] = pd.Series(a_list, index=df_1.index)
Example #1:
L = list(data)
A = list(data)
data_frame = pd.DataFrame(L)
data_frame['A'] = pd.Series(A, index=data_frame.index)
Example #2 - Same Series length (create series and set index to the same as existing data frame):
In [33]: L = list(item for item in range(10))
In [34]: A = list(item for item in range(10,20))
In [35]: data_frame = pd.DataFrame(L,columns=['L'])
In [36]: data_frame['A'] = pd.Series(A, index=data_frame.index)
In [37]: print data_frame
L A
0 0 10
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
Example #3 - Different Series lengths (create series and let pandas handle index matching):
In [45]: not_same_length = list(item for item in range(50,55))
In [46]: data_frame['nsl'] = pd.Series(not_same_length)
In [47]: print data_frame
L A nsl
0 0 10 50
1 1 11 51
2 2 12 52
3 3 13 53
4 4 14 54
5 5 15 NaN
6 6 16 NaN
7 7 17 NaN
8 8 18 NaN
9 9 19 NaN
Based on your comments, it looks like you want to join your list of lists.I'm assuming they are in list structure because array() is not a method in python. To do that you would do the following:
In [63]: A = [[381],[376], [402], [400]]
In [64]: A = [inner_item for item in A for inner_item in item]
In [65]: print A
[381, 376, 402, 400]
Then create the Series using the new array and follow the steps above to add to your data frame.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keep column order at DataFrame creation - python

Related

How to perform a multiple groupby and transform count with a condition in pandas

pandas - iterate over rows and calculate - faster

re-arranging rows into columns in a pandas dataframe

pandas split timeseries in groups

Append a list of arrays as column to pandas Data Frame with same column indices

Categories

Resources