Given any df with only 3 columns, and n rows. Im trying to split, horizontally, on loop, at the position where the value on a column is max.
Something close to what np.array_split() does, but not on equal sizes necessarily. It would have to be at the row with the value determined by the max rule, at that moment on the loop. I imagine the over or under cutting bit is not necessarily the harder part.
An example: (sorry, its my first time actually making a question. Formatting code here is unknown for me yet)
df = pd.DataFrame({'a': [3,1,5,5,4,4], 'b': [1,7,1,2,5,5], 'c': [2,4,1,3,2,2]})
This df, with the max value condition applied on column b (7), would be cutted on a 2 row df and other with 4 rows.
Perhaps this might help you. Assume our n by 3 dataframe is as follows:
df = pd.DataFrame({'a': [1,2,3,4], 'b': [4,3,2,1], 'c': [2,4,1,3]})
>>> df
a b c
0 1 4 2
1 2 3 4
2 3 2 4
3 4 1 3
We can create a list of rows where max values occur for each column.
rows = [df[df[i] == max(df[i])] for i in df.columns]
>>> rows[0]
a b c
3 4 1 3
>>> rows[2]
a b c
1 2 3 4
2 3 2 4
This can also be written as a list of indexes if preferred.
indexes = [i.index for i in rows]
>>> indexes
[Int64Index([3], dtype='int64'), Int64Index([0], dtype='int64'), Int64Index([1, 2], dtype='int64')]
Related
Suppose I have the following Pandas DataFrame:
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
})
a b c
0 1 4 7
1 2 5 8
2 3 6 9
I want to generate a new pandas.Series so that the values of this series are selected, row by row, from a random column in the DataFrame. So, a possible output for that would be the series:
0 7
1 2
2 9
dtype: int64
(where in row 0 it randomly chose 'c', in row 1 it randomly chose 'a' and in row 2 it randomly chose 'c' again).
I know this can be done by iterating over the rows and using random.choice to choose each row, but iterating over the rows not only has bad performance but also is "unpandonic", so to speak. Also, df.sample(axis=1) would choose whole columns, so all of them would be chosen from the same column, which is not what I want. Is there a better way to do this with vectorized pandas methods?
Here is a fully vectorized solution. Note however that it does not use Pandas methods, but rather involves operations on the underlying numpy array.
import numpy as np
indices = np.random.choice(np.arange(len(df.columns)), len(df), replace=True)
Example output is [1, 2, 1] which corresponds to ['b', 'c', 'b'].
Then use this to slice the numpy array:
df['random'] = df.to_numpy()[np.arange(len(df)), indices]
Results:
a b c random
0 1 4 7 7
1 2 5 8 5
2 3 6 9 9
May be something like:
pd.Series([np.random.choice(i,1)[0] for i in df.values])
This does the job (using the built-in module random):
ddf = df.apply(lambda row : random.choice(row.tolist()), axis=1)
or using pandas sample:
ddf = df.apply(lambda row : row.sample(), axis=1)
Both have the same behaviour. ddf is your Series.
pd.DataFrame(
df.values[range(df.shape[0]),
np.random.randint(
0, df.shape[1], size=df.shape[0])])
output
0
0 4
1 5
2 9
You're probably still going to need to iterate through each row while selecting a random value in each row - whether you do it explicitly with a for loop or implicitly with whatever function you decide to call.
You can, however, simplify the to a single line using a list comprehension, if it suits your style:
result = pd.Series([random.choice(pd.iloc[i]) for i in range(len(df))])
If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
When I add a column using apply on other columns, does panda store the result of this new column in the same row as the one used for the computation. If not how can I make it do it.
The reason why I'am not completely confident is following example
df = pd.DataFrame({'index':[0,1,2,3,4], 'value':[1,2,3,4,5]})
df2 = pd.DataFrame({'index':[0,2,1,3,5], 'value':[1,2,3,4,5]})
df['second_value'] = df['value'].apply(lambda x: x**2)
df['third_value'] = df2['value'].apply(lambda x: x**2)
df
The results this yields is
index value second_value third_value
0 1 1 1
1 2 4 4
2 3 9 9
3 4 16 16
4 5 25 25
So what I see here is that pandas only checks for the order. So can it happen that a DataFrame is sorted at a random moment which could mess up or can I assume that the order is always preserved when I perform
df['new_value'] = df['old_value'].apply(...)
?
EDIT: In my original code snippet I forgot to set the index and that is actually where I was doing wrong. So I had df.set_index('index') and df2.set_index('index') before using apply. the problem is that this method creates a copy with the said index. So either you asign these to the original dataframes df and df2 or even better you add inline=True in the method call in order to not create a copy and set the index in the given dataframe.
That's not how you define an index. You need to pass a list/iterable to the index keyword argument when calling the pd.DataFrame constructor.
df = pd.DataFrame({'value' : [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'value' : [1, 2, 3, 4, 5]}, index=[0, 2, 1, 3, 4])
df['second'] = df['value'] ** 2
df['third'] = df2['value'] ** 2
df
value second third
0 1 1 1
1 2 4 9 # note these
2 3 9 4 # two rows
3 4 16 16
4 5 25 25
The assignment operations are always index aligned.
So I've been doing things like this with pandas:
usrdata['columnA'] = usrdata.apply(functionA, axis=1)
in order to do row operations and changing/adding columns to my dataframe.
However, now I want to try to do something like this:
usrdata['columnB', 'columnC'] = usrdata.apply(functionB, axis=1)
But the output of function B is a Series with only one column in a tuple (with two values for each row) apparently. Is there a nice way for me to either:
format the output from functionB so it can readily be added to my
dataframe
add (and possibly have to unpack) the output from functionB and assign each each column to each column of my dataframe?
Try using zip:
usrdata['columnB'], usrdata['columnC'] = zip(*usrdata.apply(functionB, axis=1))
I'd assign directly to a df consisting of your new df's and modify the func body to return a Series constructed with a list of the data:
In [9]:
df = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df
Out[9]:
a
0 1
1 2
2 3
3 4
4 5
In [10]:
def func(x):
return pd.Series([x*3, x*10])
df[['b','c']] = df['a'].apply(func)
df
Out[10]:
a b c
0 1 3 10
1 2 6 20
2 3 9 30
3 4 12 40
4 5 15 50
The following code will (of course) keep only the first occurrence of 'Item1' in rows sorted by 'Date'. Any suggestions as to how I could get it to keep, say the first 5 occurrences?
## Sort the dataframe by Date and keep only the earliest appearance of 'Item1'
## drop_duplicates considers the column 'Date' and keeps only first occurence
coocdates = data.sort('Date').drop_duplicates(cols=['Item1'])
You want to use head, either on the dataframe itself or on the groupby:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [1, 6], [2, 8]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 1 6
3 2 8
In [13]: df.head(2) # the first two rows
Out[13]:
A B
0 1 2
1 1 4
In [14]: df.groupby('A').head(2) # the first two rows in each group
Out[14]:
A B
0 1 2
1 1 4
3 2 8
Note: the behaviour of groupby's head was changed in 0.14 (it didn't act like a filter - but modified the index), so you will have to reset index if using an earlier versions.
Use groupby() and nth():
According to Pandas docs, nth()
Take the nth row from each group if n is an int, or a subset of rows if n is a list of ints.
Therefore all you need is:
df.groupby('Date').nth([0,1,2,3,4]).reset_index(drop=False, inplace=True)