Pandas, concat Series to DF as rows - python

I attempting to add a Series to an empty DataFrame and can not find an answer
either in the Doc's or other questions. Since you can append two DataFrames by row
or by column it would seem there must be an "axis marker" missing from a Series. Can
anyone explain why this does not work?.
import Pandas as pd
df1 = pd.DataFrame()
s1 = pd.Series(['a',5,6])
df1 = pd.concat([df1,s1],axis = 1)
#go run some process return s2, s3, sn ...
s2 = pd.Series(['b',8,9])
df1 = pd.concat([df1,s2],axis = 1)
s3 = pd.Series(['c',10,11])
df1 = pd.concat([df1,s3],axis = 1)
If my example above is some how misleading perhaps using the example from the docs will help.
Quoting: Appending rows to a DataFrame.
While not especially efficient (since a new object must be created), you can append a
single row to a DataFrame by passing a Series or dict to append, which returns a new DataFrame as above. End Quote.
The example from the docs appends "S", which is a row from a DataFrame, "S1" is a Series
and attempting to append "S1" produces an error. My question is WHY will appending "S1 not work? The assumption behind the question is that a DataFrame must code or contain axes information for two axes, where a Series must contain only information for one axes.
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.xs(3); #third row of DataFrame
s1 = pd.Series([np.random.randn(4)]); #new Series of equal len
df= df.append(s, ignore_index=True)
Result
0 1
0 a b
1 5 8
2 6 9
Desired
0 1 2
0 a 5 6
1 b 8 9

You were close, just transposed the result from concat
In [14]: s1
Out[14]:
0 a
1 5
2 6
dtype: object
In [15]: s2
Out[15]:
0 b
1 8
2 9
dtype: object
In [16]: pd.concat([s1, s2], axis=1).T
Out[16]:
0 1 2
0 a 5 6
1 b 8 9
[2 rows x 3 columns]
You also don't need to create the empty DataFrame.

The best way is to use DataFrame to construct a DF from a sequence of Series, rather than using concat:
import pandas as pd
s1 = pd.Series(['a',5,6])
s2 = pd.Series(['b',8,9])
pd.DataFrame([s1, s2])
Output:
In [4]: pd.DataFrame([s1, s2])
Out[4]:
0 1 2
0 a 5 6
1 b 8 9

A method of accomplishing the same objective as appending a Series to a DataFrame
is to just convert the data to an array of lists and append the array(s) to the DataFrame.
data as an array of lists
def get_example(idx):
list1 = (idx+1,idx+2 ,chr(idx + 97))
data = [list1]
return(data)
df1 = pd.DataFrame()
for idx in range(4):
data = get_example(idx)
df1= df1.append(data, ignore_index = True)

Related

DataFrame insert row

I have some troubles with my Python work,
my steps are:
1)add the list to ordinary Dataframe
2)delete the columns which is min in the list
my list is called 'each_c' and my ordinary Dataframe is called 'df_col'
I want it to become like this:
hope someone can help me, thanks!
This is clearly described in the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df_col.drop(columns=[3])
Convert each_c to Series, append by DataFrame.append and then get indices by minimal value by Series.idxmin and pass to drop - it remove only first minimal column:
s = pd.Series(each_c)
df = df_col.append(s, ignore_index=True).drop(s.idxmin(), axis=1)
If need remove all columns if multiple minimals:
each_c = [-0.025,0.008,-0.308,-0.308]
s = pd.Series(each_c)
df_col = pd.DataFrame(np.random.random((10,4)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
0 1
0 0.602312 0.641220
1 0.586233 0.634599
2 0.294047 0.339367
3 0.246470 0.546825
4 0.093003 0.375238
5 0.765421 0.605539
6 0.962440 0.990816
7 0.810420 0.943681
8 0.307483 0.170656
9 0.851870 0.460508
10 -0.025000 0.008000
EDIT: If solution raise error:
IndexError: Boolean index has wrong length:
it means there is no default columns name by range - 0,1,2,3. Possible solution is set index values in Series by rename:
each_c = [-0.025,0.008,-0.308,-0.308]
df_col = pd.DataFrame(np.random.random((10,4)), columns=list('abcd'))
s = pd.Series(each_c).rename(dict(enumerate(df.columns)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
a b
0 0.321498 0.327755
1 0.514713 0.575802
2 0.866681 0.301447
3 0.068989 0.140084
4 0.069780 0.979451
5 0.629282 0.606209
6 0.032888 0.204491
7 0.248555 0.338516
8 0.270608 0.731319
9 0.732802 0.911920
10 -0.025000 0.008000

Pandas-iterate through a dataframe column and concatenate corresponding row values that contain a list

I have a data-frame with column1 containing string values and column 2 containing lists of sting values.
I want to iterate through column1 and concatenate column1 values with their corresponding row values into a new data-frame.
Say, my input is
`dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}`
after the operation my data will look like this
dfd2 = {'TRAINSET':['101a1','101x1','101b2', '102a1','102b3','102b2','103d3', '103g5','103x2','104x1','104b2', '104a1']}
what i tried is:
dg = pd.concat([g['TRAINSET'].map(g['unique']).apply(pd.Series)], axis = 1)
but i get KeyError:'TRAINSET' as this is probably not the proper syntax
.Also, I would like to remove the Nan values in the list
Here is possible use list comprehension with flatten values of lists, join values by + and pass to DataFrame constructor is necessary:
#if necessary
#df = df.reset_index()
#flatten values with filter out missing values
L = [(str(a) + x) for a, b in df[['TRAINSET','unique']].values for x in b if pd.notna(x)]
df1 = pd.DataFrame({'TRAINSET': L})
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Or use DataFrame.explode (pandas 0.25+), crete default index, remove missing values by DataFrame.dropna and join columns to + with Series.to_frame for one column DataFrame :
df = df.explode('unique').dropna(subset=['unique']).reset_index(drop=True)
df1 = (df['TRAINSET'].astype(str) + df['unique']).to_frame('TRAINSET')
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Coming from your original data you can do the below using explode (new in pandas -0.25+) and agg:
Input:
dfd = {'TRAINSET':['101','102','103', '104'],
'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
Solution:
df = pd.DataFrame(dfd)
df.explode('unique').astype(str).agg(''.join,1).to_frame('TRAINSET').to_dict('list')
{'TRAINSET': ['101a1',
'101x1',
'101b2',
'102a1',
'102b3',
'102b2',
'103d3',
'103g5',
'103x2',
'104x1',
'104b2',
'104a1']}
Another solution, just to give you some choice...
import pandas as pd
_dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
dfd = pd.DataFrame.from_dict(_dfd)
dfd.set_index("TRAINSET", inplace=True)
print(dfd)
dfd2 = dfd.reset_index()
def refactor(row):
key, l = str(row["TRAINSET"]), str(row["unique"])
res = [key+i for i in l]
return res
dfd2['TRAINSET'] = dfd2.apply(refactor, axis=1)
dfd2.set_index("TRAINSET", inplace=True)
dfd2.drop("unique", inplace=True, axis=1)
print(dfd2)

How to convert a Pandas series into a Dataframe for merging [duplicate]

If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6

Pandas, selecting by column and row

I want to sum up all values that I select based on some function of column and row.
Another way of putting it is that I want to use a function of the row index and column index to determine if a value should be included in a sum along an axis.
Is there an easy way of doing this?
Columns can be selected using the syntax dataframe[<list of columns>]. The index (row) can be used for filtering using the dataframe.index method.
import pandas as pd
df = pd.DataFrame({'a': [0.1, 0.2], 'b': [0.2, 0.1]})
odd_a = df['a'][df.index % 2 == 1]
even_b = df['b'][df.index % 2 == 0]
# odd_a:
# 1 0.2
# Name: a, dtype: float64
# even_b:
# 0 0.2
# Name: b, dtype: float64
If df is your dataframe :
In [477]: df
Out[477]:
A s2 B
0 1 5 5
1 2 3 5
2 4 5 5
You can access the odd rows like this :
In [478]: df.loc[1::2]
Out[478]:
A s2 B
1 2 3 5
and the even ones like this:
In [479]: df.loc[::2]
Out[479]:
A s2 B
0 1 5 5
2 4 5 5
To answer your question, getting even rows and column B would be :
In [480]: df.loc[::2,'B']
Out[480]:
0 5
2 5
Name: B, dtype: int64
and odd rows and column A can be done as:
In [481]: df.loc[1::2,'A']
Out[481]:
1 2
Name: A, dtype: int64
I think this should be fairly general if not the cleanest implementation. This should allow applying separate functions for rows and columns depending on conditions (that I defined here in dictionaries).
import numpy as np
import pandas as pd
ran = np.random.randint(0,10,size=(5,5))
df = pd.DataFrame(ran,columns = ["a","b","c","d","e"])
# A dictionary to define what function is passed
d_col = {"high":["a","c","e"], "low":["b","d"]}
d_row = {"high":[1,2,3], "low":[0,4]}
# Generate list of Pandas boolean Series
i_col = [df[i].apply(lambda x: x>5) if i in d_col["high"] else df[i].apply(lambda x: x<5) for i in df.columns]
# Pass the series as a matrix
df = df[pd.concat(i_col,axis=1)]
# Now do this again for rows
i_row = [df.T[i].apply(lambda x: x>5) if i in d_row["high"] else df.T[i].apply(lambda x: x<5) for i in df.T.columns]
# Return back the DataFrame in original shape
df = df.T[pd.concat(i_row,axis=1)].T
# Perform the final operation such as sum on the returned DataFrame
print(df.sum().sum())

How to edit/add two columns to a dataframe in pandas at once - df.apply()

So I've been doing things like this with pandas:
usrdata['columnA'] = usrdata.apply(functionA, axis=1)
in order to do row operations and changing/adding columns to my dataframe.
However, now I want to try to do something like this:
usrdata['columnB', 'columnC'] = usrdata.apply(functionB, axis=1)
But the output of function B is a Series with only one column in a tuple (with two values for each row) apparently. Is there a nice way for me to either:
format the output from functionB so it can readily be added to my
dataframe
add (and possibly have to unpack) the output from functionB and assign each each column to each column of my dataframe?
Try using zip:
usrdata['columnB'], usrdata['columnC'] = zip(*usrdata.apply(functionB, axis=1))
I'd assign directly to a df consisting of your new df's and modify the func body to return a Series constructed with a list of the data:
In [9]:
df = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df
Out[9]:
a
0 1
1 2
2 3
3 4
4 5
In [10]:
def func(x):
return pd.Series([x*3, x*10])
​
df[['b','c']] = df['a'].apply(func)
df
Out[10]:
a b c
0 1 3 10
1 2 6 20
2 3 9 30
3 4 12 40
4 5 15 50

Categories

Resources