So I currently have a dataframe that looks like:
And I want to add a completely new column called "Predictors" with only one cell that contains an array.
So [0, 'Predictors'] should contain an array and everything below that cell in the same column should be empty.
Here's my attempt, I tried to create a separate dataframe that just contained the "Predictors" column, and tried appending it to the current dataframe, but I get: 'Length mismatch: Expected axis has 3 elements, new values have 4 elements.'
How do I append a single cell containing an array to my dataframe?
# create a list and dataframe to hold the names of predictors
dataframe=dataframe.drop(['price','Date'],axis=1)
predictorsList = dataframe.columns.get_values().tolist()
predictorsList = np.array(predictorsList, dtype=object)
# Combine actual and forecasted lists to one dataframe
combinedResults = pd.DataFrame({'Actual': actual, 'Forecasted': forecasted})
predictorsDF = pd.DataFrame({'Predictors': [predictorsList]})
# Add Predictors to dataframe
#combinedResults.at[0, 'Predictors'] = predictorsList
pd.concat([combinedResults,predictorsDF], ignore_index=True, axis=1)
You could fill the rest of the cells in the desired column with NaN, but they will not "empty". To do that, use pd.merge on both indexes:
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Actual': [18.442, 15.4233, 20.6217, 16.7, 18.185],
'Forecasted': [19.6377, 13.1665, 19.3992, 17.4557, 14.0053]
})
arr = np.zeros(3)
df_arr = pd.DataFrame({'Predictors': [arr]})
Merging df and df_arr
result = pd.merge(
df,
df_arr,
how='left',
left_index=True, # Merge on both indexes, since right only has 0...
right_index=True # all the other rows will be NaN
)
Results
>>> print(result)
Actual Forecasted Predictors
0 18.4420 19.6377 [0.0, 0.0, 0.0]
1 15.4233 13.1665 NaN
2 20.6217 19.3992 NaN
3 16.7000 17.4557 NaN
4 18.1850 14.0053 NaN
>>> result.loc[0, 'Predictors']
array([0., 0., 0.])
>>> result.loc[1, 'Predictors'] # actually contains a NaN value
nan
You need to change the object type of the column (in your case Predictors) first
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(20).reshape(5,4), columns=list('abcd'))
df=df.astype(object) # this line allows the signment of the array
df.iloc[1,2] = np.array([99,99,99])
print(df)
gives
a b c d
0 0 1 2 3
1 4 5 [99, 99, 99] 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
Related
My goal is to have my column titles in the small df added to an existing large dataframe without me manually typing the name in.
This is the small dataframe.
veddra_term_code veddra_version veddra_term_name number_of_animals_affected accuracy
335 11 Emesis NaN NaN
142 11 Anaemia NOS NaN NaN
The large dataframe is similar to the above but has forty columns.
This is the code I used to extract the small dataframe from dict.
df = pd.DataFrame(reaction for result in d['results'] for reaction in result['reaction']) #get reaction data
df
You can pass dataframe.reindex a list of columns, consisting of the existing columns and also new ones. If a column does not exist yet in the dataframe, it will get as value NaN.
Assume that df is your big dataframe you want to extend with columns. You can then create a new list of column names (columns_to_add) from your small dataframe and combine them. Then you call reindex on the big dataframe.
import pandas as pd
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
existing_columns = df.columns.tolist()
columns_to_add = ["C", "D"] # or use small_df.columns.tolist()
new_columns = existing_columns + columns_to_add
df = df.reindex(columns = new_columns)
This will produce:
A B C D
0 1 2 NaN NaN
1 2 3 NaN NaN
2 3 4 NaN NaN
If you do not like NaN you can use a different value by passing the keyword fill_value.
(e.g. df.reindex(columns = new_columns, fill_value=0).
df.columns will give you an array of the names of the columns
import numpy as np
#loop small dataframe headers
for i in small_df.columns:
# if large df doesnt have the header, create the header
if i not in large_df.columns:
#creates new header with no data
large_df.loc[:,i]=np.nan
Is there a way to sort each row of a pandas data frame?
I don't care about columns names or row indexes, I just want a table with the values of each row sorted from highest to lowest.
You can use np.sort with axis=1 on the numpy data:
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,10, (2,4)))
# output
pd.DataFrame(np.sort(df.values, axis=1)[:,::-1],
index=df.index,
columns=df.columns)
Output:
0 1 2 3
0 9 6 6 1
1 8 7 2 1
If you want to override your original dataframe:
df[:] = np.sort(df.values, axis=1)[:,::-1]
Update
np.sort(df)[:,::-1] works as well, df is downcast to a numpy array, and axis=-1 is default.
I have a dataframe (df1) and I want to replace the values for the columns V2 and V3 if they have the same value than V1.
import pandas as pd
import numpy as np
df_start= pd.DataFrame({"ID":[1, 2 , 3 ,4, 5], "V1":[10,5,15,20,20], "V2":[10,5,20,17,15], "V3":[10, 25, 15, 10, 20]})
df_end = pd.DataFrame({"ID":[1, 2 , 3 ,4, 5], "V1":[10,5,15,20,20], "V2":[np.nan,np.nan,20,17,15], "V3":[np.nan, 25, np.nan, 10, np.nan]})
I know iterrows is not recommended but I don't know what I should do.
You can use mask:
For a seperate dataframe use assign:
df_end = df_start.assign(**df_start[['V2','V3']]
.mask(df_start[['V2','V3']].eq(df_start['V1'],axis=0)))
For modifying the input dataframe just assign inplace:
df_start[['V2','V3']] = (df_start[['V2','V3']]
.mask(df_start[['V2','V3']].eq(df_start['V1'],axis=0)))
ID V1 V2 V3
0 1 10 NaN NaN
1 2 5 NaN 25.0
2 3 15 20.0 NaN
3 4 20 17.0 10.0
4 5 20 15.0 NaN
You'll still use a regular loop to go through the columns, but the apply function is your best friend for this kind of row-wise operation. If you're going to use info from more than one column (here you're comparing some column and "V1"), you use apply on the DataFrame and specify the axis. If you were only looking at info from one column (like making a column that doubles values from V1 if they're even, you can use apply with just a Series.
For both versions of the function, the argument you're going to pass is a lambda expression. If you apply it do a DataFrame like you are here, the x represents the values in a row that can be indexed by a column. Finally, you assign the result back to a new or existing column in your DataFrame.
Assuming that df_start and df_end represent your planned input and output:
cols = ["V2","V3"]
for col in cols:
df_start[col] = df.apply(lambda x[col] if x[col] != x["V1"] else np.nan, axis=1]
I want to sum up all values that I select based on some function of column and row.
Another way of putting it is that I want to use a function of the row index and column index to determine if a value should be included in a sum along an axis.
Is there an easy way of doing this?
Columns can be selected using the syntax dataframe[<list of columns>]. The index (row) can be used for filtering using the dataframe.index method.
import pandas as pd
df = pd.DataFrame({'a': [0.1, 0.2], 'b': [0.2, 0.1]})
odd_a = df['a'][df.index % 2 == 1]
even_b = df['b'][df.index % 2 == 0]
# odd_a:
# 1 0.2
# Name: a, dtype: float64
# even_b:
# 0 0.2
# Name: b, dtype: float64
If df is your dataframe :
In [477]: df
Out[477]:
A s2 B
0 1 5 5
1 2 3 5
2 4 5 5
You can access the odd rows like this :
In [478]: df.loc[1::2]
Out[478]:
A s2 B
1 2 3 5
and the even ones like this:
In [479]: df.loc[::2]
Out[479]:
A s2 B
0 1 5 5
2 4 5 5
To answer your question, getting even rows and column B would be :
In [480]: df.loc[::2,'B']
Out[480]:
0 5
2 5
Name: B, dtype: int64
and odd rows and column A can be done as:
In [481]: df.loc[1::2,'A']
Out[481]:
1 2
Name: A, dtype: int64
I think this should be fairly general if not the cleanest implementation. This should allow applying separate functions for rows and columns depending on conditions (that I defined here in dictionaries).
import numpy as np
import pandas as pd
ran = np.random.randint(0,10,size=(5,5))
df = pd.DataFrame(ran,columns = ["a","b","c","d","e"])
# A dictionary to define what function is passed
d_col = {"high":["a","c","e"], "low":["b","d"]}
d_row = {"high":[1,2,3], "low":[0,4]}
# Generate list of Pandas boolean Series
i_col = [df[i].apply(lambda x: x>5) if i in d_col["high"] else df[i].apply(lambda x: x<5) for i in df.columns]
# Pass the series as a matrix
df = df[pd.concat(i_col,axis=1)]
# Now do this again for rows
i_row = [df.T[i].apply(lambda x: x>5) if i in d_row["high"] else df.T[i].apply(lambda x: x<5) for i in df.T.columns]
# Return back the DataFrame in original shape
df = df.T[pd.concat(i_row,axis=1)].T
# Perform the final operation such as sum on the returned DataFrame
print(df.sum().sum())
I attempting to add a Series to an empty DataFrame and can not find an answer
either in the Doc's or other questions. Since you can append two DataFrames by row
or by column it would seem there must be an "axis marker" missing from a Series. Can
anyone explain why this does not work?.
import Pandas as pd
df1 = pd.DataFrame()
s1 = pd.Series(['a',5,6])
df1 = pd.concat([df1,s1],axis = 1)
#go run some process return s2, s3, sn ...
s2 = pd.Series(['b',8,9])
df1 = pd.concat([df1,s2],axis = 1)
s3 = pd.Series(['c',10,11])
df1 = pd.concat([df1,s3],axis = 1)
If my example above is some how misleading perhaps using the example from the docs will help.
Quoting: Appending rows to a DataFrame.
While not especially efficient (since a new object must be created), you can append a
single row to a DataFrame by passing a Series or dict to append, which returns a new DataFrame as above. End Quote.
The example from the docs appends "S", which is a row from a DataFrame, "S1" is a Series
and attempting to append "S1" produces an error. My question is WHY will appending "S1 not work? The assumption behind the question is that a DataFrame must code or contain axes information for two axes, where a Series must contain only information for one axes.
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.xs(3); #third row of DataFrame
s1 = pd.Series([np.random.randn(4)]); #new Series of equal len
df= df.append(s, ignore_index=True)
Result
0 1
0 a b
1 5 8
2 6 9
Desired
0 1 2
0 a 5 6
1 b 8 9
You were close, just transposed the result from concat
In [14]: s1
Out[14]:
0 a
1 5
2 6
dtype: object
In [15]: s2
Out[15]:
0 b
1 8
2 9
dtype: object
In [16]: pd.concat([s1, s2], axis=1).T
Out[16]:
0 1 2
0 a 5 6
1 b 8 9
[2 rows x 3 columns]
You also don't need to create the empty DataFrame.
The best way is to use DataFrame to construct a DF from a sequence of Series, rather than using concat:
import pandas as pd
s1 = pd.Series(['a',5,6])
s2 = pd.Series(['b',8,9])
pd.DataFrame([s1, s2])
Output:
In [4]: pd.DataFrame([s1, s2])
Out[4]:
0 1 2
0 a 5 6
1 b 8 9
A method of accomplishing the same objective as appending a Series to a DataFrame
is to just convert the data to an array of lists and append the array(s) to the DataFrame.
data as an array of lists
def get_example(idx):
list1 = (idx+1,idx+2 ,chr(idx + 97))
data = [list1]
return(data)
df1 = pd.DataFrame()
for idx in range(4):
data = get_example(idx)
df1= df1.append(data, ignore_index = True)