Pandas - fill new column with values from following day - python

In the following dataframe
#Create data
data = {'Day': [1,1,2,2,3,3],
'Where': ['A','B','A','B','B','B'],
'What': ['x','y','x','x','x','y'],
'Dollars': [100,200,100,100,100,200]}
index = range(len(data['Day']))
columns = ['Day','Where','What','Dollars']
df = pd.DataFrame(data, index=index, columns=columns)
df
I would like to add a column with the future values. In this case, the first value should be 100 as on day 2 at A x was sold for 100 dollars. The complete column should contain the values 100, None, None, 100, None, None.
I thought that I could index the cells in the following way
df2 = df
df2['Tomorrow_Dollars'] = df[df.Day == df2.Day+1,'Dollars']
but this throws the following error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Is there a solution to this or a smarter approach?

Idea is add create missing combinations by reindex with MultiIndex.from_product, reshape by unstack for unique Days, so possible shift. Last reshape back and join for new column:
df1 = df.set_index(['Day','Where','What'])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
s = df1.reindex(mux)['Dollars'].unstack([1,2]).shift(-1).unstack().rename('Tomorrow_Dollars')
df = df.join(s, on=['Where','What','Day'])
print (df)
Day Where What Dollars Tomorrow_Dollars
0 1 A x 100 100.0
1 1 B y 200 NaN
2 2 A x 100 NaN
3 2 B x 100 100.0
4 3 B x 100 NaN
5 3 B y 200 NaN

Related

create dataframe with outliers and then replace with nan

I am trying to make a function to spot the columns with "100" in the header and replace the values in these columns with NaN depending on multiple criteria.
I also want in the function the value of the column "first_column" corresponding to the outlier.
For instance let's say I have a df where I want to replace all numbers that are above 100 or below 0 with NaN values :
I start with this dataframe:
import pandas as pd
data = {'first_column': [product_name', 'product_name2', 'product_name3'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_100':['89', '9', '589'],
'fourth_100':['25', '1568200', '5''],
}
df = pd.DataFrame(data)
print (df)
expected output:
IIUC, you can use filter and boolean indexing:
# get "100" columns and convert to integer
df2 = df.filter(like='100').astype(int)
# identify values <0 or >100
mask = (df2.lt(0)|df2.gt(100))
# mask them
out1 = df.mask(mask.reindex(df.columns, axis=1, fill_value=False))
# get rows with at least one match
out2 = df.loc[mask.any(1), ['first_column']+list(df.filter(like='100'))]
output 1:
first_column second_column third_100 fourth_100
0 product_name first_value 89 25
1 product_name2 second_value 9 NaN
2 product_name3 third_value NaN 5
output 2:
first_column third_100 fourth_100
1 product_name2 9 1568200
2 product_name3 589 5

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

Pandas groupby diff removes column

I have a dataframe like this:
d = {'id': ['101_i','101_e','102_i','102_e'], 1: [3, 4, 5, 7], 2: [5,9,10,11], 3: [8,4,3,7]}
df = pd.DataFrame(data=d)
I want to subtract all rows which have the same prefix id, i.e. subtract all values of rows 101_i with 101_e or vice versa. The code I use for that is:
df['new_identifier'] = [x.upper().replace('E', '').replace('I','').replace('_','') for x in df['id']]
df = df.groupby('new_identifier')[df.columns[1:-1]].diff().dropna()
I get the output like this:
I see that I lose the new column that I create, new_identifier. Is there a way I can retain that?
You can define specific aggregation function (in this case np.diff() for columns 1, 2, and 3) for columns that you know the types (int or float in this case).
import numpy as np
df.groupby('new_identifier').agg({i: np.diff for i in range(1, 4)}).dropna()
Result:
1 2 3
new_identifier
101 1 4 -4
102 2 1 4
Series.str.split to get groups, you need DataFrame.set_axis() before GroupBy, after that we use GroupBy.diff
cols = df.columns.difference(['id'])
groups = df['id'].str.split('_').str[0]
new_df = (
df.set_axis(groups, axis=0)
.groupby(level=0)
[cols]
.diff()
.dropna()
)
print(new_df)
1 2 3
id
101 1.0 4.0 -4.0
102 2.0 1.0 4.0
Detail Groups
df['id'].str.split('_').str[0]
0 101
1 101
2 102
3 102
Name: id, dtype: object

Concat two Pandas DataFrame column with different length of index

How do I add a merge columns of Pandas dataframe to another dataframe while the new columns of data has less rows? Specifically I need to new column of data to be filled with NaN at the first few rows in the merged DataFrame instead of the last few rows. Please refer to the picture. Thanks.
Use:
df1 = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
})
df2 = pd.DataFrame({
'SMA':list('rty')
})
df3 = df1.join(df2.set_index(df1.index[-len(df2):]))
Or:
df3 = pd.concat([df1, df2.set_index(df1.index[-len(df2):])], axis=1)
print (df3)
A B SMA
0 a 4 NaN
1 b 5 NaN
2 c 4 NaN
3 d 5 r
4 e 5 t
5 f 4 y
How it working:
First is selected index in df1 by length of df2 from back:
print (df1.index[-len(df2):])
RangeIndex(start=3, stop=6, step=1)
And then is overwrite existing values by DataFrame.set_index:
print (df2.set_index(df1.index[-len(df2):]))
SMA
3 r
4 t
5 y

Pandas flatten hierarchical index on non overlapping columns

I have a dataframe, and I set the index to a column of the dataframe. This creates a hierarchical column index. I want to flatten the columns to a single level. Similar to this question - Python Pandas - How to flatten a hierarchical index in columns, however, the columns do not overlap (i.e. 'id' is not at level 0 of the hierarchical index, and other columns are at level 1 of the index).
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
A B
id
101 3 x
102 5 y
Desired output is flattened columns, like this:
id A B
101 3 x
102 5 y
You are misinterpreting what you are seeing.
A B
id
101 3 x
102 5 y
Is not showing you a hierarchical column index. id is the name of the row index. In order to show you the name of the index, pandas is putting that space there for you.
The answer to your question depends on what you really want or need.
As the df is, you can dump it to a csv just the way you want:
print(df.to_csv(sep='\t'))
id A B
101 3 x
102 5 y
print(df.to_csv())
id,A,B
101,3,x
102,5,y
Or you can alter the df so that it displays the way you'd like
print(df.rename_axis(None))
A B
101 3 x
102 5 y
please do not do this!!!!
I'm putting it to demonstrate how to manipulate
I could also keep the index as it is but manipulate both column and row index names to print how you would like.
print(df.rename_axis(None).rename_axis('id', 1))
id A B
101 3 x
102 5 y
But this has named the columns' index id which makes no sense.
there will always be an index in your dataframes. if you don't set 'id' as index, it will be at the same level as other columns and pandas will populate an increasing integer for your index starting from 0.
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
In[52]: df
Out[52]:
id A B
0 101 3 x
1 102 5 y
the index is there so you can slice the original dataframe. such has
df.iloc[0]
Out[53]:
id 101
A 3
B x
Name: 0, dtype: object
so let says you want ID as index and ID as a column, which is very redundant, you could do:
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df['id'] = df.index
df
Out[55]:
A B id
id
101 3 x 101
102 5 y 102
with this you can slice by 'id' such has:
df.loc[101]
Out[57]:
A 3
B x
id 101
Name: 101, dtype: object
but it would the same info has :
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df.loc[101]
Out[58]:
A 3
B x
Name: 101, dtype: object
Given:
>>> df2=pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
>>> df2.set_index('id', inplace=True)
>>> df2
A B
id
101 3 x
102 5 y
For printing purdy, you can produce a copy of the DataFrame with a reset the index and use .to_string:
>>> print df2.reset_index().to_string(index=False)
id A B
101 3 x
102 5 y
Then play around with the formatting options so that the output suites your needs:
>>> fmts=[lambda s: u"{:^5}".format(str(s).strip())]*3
>>> print df2.reset_index().to_string(index=False, formatters=fmts)
id A B
101 3 x
102 5 y

Categories

Resources