How to assign pandas dataframe to slice of other dataframe - python

I have Excel spreadsheets with data, one for each year. Alas the columns change slightly over the year. What I want is to have one dataframe with all the data and fill the lacking columns with predefined data. I wrote a small example program to test that.
import numpy as np
import pandas as pd
# Initialize three dataframes
df1 = pd.DataFrame([[1,2], [11,22],[111,222]], columns=['een', 'twee'])
df2 = pd.DataFrame([[3,4], [33,44],[333,444]], columns=['een', 'drie'])
df3 = pd.DataFrame([[5,6], [55,66],[555,666]], columns=['twee', 'vier'])
# Store these in a dictionary and print for verification
d = {'df1': df1, 'df2': df2, 'df3': df3}
for key in d:
print(d[key])
print()
# Create a list of all columns, as order is relevant a Set is not used
cols = []
# Count total number of rows
nrows = 0
# Loop thru each dataframe to determine total number of rows and columns
for key in d:
df = d[key]
nrows += len(df)
for col in df.columns:
if col not in cols:
cols += [col]
# Create total dataframe, fill with default (zeros)
data = pd.DataFrame(np.zeros((nrows, len(cols))), columns=cols)
# Assign dataframe to each slice
c = 0
for key in d:
data.loc[c:c+len(d[key])-1, d[key].columns] = d[key]
c += len(d[key])
print(data)
The dataframes are initialized all right but there is something weird with the assignment to the slice of the data dataframe. What I wanted (and expected) is:
een twee drie vier
0 1.0 2.0 0.0 0.0
1 11.0 22.0 0.0 0.0
2 111.0 222.0 0.0 0.0
3 3.0 0.0 4.0 0.0
4 33.0 0.0 44.0 0.0
5 333.0 0.0 444.0 0.0
6 0.0 5.0 0.0 6.0
7 0.0 55.0 0.0 66.0
8 0.0 555.0 0.0 666.0
But this is what I got:
een twee drie vier
0 1.0 2.0 0.0 0.0
1 11.0 22.0 0.0 0.0
2 111.0 222.0 0.0 0.0
3 NaN 0.0 NaN 0.0
4 NaN 0.0 NaN 0.0
5 NaN 0.0 NaN 0.0
6 0.0 NaN 0.0 NaN
7 0.0 NaN 0.0 NaN
8 0.0 NaN 0.0 NaN
The location AND the data of the first dataframe are correctly assigned. However, the second dataframe is assigned to the correct location, but not its contents: NaN is assigned instead. This also happens for the third dataframe: correct location but missing data. I have tried to assign d[key].loc[0:2, d[key].columns and some more fanciful solutions to the data slice, but all return NaN. How can I get the contents of the dataframe as well assigned to data?

Per the comments, you can use:
pd.concat([df1, df2, df3])
OR
pd.concat([df1, df2, df3]).fillna(0)

Related

2 dataframes, same number of columns, different number of rows comparing and replacing values [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 4 months ago.
i think this is an easy question and I know where to look, using merge, join, loc, iloc or 1 of these functions but did not figure it out yet. Here a simplistic example what I want to do. df1 and df2 have the same columns but a different number of rows. Now I want to find rows where the column "t1" is the same for both dataframes and then replace the values in column "c1" of df1 with the values of column "c1" of df2 (so where their t1 value are the same). I also tried functions where and replace but I am pretty sure I need merge or join. Thank you.
import pandas as pd
df1 = pd.DataFrame()
df2 = pd.DataFrame()
# close price
df1.at[0,"c1"]=0
df1.at[1,"c1"]=0
df1.at[2,"c1"]=0
df1.at[3,"c1"]=0
df1.at[4,"c1"]=0
df1.at[5,"c1"]=0
df1.at[6,"c1"]=0
df1.at[7,"c1"]=0
df2.at[0,"c1"]=20
df2.at[1,"c1"]=26
df2.at[3,"c1"]=23
df2.at[4,"c1"]=21
# time stamp
df1.at[0,"t1"]=3
df1.at[1,"t1"]=4
df1.at[2,"t1"]=5
df1.at[3,"t1"]=6
df1.at[4,"t1"]=7
df1.at[5,"t1"]=8
df1.at[6,"t1"]=9
df1.at[7,"t1"]=10
df2.at[0,"t1"]=5
df2.at[1,"t1"]=6
df2.at[3,"t1"]=7
df2.at[4,"t1"]=8
They look like:
>>> df1
c1 t1
0 0.0 3.0
1 0.0 4.0
2 0.0 5.0
3 0.0 6.0
4 0.0 7.0
5 0.0 8.0
6 0.0 9.0
7 0.0 10.0
>>> df2
c1 t1
0 20.0 5.0
1 26.0 6.0
3 23.0 7.0
4 21.0 8.0
So I want df1 to look like the frame shown below. At the rows where the value for "t1" is the same for both df1 and df2 I want to replace the values in column "c1" in df1 with the values from df2.
>>> df1
c1 t1
0 0.0 3.0
1 0.0 4.0
2 20.0 5.0
3 26.0 6.0
4 23.0 7.0
5 21.0 8.0
6 0.0 9.0
7 0.0 10.0
You can use pd.merge for this:
df1 = df1.merge(df2, on=['t1'], how='left')
Which results in:
c1_x t1 c1_y
0 0.0 3.0 NaN
1 0.0 4.0 NaN
2 0.0 5.0 20.0
3 0.0 6.0 26.0
4 0.0 7.0 23.0
5 0.0 8.0 21.0
6 0.0 9.0 NaN
7 0.0 10.0 NaN
It adds a new column c1_y which are the merged values from df2. To create the desired output we only need to do the following:
df1['c1'] = df1.c1_y.fillna(df1.c1_x)
df1 = df1[['c1', 't1']]
Output:
c1 t1
0 0.0 3.0
1 0.0 4.0
2 20.0 5.0
3 26.0 6.0
4 23.0 7.0
5 21.0 8.0
6 0.0 9.0
7 0.0 10.0
Simple use merge:
res = pd.merge(df1, df2, on='t1', how='outer')
df1['t1'] = res['c1_y'].fillna(df1['t1'])
print(df1)
###output:
### c1 t1
###0 0.0 3.0
###1 0.0 4.0
###2 0.0 20.0
###3 0.0 26.0
###4 0.0 23.0
###5 0.0 21.0
###6 0.0 9.0
###7 0.0 10.0

Read only numerical values from one dataframe and create another dataframe from those values

I have imported an excel into a dataframe and it looks like this:
rule_id reqid1 reqid2 reqid3
50014 1.0 0.0 1.0
50238 0.0 1.0 0.0
50239 0.0 1.0 0.0
50356 0.0 0.0 1.0
50412 0.0 0.0 1.0
51181 0.0 1.0 0.0
53139 0.0 0.0 1.0
Then I wrote this code to compare corresponding reqids with each other and then drop the reqid columns:
m = df1.eq(df1.shift(-1, axis=1))
arr1 = np.select([df1 ==0, m], [np.nan, 1], 1*100)
dft4 = pd.DataFrame(arr1, index=df1.index).rename(columns=lambda x: 'comp{}'.format(x+1))
dft5 = df1.join(dft4)
cols = [c for c in dft5.columns if 'reqid' in c]
df8 = dft5.drop(cols, axis=1)
The result looked like this:
Then I transposed it and the data looks like this:
Now I want to write this data into a separate dataframe where only numerical values are present and empty or null values are removed. The dataframe should look like this:
If anybody could help me , I would greatly appreciate it.
Use justify function and then remove only NaNs rows by DataFrame.dropna with parameter how='all':
df8 = dft5.drop(cols, axis=1).T
df8 = pd.DataFrame(justify(df8.values,
invalid_val=np.nan,
axis=0,side='up'), columns=df8.columns).dropna(how='all')
print (df8)
rule_id 50014 50238 50239 50356 50412 51181 53139
0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
1 100.0 NaN NaN NaN NaN NaN NaN
Another pandas solution:
df8 = df8.apply(lambda x: pd.Series(x.dropna().values))
print (df8)
rule_id 50014 50238 50239 50356 50412 51181 53139
0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
1 100.0 NaN NaN NaN NaN NaN NaN

Pandas: cell-wise fillna(method = 'pad') of a list of DataFrame

Basically, I'm trying to do something like this but for a fillna instead of a sum.
I have a list of df's, each with same colunms/indexes, ordered over time:
import numpy as np
import pandas as pd
np.random.seed(0)
df_list = []
for index in range(3):
a = pd.DataFrame(np.random.randint(3, size=(5,3)), columns=list('abc'))
mask = np.random.choice([True, False], size=a.shape)
df_list.append(a.mask(mask))
now, I want to do a replace the numpy.nan cells of the ith
DataFrame in df_list by the value of the same cell in the i-1 th
DataFrame in df_list.
so if the first DataFrame is:
a b c
0 NaN 1.0 0.0
1 1.0 1.0 NaN
2 0.0 NaN 0.0
3 NaN 0.0 2.0
4 NaN 2.0 2.0
and the 2nd is:
a b c
0 0.0 NaN NaN
1 NaN NaN NaN
2 0.0 1.0 NaN
3 NaN NaN 2.0
4 0.0 NaN 2.0
Then the output output_list should be a list of the same length as df_list and having also DataFrames as elements.
The first entry of output_list is the same as the first entry of df_list.
The second entry of output_list is:
a b c
0 0.0 1.0 0.0
1 1.0 1.0 NaN
2 0.0 1.0 0.0
3 NaN 0.0 2.0
4 0.0 2.0 2.0
I believe the update functionality is very good for this, see the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html
It is a method that specifically allows you to update a DataFrame, in your case only the NaN-elements of it.
In particular, you could use it like this:
new_df_list = df_list[:1]
for df_new, df_old in zip(df_list[1:], df_list[:-1]):
df_new.update(df_old, overwrite=False)
new_df_list.append(df_new)
Which will give you the desired output

No exception raised when accessing wrong column labels in Pandas?

Accessing Pandas dataframe in some cases does not raise exception even when the columns labels are not existed.
How should I check for these cases, to avoid reading wrong results?
a = pd.DataFrame(np.zeros((5,2)), columns=['la', 'lb'])
a
Out[349]:
la lb
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
a.loc[:, 'lc'] # Raised exception as expected.
a.loc[:, ['la', 'lb', 'lc']] # Not expected.
Out[353]:
la lb lc
0 0.0 0.0 NaN
1 0.0 0.0 NaN
2 0.0 0.0 NaN
3 0.0 0.0 NaN
4 0.0 0.0 NaN
a.loc[:, ['la', 'wrong_lb', 'lc']] # Not expected.
Out[354]:
la wrong_lb lc
0 0.0 NaN NaN
1 0.0 NaN NaN
2 0.0 NaN NaN
3 0.0 NaN NaN
4 0.0 NaN NaN
Update: There is a suggested duplicate question (Safe label-based selection in DataFrame), but it's about row selection, my question is about column selection.
it looks like because at least one of the columns exists, it returns an enlarged df as a reindex operation.
You could define a user func that validates the columns which will handle whether the column exists or not. Here I construct a pandas Index object from the passed in iterable and call intersection to return the common values from the existing df and passed in iterable:
In [80]:
def val_cols(cols):
return pd.Index(cols).intersection(a.columns)
​
a.loc[:, val_cols(['la', 'lb', 'lc'])]
Out[80]:
la lb
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
This also handles completely missing columns:
In [81]:
a.loc[:, val_cols(['x', 'y'])]
Out[81]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
This also handles your latter case:
In [83]:
a.loc[:, val_cols(['la', 'wrong_lb', 'lc'])]
Out[83]:
la
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
update
in the case where you want to just test if all are valid you can just iterate over each column in the list and append the duff columns:
In [93]:
def val_cols(cols):
duff=[]
for col in cols:
try:
a[col]
except KeyError:
duff.append(col)
return duff
invalid = val_cols(['la','x', 'y'])
print(invalid)
['x', 'y']

Fill in missing values using climatology data except for current year

df.groupby([df.index.month, df.index.day])[vars_rs].transform(lambda y: y.fillna(y.median()))
I am filling missing values in a dataframe with median values from climatology. The days range from Jan 1 2010 to Dec 31st 2016. However, I only want to fill in missing values for days before current date (say Oct 1st 2016). How do I modify the statement?
The algorithm would be:
Get a part of the data frame which contains only rows filtered by date with a boolean mask
Perform required replacements on it
Append the rest of the initial data frame to the end of the resulting data frame.
Dummy data:
df = pd.DataFrame(np.zeros((5, 2)),columns=['A', 'B'],index=pd.date_range('2000',periods=5,freq='M'))
A B
2000-01-31 0.0 0.0
2000-02-29 0.0 0.0
2000-03-31 0.0 0.0
2000-04-30 0.0 0.0
2000-05-31 0.0 0.0
The code
vars_rs = ['A', 'B']
mask = df.index < '2000-03-31'
early = df[mask]
early = early.groupby([early.index.month, early.index.day])[vars_rs].transform(lambda y: y.replace(0.0, 1)) # replace with your code
result = early.append(df[~mask])
So the result is
A B
2000-01-31 1.0 1.0
2000-02-29 1.0 1.0
2000-03-31 0.0 0.0
2000-04-30 0.0 0.0
2000-05-31 0.0 0.0
Use np.where, example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','a','b','b','c','c'],'B':[1,2,3,4,5,6],'C':[1,np.nan,np.nan,np.nan,np.nan,np.nan]})
df.ix[:,'C'] = np.where((df.A != 'c')&(df.B < 4)&(pd.isnull(df.C)),-99,df.ix[:,'C'])
Like this you can directly modify the desired column using boolean expressions and all columns.
Original dataframe:
A B C
0 a 1 1.0
1 a 2 NaN
2 b 3 NaN
3 b 4 NaN
4 c 5 NaN
5 c 6 NaN
Modified dataframe:
A B C
0 a 1 1.0
1 a 2 -99.0
2 b 3 -99.0
3 b 4 NaN
4 c 5 NaN
5 c 6 NaN

Categories

Resources