I am writing a function to aid DataFrame merges between two tables. The function creates a mapping key in the first DataFrame using variables in the second DataFrame.
My issue arises when I try to include the .fillna(method=) at the end of the function.
# Import libraries
import pandas as pd
# Create data
data_1 = {"col_1": [1, 2, 3, 4, 5], "col_2": [1, , 3, , 5]}
data_2 = {"col_1": [1, 2, 3, 4, 5], "col_3": [1, , 3, , 5]}
df = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
def merge_on_key(df, df2, join_how="left", fill_na=None):
# Import libraries
import pandas as pd
# Code to create mapping key not required for question
# Merge the two dataframes
print(fill_na)
print(type(fill_na))
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna(method=fill_na)
return df3
df3 = merge_on_key(df, df2)
output:
>>> None
>>> <class 'NoneType'>
error message:
ValueError: Must specify a fill 'value' or 'method'
My question is why does the fill_na, which is equal to None, not allow the fillna(method=None, the default value for fillna(method))?
You have to either use a 'value' or a 'method'. In your call to fillna you are setting both of them to None. In short, you're telling Python to fill empty (None) values in the dataframe with None, which does nothing and thus it raises an exception.
Based on the docs (link), you could either assign a non-empty value:
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna(value=0, method=fill_na)
or change the method from None (which means "directly substitute the None values in the dataframe by the given value) to one of {'backfill', 'bfill', 'pad', 'ffill'} (each documented in the docs):
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna( method='backfill')
Related
I would want to find a way in python to merge the files on 'seq' but return all the ones with the same id, in this example only the lines with id 2 would be removed.
File one:
seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSDLILYYEQYF,2
CASSDLILYYTQYF,2
CASSGSYEQYF,3
CASSGSYEQYY,3
File two:
seq
CSVGPPNNEQFF
CASRGEAAGFYEQYF
CASSGSYEQYY
Output:
seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSGSYEQYF,3
CASSGSYEQYY,3
I have tried:
df3 = df1.merge(df2.groupby('seq',as_index=False)[['seq']].agg(','.join),how='right')
output:
seq,id
CASRGEAAGFYEQYF,1
CASSGSYEQYY,3
CSVGPPNNEQFF,0
Does anyone have any advice how to solve this?
Do you want to merge two dataframes, or just take subset of the first dataframe according to which id is included in the second dataframe (by seq)? Anyway, this gives the required result.
df1 = pd.DataFrame({
'seq': [
'CSVGPPNNEQFF',
'CTVGPPNNEQFF',
'CTVGPPNNERFF',
'CASRGEAAGFYEQYF',
'RASRGEAAGFYEQYF',
'CASRGGAAGFYEQYF',
'CASSDLILYYEQYF',
'CASSDLILYYTQYF',
'CASSGSYEQYF',
'CASSGSYEQYY'
],
'id': [0, 0, 0, 1, 1, 1, 2, 2, 3, 3]
})
df2 = pd.DataFrame({
'seq': [
'CSVGPPNNEQFF',
'CASRGEAAGFYEQYF',
'CASSGSYEQYY'
]
})
df3 = df1.loc[df1['id'].isin(df1['id'][df1['seq'].isin(df2['seq'])])]
Explanation: df1['id'][df1['seq'].isin(df2['seq'])] takes those values of id from df1 that contain at least one seq that is included in df2. Then all rows with those values of id are taken from df1.
You can use the isin() pandas method, code shall looks as follow :
df1.loc[df1['seq'].isin(df2['seq'])]
Assuming, both objects are pandas dataframe and 'seq' is a column.
I am currently running a parameter study in which the results are returned as pandas DataFrames. I want to store these DFs in a HDF5 file together with the parameter values that were used to create them (parameter foo in the example below, with values 'bar' and 'foo', respectively).
I would like to be able to query the HDF5 file based on these attributes to arrive at the respective DFs - for example, I would like to be able to query for a DF with the attribute foo equal to 'bar'. Is it possible to do this in HDF5? Or would it be smarter in this case to create a multiindex DF instead of saving the parameter values as attributes?
import pandas as pd
df_1 = pd.DataFrame({'col_1': [1, 2],
'col_2': [3, 4]})
df_2 = pd.DataFrame({'col_1': [5, 6],
'col_2': [7, 8]})
store = pd.HDFStore('file.hdf5')
store.put('table_1', df_1)
store.put('table_2', df_2)
store.get_storer('table_1').attrs.foo = 'bar'
store.get_storer('table_2').attrs.foo = 'foo'
store.close()
In the following pandas dataframe there are missing values in different columns for each row.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, None], 'col2': [None, 4, 5], 'col3': [3, None, None]}
df = pd.DataFrame(data=d)
df
I know I can use this to locate which columns are not empty in the ith row
df.iloc[0].notnull()
And then something like the following to find which specific columns are not empty.
np.where(df.iloc[0].notnull())
However, how can I then use those values as indices to return the non missing columns in the ith row?
For example, in the 0th row I'd like to return back columns
df.iloc[0, [0,2]]
This isn't quite right, but I'm guessing is somewhere along these lines?
df.iloc[0, np.where(df.iloc[0].notnull())]
** Edit
I realize I can do this
df.iloc[0, np.where(df.iloc[0].notnull())[0].tolist()]
And this returns the expected result. However, is this the most efficient approach?
Here's a way using np.isnan
# set row number
row_number = 0
# get dataframe
df.loc[row_number, ~np.isnan(df.values)[row_number]]
I have a large number of DataFrames with similar prefix df_, that look like:
df_1
df_x
df_ab
.
.
.
df_1a
df_2b
Of course I can do final_df = pd.concat([df_1, df_x, df_ab, ... df_1a, df_2b], axis = 1)
The issue is that although the prefix df_ will always be there, the rest of the dataframes' names keep changing and do not have any pattern. So, I have to constantly update the list of dataframes in pd.concat to create the 'final_df`, which is cumbersome.
Question: is there anyway to tell python to concatenate all defined dataframes in the namespace (only) starting with df_ and create the final_df or at least return a list of all such dataframes that I can then manually feed into pd.concat?
You could do something like this, using the built-in function globals():
def concat_all(prefix='df_'):
dfs = [df for name, df in globals().items() if name.startswith(prefix)
and isinstance(df, pd.DataFrame)]
return pd.concat(dfs, axis=1)
Logic:
Filter down your global namespace to DataFrames that start with prefix
Put these in a list (concat doesn't take a generator)
Call concat() on the first axis.
Example:
import pandas as pd
df_1 = pd.DataFrame([[0, 1], [2, 3]])
df_2 = pd.DataFrame([[4, 5], [6, 7]])
other_df = df_1.copy() * 2 # ignore this
s_1 = pd.Series([1, 2, 3, 4]) # and this
final_df = concat_all()
final_df
0 1 0 1
0 0 1 4 5
1 2 3 6 7
Always use globals() with caution. It gets you a dictionary of the entire module namespace.
You need globals() rather than locals() because the dictionary is being used inside a function. locals() would be null here at time of use.
I am trying to append a numpy.darray to a dataframe with little success.
The dataframe is called user2 and the numpy.darray is called CallTime.
I tried:
user2["CallTime"] = CallTime.values
but I get an error message:
Traceback (most recent call last):
File "<ipython-input-53-fa327550a3e0>", line 1, in <module>
user2["CallTime"] = CallTime.values
AttributeError: 'numpy.ndarray' object has no attribute 'values'
Then I tried:
user2["CallTime"] = user2.assign(CallTime = CallTime.values)
but I get again the same error message as above.
I also tried to use the merge command but for some reason it was not recognized by Python although I have imported pandas. In the example below CallTime is a dataframe:
user3 = merge(user2, CallTime)
Error message:
Traceback (most recent call last):
File "<ipython-input-56-0ebf65759df3>", line 1, in <module>
user3 = merge(user2, CallTime)
NameError: name 'merge' is not defined
Any ideas?
Thank you!
pandas DataFrame is a 2-dimensional data structure, and each column of a DataFrame is a 1-dimensional Series. So if you want to add one column to a DataFrame, you must first convert it into Series. np.ndarray is a multi-dimensional data structure. From your code, I believe the shape of np.ndarray CallTime should be nx1 (n rows and 1 colmun), and it's easy to convert it to a Series. Here is an example:
df = DataFrame(np.random.rand(5,2), columns=['A', 'B'])
This creates a dataframe df with two columns 'A' and 'B', and 5 rows.
CallTime = np.random.rand(5,1)
Assume this is your np.ndarray data CallTime
df['C'] = pd.Series(CallTime[:, 0])
This will add a new column to df. Here CallTime[:,0] is used to select first column of CallTime, so if you want to use different column from np.ndarray, change the index.
Please make sure that the number of rows for df and CallTime are equal.
Hope this would be helpful.
I think instead to provide only documentation, I will try to provide a sample:
import numpy as np
import pandas as pd
data = {'A': [2010, 2011, 2012],
'B': ['Bears', 'Bears', 'Bears'],
'C': [11, 8, 10],
'D': [5, 8, 6]}
user2 = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
#creating the array what will append to pandas dataframe user2
CallTime = np.array([1, 2, 3])
#convert to list the ndarray array CallTime, if you your CallTime is a matrix than after converting to list you can iterate or you can convert into dataframe and just append column required or just join the dataframe.
user2.loc[:,'CallTime'] = CallTime.tolist()
print(user2)
I think this one will help, also check numpy.ndarray.tolist documentation if need to find out why we need the list and how to do, also here is example how to create dataframe from numpy in case of need https://stackoverflow.com/a/35245297/2027457
Here is a simple solution.
user2["CallTime"] = CallTime
The problem here for you is that CallTime is an array, you couldn't use .values. Since .values is used to convert a dataframe to array. For example,
df = DataFrame(np.random.rand(10,2), columns=['A', 'B'])
# The followings are correct
df.values
df['A'].values
df['B'].values