I have a project where I'm trying to update a dataframe to a new set of changes being rolled out. There are currently 15,000 data samples in the dataframe, so runtime can become an issue quickly. I know vectorizing a dataframe using numpy is a good way to cut back on runtime, but I'm running into an issue with my numpy array and dictionary.
The goal is to look at the value in col3, use that as the key to df_dict, and use the value of that dictionary entry to multiply to col2 and assign to col1.
I've been able to do this using for loops, but it runs into a serious problem of runtime - especially because there are more steps involved than just what I'm asking for help on.
d = {"col1": [1, 2, 3, 4], "col2": [1, 2, 3, 4], "col3": ["a","b","c","d"]}
df = pd.DataFrame(data=d)
df_dict = {"a":1.2,"b":1.5,"c":0.95,"d":1.25}
df["col1"]=df["col2"].values*df_dict[df["col3"].values]
I expect col1 to be updated to [1.2, 3, 2.85, 5], but instead I get the error
TypeError: unhashable type: 'numpy.ndarray'
I get why the error occurs, I just want to find the best alternative.
Looks like you need.
d = {"col1": [1, 2, 3, 4], "col2": [1, 2, 3, 4], "col3": ["a","b","c","d"]}
df = pd.DataFrame(data=d)
df_dict = {"a":1.2,"b":1.5,"c":0.95,"d":1.25}
df["col1"]=df["col2"]* [df_dict.get(i, 1) for i in df["col3"]]
print(df)
Output:
col1 col2 col3
0 1.20 1 a
1 3.00 2 b
2 2.85 3 c
3 5.00 4 d
You can use a little better solution using .map.
So replace:
df["col1"]=df["col2"].values*df_dict[df["col3"].values]
With:
df["col1"]=df["col2"] * df['col3'].map(df_dict)
Related
I try to do following:
import pandas as pd
d = {'col1': [1, 7, 3, 6], 'col2': [3, 4, 9, 1]}
df = pd.DataFrame(data=d)
out = df.query('col1 > col2')
out= col1 col2
1 7 4
3 6 1
This works OK. But when I modify column name col1 --> col1:suf
d = {'col1:suf': [1, 7, 3, 6], 'col2': [3, 4, 9, 1]}
df = pd.DataFrame(data=d)
out = df.query('col1:suf > col2')
I get an error:
'AnnAssign' nodes are not implemented
Is there easy way to avoid this behavior? Or course renaming headers etc. is a workaround
The colon : is a special character in SQL queries. You need to enclose it in backticks.
Try this :
out = df.query('`col1:suf` > col2')
Output :
print(out)
col1:suf col2
1 7 4
3 6 1
According to ValentinFFM's comment on this issue, you need to put a backtick quote around your column name like
df.query('`Column: Name`==value')
Imagine I have a Pandas DataFrame:
# create df
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3]})
Lets assume it is ordered by 'id' and an imaginary, not shown, date column (ascending).
I want to create another column where each row is a list of 'val' at that date.
The ending DataFrame will look like this:
df = pd.DataFrame({'id': [1,1,1,2,2,2],
'val': [5,4,6,3,2,3],
'val_list': [[5],[5,4],[5,4,6],[3],[3,2],[3,2,3]]})
I don't want to use a loop because the actual df I am working with has about 4 million records. I am imagining I would use a lambda function in conjunction with groupby (something like this):
df['val_list'] = df.groupby('id')['val'].apply(lambda x: x.runlist())
This raises an AttributError because the runlist() method does not exist, but I am thinking the solution would be something like this.
Does anyone know what to do to solve this problem?
Let us try
df['new'] = df.val.map(lambda x : [x]).groupby(df.id).apply(lambda x : x.cumsum())
Out[138]:
0 [5]
1 [5, 4]
2 [5, 4, 6]
3 [3]
4 [3, 2]
5 [3, 2, 3]
Name: val, dtype: object
I'm trying to write a function that take a pandas Dataframe as argument and at some concatenate this datagframe with another.
for exemple:
def concat(df):
df = pd.concat((df, pd.DataFrame({'E': [1, 1, 1]})), axis=1)
I would like this function to modify in place the input df but I can't find how to achieve this. When I do
...
print(df)
concat(df)
print(df)
The dataframe df is identical before and after the function call
Note: I don't want to do df['E'] = [1, 1, 1] because I don't know how many column will be added to df. So I want to use pd.concat(), if possible...
This will edit the original DataFrame inplace and give the desired output as long as the new data contains the same number of rows as the original, and there are no conflicting column names.
It's the same idea as your df['E'] = [1, 1, 1] suggestion, except it will work for an arbitrary number of columns.
I don't think there is a way to achieve this using pd.concat, as it doesn't have an inplace parameter as some Pandas functions do.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [10, 20, 30], 'D': [40, 50, 60]})
df[df2.columns] = df2
Results (df):
A B C D
0 1 4 10 40
1 2 5 20 50
2 3 6 30 60
In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results
In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results