I've created a 4 level b-tree thing with each leaf being a Pandas Series and each level is a Series with 2 values based on True or False. Each level's Series is named according to the level. The result is a not very useful object, but convenient to create.
The code below shows how to create a similar (but simpler) object, that has the same essential properties.
What I really want is a MultiIndex dataframe, where each level of index inherits its name from the same name of that level Series.
import random
import pandas as pd
def sertree(names):
if len(names) <= 1:
ga = pd.Series([random.randint(0,100) for x in range(5)], name='last')
gb = pd.Series([random.randint(0,100) for x in range(5)], name='last')
return pd.Series([ga,gb], index=[True,False], name=names[0])
else:
xa = sertree(names[1:])
xb = sertree(names[1:])
return pd.Series([xa,xb], index=[True,False], name=names[0])
pp = sertree(['top', 'next', 'end'])
n=4
while True:
print(f"{'':>{n}s}{pp.name}")
n+=4
if len(pp) > 2 : break
pp = pp[True]
top
next
end
last
What I want is something like this...
top=[True,False]
nxt=[True,False]
end=[True,False]
last=range(5)
midx = pd.MultiIndex.from_product([top,nxt,end,last],names=['top','next','end','last']) ;
midf = pd.DataFrame([random.randint(0,100) for x in range(len(midx))], index=midx, columns=['name'])
In [593]: midf.head(12)
Out[593]:
name
top next end last
True True True 0 99
1 74
2 16
3 61
4 3
False 0 44
1 46
2 59
3 14
4 82
False True 0 98
1 93
Any ideas how to transform my 'pp' abomination into a nice DataFrame multi-index in a nice Pandas method that I'm lost to understand. Essential is to maintain the Series name as the Multi-index name at each level.
I have some chemical data that I'm trying to process using Pandas. I have two dataframes:
C_atoms_all.head()
id_all index_all label_all species_all position
0 217 1 C C [6.609, 6.6024, 19.3301]
1 218 2 C C [4.8792, 11.9845, 14.6312]
2 219 3 C C [4.8373, 10.7563, 13.9466]
3 220 4 C C [4.7366, 10.9327, 12.5408]
4 6573 5 C C [1.9482, -3.8747, 19.6319]
C_atoms_a.head()
id_a index_a label_a species_a position
0 55 1 C C [6.609, 6.6024, 19.3302]
1 56 2 C C [4.8792, 11.9844, 14.6313]
2 57 3 C C [4.8372, 10.7565, 13.9467]
3 58 4 C C [4.7367, 10.9326, 12.5409]
4 59 5 C C [5.1528, 15.5976, 14.1249]
What I want to do is get a mapping of all of the id_all values to the id_a values where their position matches. You can see that for C_atoms_all.iloc[0]['id_all'] (which returns 55) and the same query for C_atoms_a, the position values match (within a small fudge factor), which I should also include in the query.
The problem I'm having is that I can't merge or filter on the position columns because lists aren't hashable in Python.
I'd ideally like to return a dataframe that looks like so:
id_all id_a position
217 55 [6.609, 6.6024, 19.3301]
... ... ...
for every row where the position values match.
You can do it like below:
I named your C_atoms_all as df_all and C_atoms_a as df_a:
# First we try to extract different values in "position" columns for both dataframes.
df_all["val0"] = df_all["position"].str[0]
df_all["val1"] = df_all["position"].str[1]
df_all["val2"] = df_all["position"].str[2]
df_a["val0"] = df_a["position"].str[0]
df_a["val1"] = df_a["position"].str[1]
df_a["val2"] = df_a["position"].str[2]
# Then because the position values match (within a small fudge factor)
# we round them with three decimal
df_all.loc[:, ["val0", "val1", "val2"]] = df_all[["val0", "val1", "val2"]].round(3)
df_a.loc[:, ["val0", "val1", "val2"]]= df_a[["val0", "val1", "val2"]].round(3)
# We use loc to modify the original dataframe, instead of a copy of it.
# Then we use merge on three extracted values from position column
df = df_all.merge(df_a, on=["val0", "val1", "val2"], left_index=False, right_index=False,
suffixes=(None, "_y"))
# Finally we just keep the the desired columns
df = df[["id_all", "id_a", "position"]]
print(df)
id_all id_a position
0 217 55 [6.609, 6.6024, 19.3301]
1 218 56 [4.8792, 11.9845, 14.6312]
2 219 57 [4.8373, 10.7563, 13.9466]
3 220 58 [4.7366, 10.9327, 12.5408]
This isn't pretty, but it might work for you
def do(x, df_a):
try:
return next((df_a.iloc[i]['id_a'] for i in df_a.index if df_a.iloc[i]['position'] == x))
except StopIteration:
return np.NAN
match = pd.DataFrame(C_atoms_all[['id_all', 'position']])
match['id_a'] = C_atoms_all['position'].apply(do, args=(C_atoms_a,))
You can create a new column in both datasets that contains the hash of the position column and then merge both datasets by that new column.
# Custom hash function
def hash_position(position):
return hash(tuple(position))
# Create the hash column "hashed_position"
C_atoms_all['hashed_position'] = C_atoms_all['position'].apply(hash_position)
C_atoms_a['hashed_position'] = C_atoms_a['position'].apply(hash_position)
# merge datasets
C_atoms_a.merge(C_atoms_all, how='inner', on='hashed_position')
# ... keep the columns you need
Your question is not clear. It seems to me an interesting question though. For that reason I have reproduced your data in a more useful format just in case there is some one who can help more than I can.
Data
C_atoms_all = pd.DataFrame({
'id_all': [217,218,219,220,6573],
'index_all': [1,2,3,4,5],
'label_all': ['C','C','C','C','C'],
'species_all': ['C','C','C','C','C'],
'position':[[6.609, 6.6024, 19.3301],[4.8792, 11.9845, 14.6312],[4.8373, 10.7563, 13.9466],[4.7366, 10.9327, 12.5408],[1.9482,-3.8747, 19.6319]]})
C_atoms_a = pd.DataFrame({
'id_a': [55,56,57,58,59],
'index_a': [1,2,3,4,5],
'label_a': ['C','C','C','C','C'],
'species_a': ['C','C','C','C','C'],
'position':[[6.609, 6.6024, 19.3302],[4.8792, 11.9844, 14.6313],[4.8372, 10.7565, 13.9467],[4.7367, 10.9326, 12.5409],[5.1528, 15.5976, 14.1249]]})
C_atoms_ab
Solution
#new dataframe bringing together columns position
df3=C_atoms_all.set_index('index_all').join(C_atoms_a.set_index('index_a').loc[:,'position'].to_frame(),rsuffix='_r').reset_index()
#Create temp column that gives you the comparison tolerances
df3['temp']=df3.filter(regex='^position').apply(lambda x: np.round(np.array(x[0])-np.array(x[1]), 4), axis=1)
#Assume tolerance is where only one of the values is over 0.0
C_atoms_all[df3.explode('temp').groupby(level=0)['temp'].apply(lambda x:x.eq(0).sum()).gt(1)]
id_all index_all label_all species_all position
0 217 1 C C [6.609, 6.6024, 19.3301]
I have a dataframe df with 2 columns
Sub_marks Total_marks
40 90
60 80
100 90
0 0
I need to find which all rows fails the criteria that sub_marks <= Total_marks.
Currently i am using sympy function as below
def fn_validate(formula,**args):
exp=sy.sympify(formula)
exp=exp.subs(args)
return exp
I am calling above function using apply method as below
df['val_check']=df.apply(lambda row:fn_validate('X<=Y',X=row['Sub_marks],Y=row['Total_marks']),axis=1)
I am expecting a column val_check with True/False expression validation result. But in case of 0 values, I am getting error.
Invalid Nan Comparison
I cant remove this values from the dataframe
Please let me know, is there anyother way to do this expression validation
You can try this:
df['val_check'] = df.Sub_marks <= df.Total_marks
df
Sub_marks Total_marks val_check
0 40 90 True
1 60 80 True
2 100 90 False
3 0 0 True
You can directly compare columns and store them in a list.
condition = df['sub_marks']>=df['total_marks']
print(condition)
Output:
[True,True,False,True]
I am using the data from the example shown here:
http://pandas.pydata.org/pandas-docs/stable/groupby.html. Go to the subheading: New syntax to window and resample operations
At the command prompt, the new syntax works as shown in the pandas documentation. But I want to add a new column with the expanded data to the existing dataframe, as would be done in a saved program.
Before a syntax upgrade to the groupby expanding code, I was able to use the following single line code:
df = pd.DataFrame({'A': [1] * 10 + [5] * 10, 'B': np.arange(20)})
df['Sum of B'] = df.groupby('A')['B'].transform(lambda x: pd.expanding_sum(x))
This gives the expected results, but also gives an 'expanding_sum is deprecated' message. Expected results are:
A B Sum of B
0 1 0 0
1 1 1 1
2 1 2 3
3 1 3 6
4 1 4 10
5 1 5 15
6 1 6 21
7 1 7 28
8 1 8 36
9 1 9 45
10 5 10 10
11 5 11 21
12 5 12 33
13 5 13 46
14 5 14 60
15 5 15 75
16 5 16 91
17 5 17 108
18 5 18 126
19 5 19 145
I want to use the new syntax to replace the deprecated syntax. If I try the new syntax, I get the error message:
df['Sum of B'] = df.groupby('A').expanding().B.sum()
TypeError: incompatible index of inserted column with frame index
I did some searching on here, and saw something that might have helped, but it gave me a different message:
df['Sum of B'] = df.groupby('A').expanding().B.sum().reset_index(level = 0)
ValueError: Wrong number of items passed 2, placement implies 1
The only way I can get it to work is to assign the result to a temporary df, then merge the temporary df into the original df:
temp_df = df.groupby('A').expanding().B.sum().reset_index(level = 0).rename(columns = {'B' : 'Sum of B'})
new_df = pd.merge(df, temp_df, on = 'A', left_index = True, right_index = True)
print (new_df)
This code gives the expected results as shown above.
I've tried different variations using transform as well, but have not been able to come up with coding this in one line as I did before the deprecation. Is there a single line syntax that will work? Thanks.
It seems you need a cumsum:
df.groupby('A')['B'].cumsum()
TL;DR
df['Sum of B'] = df.groupby('A')['B'].transform(lambda x: x.expanding().sum())
Explanation
We start from the offending line:
df.groupby('A')['B'].transform(lambda x: pd.expanding_sum(x))
Let's read carefully the warning you mentioned:
FutureWarning: pd.expanding_sum is deprecated for Series and will be
removed in a future version, replace with
Series.expanding(min_periods=1).sum()
After reading Pandas 0.17.0: pandas.expanding_sum it becomes clear that the Series the warning is talking about is the first parameter of the pd.expanding_sum. I.e. in our case it is x.
Now we apply the code transformation suggested in the warning. So pd.expanding_sum(x) becomes x.expanding(min_periods=1).sum().
According to Pandas 0.22.0: pandas.Series.expanding min_periods has a default value of 1 so in your case it can be omitted altogether, hence the final result.
df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
Thanks!
Edit:
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
http://pastebin.com/qXiaWcJq
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>>
>>> grouped = df.groupby("A")
>>>
>>> result = grouped.agg('first')
>>> result
mask values
A
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
...
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>>
>>> result
mask values Z
A
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.