How to fix error: 'AnnAssign' nodes are not implemented in Python - python

I try to do following:
import pandas as pd
d = {'col1': [1, 7, 3, 6], 'col2': [3, 4, 9, 1]}
df = pd.DataFrame(data=d)
out = df.query('col1 > col2')
out= col1 col2
1 7 4
3 6 1
This works OK. But when I modify column name col1 --> col1:suf
d = {'col1:suf': [1, 7, 3, 6], 'col2': [3, 4, 9, 1]}
df = pd.DataFrame(data=d)
out = df.query('col1:suf > col2')
I get an error:
'AnnAssign' nodes are not implemented
Is there easy way to avoid this behavior? Or course renaming headers etc. is a workaround

The colon : is a special character in SQL queries. You need to enclose it in backticks.
Try this :
out = df.query('`col1:suf` > col2')
Output :
print(out)
col1:suf col2
1 7 4
3 6 1

According to ValentinFFM's comment on this issue, you need to put a backtick quote around your column name like
df.query('`Column: Name`==value')

Related

Python Pandas drop

I build a script with Python and i use Pandas.
I'm trying to delete line from a dataframe.
I want to delete lines that contains empty values into two specific columns.
If one of those two column is regularly completed but not the other one, the line is preserved.
So i have build this code that works. But i'm beginner and i am sure that i can simplify my work.
I'm sure i don't need loop "for" in my function. I think there is a way with a good method. I read the doc on internet but i found nothing.
I try my best but i need help.
Also for some reasons i don't want to use numpy.
So here my code :
import pandas as pnd
def drop_empty_line(df):
a = df[(df["B"].isna()) & (df["C"].isna())].index
for i in a:
df = df.drop([i])
return df
def main():
df = pnd.DataFrame({
"A": [5, 0, 4, 6, 5],
"B": [pnd.NA, 4, pnd.NA, pnd.NA, 5],
"C": [pnd.NA, pnd.NA, 9, pnd.NA, 8],
"D": [5, 3, 8, 5, 2],
"E": [pnd.NA, 4, 2, 0, 3]
})
print(drop_empty_line(df))
if __name__ == '__main__':
main()
You indeed don't need a loop. You don't even need a custom function, there is already dropna:
df = df.dropna(subset=['B', 'C'], how='all')
# or in place:
# df.dropna(subset=['B', 'C'], how='all', inplace=True)
output:
A B C D E
1 0 4 <NA> 3 4
2 4 <NA> 9 8 2
4 5 5 8 2 3

Proper way to do this in pandas without using for loop

The question is I would like to avoid iterrows here.
From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".
In this case
"1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "3".
"2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".
"3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.
and so on
I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?
Expected Output (My code works, but its not using pandas methods)
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Code
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
c = 1
seen = {}
for i, j in df.iterrows():
j = tuple(j)
if j not in seen:
seen[j] = 'unique' + str(c)
c += 1
for key, value in seen.items():
df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value
Let's use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:
df['unique'] = 'unique' + \
df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)
Or with map and format instead of converting and concatenating:
df['unique'] = (
df.groupby(['a', 'b'], sort=False).ngroup()
.add(1)
.map('unique{}'.format)
)
df:
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Setup:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})
I came up with a slightly different solution. I'll add this for posterity, but the groupby answer is superior.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
print(df)
df1 = df[~df.duplicated()]
print(df1)
df1['unique'] = df1.index
print(df1)
df2 = df.merge(df1, how='left')
print(df2)

Average by value duplicated pandas python

I have the next csv and I need get the values duplicated from DialedNumer column and then the averege Duration of those duplicates.
I already got the duplicates with the next code:
df = pd.read_csv('cdrs.csv')
dnidump = pd.DataFrame(df, columns=['DialedNumber'])
pd.options.display.float_format = '{:.0f}'.format
dupl_dni = dnidump.pivot_table(index=['DialedNumber'], aggfunc='size')
a1 = dupl_dni.to_frame().rename(columns={0:'TimesRepeated'}).sort_values(by=['TimesRepeated'], ascending=False)
b = a1.head(10)
print(b)
Output:
DialedNumber TimesRepeated
50947740194 4
50936564292 2
50931473242 3
I can't figure out how to get the duration avarege of those duplicates, any ideas?
thx
try:
df_mean = df.groupby('DialedNumber').mean()
Use df.groupby('column').mean()
Here is sample code.
Input
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [2461, 1023, 9, 5614, 212],
'C': [2, 4, 8, 16, 32]}, columns=['A', 'B', 'C'])
df.groupby('A').mean()
Output
B C
A
1 1164.333333 4.666667
2 2913.000000 24.000000
API reference of pandas.core.groupby.GroupBy.mean
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.mean.html

Iterating through numpy array for use in dictionary

I have a project where I'm trying to update a dataframe to a new set of changes being rolled out. There are currently 15,000 data samples in the dataframe, so runtime can become an issue quickly. I know vectorizing a dataframe using numpy is a good way to cut back on runtime, but I'm running into an issue with my numpy array and dictionary.
The goal is to look at the value in col3, use that as the key to df_dict, and use the value of that dictionary entry to multiply to col2 and assign to col1.
I've been able to do this using for loops, but it runs into a serious problem of runtime - especially because there are more steps involved than just what I'm asking for help on.
d = {"col1": [1, 2, 3, 4], "col2": [1, 2, 3, 4], "col3": ["a","b","c","d"]}
df = pd.DataFrame(data=d)
df_dict = {"a":1.2,"b":1.5,"c":0.95,"d":1.25}
df["col1"]=df["col2"].values*df_dict[df["col3"].values]
I expect col1 to be updated to [1.2, 3, 2.85, 5], but instead I get the error
TypeError: unhashable type: 'numpy.ndarray'
I get why the error occurs, I just want to find the best alternative.
Looks like you need.
d = {"col1": [1, 2, 3, 4], "col2": [1, 2, 3, 4], "col3": ["a","b","c","d"]}
df = pd.DataFrame(data=d)
df_dict = {"a":1.2,"b":1.5,"c":0.95,"d":1.25}
df["col1"]=df["col2"]* [df_dict.get(i, 1) for i in df["col3"]]
print(df)
Output:
col1 col2 col3
0 1.20 1 a
1 3.00 2 b
2 2.85 3 c
3 5.00 4 d
You can use a little better solution using .map.
So replace:
df["col1"]=df["col2"].values*df_dict[df["col3"].values]
With:
df["col1"]=df["col2"] * df['col3'].map(df_dict)

Pandas div using index

I am sometimes struggling a bit to understand pandas datastructures and it seems to be the case again. Basically, I've got:
1 pivot table, major axis being a serial number
a Serie using the same index
I would like to divide each column of my pivot table by the value in the Serie using index to match the lines. I've tried plenty of combinations... without being successful so far :/
import pandas as pd
df = pd.DataFrame([['123', 1, 1, 3], ['456', 2, 3, 4], ['123', 4, 5, 6]], columns=['A', 'B', 'C', 'D'])
pt = pd.pivot_table(df, rows=['A', 'B'], cols='C', values='D', fill_value=0)
serie = pd.Series([5, 5, 5], index=['123', '678', '345'])
pt.div(serie, axis='index')
But I am only getting NaN. I guess it's because columns names are not matching but that's why I was using index as the axis. Any ideas on what I am doing wrong?
Thanks
You say "using the same index", but they're not the same: pt has a multiindex, and serie only an index:
>>> pt.index
MultiIndex(levels=[[u'123', u'456'], [1, 2, 4]],
labels=[[0, 0, 1], [0, 2, 1]],
names=[u'A', u'B'])
And you haven't told the division that you want to align on the A part of the index. You can pass that information using level:
>>> pt.div(serie, level='A', axis='index')
C 1 3 5
A B
123 1 0.6 0 0.0
4 0.0 0 1.2
456 2 NaN NaN NaN
[3 rows x 3 columns]

Categories

Resources