There are few issues I am having with Dask Dataframes.
lets say I have a dataframe with 2 columns ['a','b']
if i want a new column c = a + b
in pandas i would do :
df['c'] = df['a'] + df['b']
In dask I am doing the same operation as follows:
df = df.assign(c=(df.a + df.b).compute())
is it possible to write this operation in a better way, similar to what we do in pandas?
Second question is something which is troubling me more.
In pandas if i want to change the value of 'a' for row 2 & 6 to np.pi , I do the following
df.loc[[2,6],'a'] = np.pi
I have not been able to figure out how to do a similar operation in Dask. My logic selects some rows and I only want to change values in those rows.
Edit Add New Columns
Setitem syntax now works in dask.dataframe
df['z'] = df.x + df.y
Old answer: Add new columns
You're correct that the setitem syntax doesn't work in dask.dataframe.
df['c'] = ... # mutation not supported
As you suggest you should instead use .assign(...).
df = df.assign(c=df.a + df.b)
In your example you have an unnecessary call to .compute(). Generally you want to call compute only at the very end, once you have your final result.
Change rows
As before, dask.dataframe does not support changing rows in place. Inplace operations are difficult to reason about in parallel codes. At the moment dask.dataframe has no nice alternative operation in this case. I've raised issue #653 for conversation on this topic.
Related
I'm running several loops of code on a pandas dataframe which should add new columns.
There are several blocks but they basically look like this:
Bbands_list = [-3,-2.5,-2,-1.5,-1,0, -0.5,0.5,1,1.5,2,2.5,3]
SMA_list = [1,2,3,4,5,6,7,8,9,10,12,14,16,18,20,21,22,24,26,28,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100,125,150,175,200]
for m in SMA_list:
for b in Bbands_list:
name = 'M' + str(m) + "B" + str(b)
df[name] = df['Close'].rolling(m).mean() + (df['Close'].rolling(m).std() * b)
df[name] = (df[name] - df['Close'])/df['Close']
But when I run the code, I get this error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df[name] = (df['Close'].rolling(5).std() - df['Close'].rolling(5).std(i))/df['Close'].rolling(5).std()
My understanding was that this was just a performance warning, and I have plenty of memory so I've ignored it in the past, or made a fresh copy after the loops, to have a defragmented copy. But this time, whenever I run the code, it keeps ending with the df now being equal to 'None'. Any idea what might be going on or how I can fix this?
I've tried df = df.copy() but no matter where in the code I place this, it doesn't change anything
As suggested by #mozway, don't append records to a dataframe. Prefer collect data in a python structure (dict, list) and the concatenate it to create a DataFrame. Something like:
from itertools import product
data = {} # <- a dict
for m, b in product(*[SMA_list, Bbands_list]):
data[f"M{m}_B{b}"] = (df['Close'].rolling(m).mean() + df['Close'].rolling(m).std() * b - df['Close']) / df['Close']
out = pd.concat(data, axis=1) # <- the dataframe
I have a dataframe with 10 columns. I want to add a new column 'age_bmi' which should be a calculated column multiplying 'age' * 'bmi'. age is an INT, bmi is a FLOAT.
That then creates the new dataframe with 11 columns.
Something I am doing isn't quite right. I think it's a syntax issue. Any ideas?
Thanks
df2['age_bmi'] = df(['age'] * ['bmi'])
print(df2)
try df2['age_bmi'] = df.age * df.bmi.
You're trying to call the dataframe as a function, when you need to get the values of the columns, which you can access by key like a dictionary or by property if it's a lowercase name with no spaces that doesn't match a built-in DataFrame method.
Someone linked this in a comment the other day and it's pretty awesome. I recommend giving it a watch, even if you don't do the exercises: https://www.youtube.com/watch?v=5JnMutdy6Fw
As pointed by Cory, you're calling a dataframe as a function, that'll not work as you expect. Here are 4 ways to multiple two columns, in most cases you'd use the first method.
In [299]: df['age_bmi'] = df.age * df.bmi
or,
In [300]: df['age_bmi'] = df.eval('age*bmi')
or,
In [301]: df['age_bmi'] = pd.eval('df.age*df.bmi')
or,
In [302]: df['age_bmi'] = df.age.mul(df.bmi)
You have combined age & bmi inside a bracket and treating df as a function rather than a dataframe. Here df should be used to call the columns as a property of DataFrame-
df2['age_bmi'] = df['age'] *df['bmi']
You can also use assign:
df2 = df.assign(age_bmi = df['age'] * df['bmi'])
I would like to use python 3.4 to compare columns.
I have two columns a and b
If A=B print A in column C.
If B > A, print all numbers between A and B including A and B in column C.
The subsequent compared rows would print in column C after the results of the previous test.
Any help is appreciated. My question wording must be off as I'm sure this has been done before, but I just can't find it here or elsewhere.
as brittenb noticed, try apply function in pandas.
import pandas as pd
df = pd.read_excel("somefile.xlsx")
df['c'] = df.apply(lambda r: list(range(r['a'], r['b']+1)), axis=1)
Update
If you want to add rows, writing in pandas may get complicated. If you don't care much about speed and memory, classic python style seems easier to understand.
ary = []
for i,r in df.iterrows():
for j in range(r['a'], r['b']+1):
ary.append( (r['a'], r['b'], j) )
df = pd.DataFrame(ary, columns = ['a','b','c'])
Is there a way with Pandas Dataframe to name only the first or first and second column even if there's 4 columns :
Here
for x in range(1, len(table2_query) + 1):
if x == 1:
cursor.execute(table2_query[x])
df = pd.DataFrame(data=cursor.fetchall(), columns=['Q', col_name[x-1]])
and it gives me this :
AssertionError: 2 columns passed, passed data had 4 columns
Consider the df:
df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=list('ABCD'))
df
then use rename and pass a dictionary with the name changes to the argument columns:
df.rename(columns=dict(A='a', B='b'))
Instantiating a DataFrame while only naming a subset of the columns
When constructing a dataframe with pd.DataFrame, you either don't pass an index/columns argument and let pandas auto-generate the index/columns object, or you pass one in yourself. If you pass it in yourself, it must match the dimensions of your data. The trouble of mimicking the auto-generation of pandas while augmenting just the ones you want is not worth the trouble and is ugly and is probably non-performant. In other words, I can't even think of a good reason to do it.
On the other hand, it is super easy to rename the columns/index values. In fact, we can rename just a few. I think below is more in line with the spirit of your question:
df = pd.DataFrame(np.arange(8).reshape(2, 4)).rename(columns=str).rename(columns={'1': 'A', '3': 'F'})
df
I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.
As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.
Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().