I am very aware of the strong priority and preference of using scalar operation in the world of pandas dataframe. However I have so far exhausted all the options of an exercise where I need to create a new df column which depends on the previous row's value of another column and the value of such another column would also depends on the previous row's value of the new column and thus these two columns are inter-dependent (not the case of circular-referencing because each of these two columns will depend on the previous values of each another).
A kind of for-loop case (event driven type) is most relevant and I learn of the itertuples() function to start:
for row in df.itertuples()
However, I want to know what is the best way to create a new df based on such function? Should I first create a new/blank df column and then do some sort of append function?
Related
i have two dataframes df1 and df2. they have a column with common values but in df1['comon_values_column'] every value comes up only once while in df2['comon_values_column'] each value can come up more than once. I want to see if i can do the following with a single line and without a loop
for value in df2['comon_values_column']:
df2['empty_column'].loc[df2['comon_values_column']==value]=df1['other_column'].loc[df1['comon_values_column']==value]
i have tried to use merge but because of the size of the dataframe it is very difficult to make sure that it does exactly what i want
Can I create another index on an existing column of pandas DataFrame? Just like what CREATE INDEX in SQL does. For example: My DataFrame has two columns id_a and id_b, both of them are unique for each row, and I'd like to index rows sometimes with id_a while other times with id_b (so I think MultiIndex won't work for me). I want these operations to be fast, so "index" must be created for both id_a and id_b.
Is it possible I can do this in pandas currently?
You can't have 2 indices in a Pandas DataFrame object.
You will have to workaround this limitation, for example:
by code logic,
using other Pandas features,
using columns and flags depending which column needs to be used to index a given row
The operations should be fast. For additional performance, you can adjust the dtypes according to your needs. To match hashmap lookups or similar, you will have to add more thought into your use case and perhaps use a different logical approach with a separate mapping/dict or similar.
I am using pandas for the first time.
df.groupby(np.arange(len(df))//10).mean()
I used the code above which works to take an average of every 10th row. I want to save this updated data frame but doing df.to_csv is saving the original dataframe which I imported.
I also want to multiply one column from my df (df.groupby dataframe essentially) with a number and make a new column. How do I do that?
The operation:
df.groupby(np.arange(len(df))//10).mean()
Might return the averages dataframe as you want it, but it wont change the original dataframe. Instead you'll need to do:
df_new = df.groupby(np.arange(len(df))//10).mean()
You could assign it the same name if you want. The other options is some operations which you might expect to modify the dataframe accept in inplace argument which normally defaults to False. See this question on SO.
To create a new column which is an existing column multpied by a number you'd do:
df_new['new_col'] = df_new['existing_col']*a_number
I have a master dataframe with anywhere between 750 to 3000 rows of data.
I have a daily order dataframe with anywhere from 3000 to 5000 rows of data.
If the product code of the daily order dataframe is found in the master dataframe, I get the item cost. Otherwise, it is marked as invalid and deleted.
I currently do this via 2 for loops. But I will have to do many more such comparisons and data updating (other fields to compare, other values to copy)
What is the most efficient way to do this?
I cannot make the column I am comparing the index column of the master dataframe.
In this case, the product code may be unique in the master and I could do a merge, but there are other cases where I may have to compare other values like supplier city which may not be unique.
I seem to be doing this repeatedly in all my Python codes and I want to learn the most efficient way to do this.
Order DF:
[![Order csv from which the Order DF is created][1]][1]
Master DF
[![Master csv from which Master DF is created][1]][1]
def fillVol(orderDF,mstrDF,paramC,paramF,notFound):
orderDF['ttlVol']=0
for i in range(len(orderDF)):
found=False
for row in mstrDF.itertuples():
if (orderDF.loc[i,paramC]==getattr(row,paramC)):
orderDF.loc[i,paramF[0]]=getattr(row,paramF[0])#mtrl cbf
found=True
break
if (found==False):
notFound.append(inv.loc[i,paramC])
inv['ttlVol']=inv[paramF[0]]*inv[paramF[2]]
return notFound
I am passing along the column names I am comparing and the column names I am filling with data because there are minor variations in naming the csv. In the data I have shared, the material volume is CBF, in come cases it is CBM
The data columns cannot be index because there are no unique data in any of the columns, it is always a combination of values that makes them unique.
The data, in this case, is a float and numpy could be used, but in other cases like copying city names from a master, the data is a string. numpy was the suggestion to other people with a similar issue
I dont know if this is the most efficient way of doing it - as someone who started programming with Fortran and then C, I am always for basic datatypes and this solution is not utilising basic datatype. This is definitely a highly Pythonic solution.
orderDF=orderDF[orderDF[ParamF].isin(mstrDF[ParamF])]
orderDF=orderDF.reset_index(drop=True)
I use a left merge on the orderDF and msterDF data frames to copy all relevant values
orderDF=orderDF.merge(mstrDF.drop_duplicates(paramC,keep='last')[[paramF[0]]]', how='left',validate = 'm:1')
I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow