how to use dataframe.iat with string column index - python

I need to read and update my data cells values based on dataframe.iat[row, column].
my data is about 338 000 row so that I need to use the faster way (iat) for this goal.
I have to use column by its name because it changes dynamically by another loop
when I execute my code I obtain the following error
for i in range(30000):
b = data_jeux.iat[i, 'skill_id_%s' % k]
ValueError: iAt based indexing can only have integer indexers
ps: I already use df.get_value() it work correctly but I need obligatory to get a solution with .iat

pd.Index.get_loc
With Pandas, you should generally avoid Python-level for loops. However, assuming you must iterate explicitly you can use get_loc to extract the column index:
col_loc = data_jeux.columns.get_loc
for i in range(30000):
b = data_jeux.iat[i, col_loc(f'skill_id_{k}')]
Given you have a large loop, assigning data_jeux.columns.get_loc to a variable outside your loop and using f-strings may offer some marginal performance improvements.

Related

use a function at the result of pandas .loc

Hy peeps,
I wold like to know if have some possibility to use a function at the result of pandas .loc or if exist some better way to do it.
So what I'm trying to do is:
If the value in this series is =!0, then get the values of other rows and use as parameters for one function (in this case, get_working_days_delta), after this put the result in the same series.
df.loc[(df["SERIES"] != 0), 'SERIES'] = df.apply(cal.get_working_days_delta(df["DATE_1"],df["DATE_2"]))
The output is: datetime64[ns] is of unsupported type (<class 'pandas.core.series.Series'>)
In this case, the parameters used (df["DATE_1"] df["DATE_2"]) are recognized as the entire series rather than cell values
I don't wanna use .apply or .at because this df has over 4 milion rows
Hard to give a good answer without an example.
But I think the problem is that you need to filter before using apply.
Try
df[df["SERIES"] != 0].apply(...)

How can I work with .iloc[] in Python to do some calculation?

I have to implement some functions to calculate special values. I read a csv file for it with pd.read_csv(). Then I used .iloc[] to find the respective row and column I need for my calculation:
V_left = data_one.iloc[0,0:4]
V_right= data_one.iloc[0,5:9]
My formula, which I want to implement is: V_left/V_right
V is a vector of 5 parameters (values).
My question is now: How can I use the values, which I pick out with .iloc[], to do a calculation like my formula?
See me current code here
You can use:
V_left.values and V_right.values to make those dataframes numpy arrays, so that you can manipulate them.
However, I wouldn't use iloc in the first place, you can directly convert them:
V_left = data_one.values[0,:4]
V_right = data_one.values[0, 5:9]
Adding V_left.values / V_right.values should be enough.

How to calculate percentage and store it in a new column?

I have the following dataframe and I want to add a new column with the percentage value:
df =
TIME_1 TIME_2
80 150
120 20
I want to get a new columt TIME_1_PROC that will store the percentage value of TIME_1 from TIME_1 + TIME_2.
This is my code, but it triggers a warning:
df.TIME_1_PROC = (df.TIME_1*100/(df.TIME_1+df.TIME_2))
Warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
This creates a new variable:
df['TIME1_PROC'] = (df.TIME_1 * 100 / (df.TIME_1 + df.TIME_2))
Out[27]:
TIME_1 TIME_2 TIME1_PROC
0 80 150 34.782609
1 120 20 85.714286
Generally speaking...
Just a quick elaboration on #Imo's correct answer. Most of the time you are better off creating and referring to columns like this:
df['x']
rather than this:
df.x
And when you are creating a new variable, you MUST use the first method. But even for existing variables, the first way is considered better because you avoid potential errors if you happen to have a column called "index". E.g. if you type df.index, will that return the index or the column named "index"? Of course, we all use the attribute-style as a shortcut on occasion, so perhaps a more reasonable rule of thumb would be to only use the shortcut on the right hand side.
This particular example...
All that said, the behavior by pandas here does not seem ideal. The warning message you got here is a common one in pandas and often ignorable (as it is here). But what is unfortunate is that you didn't get an error message about attempting to access a non-existent column. And furthermore consider the following:
df['TIME_1_PROC'] # KeyError: 'TIME_1_PROC'
df.TIME_1_PROC
0 34.782609
1 85.714286
dtype: float64
So your new column did get created, but as an attribute rather than a column. To be more explicit here, usually when we use the attribute-style reference, it is interpreted by pandas as referring to a column. But in this case it actually is an attibute (and that's not what you want).
use pd.set_option('chained',None) to avoid such messages

pandas set one cell value equals to another

I want to set a cell of pandas dataframe equal to another. For example:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude']
However, when I checked
station_dim.loc[station_dim.nlc==573,'longitude']
It returns NaN
Beside directly set the station_dim.loc[station_dim.nlc==573,'longitude']to a number, what else choice do I have? And why can't I use this method?
Take a look at get_value, or use .values:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude'].values[0]
For the assignment to work - .loc[] will return a pd.Series, the index of that pd.Series would need to align with your df, which it probably doesn't. So either extract the value directly using .get_value() - where you need to get the index position first - or use .values, which returns a np.array, and take the first value of that array.

Pandas: DataFrame.sum() or DataFrame().as_matrix.sum()

I am writing a function that computes conditional probability all columns in a pd.DataFrame that has ~800 columns. I wrote a few versions of the function and found a very big difference in compute time over two primary options:
col_sums = data.sum() #Simple Column Sum over 800 x 800 DataFrame
Option #1:
{'col_sums' and 'data' are a Series and DataFrame respectively}
[This is contained within a loop over index1 and index2 to get all combinations]
joint_occurance = data[index1] * data[index2]
sum_joint_occurance = joint_occurance.sum()
max_single_occurance = max(col_sum[index1], col_sum[index2])
cond_prob = sum_joint_occurance / max_single_occurance #Symmetric Conditional Prob
results[index1][index2] = cond_prob
Vs.
Option #2: [While looping over index1 and index2 to get get all combinations]
Only Difference is instead of using DataFrame I exported the data_matrix to a np.array prior to looping
new_data = data.T.as_matrix() [Type: np.array]
Option #1 Runtime is ~1700 sec
Option #2 Runtime is ~122 sec
Questions:
Is converting the contents of DataFrames to np.array's best for computational tasks?
Is the .sum() routine in pandas significantly different to to .sum() routine in NumPy or is the difference in speed due to the label access to data?
Why are these runtimes so different?
While reading the documentation I came across:
Section 7.1.1 Fast scalar value getting and setting Since indexing with [] must handle a lot of cases (single-label access, slicing,
boolean indexing, etc.), it has a bit of overhead in order to figure
out what you’re asking for. If you only want to access a scalar value,
the fastest way is to use the get_value method, which is implemented
on all of the data structures:
In [656]: s.get_value(dates[5])
Out[656]: -0.67368970808837059
In [657]: df.get_value(dates[5], ’A’)
Out[657]: -0.67368970808837059
Best Guess:
Because I am accessing individual data elements many times from the dataframe (order of ~640,000 per matrix). I think the speed reduction came from how I referenced the data (i.e. "indexing with [] handles a lot of cases") and therefore I should be using the get_value() method for accessing scalars similar to a matrix lookup.

Categories

Resources