I am currently working on finding position(row index, column index) of maximum cell in each column of a dataframe.
There are a lot of similar dataframe like this, so I made a function like below.
def FindPosition_max(series_max, txt_name):
# series_max : only needed to get the number of columns in time_history(Index starts from 1).
time_history = pd.read_csv(txt_name, skiprows=len(series_max)+7, header=None, usecols=[i for i in range(1,len(series_max)+1)])[:-2]
col_index = series_max.index
row_index = []
for current_col_index in series_max.index:
row_index.append(time_history.loc[:, current_col_index].idxmax())
return row_index, col_index.tolist()
This works well, but takes too much time to run with a lot of dataframes. I found on the internet that .apply() is much more faster than for loop and I tried like this.
def FindPosition_max(series_max, txt_name):
time_history = pd.read_csv(txt_name, skiprows=len(series_max)+7, header=None, usecols=[i for i in range(1,len(series_max)+1)])[:-2]
col_index = series_max.index
row_index = pd.Series(series_max.index).apply(lambda x: time_history.loc[:, x].idxmax())
return row_index, series_max.index.tolist()
And the error comes like this,
File "C:\Users\hwlee\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 844, in _list_of_series_to_arrays
indexer = indexer_cache[id(index)] = index.get_indexer(columns)
AttributeError: 'builtin_function_or_method' object has no attribute 'get_indexer'
I tried to find what causes this error, but this error never goes away. Also when I tested the codes inside the function separately, it works well.
Could anyone help me to solve this problem? Thank u!
Related
I have the following problem,
i am working with pySpark to build a model for a regression problem.
df = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load(path)
I cast everything into double/int. it works well
with this line i see no null values
df.describe().toPandas()
i seperate the label from the features
rdd_ml = df.rdd.map(lambda x: (x[0], DenseVector(x[1:])))
but this line throws the error raise ValueError("The first row in RDD is empty, "
df_ml = spark.createDataFrame(rdd_ml, ['label', 'features'])
if i for example execute this line i see nothing strange
df.rdd.collect()
is there something that i am missing here???
thank you if you have any advice on this.
Ok, I've got a weird one. I might have found a bug, but let's assume I made a mistake first. Anyways, I am running into some issues with pandas.
I want to locate the two last columns of a dataframe to compare the values of column 'Col'. I run the code inside a for loop because it needs to run on all files in a folder. This code:
import pandas
for item in itemList:
df = df[['Col']].tail(2)
valA = df.iloc[1]['Col']
valB = df.iloc[0]['Col']
Works mostly. I ran it over 1040 data frames without issues. Then at 1041 of about 2000 it causes this error:
Traceback (most recent call last):
File "/path/to/script.py", line 206, in <module>
valA = df.iloc[1]['Col']
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1373, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1830, in _getitem_axis
self._is_valid_integer(key, axis)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1713, in _is_valid_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
From this I thought, the data frame might be too short. It shouldn't be, I test for this elsewhere, but ok, mistakes happen so let's print(df) to figure this out.
If I print(df) before the assignment of .tail(2) like this:
print(df)
df = df[['Col']].tail(2)
valA = df.iloc[1]['Col']
valB = df.iloc[0]['Col']
I see a data frame of 37 rows. In my world, 37 > 2.
Now, let's move the print(df) down one line like so:
df = df[['Col']].tail(2)
print(df)
The output is usually two lines as one would expect. However, at the error the df.tail(2) returns a single row of data frame out of a data frame with 37 rows. Not two rows, one row. However, this only happens for one item in the loop. All others work fine. If I skip over the item manually like so:
for item in itemList:
if item == 'troublemaker':
continue
... the script runs through to the end. No errors happen.
I must add, I am fairly new to all this, so I might overlook something entirely. Am I? Suggestions appreciated. Thanks.
Edit: Here's the output of print(df) in case of the error
Col
Date
2018-11-30 True
and in all other cases:
Col
Date
2018-10-31 False
2018-11-30 True
Since it does not have second index, that is why return the error , try using tail and head , be aware of this , for your sample df, valA and valB will be the same value
import pandas
for item in itemList:
df = df[['Col']].tail(2)
valA = df.tail(1)['Col']
valB = df.head(1)['Col']
I don't think it's a bug since it only happens to one df in 2000. Can you show that df?
I also don't think you need tail here, have you tried
valA = df.iloc[-2]['Col']
valB = df.iloc[-1]['Col']
to get the last values.
I've checked posts and haven't found a solution to my problem. I'm getting the error I put in the subject after the code works fine.
I'm simply trying to add a row to a holder dataframe that only appends rows that aren't similar to previously appended rows. You'll see that friend is checked against 'Target' and Target against 'Friend' in the query.
It iterates 71 times before giving me the error. 'cur' is the iterator, which is not included in this section of code. Here's the code:
same = df[(df['Source']==cur) & (df['StratDiff']==0)]
holder = pd.DataFrame(index=['pbp'],columns=['Source', 'Target', 'Friend', 'SS', 'TS', 'FS'])
holder.iloc[0:0]
i=1
for index, row in same.iterrows():
Target = row['Target']
stratcur = row['SourceStrategy']
strattar = row['TargetStrategy']
sametarget = df[(df['Source']==Target)]
samejoin = pd.merge(same, sametarget, how='inner', left_on=['Target'],
right_on = ['Target'])
for index, row in samejoin.iterrows():
Friend = row['Target']
stratfriend = row['TargetStrategy_x']
#print(cur, Friend, Target)
temp = holder[holder[(holder['Source']==cur) &
(holder['Target']==Friend) & (holder['Friend']==Target)]]
if temp.isnull().values.any():
holder.loc[i] = [cur,Target,Friend,stratcur,strattar,stratfriend]
print(i, cur)
i=i+1
I just want to update everyone. I was able to solve this. It took awhile, but the problem was located in line where I query holder. It was too complex. I simplified it into multiple, simpler queries. It works fine now.
I am using the following function with a DataFrame:
df['error_code'] = df.apply(lambda row: replace_semi_colon(row), axis=1)
The embedded function is:
def replace_semi_colon(row):
errrcd = str(row['error_code'])
semi_colon_pat = re.compile(r'.*;.*')
if pd.notnull(errrcd):
if semi_colon_pat.match(errrcd):
mod_error_code = str(errrcd.replace(';',':'))
return mod_error_code
return errrcd
But I am receiving the (in)famous
SettingWithCopyWarning
I have read many posts but still do not know how to prevent it.
The strange thing is that I use other apply functions the same way but they do not throw the same error.
Can someone explain why I am getting this warning?
Before the apply there was another statement:
df = df.query('error_code != "BM" and eror_code != "PM"')
I modified that to:
df.loc[:] = df.query('error_code != "BM" and eror_code != "PM"')
That solved it.
So given a cell I want to know the value in which the cell right before it (same row, previous column) has.
Here is my code and I thought it was working but...:
def excel_test(col_num, sheet_object):
for cell in sheet_object.columns[col_number]:
prev_col = (column_index_from_string(cell.column))
row = cell.row
prev_cell = sheet_object.cell(row, prev_col)
I keep getting this error:
coordinate = coordinate.upper().replace('$', '')
builtins.AttributeError: 'int' object has no attribute 'upper'
I have also tried this:
def excel_test(col_num, sheet_object):
for cell in sheet_object.columns[col_number]:
prev_col = (column_index_from_string(cell.column))
row = cell.row
prev_cell = sheet_object.cell(row, get_column_letter(prev_col))
Can somebody tell me how i can access that, I've also imported everything there needs to be imported.
You should look at the cell.offset() method.