pandas crashes on repeated DataFrame.reset_index() - python

Very weird bug here: I'm using pandas to merge several dataframes. As part of the merge, I have to call reset_index several times. But when I do, it crashes unexpectedly on the second or third use of reset_index.
Here's minimal code to reproduce the error:
import pandas
A = pandas.DataFrame({
'val' : ['aaaaa', 'acaca', 'ddddd', 'zzzzz'],
'extra' : range(10,14),
})
A = A.reset_index()
A = A.reset_index()
A = A.reset_index()
Here's the relevant part of the traceback:
....
A = A.reset_index()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2393, in reset_index
new_obj.insert(0, name, _maybe_cast(self.index.values))
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1787, in insert
self._data.insert(loc, column, value)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 893, in insert
raise Exception('cannot insert %s, already exists' % item)
Exception: cannot insert level_0, already exists
Any idea what's going wrong here? How do I work around it?

Inspecting frame.py, it looks like pandas tries to insert a column 'index' or 'level_0'. If either/both(??) of them are already taken, then it throws the error.
Fortunately, there's a "drop" option. AFAICT, this drops an existing index with the same name and replaces it with the new, reset index. This might get you in trouble if you have a column named "index," but I think otherwise you're okay.
"Fixed" code:
import pandas
A = pandas.DataFrame({
'val' : ['aaaaa', 'acaca', 'ddddd', 'zzzzz'],
'extra' : range(10,14),
})
A = A.reset_index(drop=True)
A = A.reset_index(drop=True)
A = A.reset_index(drop=True)

you can use :
A.reset_index(drop=True, inplace=True)

Related

How to fix RDD empty in pyspark

I have the following problem,
i am working with pySpark to build a model for a regression problem.
df = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load(path)
I cast everything into double/int. it works well
with this line i see no null values
df.describe().toPandas()
i seperate the label from the features
rdd_ml = df.rdd.map(lambda x: (x[0], DenseVector(x[1:])))
but this line throws the error raise ValueError("The first row in RDD is empty, "
df_ml = spark.createDataFrame(rdd_ml, ['label', 'features'])
if i for example execute this line i see nothing strange
df.rdd.collect()
is there something that i am missing here???
thank you if you have any advice on this.

Pandas 1.3.3 ValueError: Columns must be same length as key (no duplicated columns and same shape result)

I have recently updated pandas to version 1.3.3 (in version 1.2.0 this issue works perfectly)
I am trying to do the following:
self._df[cts.NUMERIC_COLS] = self._df[cts.NUMERIC_COLS].apply(pd.to_numeric, errors='coerce')
where cts.NUMERIC_COLS is a list of column names that should be parsed.
I've checked that my dataframe has no duplicated columns and I am getting the following error:
File "/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py", line 3641, in _setitem_array
self[k1] = value[k2]
File "/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py", line 3602, in __setitem__
self._set_item_frame_value(key, value)
File "/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py", line 3729, in _set_item_frame_value
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
When executing that line:
self._df[cts.NUMERIC_COLS].shape >> (1142, 58)
self._df[cts.NUMERIC_COLS].apply(pd.to_numeric, errors='coerce').shape >> (1142, 58)`
I've tried different approaches but without succeed. I am doing something wrong? Has it happened to anyone too?
Thank you to all in advance!

Problem while trying to merge two columns of two different dataframes?

I am currently facing a problem that I don't seem to be able to solve with regards to handling and manipulating dataframes using Pandas.
To give you an idea of the dataframes I'm talking about and that you'll see in my code:
I’m trying to change the words found in column ‘exercise’ of the dataset ‘data’ with the words found in column ‘name’ of the dataset ‘exercise’.
For example, the acronym ‘Dl’ in the exercise column of the ‘data’ dataset should be changed into ‘Dead lifts’ found in the ‘name’ column of the ‘exercise’ dataset.
I have tried many methods but all have seemed to fail. I receive the same error every time.
Here is my code with the methods I tried:
### Method 1 ###
# Rename Name Column in 'exercise'
exercise = exercise.rename(columns={'label': 'exercise'})
# Merge Exercise Columns in 'exercise' and in 'data'
data = pd.merge(data, exercise, how = 'left', on='exercise')
### Method 2 ###
data.merge(exercise, left_on='exercise', right_on='label')
### Method 3 ###
data['exercise'] = data['exercise'].astype('category')
EXERCISELIST = exercise['name'].copy().to_list()
data['exercise'].cat.rename_categories(new_categories = EXERCISELIST, inplace = True)
### Same Error, New dataset ###
# Rename Name Column in 'area'
area = area.rename(columns={'description': 'area'})
# Merge Exercise Columns in 'exercise' and in 'data'
data = pd.merge(data, area, how = 'left', on = 'area')
This is the error I get:
Traceback (most recent call last):
File "---", line 232, in
data.to_frame().merge(exercise, left_on='exercise', right_on='label')
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 8192, in merge
return merge(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 74, in merge
op = _MergeOperation(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 668, in init
) = self._get_merge_keys()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 1046, in _get_merge_keys
left_keys.append(left._get_label_or_level_values(lk))
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/generic.py", line 1683, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'exercise'
Is someone able to help me with this? Thank you very much in advance.
merge, then drop and rename columns between data and area
merge, then drop and rename columns between step 1 and exercise
area = pd.DataFrame({"arealabel":["AGI","BAL"],
"description":["Agility","Balance"]})
exercise = pd.DataFrame({"description":["Jump rope","Dead lifts"],
"label":["Jr","Dl"]})
data = pd.DataFrame({"exercise":["Dl","Dl"],
"area":["AGI","BAL"],
"level":[0,3]})
(data.merge(area, left_on="area", right_on="arealabel")
.drop(columns=["arealabel","area"])
.rename(columns={"description":"area"})
.merge(exercise, left_on="exercise", right_on="label")
.drop(columns=["exercise","label"])
.rename(columns={"description":"exercise"})
)
level
area
exercise
0
0
Agility
Dead lifts
1
3
Balance
Dead lifts

How to iterate over a CSV file with Pywikibot

I wanted to try uploading a series of items to test.wikidata, creating the item and then adding a statement of inception P571. The csv file sometimes has a date value, sometimes not. When no date value is given, I want to write out a placeholder 'some value'.
Imagine a dataframe like this:
df = {'Object': [1, 2,3], 'Date': [250,,300]}
However, I am not sure using Pywikibot how to iterate over a csv file with pywikibot to create an item for each row and add a statement. Here is the code I wrote:
import pywikibot
import pandas as pd
site = pywikibot.Site("test", "wikidata")
repo = site.data_repository()
df = pd.read_csv('experiment.csv')
item = pywikibot.ItemPage(repo)
for item in df:
date = df['date']
prop_date = pywikibot.Claim(repo, u'P571')
if date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)
When I run this through PAWS, I get the message: KeyError: 'date'
But I think the real issue here is that I am not sure how to get Pywikibot to iterate over each row of the dataframe and create a new claim for each new date value. I would value any feedback or suggestions for good examples and documentation. Many thanks!
Looking back on this, the solution was to use .iterrows() or .itertuples() or .loc[] to access the values in the row.
So
for item in df.itertuples():
prop_date = pywikibot.Claim(repo, u'P571')
if item.date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)

Have I found a bug or made a mistake in pandas df.tail()?

Ok, I've got a weird one. I might have found a bug, but let's assume I made a mistake first. Anyways, I am running into some issues with pandas.
I want to locate the two last columns of a dataframe to compare the values of column 'Col'. I run the code inside a for loop because it needs to run on all files in a folder. This code:
import pandas
for item in itemList:
df = df[['Col']].tail(2)
valA = df.iloc[1]['Col']
valB = df.iloc[0]['Col']
Works mostly. I ran it over 1040 data frames without issues. Then at 1041 of about 2000 it causes this error:
Traceback (most recent call last):
File "/path/to/script.py", line 206, in <module>
valA = df.iloc[1]['Col']
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1373, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1830, in _getitem_axis
self._is_valid_integer(key, axis)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1713, in _is_valid_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
From this I thought, the data frame might be too short. It shouldn't be, I test for this elsewhere, but ok, mistakes happen so let's print(df) to figure this out.
If I print(df) before the assignment of .tail(2) like this:
print(df)
df = df[['Col']].tail(2)
valA = df.iloc[1]['Col']
valB = df.iloc[0]['Col']
I see a data frame of 37 rows. In my world, 37 > 2.
Now, let's move the print(df) down one line like so:
df = df[['Col']].tail(2)
print(df)
The output is usually two lines as one would expect. However, at the error the df.tail(2) returns a single row of data frame out of a data frame with 37 rows. Not two rows, one row. However, this only happens for one item in the loop. All others work fine. If I skip over the item manually like so:
for item in itemList:
if item == 'troublemaker':
continue
... the script runs through to the end. No errors happen.
I must add, I am fairly new to all this, so I might overlook something entirely. Am I? Suggestions appreciated. Thanks.
Edit: Here's the output of print(df) in case of the error
Col
Date
2018-11-30 True
and in all other cases:
Col
Date
2018-10-31 False
2018-11-30 True
Since it does not have second index, that is why return the error , try using tail and head , be aware of this , for your sample df, valA and valB will be the same value
import pandas
for item in itemList:
df = df[['Col']].tail(2)
valA = df.tail(1)['Col']
valB = df.head(1)['Col']
I don't think it's a bug since it only happens to one df in 2000. Can you show that df?
I also don't think you need tail here, have you tried
valA = df.iloc[-2]['Col']
valB = df.iloc[-1]['Col']
to get the last values.

Categories

Resources