Problem while trying to merge two columns of two different dataframes? - python

I am currently facing a problem that I don't seem to be able to solve with regards to handling and manipulating dataframes using Pandas.
To give you an idea of the dataframes I'm talking about and that you'll see in my code:
I’m trying to change the words found in column ‘exercise’ of the dataset ‘data’ with the words found in column ‘name’ of the dataset ‘exercise’.
For example, the acronym ‘Dl’ in the exercise column of the ‘data’ dataset should be changed into ‘Dead lifts’ found in the ‘name’ column of the ‘exercise’ dataset.
I have tried many methods but all have seemed to fail. I receive the same error every time.
Here is my code with the methods I tried:
### Method 1 ###
# Rename Name Column in 'exercise'
exercise = exercise.rename(columns={'label': 'exercise'})
# Merge Exercise Columns in 'exercise' and in 'data'
data = pd.merge(data, exercise, how = 'left', on='exercise')
### Method 2 ###
data.merge(exercise, left_on='exercise', right_on='label')
### Method 3 ###
data['exercise'] = data['exercise'].astype('category')
EXERCISELIST = exercise['name'].copy().to_list()
data['exercise'].cat.rename_categories(new_categories = EXERCISELIST, inplace = True)
### Same Error, New dataset ###
# Rename Name Column in 'area'
area = area.rename(columns={'description': 'area'})
# Merge Exercise Columns in 'exercise' and in 'data'
data = pd.merge(data, area, how = 'left', on = 'area')
This is the error I get:
Traceback (most recent call last):
File "---", line 232, in
data.to_frame().merge(exercise, left_on='exercise', right_on='label')
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 8192, in merge
return merge(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 74, in merge
op = _MergeOperation(
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 668, in init
) = self._get_merge_keys()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 1046, in _get_merge_keys
left_keys.append(left._get_label_or_level_values(lk))
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/generic.py", line 1683, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'exercise'
Is someone able to help me with this? Thank you very much in advance.

merge, then drop and rename columns between data and area
merge, then drop and rename columns between step 1 and exercise
area = pd.DataFrame({"arealabel":["AGI","BAL"],
"description":["Agility","Balance"]})
exercise = pd.DataFrame({"description":["Jump rope","Dead lifts"],
"label":["Jr","Dl"]})
data = pd.DataFrame({"exercise":["Dl","Dl"],
"area":["AGI","BAL"],
"level":[0,3]})
(data.merge(area, left_on="area", right_on="arealabel")
.drop(columns=["arealabel","area"])
.rename(columns={"description":"area"})
.merge(exercise, left_on="exercise", right_on="label")
.drop(columns=["exercise","label"])
.rename(columns={"description":"exercise"})
)
level
area
exercise
0
0
Agility
Dead lifts
1
3
Balance
Dead lifts

Related

Cannot concatenate object when adding to a DataFrame

I am trying to add a sentence as well as a coin(like a label in this case I guess) to a DataFrame. Although I keep getting this error:
Traceback (most recent call last):
File "c:\Users\gjohn\Documents\code\machineLearning\trading_bot\filter.py", line 132, in <module>
df = df.append({'coin': coin, 'review': sentence}, ignore_index=True)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 2877, in append
return concat(
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\reshape\concat.py", line 294, in concat
op = _Concatenator(
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\reshape\concat.py", line 384, in __init__
raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'dict'>'; only Series and DataFrame objs are valid
Here is the code:
data = pd.read_csv('C:\\Users\\gjohn\\Documents\\code\\machineLearning\\trading_bot\\testreviews.csv')
df = data['review'] # Create a dataframe of the reviews.
classes = data['class'] # Create a dataframe of the classes.
for sentence in sentences:
coin = find_coin(common_words, sentence)
if len(sentence) > 0 and coin != None:
df = df.append({'coin': coin, 'review': sentence}, ignore_index=True)
I can't find how to fix this and I really need help, it would be great if you could help me out. Thanks!
Also sorry for the messy code :D
What is the sentence you use to construct the dictionary?
Perhaps you should check if the dictionary is constructed correctly?

how to display filtered dataframe using python and streamlit

I am new to streamlit, I have dataframe consist of 3 columns ,I want to apply a filter on the datafarme using df.loc and then display the result as table using streamlit.
The filter is done based on the user selection columns and values from 2 dropdownlists using for loop in order to iterate over all the columns and values
screenshot:
the expected result based on the picture : it must return an empty table. As there is no record that have loc3 & cat3
when i run the code below it display this error:
KeyError: 0
Traceback:
File "f:\aienv\lib\site-packages\streamlit\script_runner.py", line 333, in _run_script
exec(code, module.__dict__)
File "F:\AIenv\streamlit\app.py", line 309, in <module>
st.table(df.loc[(sidebars[0].str.startswith(sidebars[0][0])) & (sidebars[1].str.startswith(sidebars[1][1]) | (sidebars[2].str.startswith(sidebars[2][2]))),['source_number','location','category']])
code:
import numpy as np
import pandas as pd
import streamlit as st
df =pd.DataFrame({
"source_number":[11199,11328,11287,32345,12342,1232,13456,123244,13456],
"location":["loc2","loc1","loc3","loc1","loc2","loc2","loc3","loc2","loc1"],
"category":["cat1","cat2","cat1","cat3","cat3","cat3","cat2","cat3","cat2"],
})
is_check = st.checkbox("Display Data")
if is_check:
st.table(df)
columns = st.sidebar.multiselect("Enter the variables", df.columns)
st.write("You selected these columns", columns)
sidebars = {}
for y in columns:
ucolumns=list(df[y].unique())
sidebars[y+"filter"]=st.sidebar.multiselect('Filter '+y, ucolumns)
st.write(sidebars)
(sidebars[1].str.startswith(sidebars[1][1]) | (sidebars[2].str.startswith(sidebars[2][2]))),['source_number','location','category']])
where is the error in my code and how to display the correct result

Pandas TypeError when trying to count NaNs in subset of dataframe column

I'm writing a script to perform LLoD analysis for qPCR assays for my lab. I import the relevant columns from the .csv of data from the instrument using pandas.read_csv() with the usecols parameter, make a list of the unique values of RNA quantity/concentration column, and then I need to determine the detection rate / hit rate at each given concentration. If the target is detected, the result will be a number; if not, it'll be listed as "TND" or "Undetermined" or some other non-numeric string (depends on the instrument). So I wrote a function that (should) take a quantity and the dataframe of results and return the probability of detection for that quantity. However, on running the script, I get the following error:
Traceback (most recent call last):
File "C:\Python\llod_custom.py", line 34, in <module>
prop[idx] = hitrate(val, data)
File "C:\Python\llod_custom.py", line 29, in hitrate
df = pd.to_numeric(list[:,1], errors='coerce').isna()
File "C:\Users\wmacturk\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\wmacturk\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '(slice(None, None, None), 1)' is an invalid key
The idea in the line that's throwing the error (df = pd.to_numeric(list[:,1], errors='coerce').isna()) is to change any non-numeric values in the column to NaN, then get a boolean array telling me whether a given row's entry is NaN, so I can count the number of numeric entries with df.sum() later.
I'm sure it's something that should be obvious to anyone who's worked with pandas / dataframes, but I haven't used dataframes in python before, so I'm at a loss. I'm also much more familiar with C and JavaScript, so something like python that isn't as rigid can actually be a bit confusing since it's so flexible. Any help would be greatly appreciated.
N.B. the conc column will consist of 5 to 10 different values, each repeated 5-10 times (i.e. 5-10 replicates at each of the 5-10 concentrations); the detect column will contain either a number or a character string in each row -- numbers mean success, strings mean failure... For my purposes the value of the numbers is irrelevant, I only need to know if the target was detected or not for a given replicate. My script (up to this point) follows:
import os
import pandas as pd
import numpy as np
import statsmodels as sm
from scipy.stats import norm
from tkinter import filedialog
from tkinter import *
# initialize tkinter
root = Tk()
root.withdraw()
# prompt for data file and column headers, then read those columns into a dataframe
print("In the directory prompt, select the .csv file containing data for analysis")
path = filedialog.askopenfilename()
conc = input("Enter the column header for concentration/number of copies: ")
detect = input("Enter the column header for target detection: ")
tnd = input("Enter the value listed when a target is not detected (e.g. \"TND\", \"Undetected\", etc.): ")
data = pd.read_csv(path, usecols=[conc, detect])
# create list of unique values for quantity of RNA, initialize vectors of same length
# to store probabilies and probit scores for each
qtys = data[conc].unique()
prop = probit = [0] * len(qtys)
# Function to get the hitrate/probability of detection for a given quantity
def hitrate(qty, dataFrame):
list = dataFrame[dataFrame.iloc[:,0] == qty]
df = pd.to_numeric(list[:,1], errors='coerce').isna()
return (len(df) - (len(df)-df.sum()))/len(df)
# iterate over quantities to calculate the corresponding probability of Detection
# and its associate probit score
for idx, val in enumerate(qtys):
prop[idx] = hitrate(val, data)
probit[idx] = norm.ppf(hitrate(val, data))
# create an array of the quantities with their associated probabilities & Probit scores
hitTable = vstack([qtys,prop,probit])
sample dataframe can be created with:
d = {'qty':[1,1,1,1,1, 10,10,10,10,10, 20,20,20,20,20, 50,50,50,50,50, 100,100,100,100,100], 'result':['TND','TND','TND',5,'TND', 'TND',5,'TND',5,'TND', 5,'TND',5,'TND',5, 5,6,5,5,'TND', 5,5,5,5,5]}
exData = pd.DataFrame(data=d)
Then just use exData as the dataframe data in the original code
EDIT: I've fixed the problem by tweaking Loic RW's answer slightly. The function hitrate should be
def hitrate(qty, df):
t_s = df[df.qty == qty].result
t_s = t_s.apply(pd.to_numeric, args=('coerce',)).isna()
return (len(t_s)-t_s.sum())/len(t_s)
Does the following achieve what you want? I made some assumptions on the structure of your data.
def hitrate(qty, df):
target_subset = df[df.qty == qty].target
target_subset = target_subset.apply(pd.to_numeric, args=('coerce',)).isna()
return 1-((target_subset.sum())/len(target_subset))
If i run the following:
data = pd.DataFrame({'qty': [1,2,2,2,3],
'target': [.5, .8, 'TND', 'Undetermined', .99]})
hitrate(2, data)
I get:
0.33333333333333337

Have I found a bug or made a mistake in pandas df.tail()?

Ok, I've got a weird one. I might have found a bug, but let's assume I made a mistake first. Anyways, I am running into some issues with pandas.
I want to locate the two last columns of a dataframe to compare the values of column 'Col'. I run the code inside a for loop because it needs to run on all files in a folder. This code:
import pandas
for item in itemList:
df = df[['Col']].tail(2)
valA = df.iloc[1]['Col']
valB = df.iloc[0]['Col']
Works mostly. I ran it over 1040 data frames without issues. Then at 1041 of about 2000 it causes this error:
Traceback (most recent call last):
File "/path/to/script.py", line 206, in <module>
valA = df.iloc[1]['Col']
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1373, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1830, in _getitem_axis
self._is_valid_integer(key, axis)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1713, in _is_valid_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
From this I thought, the data frame might be too short. It shouldn't be, I test for this elsewhere, but ok, mistakes happen so let's print(df) to figure this out.
If I print(df) before the assignment of .tail(2) like this:
print(df)
df = df[['Col']].tail(2)
valA = df.iloc[1]['Col']
valB = df.iloc[0]['Col']
I see a data frame of 37 rows. In my world, 37 > 2.
Now, let's move the print(df) down one line like so:
df = df[['Col']].tail(2)
print(df)
The output is usually two lines as one would expect. However, at the error the df.tail(2) returns a single row of data frame out of a data frame with 37 rows. Not two rows, one row. However, this only happens for one item in the loop. All others work fine. If I skip over the item manually like so:
for item in itemList:
if item == 'troublemaker':
continue
... the script runs through to the end. No errors happen.
I must add, I am fairly new to all this, so I might overlook something entirely. Am I? Suggestions appreciated. Thanks.
Edit: Here's the output of print(df) in case of the error
Col
Date
2018-11-30 True
and in all other cases:
Col
Date
2018-10-31 False
2018-11-30 True
Since it does not have second index, that is why return the error , try using tail and head , be aware of this , for your sample df, valA and valB will be the same value
import pandas
for item in itemList:
df = df[['Col']].tail(2)
valA = df.tail(1)['Col']
valB = df.head(1)['Col']
I don't think it's a bug since it only happens to one df in 2000. Can you show that df?
I also don't think you need tail here, have you tried
valA = df.iloc[-2]['Col']
valB = df.iloc[-1]['Col']
to get the last values.

pandas crashes on repeated DataFrame.reset_index()

Very weird bug here: I'm using pandas to merge several dataframes. As part of the merge, I have to call reset_index several times. But when I do, it crashes unexpectedly on the second or third use of reset_index.
Here's minimal code to reproduce the error:
import pandas
A = pandas.DataFrame({
'val' : ['aaaaa', 'acaca', 'ddddd', 'zzzzz'],
'extra' : range(10,14),
})
A = A.reset_index()
A = A.reset_index()
A = A.reset_index()
Here's the relevant part of the traceback:
....
A = A.reset_index()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2393, in reset_index
new_obj.insert(0, name, _maybe_cast(self.index.values))
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1787, in insert
self._data.insert(loc, column, value)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 893, in insert
raise Exception('cannot insert %s, already exists' % item)
Exception: cannot insert level_0, already exists
Any idea what's going wrong here? How do I work around it?
Inspecting frame.py, it looks like pandas tries to insert a column 'index' or 'level_0'. If either/both(??) of them are already taken, then it throws the error.
Fortunately, there's a "drop" option. AFAICT, this drops an existing index with the same name and replaces it with the new, reset index. This might get you in trouble if you have a column named "index," but I think otherwise you're okay.
"Fixed" code:
import pandas
A = pandas.DataFrame({
'val' : ['aaaaa', 'acaca', 'ddddd', 'zzzzz'],
'extra' : range(10,14),
})
A = A.reset_index(drop=True)
A = A.reset_index(drop=True)
A = A.reset_index(drop=True)
you can use :
A.reset_index(drop=True, inplace=True)

Categories

Resources