Why am I recieving key error after slicing my data? [duplicate] - python

This question already has answers here:
Problem with getting rid of specific columns [closed]
(2 answers)
Closed 3 years ago.
I have a code that slices data and then suppose to calculte different indices according to the columns.
My code worked well but today I had to slice differently the data and since then I get keyerror whenever I try to compute the indices.
unfortinatly I can't share my original data but I hope this code can help in understand what happenned here.
This is my code with some explainations:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_plants = pd.read_csv('my_data')
#My data contains columns with numerical data and their column title is numbers
#here I have changed the numbers titles into float
float_cols = [float(i) for i in df_plants.columns.tolist()[4:] if type(i)==str]
df_plants.columns.values[4:] = float_cols
#detector edges removal
#Here my goal is to remove some of the columns that has wrong data.
#this part was added today and might be the reason for the problem
cols = df_plants.columns.tolist()
df_plants=df_plants[cols[:4] + cols[11:]].copy()
#Trying to calculte indices:
filter_plants['NDVI']=(filter_plants['801.03']-filter_plants['680.75'])/(filter_plants['801.03']+filter_plants['680.75'])
KeyError: '801.03'
In order to solve this problem I have tried to add this line again before the calculation:
float_cols = [float(i) for i in df_plants.columns.tolist()[4:] ]
df_plants.columns.values[4:] = float_cols
but I still got the keyerror.
My end goal is to be able to do calculations with my indices which I believe relate to changing in the type of the columns

Try changing the last line to:
filter_plants['NDVI']=(filter_plants[801.03]-filter_plants[680.75])/(filter_plants[801.03]+filter_plants[680.75])

Related

Print Pandas without dtype

I've read a few other posts about this but the other solutions haven't worked for me. I'm trying to look at 2 different CSV files and compare data from 1 column from each file. Here's what I have so far:
import pandas as pd
import numpy as np
dataBI = pd.read_csv("U:/eu_inventory/EO BI Orders.csv")
dataOrderTrimmed = dataBI.iloc[:,1:2].values
dataVA05 = pd.read_csv("U:\eu_inventory\VA05_Export.csv")
dataVAOrder = dataVA05.iloc[:,1:2].values
dataVAList = []
ordersInBoth = []
ordersInBI = []
ordersInVA = []
for order in np.nditer(dataOrderTrimmed):
if order in dataVAOrder:
ordersInBoth.append(order)
else:
ordersInBI.append(order)
So if the order number from dataOrderTrimmed is also in dataVAOrder I want to add it to ordersInBoth, otherwise I want to add it to ordersInBI. I think it splits the information correctly but if I try to print ordersInBoth each item prints as array(5555555, dtype=int64) I want to have a list of the order numbers not as an array and not including the dtype information. Let me know if you need more information or if the way I've typed it out is confusing. Thanks!
The way you're using .iloc is giving you a DataFrame, which becomes 2D array when you access values. If you just want the values in the column at index 1, then you should just say:
dataOrderTrimmed = dataBI.iloc[:, 1].values
Then you can iterate over dataOrderTrimmed directly (i.e. you don't need nditer), and you will get regular scalar values.

How do you run an loop that takes data from a certain file and adds it to the elements of an 2D array in PYTHON? [duplicate]

This question already has answers here:
Why does this iterative list-growing code give IndexError: list assignment index out of range? How can I repeatedly add (append) elements to a list?
(9 answers)
Closed 4 months ago.
I am basically trying to read data from an excel file and I want to add it in an array. The excel file has multiple rows and multiple columns and I want them to be shown the exact same way in the array too.
This data needs to be shown in an array
from importlib.resources import open_binary
import openpyxl
import numpy as np
import array
import sys
wb = openpyxl.load_workbook("IEEE.xlsx")
bus1 = wb['33-Bus']
bus2 = wb['69-Bus']
rowMax = bus1.max_row
columnMax = bus1.max_column
print(rowMax,columnMax)
for i in range(1,rowMax+1):
for j in range(1,columnMax+1):
result = [bus1.cell(i,j).value]
print(result)
I want the loop to run and add the first cell to the array, then run again and add the second element and keep doing that until it hits the end of that row. Then I want it to create another array for the second row. Then add all of these arrays of the rows data together in a new array. So a 2D array. Can anyone help me with that? I have javascript knowledge and python is kinda different so I can't seem to figure this out
Implementing what I said in my comment,
matrix = []
for i in range(1,rowMax+1):
row = []
for j in range(1,columnMax+1):
row.append( bus1.cell(i,j).value )
matrix.append(row)
Or
matrix = []
for i in range(1,rowMax+1):
matrix.append( [bus1.cell(i,j+1).value for j in range(columnMax)] )
Or even
matrix = [
[bus1.cell(i+1,j+1).value for j in range(columnMax)]
for i in range(rowMax)
]
Again, however pandas has a read_excel function that can do this in one step.

How to load a dataframe from a printed dataframe string? [duplicate]

This question already has answers here:
Create Pandas DataFrame from a string
(7 answers)
How to make good reproducible pandas examples
(5 answers)
Closed 3 years ago.
Often people ask questions on Stack Overflow with an output of print(dataframe). It is convenient if one has a way of quickly loading the dataframe data into a pandas.dataframe object.
What is/are the most suggestible ways of loading a dataframe from a dataframe-string (which may or may not be properly formatted)?
Example-1
If you want to load the following string as a dataframe what would you do?
# Dummy Data
s1 = """
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
"""
Example-2
This type is more similar to what you find in csv file.
# Dummy Data
s2 = """
Client, NumberOfProducts, ID
A, 1, 2
A, 5, 1
B, 1, 2
B, 6, 1
C, 9, 1
"""
Expected Output
References
Note: The following two links do not address the specific situation presented in Example-1. The reason I think my question is not a duplicate is that I think one cannot load the string in Example-1 using any of the solutions already posted on those links (at the time of writing).
Create Pandas DataFrame from a string. Note that pd.read_csv(StringIO(s1), sep), as suggested here, doesn't really work for Example-1. You get the following output.
This question was marked as a duplicate of two Stack Overflow links. One of them is the one above, which fails in addressing the case presented in Example-1. And the second one is . Among all the answers presented there, only one looked like it might work for Example-1, but it did not work.
# could not read the clipboard and threw error
pd.read_clipboard(sep='\s\s+')
Error Thrown:
PyperclipException:
Pyperclip could not find a copy/paste mechanism for your system.
For more information, please visit https://pyperclip.readthedocs.org
I can suggest two methods to approach this problem.
Method-1
Process the string with regex and numpy to make the dataframe. What I have seen is that this works most of the time. This would for the case presented in "Example-1".
# Make Dataframe
import pandas as pd
import numpy as np
import re
# Make Dataframe
# s = s1
ncols = 3 # number_of_columns
ss = re.sub('\s+',',',s.strip())
sa = np.array(ss.split(',')).reshape(-1,ncols)
df = pd.DataFrame(dict((k,v) for k,v in zip(sa[0,:], sa[1:,].T)))
df
Method-2
Use io.StringIO to feed into pandas.read_csv(). But this would work if the separator is well defined. For instance, if your data looks similar to "Example-2". Source credit
import pandas as pd
from io import StringIO
# Make Dataframe
# s = s2
df = pd.read_csv(StringIO(s), sep=',')
Output

Delete series value from row of a pandas data frame based on another data frame value

My question is little bit different than the question posted here
So I thought to open a new thread.I have a pandas data frame with 5 attributes.One of these attribute is created using pandas series.Here is the sample code for creating the data frame
import numpy as np
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
data = np.array([2540948, 2540955, 2540956,2540956,7138932])
x=pd.Series(data)
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
I have another data frame,the code for creating the data frame is given below
mydf2=pd.DataFrame(columns=['group','id'])
data1 = np.array([2540948, 2540955, 2540956])
y=pd.Series(data1)
mydf2.loc[0]=[1,y]
These are sample data. Actual data will have large number of rows & also the series length is large too .I want to match mydf1 with mydf2 & if it matches,sometime I wont have matching element in mydf2,then I will delete values of id from mydf1 which are there in mydf2 for example after the run,my id will be for group 1 2540956,7138932. I also tried the code mentioned in above link. But for the first line
counts = mydf1.groupby('id').cumcount()
I got error message as
TypeError: 'Series' objects are mutable, thus they cannot be hashed
in my Python 3.X. Can you please suggest me how to solve this?
This should work. We use Counter to find the difference between 2 lists of ids. (p.s. This problem does not requires the difference is in order.)
Setup
import numpy as np
from collections import Counter
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
x = [2540948, 2540955, 2540956,2540956,7138932]
y = [2540948, 2540955, 2540956,2540956,7138932]
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
mydf1.loc[1]=[2,y,'def','def#xyz.com','female']
mydf2=pd.DataFrame(columns=['group','id'])
x2 = np.array([2540948, 2540955, 2540956])
y2 = np.array([2540955, 2540956])
mydf2.loc[0]=[1,x2]
mydf2.loc[1]=[2,y2]
Code
mydf3 = mydf1[["group", "id"]]
mydf3 = mydf3.merge(mydf2, how="inner", on="group")
new_id_finder = lambda x: list((Counter(x.id_x) - Counter(x.id_y)).elements())
mydf3["new_id"] = mydf3.apply(new_id_finder, 1)
mydf3["new_id"]
group new_id
0 1 [2540956, 7138932]
1 2 [2540948, 2540956, 7138932]
One Counter object can substract another to get the difference in occurances of elements. Then, you can use elements function to retrieve all values left.

Categorizing CSV data by groups defined through string values

So I am trying to organize data through a CSV file using pandas so I can graph it in matplotlib, I have different rows of values in which some are control and others are experimental. I am able to separate the rows to graph however I can not seem to make it work, I have attempted for loops (seen below) to graph although I keep getting 'TypeError: 'type' object is not subscriptable'.
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv('C:\\Users\\User\\Desktop\\Ubiome samples\\samples.csv')
group = (df['Group'])
count = (df['Count'])
time = (df['Time'])
for steps in range [group]:
plt.plot([time],[count],'bs')
plt.show()
There is a typo in your for loop :
for steps in range [group]:
Should be
for steps in range(group):
Your for loop tries to call __getitem__ on range, but since this method isn't defined for range, you get a TypeError: 'type' object is not subscriptable. Check python documentation for getitem() for more details.
However, you cannot use range on a pandas Series to loop over every item in it, since range expects integers as it's input. Instead you should use :
for steps in group:
This will loop over every row in your csv file, and output the exact same plot for each row. I'm quite sure this is not what you actually want to do.
If I understand your question well, you want to plot each group of experimental/control values you have in your csv.
Then you should try (untested) :
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv('C:\\Users\\User\\Desktop\\Ubiome samples\\samples.csv')
for group in df['Group'].unique():
group_data = df[df['Group'] == group]
plt.plot(group_data['Time'], group_data['Count'], 'bs')
plt.show()
for group in df['Group'].unique() will loop over every piece of data in the Group column, ignoring duplicates.
For instance, if your column have 1000 strings in it, but all of these strings are either "experimental" or "control", then this will loop over ['experimental', 'control'] (actually a numpy array, also, do note that unique() doesn't sort, so the order of the output depends on the order of the input).
df[df['Group'] == group] will then select all the rows where the column 'Group' is equal to group.
Check pandas documentation for where method and masking for more details.

Categories

Resources