Categorizing CSV data by groups defined through string values - python

So I am trying to organize data through a CSV file using pandas so I can graph it in matplotlib, I have different rows of values in which some are control and others are experimental. I am able to separate the rows to graph however I can not seem to make it work, I have attempted for loops (seen below) to graph although I keep getting 'TypeError: 'type' object is not subscriptable'.
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv('C:\\Users\\User\\Desktop\\Ubiome samples\\samples.csv')
group = (df['Group'])
count = (df['Count'])
time = (df['Time'])
for steps in range [group]:
plt.plot([time],[count],'bs')
plt.show()

There is a typo in your for loop :
for steps in range [group]:
Should be
for steps in range(group):
Your for loop tries to call __getitem__ on range, but since this method isn't defined for range, you get a TypeError: 'type' object is not subscriptable. Check python documentation for getitem() for more details.
However, you cannot use range on a pandas Series to loop over every item in it, since range expects integers as it's input. Instead you should use :
for steps in group:
This will loop over every row in your csv file, and output the exact same plot for each row. I'm quite sure this is not what you actually want to do.
If I understand your question well, you want to plot each group of experimental/control values you have in your csv.
Then you should try (untested) :
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv('C:\\Users\\User\\Desktop\\Ubiome samples\\samples.csv')
for group in df['Group'].unique():
group_data = df[df['Group'] == group]
plt.plot(group_data['Time'], group_data['Count'], 'bs')
plt.show()
for group in df['Group'].unique() will loop over every piece of data in the Group column, ignoring duplicates.
For instance, if your column have 1000 strings in it, but all of these strings are either "experimental" or "control", then this will loop over ['experimental', 'control'] (actually a numpy array, also, do note that unique() doesn't sort, so the order of the output depends on the order of the input).
df[df['Group'] == group] will then select all the rows where the column 'Group' is equal to group.
Check pandas documentation for where method and masking for more details.

Related

Convert Varying Column Length to Rows in Pandas

I'm trying to create a graph with Seaborn that shows all of the Windows events in the Domain Controller that have taken place in a given time range, which means you have, say, five events now, but when you run the program again in 10 minutes, you might get 25 events.
With that said, I've been able to parse these events (labeled Actions) from a mumbo-jumbo of other data in the log file and then create a DataFrame in Pandas. The script outputs the data as a dictionary. After creating the DataFrame, this is what the output looks like:
logged-out kerberos-authentication-ticket-requested logged-in created-process exited-process
1 1 5 2 1 1
Note: The values you see above are the number of times the process took place within that time frame.
That would be good enough for me, but only if a table was all I needed. When I try to put this DataFrame into Seaborn, I get an error because I don't know what to name the x and y axes because, well, they are always changing. So, my solution was to use the df.melt() function in order to convert those columns into rows, and then label the only two columns needed ('Actions','Count'). But that's where I fumbled multiple times. I can't figure out how to use the df.melt() functions correctly.
Here is my code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#Ever-changing data
actions = {'logged-out': 2, 'kerberos-authentication-ticket-requested': 5, 'logged-in': 2,
'created-process': 1, 'exited-process': 1, 'logged-out': 1}
#Create DataFrame
data = actions
index = 1 * pd.RangeIndex(start=1, stop=2) #add automatic index
df = pd.DataFrame(data,index=index,columns=['Action','Count'])
print(df)
#Convert Columns to Rows and Add
df.melt(id_vars=["Action", "Count"],
var_name="Action",
value_name="Count")
#Create graph
sns.barplot(data=df,x='Action',y='Count',
palette=['#476a6f','#e63946'],
dodge=False,saturation=0.65)
plt.savefig('fig.png')
plt.show()
Any help is appreciated.
You can use:
df.melt(var_name="Action", value_name="Count")
without using any id_vars!

Pandas qcut function duplicates parameter

Maybe I don't get the point? but why isn't in Pandas qcut function accepting "ignore" as argument from duplicates?
So small Datasets with duplicate Values are printing the Error:
"Bin edges must be unique"
and the advice to use the "drop" option. But if you want to have a fixed Number of bins there is no possibility?
small code example thats not working:
import pandas as pd
import numpy as np
data=pd.Series([1,1,2,3])
pd.qcut(data,10,labels=np.arange(0,10),duplicates="raise")
small code how it works, but don't get the same number of bins:
import pandas as pd
import numpy as np
data=pd.Series([1,1,2,3])
qcut(data,4,labels=np.arange(0,3),duplicates="drop")
What could be a possible solution:
Insert a third option "ignore" to https://github.com/pandas-dev/pandas/blob/06d230151e6f18fdb8139d09abf539867a8cd481/pandas/core/reshape/tile.py#L405
Change the if else block in https://github.com/pandas-dev/pandas/blob/06d230151e6f18fdb8139d09abf539867a8cd481/pandas/core/reshape/tile.py#L418-L424
to
if duplicates == "raise":
raise ValueError(
f"Bin edges must be unique: {repr(bins)}.\n"
f"You can drop duplicate edges by setting the 'duplicates' kwarg"
)
elif duplicates == "drop":
bins = unique_bins

Why am I recieving key error after slicing my data? [duplicate]

This question already has answers here:
Problem with getting rid of specific columns [closed]
(2 answers)
Closed 3 years ago.
I have a code that slices data and then suppose to calculte different indices according to the columns.
My code worked well but today I had to slice differently the data and since then I get keyerror whenever I try to compute the indices.
unfortinatly I can't share my original data but I hope this code can help in understand what happenned here.
This is my code with some explainations:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_plants = pd.read_csv('my_data')
#My data contains columns with numerical data and their column title is numbers
#here I have changed the numbers titles into float
float_cols = [float(i) for i in df_plants.columns.tolist()[4:] if type(i)==str]
df_plants.columns.values[4:] = float_cols
#detector edges removal
#Here my goal is to remove some of the columns that has wrong data.
#this part was added today and might be the reason for the problem
cols = df_plants.columns.tolist()
df_plants=df_plants[cols[:4] + cols[11:]].copy()
#Trying to calculte indices:
filter_plants['NDVI']=(filter_plants['801.03']-filter_plants['680.75'])/(filter_plants['801.03']+filter_plants['680.75'])
KeyError: '801.03'
In order to solve this problem I have tried to add this line again before the calculation:
float_cols = [float(i) for i in df_plants.columns.tolist()[4:] ]
df_plants.columns.values[4:] = float_cols
but I still got the keyerror.
My end goal is to be able to do calculations with my indices which I believe relate to changing in the type of the columns
Try changing the last line to:
filter_plants['NDVI']=(filter_plants[801.03]-filter_plants[680.75])/(filter_plants[801.03]+filter_plants[680.75])

Find big differences in numpy array

I have a csv file that contains data from two led measurements. There are some mistakes in the file that gives huge sparks in the graph. I want to locate this places where this happens.
I have this code that makes two arrays that I plot.
x625 = np.array(df['LED Group 625'].dropna(axis=0, how='all'))
x940 = np.array(df['LED Group 940'].dropna(axis=0, how='all'))
I will provide an answer with some artificial data since you have not posted any data yet.
So after you convert the pandas columns into a numpy array, you can do something like this:
import numpy as np
# some random data. 100 lines and 1 column
x625 = np.random.rand(100,1)
# Assume that the maximum value in `x625` is a spark.
spark = x625.max()
# Find where these spark are in the `x625`
np.where(x625==spark)
#(array([64]), array([0]))
The above means that a value equal to spark is located on the 64th line of the 0th column.
Similarly, you can use np.where(x625 > any_number_here)
If instead of the location you need to create a boolean mask use this:
boolean_mask = (x625==spark)
# verify
np.where(boolean_mask)
# (array([64]), array([0]))
EDIT 1
You can use numpy.diff() to get the element wise differences of all elements into the list (variable).
diffs = np.diff(x625.ravel())
This will have in index 0 the results of element1-element0.
If the vaules in diffs are big in a specific index positio, then a spark occured in that position.

Delete series value from row of a pandas data frame based on another data frame value

My question is little bit different than the question posted here
So I thought to open a new thread.I have a pandas data frame with 5 attributes.One of these attribute is created using pandas series.Here is the sample code for creating the data frame
import numpy as np
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
data = np.array([2540948, 2540955, 2540956,2540956,7138932])
x=pd.Series(data)
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
I have another data frame,the code for creating the data frame is given below
mydf2=pd.DataFrame(columns=['group','id'])
data1 = np.array([2540948, 2540955, 2540956])
y=pd.Series(data1)
mydf2.loc[0]=[1,y]
These are sample data. Actual data will have large number of rows & also the series length is large too .I want to match mydf1 with mydf2 & if it matches,sometime I wont have matching element in mydf2,then I will delete values of id from mydf1 which are there in mydf2 for example after the run,my id will be for group 1 2540956,7138932. I also tried the code mentioned in above link. But for the first line
counts = mydf1.groupby('id').cumcount()
I got error message as
TypeError: 'Series' objects are mutable, thus they cannot be hashed
in my Python 3.X. Can you please suggest me how to solve this?
This should work. We use Counter to find the difference between 2 lists of ids. (p.s. This problem does not requires the difference is in order.)
Setup
import numpy as np
from collections import Counter
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
x = [2540948, 2540955, 2540956,2540956,7138932]
y = [2540948, 2540955, 2540956,2540956,7138932]
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
mydf1.loc[1]=[2,y,'def','def#xyz.com','female']
mydf2=pd.DataFrame(columns=['group','id'])
x2 = np.array([2540948, 2540955, 2540956])
y2 = np.array([2540955, 2540956])
mydf2.loc[0]=[1,x2]
mydf2.loc[1]=[2,y2]
Code
mydf3 = mydf1[["group", "id"]]
mydf3 = mydf3.merge(mydf2, how="inner", on="group")
new_id_finder = lambda x: list((Counter(x.id_x) - Counter(x.id_y)).elements())
mydf3["new_id"] = mydf3.apply(new_id_finder, 1)
mydf3["new_id"]
group new_id
0 1 [2540956, 7138932]
1 2 [2540948, 2540956, 7138932]
One Counter object can substract another to get the difference in occurances of elements. Then, you can use elements function to retrieve all values left.

Categories

Resources