I am a beginner and this is my first project.. I searched for the answer but it still isn't clear.
I have imported a worksheet from excel using Pandas..
**Rabbit Class:
Num Behavior Speaking Listening
0 1 3 1 1
1 2 1 1 1
2 3 3 1 1
3 4 1 1 1
4 5 3 2 2
5 6 3 2 3
6 7 3 3 1
7 8 3 3 3
8 9 2 3 2
What I want to do is create if functions.. ex. if a student's behavior is a "1" I want it to print one string, else print a different string. How can I reference a particular cell of the worksheet to set up such a function? I tried: val = df.at(1, "Behavior") but that clearly isn't working..
Here is the code I have so far..
import os
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
path = r"C:\Users\USER\Desktop\Python\rabbit_class.xls"
print("Rabbit Class:")
print(df)
Also you can do
dff = df.loc[df['Behavior']==1]
if(not(dff.empty)):
# do Something
What you want is to find rows where df.Behavior is equal to 1. Use any of the following three methods.
# Method-1
df[df["Behavior"]==1]
# Method-2
df.loc[df["Behavior"]==1]
# Method-3
df.query("Behavior==1")
Output:
Num Behavior Speaking Listening LastColumn
0 0 1 3 1 1
Note: Dummy Data
Your sample data does not have a column header (the last one). So I named it LastColumn and read-in the data as a dataframe.
# Dummy Data
s = """
Num Behavior Speaking Listening LastColumn
0 1 3 1 1
1 2 1 1 1
2 3 3 1 1
3 4 1 1 1
4 5 3 2 2
5 6 3 2 3
6 7 3 3 1
7 8 3 3 3
8 9 2 3 2
"""
# Make Dataframe
ss = re.sub('\s+',',',s)
ss = ss[1:-1]
sa = np.array(ss.split(',')).reshape(-1,5)
df = pd.DataFrame(dict((k,v) for k,v in zip(sa[0,:], sa[1:,].T)))
df = df.astype(int)
df
Hope below example will help you
import pandas as pd
df = pd.read_excel(r"D:\test_stackoverflow.xlsx")
print(df.columns)
def _filter(col, filter_):
return df[df[col]==filter_]
print(_filter('Behavior', 1))
Thank you all for your answers. I finally figured out what I was trying to do using the following code:
i = 0
for i in df.index:
student_number = df["Student Number"][i]
print(student_number)
student_name = student_list[int(student_number) - 1]
behavior = df["Behavior"][i]
if behavior == 1:
print("%s's behavior is good" % student_name)
elif behavior == 2:
print ("%s's behavior is average." % student_name)
else:
print ("%s's behavior is poor" % student_name)
speaking = df["Speaking"][i]
Related
I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17
I have an excel sheet which consists of 2 columns. The first keywords and the second is Url.
I am making a script to extract groups which shares the same 3 URLs or more.
I wrote the below code but it takes around an hour to process the main function on a huge excel sheet.
import pandas as pd
import numpy as np
import time
loop = 1
numerator = 0
continuee= []
df_list = []
for index in list(df.sort_values('Url').set_index('Url').index.unique()):
if len(df.sort_values('Url').set_index('Url').loc[index].values) == 1:
list1 = list(df.sort_values('Url').set_index('Url').loc[index].values)
elif len(df.sort_values('Url').set_index('Url').loc[index].keywords.values) > 1:
list1 = list(df.sort_values('Url').set_index('Url').loc[index].keywords.values)
df1 = df[df.keywords.isin(list1)]
df1 = df1[df1.Url.duplicated(keep=False)]
df1 = df1.groupby('Url').filter(lambda x: x.Url.value_counts() == df1.keywords.nunique())
df1 = df1.groupby('keywords').filter(lambda x: x.keywords.value_counts() >= 3)
df1 = df1.groupby('Url').filter(lambda x: x.Url.value_counts() == df1.keywords.nunique())
if df1.keywords.nunique() > 1:
silos = list(df1.keywords.unique())
df_list.append({numerator:silos})
word = word[~(word.isin(silos))]
numerator += 1
else:
singles = list(word[word.keywords.isin(list1)].keywords.unique())
df_list.append({"single" : singles})
word = word[~(word.isin(singles))]
print(loop)
loop += 1
trial = pd.DataFrame(df_list)
if 'single' in list(trial.columns):
for i in list(word.keywords.unique()):
if i not in list(trial.single):
df_list.append({"single" : i})
else:
for i in list(word.keywords.unique()):
df_list.append({"single" : i})
trial = pd.DataFrame(df_list)
I tried many times to use multiprocessing but I failed as I am not really getting how it works with Pandas. Is there a way to help me, please? Also, if I wanted to pass another couple of functions how would I do it? Many thanks in advance.
From what I can gather, this should be your solution;
by_size = df.groupby(df.columns.tolist()).size().reset_index()
three_or_more=by_size[by_size[0]>=3].iloc[:,:-1]
Example:
>>> df
keyword url
0 2 2
1 4 3
2 2 1
3 4 3
4 1 1
5 2 1
6 4 1
7 2 1
8 1 1
9 3 3
>>> by_size = df.groupby(df.columns.tolist()).size().reset_index()
>>> by_size
keyword url 0
0 1 1 2
1 2 1 3
2 2 2 1
3 3 3 1
4 4 1 1
5 4 3 2
>>> three_or_more=by_size[by_size[0]>=3].iloc[:,:-1]
>>> three_or_more
keyword url
1 2 1
Can you help on the following task? I have a dataframe column such as:
index df['Q0']
0 1
1 2
2 3
3 5
4 5
5 6
6 7
7 8
8 3
9 2
10 4
11 7
I want to substitute the values in df.loc[3:8,'Q0'] with the values in df.loc[0:2,'Q0'] if df.loc[0,'Q0']!=df.loc[3,'Q0']
The result should look like the one below:
index df['Q0']
0 1
1 2
2 3
3 1
4 2
5 3
6 1
7 2
8 3
9 2
10 4
11 7
I tried the following line:
df.loc[3:8,'Q0'].where(~df.loc[0,'Q0']!=df.loc[3,'Q0']),other=df.loc[0:2,'Q0'],inplace=True)
or
df['Q0'].replace(to_replace=df.loc[3:8,'Q0'], value=df.loc[0:2,'Q0'], inplace=True)
But it doesn't work. Most possible I am doing something wrong.
Any suggestions?
You can use the cycle function:
from itertools import cycle
c = cycle(df["Q0"][0:3])
if df.Q0[0] != df.Q0[3]:
df["Q0"][3:8] = [next(c) for _ in range(5)]
Thanks for the replies. I tried the suggestions but I have some issues:
#adnanmuttaleb -
When I applied the function in a dataframe with more than 1 column (e.g. 12x2 or larger) I notice that the value in df.Q0[8] didn't change. Why?
#jezrael -
When I adjust to your suggestion I get the error:
ValueError: cannot copy sequence with size 5 to array axis with dimension 6
When I change the range to 6, I am getting wrong results
import pandas as pd
from itertools import cycle
data={'Q0':[1,2,3,5,5,6,7,8,3,2,4,7],
'Q0_New':[0,0,0,0,0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data)
##### version 1
c = cycle(df["Q0"][0:3])
if df.Q0[0] != df.Q0[3]:
df['Q0_New'][3:8] = [next(c) for _ in range(5)]
##### version 2
d = cycle(df.loc[0:3,'Q0'])
if df.Q0[0] != df.Q0[3]:
df.loc[3:8,'Q0_New'] = [next(d) for _ in range(6)]
Why we have different behaviors and what corrections need to be made?
Thanks once more guys.
I'm trying to parse a logfile of our manufacturing process. Most of the time the process is run automatically but occasionally, the engineer needs to switch into manual mode to make some changes and then switches back to automatic control by the reactor software. When set to manual mode the logfile records the step as being "MAN.OP." instead of a number. Below is a representative example.
steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
ser_orig = pd.Series(steps)
which results in
0 1
1 2
2 2
3 MAN.OP.
4 MAN.OP.
5 2
6 2
7 3
8 3
9 MAN.OP.
10 MAN.OP.
11 4
12 4
dtype: object
I need to detect the 'MAN.OP.' and make them distinct from each other. In this example, the two regions with values == 2 should be one region after detecting the manual mode section like this:
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
I have code that iterates over this series and produces the correct result when the series is passed to my object. The setter is:
#step_series.setter
def step_series(self, ss):
"""
On assignment, give the manual mode steps a unique name. Leave
the steps done on recipe the same.
"""
manual_mode = "MAN.OP."
new_manual_mode_text = "Manual_Mode_{}"
counter = 0
continuous = False
for i in ss.index:
if continuous and ss.at[i] != manual_mode:
continuous = False
counter += 1
elif not continuous and ss.at[i] == manual_mode:
continuous = True
ss.at[i] = new_manual_mode_text.format(str(counter))
elif continuous and ss.at[i] == manual_mode:
ss.at[i] = new_manual_mode_text.format(str(counter))
self._step_series = ss
but this iterates over the entire dataframe and is the slowest part of my code other than reading the logfile over the network.
How can I detect these non-unique sections and rename them uniquely without iterating over the entire series? The series is a column selection from a larger dataframe so adding extra columns is fine if needed.
For the completed answer I ended up with:
#step_series.setter
def step_series(self, ss):
pd.options.mode.chained_assignment = None
manual_mode = "MAN.OP."
new_manual_mode_text = "Manual_Mode_{}"
newManOp = (ss=='MAN.OP.') & (ss != ss.shift())
ss[ss == 'MAN.OP.'] = 'Manual_Mode_' + (newManOp.cumsum()-1).astype(str)
self._step_series = ss
Here's one way:
steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
steps = pd.Series(steps)
newManOp = (steps=='MAN.OP.') & (steps != steps.shift())
steps[steps=='MAN.OP.'] += seq.cumsum().astype(str)
>>> steps
0 1
1 2
2 2
3 MAN.OP.1
4 MAN.OP.1
5 2
6 2
7 3
8 3
9 MAN.OP.2
10 MAN.OP.2
11 4
12 4
dtype: object
To get the exact format you listed (starting from zero instead of one, and changing from "MAN.OP." to "Manual_mode_"), just tweak the last line:
steps[steps=='MAN.OP.'] = 'Manual_Mode_' + (seq.cumsum()-1).astype(str)
>>> steps
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
There a pandas enhancement request for contiguous groupby, which would make this type of task simpler.
There is s function in matplotlib that takes a boolean array and returns a list of (start, end) pairs. Each pair represents a contiguous region where the input is True.
import matplotlib.mlab as mlab
regions = mlab.contiguous_regions(ser_orig == manual_mode)
for i, (start, end) in enumerate(regions):
ser_orig[start:end] = new_manual_mode_text.format(i)
ser_orig
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
I have a sheet of numbers, separated by spaces into columns. Each column represents a different category, and within each column, each number represents a different value. For example, column number four represents age, and within the column, the number 5 represents an age of 44-55. Obviously, each row is a different person's record. I'd like to use a Python script to search through the the sheet, and find all columns where the sixth column is number "1." After that, I want to know how many times each number in column one appears where the number in column six is equal to "1." The script should output to the user that "While column six equals '1', the value '1' appears 12 times in column one. The value '2' appears 18 times..." etc. I hope I'm being clear here. I just want it to list the numbers, basically. Anyway, I'm new to Python. I've attached my code below. I think I should be using dictionaries, but I'm just not totally sure how. So far, I haven't really come close to figuring this out. I would really appreciate if someone could walk me through the logic that would be behind such code. Thank you so much!
ldata = open("list.data", "r")
income_dist = {}
for line in ldata:
linelist = line.strip().split(" ")
key_income_dist = linelist[6]
if key_income_dist in income_dist:
income_dist[key_income_dist] = 1 + income_dist[key_income_dist]
else:
income_dist[key_income_dist] = 1
ldata.close()
print value_no_occupations
First, indentation is majorly important in Python and the above is bad: the 5 lines following linelist = line.strip().split(" ") need to be indented to be in the loop like they should be.
Next they should be indented further and this line added before them:
if len(linelist)>6 and linelist[6]=="1":
This line skips over short lines (there are some), and tests for what you said you wanted: "where column six equals "1."" This is column [6] where the first number on the line is referenced as [0] (these are "offsets", not "cardinal", or counting, numbers).
You'll probably want to change key_income_dist = linelist[6] to key_income_dist = linelist[0] or [1] to get what you want. Play around if necessary.
Finally, you should say print income_dist at the end to get a look at your results. If you want fancier output, study up on formatting.
This is actually easier than it seems! The key is collections.Counter
from collections import Counter
ldata = open("list.data")
rows = [tuple(row.split()) for row in ldata if row.split()[5]==1]
# warning this will break if some rows are shorter than 6 columns
first_col = Counter(item[0] for item in rows)
If you want the distribution of every column (not just the first) do:
distribution = {column: Counter(item[column] for item in rows) for column in range(len(rows[0]))}
# warning this will break if all rows are not the same size!
Considering that the data file has ~9000 rows of data, if you don't want to keep the original data, you can combine step 1 & 2 to make the program use less memory and a little faster.
ldata = open("list.data", "r")
# read in all the rows, note that the list values are strings instead of integers
# keep only the rows with 6th column = '1'
only1 = []
for line in ldata:
if line.strip() == '': # ignor blank lines
continue
row = tuple(line.strip().split(" "))
if row[5] == '1':
only1.append(row)
ldata.close()
# tally the statistics
income_dist = {}
for row in only1:
if row[0] in income_dist:
income_dist[row[0]] += 1
else:
income_dist[row[0]] = 1
# print result
print "While column six equals '1',"
for num in sorted(income_dist):
print "the value %s appears %d times in column one." % (num, income_dist[num])
Sample Test Data in list.data:
9 2 1 5 4 5 5 3 3 0 1 1 7 NA
9 1 1 5 5 5 5 3 5 2 1 1 7 1
9 2 1 3 5 1 5 2 3 1 2 3 7 1
1 2 5 1 2 6 5 1 4 2 3 1 7 1
1 2 5 1 2 6 3 1 4 2 3 1 7 1
8 1 1 6 4 8 5 3 2 0 1 1 7 1
1 1 5 2 3 9 4 1 3 1 2 3 7 1
6 1 3 3 4 1 5 1 1 0 2 3 7 1
2 1 1 6 3 8 5 3 3 0 2 3 7 1
4 1 1 7 4 8 4 3 2 0 2 3 7 1
1 1 5 2 4 1 5 1 1 0 2 3 7 1
4 2 2 2 3 2 5 1 2 0 1 1 5 1
8 2 1 3 6 6 2 2 4 2 1 1 7 1
7 2 1 5 3 5 5 3 4 0 2 1 7 1
1 1 5 2 3 9 4 1 3 1 2 3 7 1
6 1 3 3 4 1 5 1 1 0 2 3 7 1
2 1 1 6 3 8 5 3 3 0 2 3 7 1
4 1 1 7 4 8 4 3 2 0 2 3 7 1
1 1 5 2 4 9 5 1 1 0 2 3 7 1
4 2 2 2 3 2 5 1 2 0 1 1 5 1
Following your original program logic, I come up with this version:
ldata = open("list.data", "r")
# read in all the rows, note that the list values are strings instead of integers
linelist = []
for line in ldata:
linelist.append(tuple(line.strip().split(" ")))
ldata.close()
# keep only the rows with 6th column = '1'
only1 = []
for row in linelist:
if row[5] == '1':
only1.append(row)
# tally the statistics
income_dist = {}
for row in only1:
if row[0] in income_dist:
income_dist[row[0]] += 1
else:
income_dist[row[0]] = 1
# print result
print "While column six equals '1',"
for num in sorted(income_dist):
print "the value %s appears %d times in column one." % (num, income_dist[num])