I am not creating another variable as I expect to

I am not creating another variable as I expect to - python

I am trying to create a running (moving) total of a value called var1.
Thus, I would want it to look like this:
Thus, if var1 = 5, 4, 3, 12 for the first four values respectively, I want
9 (5+4), 7 (4+3), 15 (3+12) for the TOTAL values etc.
Instead, it is just taking 2 TIMES var1, so that the first four values of total are:
10, 8, 6, 24 etc.
This is the code I am trying. It seems to work (no errors)
import datetime
import pandas as pd
data=pd.read_csv("C:/Users/ameri/tempjohn.csv")
data.total=0
i=1
while i < 3:
data.total+=data.var1
i+=1
print(data.total)
can anybody help?
thanks
John

A Pandas dataframe is not a simple Python variable even if you do computations with it: it behaves more or less as a vectorized 2D array.
What happens in your code:
you set the column total of the dataframe to 0: data.total becomes a Series of same lenght as the dataframe containing only 0 values
you execute (for i == 1) data.total += data.var1: as it previously contained 0 values, data.total is now a copy of (the Series) data.var1
you execute (for i == 2) data.total += data.var1: ok, data.total now contains twice the values of data.var1
end of loop because 3 < 3 is false...
What do to next:
read a Pandas tutorial if you want to go that way, but please remember that Pandas is not Python and some Pandas objects have different semantics than standard Python ones... of forget about Pandas if you only want to learn Python
If you really want to do it the Pandas way, the magic word is shift: data.total = data.var1 + data.var1.shift()

Related

How Do I Create a Dataframe from 1 to 100,000?

I am sure this is not hard, but I can't figure it out!
I want to create a dataframe that starts at 1 for the first row and ends at 100,000 in increments of 1, 2, 4, 5, or whatever. I could do this in my sleep in Excel, but is there a slick way to do this without importing a .csv or .txt file?
I have needed to do this in variations many times and just settled on importing a .csv, but I am tired of that.
Example in Excel

Generating numbers
Generating numbers is not something special to pandas, rather numpy module or range function (as mentioned by #Grismer) can do the trick. Let's say you want to generate a series of numbers and assign these numbers to a dataframe. As I said before, there are multiple approaches two of which I personally prefer.
range function
Take range(1,1000,1) as an Example. This function gets three arguments two of which are not mandatory. The first argument defines the start number, the second one defines the end number, and the last one points to the steps of this range. So the abovementioned example will result in the numbers 1 to 9999 (Note that this range is a half-open interval which is closed at the start and open at the end).
numpy.arange function
To have the same results as the previous example, take numpy.arange(1,1000,1) as an example. The arguments are completely the same as the range's arguments.
Assigning to dataframe
Now, if you want to assign these numbers to a dataframe, you can easily do this by using the pandas module. Code below is an example of how to generate a dataframe:
import numpy as np
import pandas as pd
myRange = np.arange(1,1001,1) # Could be something like myRange = range(1,1000,1)
df = pd.DataFrame({"numbers": myRange})
df.head(5)
which results in a dataframe like(Note that just the first five rows have been shown):
numbers
0
1
1
2
2
3
3
4
4
5
Difference of numpy.arange and range
To keep this answer short, I'd rather to refer to this answer by #hpaulj

Is there a way to iterate through an excel column to check that every values' preceding value is higher by 1? E.g (1, 2, 3, 4, 5)

I am using the numpy and pandas modules to work with data from an excel sheet. I want to iterate through a column and make sure each rows' value is higher than the previous ones' by 1.
For example, cell A1 of excel sheet has a value of 1, I would like to make sure cell A2 has a value of 2. And I would like to do this for the entire column of my excel sheet.
The problem is I'm not sure if this is a good way to go about doing this.
This is the code I've come up with so far:
import numpy as np
import pandas as pd
i = 1
df = pd.read_excel("HR-Employee-Attrition(1).xlsx")
out = df['EmployeeNumber'].to_numpy().tolist()
print(out)
for i in out:
if out[i] + 1 == out[i+1]:
if out[i] == 1470:
break
i += 1
pass
else:
print(out[i])
break
It gives me the error:
IndexError: list index out of range.
Could someone advise me on how to check every row in my excel column?

If I understood the problem correctly, you may need to iterate over the length of the list -1 to avoid the out of range:
for i in range(len(out)-1):
if out[i] + 1 == out[i+1]:
if out[i] == 1470:
break
i += 1
pass
else:
print(out[i])
break
but there is an easier way to achieve this though, which is:
df['EmployeeNumber'].diff()

I don't understand why you are using a for-loop for such a thing:
I've created an Excel-sheet, with two columns, like this:
Index Name
1 A
2 B
C
D
E
I selected the two numbers (1 and 2) and double-clicked on the right-bottom corner of the selection rectangle, while recording what I was doing, and this macro got recorded:
Selection.AutoFill Destination:=Range("A2:A6")
As you see, Excel does not write a for-loop for this (the for-loop might prove being a performance whole in case of large Excel sheets).
The result on my Excel sheet was:
Index Name
1 A
2 B
3 C
4 D
5 E

Iterating through start, finish and class values in Python

I have a little script that creates a new column in my pandas dataset called class, and assigns class values for a given time range. It works well, but suddenly I have thousands of time ranges to input, and wondered if it might possible to write some kind of loop which gets the three columns (start, finish, and class) from a pandas dataframe.
To complicate things, the time ranges are of irregular interval in dataframe 1 (e.g. a nanosecond, 30 seconds, 4 minutes) and in dataframe 2, (which contains accelerometer data) the time series data increases in increments of 0.010 seconds. Any help appreciated as I'm new to Python.
conditions = [(X['DATETIME'] < '2017-11-17 07:31:07') & (X['DATETIME']>= '2017-11-17 00:00:00'),(X['DATETIME'] < '2017-11-17 07:32:35') & (X['DATETIME']>= '2017-11-17 07:31:07'),(X['DATETIME'] < '2017-11-17 09:01:05') & (X['DATETIME']>= '2017-11-17 08:58:39')]
classes = ['0','1','2']
X['CLASS'] = np.select(conditions, classes, default='5')

There are many possible solutions to this, you could use for loops as you said, etc. But if you are new to Python, I think this answer would show you more about the power of python and its great packages. I will use the numpy package here. And I suppose that your first table is in a pandas data frame named X while the second in one named condidtions.
import numpy as np
X['CLASS'] = conditions['CLASS'].iloc[np.digitize(X['Datetime'].view('i8'),
conditions['Start'].view('i8')) - 1]
Don't worry, I won't let you there. So np.digitize takes it's first list and bins it based on the bin borders defined by the second argument. So here you will get the index of the condition corresponding to the time in the given row.
There are a couple of details to be noted:
.view('i8') provides a view of the datetime object which can be easily used by the numpy package (if you are interested, you can read more about the details)
-1 is needed to realign the results (the value after the start of the first condition will get a value of 1, but we want it to start from 0.
in the end we use the iloc function of the conditions['CLASS'] series to map these indices to the class values.

Pandas - slice sections of dataframe into multiple dataframes

I have a Pandas dataframe with 3000+ rows that looks like this:
t090:   c0S/m:    pr:      timeJ:  potemp090C:   sal00:  depSM:  \
407  19.3574  4.16649  1.836  189.617454      19.3571  30.3949   1.824
408  19.3519  4.47521  1.381  189.617512      19.3517  32.9250   1.372
409  19.3712  4.44736  0.710  189.617569      19.3711  32.6810   0.705
410  19.3602  4.26486  0.264  189.617627      19.3602  31.1949   0.262
411  19.3616  3.55025  0.084  189.617685      19.3616  25.4410   0.083
412  19.2559  0.13710  0.071  189.617743      19.2559   0.7783   0.071
413  19.2092  0.03000  0.068  189.617801      19.2092   0.1630   0.068
414  19.4396  0.00522  0.068  189.617859      19.4396   0.0321   0.068
What I want to do is: create individual dataframes from each portion of the dataframe in which the values in column 'c0S/m' exceed 0.1 (eg rows 407-412 in the example above).
So let's say that I have 7 sections in my 3000+ row dataframe in which a series of rows exceed 0.1 in the second column. My if/for/while statement will slice these sections and create 7 separate dataframes.
I tried researching the best I could but could not find a question that would address this problem. Any help is appreciated.
Thank you.

You can try this:
First add a column of 0 or 1 based on whether the value is greater than 1 or less.
df['splitter'] = np.where(df['c0S/m:'] > 1, 1, 0)
Now groupby this column diff.cumsum()
df.groupby((df['splitter'].diff(1) != 0).astype('int').cumsum()).apply(lambda x: [x.index.min(),x.index.max()])
You get the required blocks of indices
splitter
1 [407, 411]
2 [412, 414]
3 [415, 415]
Now you can create dataframes using loc
df.loc[407:411]
Note: I added a line to your sample df using:
df.loc[415] = [19.01, 5.005, 0.09, 189.62, 19.01, 0.026, 0.09]
to be able to test better and hence its splitting in 3 groups

Here's another way.
sub_set = df[df['c0S/m'] > 0.1]
last = None
for i in sub_set.index:
if last is None:
start = i
else:
if i - last > 1:
print start, last
start = i
last = i
I think it works. (Instead of print start, last you could insert code to create the slices you wanted of the original data frame).
Some neat tricks here that do an even better job.

Pandas Cumulative Sum using Current Row as Condition

I've got a fairly large data set of about 2 million records, each of which has a start time and an end time. I'd like to insert a field into each record that counts how many records there are in the table where:
Start time is less than or equal to "this row"'s start time
AND end time is greater than "this row"'s start time
So basically each record ends up with a count of how many events, including itself, are "active" concurrently with it.
I've been trying to teach myself pandas to do this with but I am not even sure where to start looking. I can find lots of examples of summing rows that meet a given condition like "> 2", but can't seem to grasp how to iterate over rows to conditionally sum a column based on values in the current row.

You can try below code to get the final result.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,10],[5,8],[3,8],[6,9]]),columns=["start","end"])
active_events= {}
for i in df.index:
active_events[i] = len(df[(df["start"]<=df.loc[i,"start"]) & (df["end"]> df.loc[i,"start"])])
last_columns = pd.DataFrame({'No. active events' : pd.Series(active_events)})
df.join(last_columns)

Here goes. This is going to be SLOW.
Note that this counts each row as overlapping with itself, so the results column will never be 0. (Subtract 1 from the result to do it the other way.)
import pandas as pd
df = pd.DataFrame({'start_time': [4,3,1,2],'end_time': [7,5,3,8]})
df = df[['start_time','end_time']] #just changing the order of the columns for aesthetics
def overlaps_with_row(row,frame):
starts_before_mask = frame.start_time <= row.start_time
ends_after_mask = frame.end_time > row.start_time
return (starts_before_mask & ends_after_mask).sum()
df['number_which_overlap'] = df.apply(overlaps_with_row,frame=df,axis=1)
Yields:
In [8]: df
Out[8]:
start_time end_time number_which_overlap
0 4 7 3
1 3 5 2
2 1 3 1
3 2 8 2
[4 rows x 3 columns]

def counter (s: pd.Series):
return ((df["start"]<= s["start"]) & (df["end"] >= s["start"])).sum()
df["count"] = df.apply(counter , axis = 1)
This feels a lot simpler approach, using the apply method. This doesn't really compromise on speed as the apply function, although not as fast as python native functions like cumsum() or cum, it should be faster than using a for loop.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

I am not creating another variable as I expect to - python

Related

How Do I Create a Dataframe from 1 to 100,000?

Is there a way to iterate through an excel column to check that every values' preceding value is higher by 1? E.g (1, 2, 3, 4, 5)

Iterating through start, finish and class values in Python

Pandas - slice sections of dataframe into multiple dataframes

Pandas Cumulative Sum using Current Row as Condition

Categories

Resources