I am sure this is not hard, but I can't figure it out!
I want to create a dataframe that starts at 1 for the first row and ends at 100,000 in increments of 1, 2, 4, 5, or whatever. I could do this in my sleep in Excel, but is there a slick way to do this without importing a .csv or .txt file?
I have needed to do this in variations many times and just settled on importing a .csv, but I am tired of that.
Example in Excel
Generating numbers
Generating numbers is not something special to pandas, rather numpy module or range function (as mentioned by #Grismer) can do the trick. Let's say you want to generate a series of numbers and assign these numbers to a dataframe. As I said before, there are multiple approaches two of which I personally prefer.
range function
Take range(1,1000,1) as an Example. This function gets three arguments two of which are not mandatory. The first argument defines the start number, the second one defines the end number, and the last one points to the steps of this range. So the abovementioned example will result in the numbers 1 to 9999 (Note that this range is a half-open interval which is closed at the start and open at the end).
numpy.arange function
To have the same results as the previous example, take numpy.arange(1,1000,1) as an example. The arguments are completely the same as the range's arguments.
Assigning to dataframe
Now, if you want to assign these numbers to a dataframe, you can easily do this by using the pandas module. Code below is an example of how to generate a dataframe:
import numpy as np
import pandas as pd
myRange = np.arange(1,1001,1) # Could be something like myRange = range(1,1000,1)
df = pd.DataFrame({"numbers": myRange})
df.head(5)
which results in a dataframe like(Note that just the first five rows have been shown):
numbers
0
1
1
2
2
3
3
4
4
5
Difference of numpy.arange and range
To keep this answer short, I'd rather to refer to this answer by #hpaulj
Related
I have a large dataframe that consists of around 19,000 rows and 150 columns. Many of these columns contain values with -1s and -2s. When I try to replace the -1s and -2s with 0 using the following code, Jupyter times out on me and says no memory left. So, I am curious if you can select a range of columns and apply the replace function. This way I can replace in batches since I cant seem to replace in one pass based on my available memory.
Here is the code a tried to use that timed out on me when first replacing the -2s:
df.replace(to_replace=-2, value="0").
Thank you for any guidance!
Sean
Let's say you want to divide your columns in chunks of 10, then you should try something like this:
columns = your_df.columns
division_num = 10
chunks_num = int(len(columns)/division_num)
index = 0
for i in range(chunks_num):
your_df[columns[index: index+10]].replace(to_replace=-2, value="0")
index += division_num
If your memory keeps overflowing then maybe you can try with loc function to divide the data by rows instead of columns.
I am trying to modify a pandas dataframe column this way:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE["Var"]["Jan"] = 2678400*SLICE["Var"]["Jan"]
However, this does not work. The resulting column SLICE["Var"]["Jan"] is still the same as before the multiplication.
If I multiply with 2 orders of magnitude less, the multiplication works. Also a subsequent multiplication with 100 to receive the same value that was intended in the first place, works.
SLICE["Var"]["Jan"] = 26784*SLICE["Var"]["Jan"]
SLICE["Var"]["Jan"] = 100*SLICE["Var"]["Jan"]
I seems like the scalar is too large for the multiplication. Is this a python thing or a pandas thing? How can I make sure that the multiplication with the 7-digit number works directly?
I am using Python 3.8, the precision of numbers in the dataframe is float32, they are in a range between 5.0xE-5 and -5.0xE-5 with some numbers having a smaller absolute value than 1xE-11.
EDIT: It might have to do with the 2-level column indexing. When I delete the first level, the calculation works:
Temporary=DF.loc[start:end].copy()
SLICE=Temporary.unstack("time").copy()
SLICE=SLICE.droplevel(0, axis=1)
SLICE["Jan"] = 2678400*SLICE["Jan"]
Your first method might give SettingWithCopyWarning which basically means the changes are not made to the actual dataframe. You can use .loc instead:
SLICE.loc[:,('Var', 'Jan')] = SLICE.loc[:,('Var', 'Jan')]*2678400
I have an array with around 160k entries which I get from a CSV-file and it looks like this:
data_arr = np.array(['ID0524', 1.0]
['ID0965', 2.5]
.
.
['ID0524', 6.7]
['ID0324', 3.0])
I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers.
So if I would need to look up "ID0524", the sum would be 7.7.
My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):
def sumValues(self, id)
sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]
sum_arr = sub_arr[0:sub_arr.size, 1]
return sum_arr.sum()
And it takes around ~18s to do this for all 3k IDs.
I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!
You could try the using builtin numpy methods.
numpy.intersect1d to find the unique IDs
numpy.sum to sum them up
A convenient tool to do your task is Pandas, with its grouping mechanism.
Start from the necessary import:
import pandas as pd
Then convert data_arr to a pandasonic DataFrame:
df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})
The reason for some complication in the above code is that:
elements of your input array are of a single type (in this case
object),
so there is necessary to convert the second column to float.
Then you can get the expected result in a single instruction:
result = df.groupby('Id').sum()
The result, for your data sample, is:
Amount
Id
ID0324 3.0
ID0524 7.7
ID0965 2.5
Another approach is that you could read your CSV file directly
into a DataFrame (see read_csv method), so there is no need to use
any Numpy array.
The advantage is that read_csv is clever enough to recognize the data
type of each column separately, at least it is able to tell apart numbers
from strings.
I am trying to create a running (moving) total of a value called var1.
Thus, I would want it to look like this:
Thus, if var1 = 5, 4, 3, 12 for the first four values respectively, I want
9 (5+4), 7 (4+3), 15 (3+12) for the TOTAL values etc.
Instead, it is just taking 2 TIMES var1, so that the first four values of total are:
10, 8, 6, 24 etc.
This is the code I am trying. It seems to work (no errors)
import datetime
import pandas as pd
data=pd.read_csv("C:/Users/ameri/tempjohn.csv")
data.total=0
i=1
while i < 3:
data.total+=data.var1
i+=1
print(data.total)
can anybody help?
thanks
John
A Pandas dataframe is not a simple Python variable even if you do computations with it: it behaves more or less as a vectorized 2D array.
What happens in your code:
you set the column total of the dataframe to 0: data.total becomes a Series of same lenght as the dataframe containing only 0 values
you execute (for i == 1) data.total += data.var1: as it previously contained 0 values, data.total is now a copy of (the Series) data.var1
you execute (for i == 2) data.total += data.var1: ok, data.total now contains twice the values of data.var1
end of loop because 3 < 3 is false...
What do to next:
read a Pandas tutorial if you want to go that way, but please remember that Pandas is not Python and some Pandas objects have different semantics than standard Python ones... of forget about Pandas if you only want to learn Python
If you really want to do it the Pandas way, the magic word is shift: data.total = data.var1 + data.var1.shift()
I have a little script that creates a new column in my pandas dataset called class, and assigns class values for a given time range. It works well, but suddenly I have thousands of time ranges to input, and wondered if it might possible to write some kind of loop which gets the three columns (start, finish, and class) from a pandas dataframe.
To complicate things, the time ranges are of irregular interval in dataframe 1 (e.g. a nanosecond, 30 seconds, 4 minutes) and in dataframe 2, (which contains accelerometer data) the time series data increases in increments of 0.010 seconds. Any help appreciated as I'm new to Python.
conditions = [(X['DATETIME'] < '2017-11-17 07:31:07') & (X['DATETIME']>= '2017-11-17 00:00:00'),(X['DATETIME'] < '2017-11-17 07:32:35') & (X['DATETIME']>= '2017-11-17 07:31:07'),(X['DATETIME'] < '2017-11-17 09:01:05') & (X['DATETIME']>= '2017-11-17 08:58:39')]
classes = ['0','1','2']
X['CLASS'] = np.select(conditions, classes, default='5')
There are many possible solutions to this, you could use for loops as you said, etc. But if you are new to Python, I think this answer would show you more about the power of python and its great packages. I will use the numpy package here. And I suppose that your first table is in a pandas data frame named X while the second in one named condidtions.
import numpy as np
X['CLASS'] = conditions['CLASS'].iloc[np.digitize(X['Datetime'].view('i8'),
conditions['Start'].view('i8')) - 1]
Don't worry, I won't let you there. So np.digitize takes it's first list and bins it based on the bin borders defined by the second argument. So here you will get the index of the condition corresponding to the time in the given row.
There are a couple of details to be noted:
.view('i8') provides a view of the datetime object which can be easily used by the numpy package (if you are interested, you can read more about the details)
-1 is needed to realign the results (the value after the start of the first condition will get a value of 1, but we want it to start from 0.
in the end we use the iloc function of the conditions['CLASS'] series to map these indices to the class values.