I'm trying to create a range of numbers in this dataframe:
pd.DataFrame({'Ranges': 1-np.arange(0, 1 , 0.1)
})
But the output expected is (in the same column):
0.9 - 1
0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
I have tried using these 1,2,3 solutions but none of them helps me to get nearer a solution. Any suggestions?
PS: The specific format of the numbers it doesn't matter (could be 1.0 or 1, 0.5 or .5 for example)
As far as i could understand you needed to make intervals such as "0.9 - 1", here's my suggestion.
pd.DataFrame(
{'Ranges': [str(x/10) +'-' +str(y/10) for x,y in zip(9- np.arange(1, 10, 1), 10-np.arange(1, 10, 1))]
})
Expected output :
You can use string concatenation on the shifted Series:
df = pd.DataFrame({'Ranges': 1-np.arange(0, 1 , 0.1)})
s = df['Ranges'].round(2).astype(str)
out = s.shift(-1, fill_value='0.0') + ' - ' + s
output:
0 0.9 - 1.0
1 0.8 - 0.9
2 0.7 - 0.8
3 0.6 - 0.7
4 0.5 - 0.6
5 0.4 - 0.5
6 0.3 - 0.4
7 0.2 - 0.3
8 0.1 - 0.2
9 0.0 - 0.1
Name: Ranges, dtype: object
Related
I have a pandas dataframe that contain score such as
score
0.1
0.15
0.2
0.3
0.35
0.4
0.5
etc
I want to group these value into the gorups of 0.2
so if score is between 0.1 or 0.2 the value for this row in sore will be 0.2
if score is between 0.2 and 0.4 then the value for score will be 0.4
so for example if max score is 1, I will have 5 buckets of score, 0.2 0.4 0.6 0.8 1
desired output:
score
0.2
0.2
0.2
0.4
0.4
0.4
0.6
You can first define a function that does the rounding for you:
import numpy as np
def custom_round(x, base):
return base * np.ceil(x / base)
Then use .apply() to apply the function to your column:
df.score.apply(lambda x: custom_round(x, base=.2))
Output:
0 0.2
1 0.2
2 0.2
3 0.4
4 0.4
5 0.4
6 0.6
Name: score, dtype: float64
Try np.ceil:
import pandas as pd
import numpy as np
data = {'score': {0: 0.1, 1: 0.15, 2: 0.2, 3: 0.3, 4: 0.35, 5: 0.4, 6: 0.5}}
df = pd.DataFrame(data)
base = 0.2
df['score'] = base * np.ceil(df.score/base)
print(df)
score
0 0.2
1 0.2
2 0.2
3 0.4
4 0.4
5 0.4
6 0.6
I have two time-based data. One is the accelerometer's measurement data, another is label data.
For example,
accelerometer.csv
timestamp,X,Y,Z
1.0,0.5,0.2,0.0
1.1,0.2,0.3,0.0
1.2,-0.1,0.5,0.0
...
2.0,0.9,0.8,0.5
2.1,0.4,0.1,0.0
2.2,0.3,0.2,0.3
...
label.csv
start,end,label
1.0,2.0,"running"
2.0,3.0,"exercising"
Maybe these data are unrealistic because these are just examples.
In this case, I want to merge these data to below:
merged.csv
timestamp,X,Y,Z,label
1.0,0.5,0.2,0.0,"running"
1.1,0.2,0.3,0.0,"running"
1.2,-0.1,0.5,0.0,"running"
...
2.0,0.9,0.8,0.5,"exercising"
2.1,0.4,0.1,0.0,"exercising"
2.2,0.3,0.2,0.3,"exercising"
...
I'm using the "iterrows" of pandas. However, the number of rows of real data is greater than 10,000. Therefore, the running time of program is so long. I think, there is at least one method for this work without iteration.
My code like to below:
import pandas as pd
acc = pd.read_csv("./accelerometer.csv")
labeled = pd.read_csv("./label.csv")
for index, row in labeled.iterrows():
start = row["start"]
end = row["end"]
acc.loc[(start <= acc["timestamp"]) & (acc["timestamp"] < end), "label"] = row["label"]
How can I modify my code to get rid of "for" iteration?
If the times in accelerometer don't go outside the boundaries of the times in label, you could use merge_asof:
accmerged = pd.merge_asof(acc, labeled, left_on='timestamp', right_on='start', direction='backward')
Output (for the sample data in your question):
timestamp X Y Z start end label
0 1.0 0.5 0.2 0.0 1.0 2.0 running
1 1.1 0.2 0.3 0.0 1.0 2.0 running
2 1.2 -0.1 0.5 0.0 1.0 2.0 running
3 2.0 0.9 0.8 0.5 2.0 3.0 exercising
4 2.1 0.4 0.1 0.0 2.0 3.0 exercising
5 2.2 0.3 0.2 0.3 2.0 3.0 exercising
Note you can remove the start and end columns with drop if you want to:
accmerged = accmerged.drop(['start', 'end'], axis=1)
Output:
timestamp X Y Z label
0 1.0 0.5 0.2 0.0 running
1 1.1 0.2 0.3 0.0 running
2 1.2 -0.1 0.5 0.0 running
3 2.0 0.9 0.8 0.5 exercising
4 2.1 0.4 0.1 0.0 exercising
5 2.2 0.3 0.2 0.3 exercising
Consider a dataframe df of the following structure:-
Name Slide Height Weight Status General
A X 3 0.1 0.5 0.2
B Y 10 0.2 0.7 0.8
...
I would like to create duplicates for each row in this dataframe (specific to the Name and Slide) for the following combinations of Height and Weight shown by this list:-
list_combinations = [[3,0.1],[10,0.2],[5,1.3]]
The desired output:-
Name Slide Height Weight Status General
A X 3 0.1 0.5 0.2 #original
A X 10 0.2 0.5 0.2 # modified duplicate
A X 5 1.3 0.5 0.2 # modified duplicate
B Y 10 0.2 0.7 0.8 #original
B Y 3 0.1 0.7 0.8 # modified duplicate
B Y 5 1.3 0.7 0.8 # modified duplicate
etc. ...
Any suggestions and help would be much appreciated.
We can do merge with cross
out = pd.DataFrame(list_combinations,columns = ['Height','Weight']).\
merge(df,how='cross',suffixes = ('','_')).\
reindex(columns=df.columns).sort_values('Name')
Name Slide Height Weight Status General
0 A X 3 0.1 0.5 0.2
2 A X 10 0.2 0.5 0.2
4 A X 5 1.3 0.5 0.2
1 B Y 3 0.1 0.7 0.8
3 B Y 10 0.2 0.7 0.8
5 B Y 5 1.3 0.7 0.8
I have ~1.2k files that when converted into dataframes look like this:
df1
A B C D
0 0.1 0.5 0.2 C
1 0.0 0.0 0.8 C
2 0.5 0.1 0.1 H
3 0.4 0.5 0.1 H
4 0.0 0.0 0.8 C
5 0.1 0.5 0.2 C
6 0.1 0.5 0.2 C
Now, I have to subset each dataframe with a window of fixed size along the rows, and add its contents to a second dataframe, with all its values originally initialized to 0.
df_sum
A B C
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
For example, let's set the window size to 3. The first subset therefore will be
window = df.loc[start:end, 'A':'C']
window
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
window.index = correct_index
df_sum = df_sum.add(window, fill_value=0)
df_sum
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
After that, the window will be the subset of df1 from rows 1-4, then rows 2-5, and finally rows 3-6. Once the first file has been scanned, the second file will begin, until all file have been processed. As you can see, this approach relies on df.loc for the subset and df.add for the addition. However, despite the ease of coding, it is very inefficient. On my machine it takes about 5 minutes to process the whole batch of 1.2k files of 200 lines each. I know that an implementation based on numpy arrays is orders of magnitude faster (about 10 seconds), but a bit more complicated in terms of subsetting and adding. Is there any way to increase the performance of this method while stile using dataframe? For example substituting the loc with a more performing slice method.
Example:
def generate_index_list(window_size):
before_offset = -(window_size - 1)// 2
after_offset = (window_size - 1)// 2
index_list = list()
for n in range(before_offset, after_offset + 1):
index_list.append(str(n))
return index_list
window_size = 3
for file in os.listdir('.'):
df1 = pd.read_csv(file, sep= '\t')
starting_index = (window_size - 1)//2
before_offset = (window_size - 1)// 2
after_offset = (window_size -1)//2
for index in df1.iterrows():
if index < starting_index or index + before_offset + 1 > len(profile.index):
continue
indexes = generate_index_list(window_size)
window = df1.loc[index - before_offset:index + after_offset, 'A':'C']
window.index = indexes
df_sum = df_sum.add(window, fill_value=0)
Expected output:
df_sum
A B C
0 1.0 1.1 2.0
1 1.0 1.1 2.0
2 1.1 1.6 1.4
Consider building a list of subsetted data frames with.loc and .head. Then run groupby aggregation after individual elements are concatenated.
window_size = 3
def window_process(file):
csv_df = pd.read_csv(file, sep= '\t')
window_dfs = [(csv_df.loc[i:,['A', 'B', 'C']] # ROW AND COLUMN SLICE
.head(window) # SELECT FIRST WINDOW ROWS
.reset_index(drop=True) # RESET INDEX TO 0, 1, 2, ...
) for i in range(df.shape[0])]
sum_df = (pd.concat(window_dfs) # COMBINE WINDOW DFS
.groupby(level=0).sum()) # AGGREGATE BY INDEX
return sum_df
# BUILD LONG DF FROM ALL FILES
long_df = pd.concat([window_process(f) for file in os.listdir('.')])
# FINAL AGGREGATION
df_sum = long_df.groupby(level=0).sum()
Using posted data sample, below are the outputs of each window_dfs:
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
A B C
0 0.0 0.0 0.8
1 0.5 0.1 0.1
2 0.4 0.5 0.1
A B C
0 0.5 0.1 0.1
1 0.4 0.5 0.1
2 0.0 0.0 0.8
A B C
0 0.4 0.5 0.1
1 0.0 0.0 0.8
2 0.1 0.5 0.2
A B C
0 0.0 0.0 0.8
1 0.1 0.5 0.2
2 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
1 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
With final df_sum to show accuracy of DataFrame.add():
df_sum
A B C
0 1.2 2.1 2.4
1 1.1 1.6 2.2
2 1.1 1.6 1.4
I'm trying to create a script that counts to 3 (step size 0.1) using while, and I'm trying to make it not display .0 for numbers without decimal number (1.0 should be displayed as 1, 2.0 should be 2...)
What I tried to do is convert the float to int and then check if they equal. the problem is that it works only with the first number (0) but it doesn't work when it gets to 1.0 and 2.0..
this is my code:
i = 0
while i < 3.1:
if int(i) == i:
print int(i)
else:
print i
i = i + 0.1
that's the output I get:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
the output I should get:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
thank you for your time.
Due to lack of precision in floating point numbers, they will not have an exact integral representation. Therefore, you want to make sure the difference is smaller than some small epsilon.
epsilon = 1e-10
i = 0
while i < 3.1:
if abs(round(i) - i) < epsilon:
print round(i)
else:
print i
i = i + 0.1
You can remove trailing zeros with '{0:g}'.format(1.00).
i = 0
while i < 3.1:
if int(i) == i:
print int(i)
else:
print '{0:g}'.format(i)
i = i + 0.1
See: https://docs.python.org/3/library/string.html#format-specification-mini-language
Update: Too lazy while copy/pasting. Thanks #aganders3
i = 0
while i < 3.1:
print '{0:g}'.format(i)
i = i + 0.1