I have a dataframe with the following information:
ticker date close gap
0 BHP 1981-07-31 0.945416 -0.199458
1 BHP 1981-08-31 0.919463 -0.235930
2 BHP 1981-09-30 0.760040 -0.434985
3 BHP 1981-10-30 0.711842 -0.509136
4 BHP 1981-11-30 0.778578 -0.428161
.. ... ... ... ...
460 BHP 2019-11-29 38.230000 0.472563
461 BHP 2019-12-31 38.920000 0.463312
462 BHP 2020-01-31 39.400000 0.459691
463 BHP 2020-02-28 33.600000 0.627567
464 BHP 2020-03-31 28.980000 0.784124
I developed the following code to find where the rows are when it crosses 0:
zero_crossings =np.where(np.diff(np.sign(BHP_data['gap'])))[0]
This returns:
array([ 52, 54, 57, 75, 79, 86, 93, 194, 220, 221, 234, 235, 236,
238, 245, 248, 277, 379, 381, 382, 383, 391, 392, 393, 395, 396],
dtype=int64)
I need to be able to do the following:
calculate the number of months between points where 'gap' crosses 0
remove items where the number of months is <12
average the remaining months
However, I don't know how to turn this nd.array into something useful that I can make the calculations from. When I try:
pd.DataFrame(zero_crossings)
I get the following df, which only returns the index:
0
0 52
1 54
2 57
3 75
4 79
5 86
.. ..
Please help...
Just extended your code a bit to get the zero_crossings into the original dataframe as required.
import pandas as pd
import numpy as np
BHP_data = pd.DataFrame({'gap': [-0.199458, 0.472563, 0.463312, 0.493318, -0.509136, 0.534985, 0.784124]})
BHP_data['zero_crossings'] = 0
zero_crossings = np.where(np.diff(np.sign(BHP_data['gap'])))[0]
print(zero_crossings) # [0 3 4]
# Updates the column to 1 based on the 0 crossing
BHP_data.loc[zero_crossings, 'zero_crossings'] = 1
print(BHP_data)
Output
gap zero_crossings
0 -0.199458 1
1 0.472563 0
2 0.463312 0
3 0.493318 1
4 -0.509136 1
5 0.534985 0
6 0.784124 0
Related
I have a pandas dataset that looks at the number of n cases of an instance over time.
I have sorted the dataset in ascending order from the first recorded date and have created a new column called 'change'.
I am unsure however how to take the data from column n and map it onto the 'change' column such that each cell in the 'change' column represents the difference from the previous day.
For example, if on day 334 there were n = 14000 and on day 335 there were n = 14500 cases, in that corresponding 'change' cell I would want it to say '500'.
I have been trying things out for the past couple of hours but to no avail so have come here for some help.
I know this is wordier than I would like, but if you need any clarification let me know.
import pandas as pd
df = pd.DataFrame({
'date': [1,2,3,4,5,6,7,8,9,10],
'cases': [100, 120, 129, 231, 243, 212, 375, 412, 440, 1]
})
df['change'] = df.cases.diff()
OUTPUT
date cases change
0 1 100 NaN
1 2 120 20.0
2 3 129 9.0
3 4 231 102.0
4 5 243 12.0
5 6 212 -31.0
6 7 375 163.0
7 8 412 37.0
8 9 440 28.0
9 10 1 -439.0
I have a dataframe as follows. The values are in a cell, a list of elements. I want to visualize distribution of the values from the "Values" column using histogram"S" stacked in rows OR separated by colours (Area_code).
How can I get the values and construct histogram"S" in plotly? Any other idea also welcome. Thank you.
Area_code Values
0 New_York [999, 54, 231, 43, 177, 313, 212, 279, 199, 267]
1 Dallas [915, 183, 2326, 316, 206, 31, 317, 26, 31, 56, 316]
2 XXX [560]
3 YYY [884, 13]
4 ZZZ [203, 1066, 453, 266, 160, 109, 45, 627, 83, 685, 120, 410, 151, 33, 618, 164, 496]
If you reshape your data, this would be a perfect case for px.histogram. And from there you can opt between several outputs like sum, average, count through the histfunc method:
fig = px.histogram(df, x = 'Area_code', y = 'Values', histfunc='sum')
fig.show()
You haven't specified what kind of output you're aiming for, but I'll leave it up to you to change the argument for histfunc and see which option suits your needs best.
I'm often inclined to urge users to rethink their entire data process, but I'm just going to assume that there are good reasons why you're stuck with what seems like a pretty weird setup in your dataframe. The snippet below contains a complete data munginge process to reshape your data from your setup, to a so-called long format:
Area_code Values
0 New_York 999
1 New_York 54
2 New_York 231
3 New_York 43
4 New_York 177
5 New_York 313
6 New_York 212
7 New_York 279
8 New_York 199
9 New_York 267
10 Dallas 915
11 Dallas 183
12 Dallas 2326
13 Dallas 316
14 Dallas 206
15 Dallas 31
16 Dallas 317
17 Dallas 26
18 Dallas 31
19 Dallas 56
20 Dallas 316
21 XXX 560
22 YYY 884
23 YYY 13
24 ZZZ 203
And this is a perfect format for many of the great functionalites of plotly.express.
Complete code:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
# data input
df = pd.DataFrame({'Area_code': {0: 'New_York', 1: 'Dallas', 2: 'XXX', 3: 'YYY', 4: 'ZZZ'},
'Values': {0: [999, 54, 231, 43, 177, 313, 212, 279, 199, 267],
1: [915, 183, 2326, 316, 206, 31, 317, 26, 31, 56, 316],
2: [560],
3: [884, 13],
4: [203, 1066, 453, 266, 160, 109, 45, 627, 83, 685, 120, 410, 151, 33, 618, 164, 496]}})
# data munging
areas = []
value = []
for i, row in df.iterrows():
# print(row['Values'])
for j, val in enumerate(row['Values']):
areas.append(row['Area_code'])
value.append(val)
df = pd.DataFrame({'Area_code': areas,
'Values': value})
# plotly
fig = px.histogram(df, x = 'Area_code', y = 'Values', histfunc='sum')
fig.show()
I have following dataframe - dfgeo:
x y z zt n k pv geometry dist
0 6574878.210 4757530.610 1152.588 1 8 4 90 POINT (6574878.210 4757530.610) 0.000000
1 6574919.993 4757570.314 1174.724 0 POINT (6574919.993 4757570.314) 57.638760
2 6575020.518 4757665.839 1177.339 0 POINT (6575020.518 4757665.839) 138.673362
3 6575239.548 4757873.972 1160.156 1 8 4 90 POINT (6575239.548 4757873.972) 302.148120
4 6575351.603 4757980.452 1202.418 0 POINT (6575351.603 4757980.452) 154.577856
5 6575442.780 4758067.093 1199.297 0 POINT (6575442.780 4758067.093) 125.777217
6 6575538.217 4758157.782 1192.914 1 8 4 90 POINT (6575538.217 4758157.782) 131.653772
7 6575594.625 4758240.033 1217.442 0 POINT (6575594.625 4758240.033) 99.735096
8 6575738.820 4758450.289 1174.477 0 POINT (6575738.820 4758450.289) 254.950551
9 6575850.937 4758613.772 1123.852 1 8 4 90 POINT (6575850.937 4758613.772) 198.234490
10 6575984.323 4758647.118 1131.761 0 POINT (6575984.323 4758647.118) 137.491020
11 6576204.312 4758702.115 1119.407 0 POINT (6576204.312 4758702.115) 226.759410
12 6576303.976 4758727.031 1103.064 0 POINT (6576303.976 4758727.031) 102.731300
13 6576591.496 4758798.910 1114.06 0 POINT (6576591.496 4758798.910) 296.368590
14 6576736.965 4758835.277 1120.285 1 8 4 90 POINT (6576736.965 4758835.277) 149.945952
I am trying to group by zt values an summarize dist column. I have tried this:
def summarize(group):
s = group['zt'].eq(1).cumsum()
return group.groupby(s).agg(
D=('dist', 'sum')
)
dfzp=dfgeo.apply(summarize)
But i get following errors on last line of code
s = group['zt'].eq(1).cumsum()
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 871, in __getitem__
result = self.index.get_value(self, key)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4405, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 90, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 135, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index_class_helper.pxi", line 109, in pandas._libs.index.Int64Engine._check_type
KeyError: 'zt'
Any help in resolving this appreciated.
If need pass Dataframe to function use:
dfzp=summarize(dfgeo)
Or DataFrame.pipe:
dfzp=dfgeo.pipe(summarize)
If use DataFrame.apply then is used function per columns or per rows if axis=1.
I have a dataframe column which specifies how many times a user has performed an activity.
eg.
>>> df['ActivityCount']
Users ActivityCount
User0 220
User1 190
User2 105
User3 109
User4 271
User5 265
...
User95 64
User96 15
User97 168
User98 251
User99 278
Name: ActivityCount, Length: 100, dtype: int32
>>> activities = sorted(df['ActivityCount'].unique())
[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 78,
83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 144, 145, 148, 153, 155, 157, 162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 240, 244, 247, 251, 255, 258, 260, 265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]
According to their ActivityCount, I have to divide users into 5 different categories eg A, B, C, D and E.
Activity Count range varies from time to time. In the above example it's approx in-between (9-290) (lowest and highest of the series), it could be (5-500) or (5 to 30).
In above example, I can take the max number of activities and divide it by 5 and categorize each user between the range of 58 (from 290/5) like Range A: 0-58, Range B: 59-116, Range C: 117-174...etc
Is there any other way to achieve this using pandas or numpy, so that I can directly categorize the column in the given categories?
Expected output: -
>>> df
Users ActivityCount Category/Range
User0 220 D
User1 190 D
User2 105 B
User3 109 B
User4 271 E
User5 265 E
...
User95 64 B
User96 15 A
User97 168 C
User98 251 E
User99 278 E
The natural way to do that would be to split the data into 5 quanties, and then split the data into bins based on these quantities. Luckily, pandas allows you do easily do that:
df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])
The output is something like:
Activity Category
34 115 b
15 43 a
57 192 d
78 271 e
26 88 b
6 25 a
55 186 d
63 220 d
1 15 a
76 268 e
An alternative view - clustering
In the above method, we've split the data into 5 bins, where the sizes of the different bins are equal. An alternative, more sophisticated approach, would be to split the data into 5 clusters and aim to have the data points in each cluster as similar to each other as possible. In machine learning, this is known as a clustering / classification problem.
One classic clustering algorithm is k-means. It's typically used for data with multiple dimensions (e.g. monthly activity, age, gender, etc.) This is, therefore, a very simplistic case of clustering.
In this case, k-means clustering can be done in the following way:
import scipy
from scipy.cluster.vq import vq, kmeans, whiten
df = pd.DataFrame({"Activity": l})
features = np.array([[x] for x in df.Activity])
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 5)
code, dist = vq(whitened, codebook)
df["Category"] = code
And the output looks like:
Activity Category
40 138 1
79 272 0
72 255 0
13 38 3
41 139 1
65 231 0
26 88 2
59 197 4
76 268 0
45 145 1
A couple of notes:
The labels of the categories are random. In this case label '2' refers to higher activity than lavel '1'.
I didn't migrate the labels from 0-4 to A-E. This can easily be done using pandas' map.
Try the below solution:
df['Categ'] = pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'))
It creates Categ column - a result of division of ActivityCount
into 5 bins, labelled with A, ... E.
Borders of bins are set by division of full range into n subranges of
equal size.
You can also see the borders of each bin, calling:
pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'), retbins=True)[1]
i have a sample pandas data frame:
import pandas as pd
df = {'ID': [73, 68,1,94,42,22, 28,70,47, 46,17, 19, 56, 33 ],
'CloneID': [1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4 ],
'VGene': ['64D', '64D', '64D', 61, 61, 61, 311, 311, 311, 311, 311, 311, 311, 311]}
df = pd.DataFrame(df)
it looks like this:
df
Out[7]:
CloneID ID VGene
0 1 73 64D
1 1 68 64D
2 1 1 64D
3 1 94 61
4 1 42 61
5 2 22 61
6 2 28 311
7 3 70 311
8 3 47 311
9 3 46 311
10 4 17 311
11 4 19 311
12 4 56 311
13 4 33 311
i want to write a simple script to output each cloneID to a different output file. so in this case there would be 4 different files.
the first file would be named 'CloneID1.txt' and it would look like this:
CloneID ID VGene
1 73 64D
1 68 64D
1 1 64D
1 94 61
1 42 61
second file would be named 'CloneID2.txt':
CloneID ID VGene
2 22 61
2 28 311
third file would be named 'CloneID3.txt':
CloneID ID VGene
3 70 311
3 47 311
3 46 311
and last file would be 'CloneID4.txt':
CloneID ID VGene
4 17 311
4 19 311
4 56 311
4 33 311
the code i found online was:
import pandas as pd
data = pd.read_excel('data.xlsx')
for group_name, data in data.groupby('CloneID'):
with open('results.csv', 'a') as f:
data.to_csv(f)
but it outputs everything to one file instead of multiple files.
You can do something like the following:
In [19]:
gp = df.groupby('CloneID')
for g in gp.groups:
print('CloneID' + str(g) + '.txt')
print(gp.get_group(g).to_csv())
CloneID1.txt
,CloneID,ID,VGene
0,1,73,64D
1,1,68,64D
2,1,1,64D
3,1,94,61
4,1,42,61
CloneID2.txt
,CloneID,ID,VGene
5,2,22,61
6,2,28,311
CloneID3.txt
,CloneID,ID,VGene
7,3,70,311
8,3,47,311
9,3,46,311
CloneID4.txt
,CloneID,ID,VGene
10,4,17,311
11,4,19,311
12,4,56,311
13,4,33,311
So here we iterate over the groups in for g in gp.groups: and we use this to create the result file path name and call to_csv on the group so the following should work for you:
gp = df.groupby('CloneID')
for g in gp.groups:
path = 'CloneID' + str(g) + '.txt'
gp.get_group(g).to_csv(path)
Actually the following would be even simpler:
gp = df.groupby('CloneID')
gp.apply(lambda x: x.to_csv('CloneID' + str(x.name) + '.txt'))