Remove the group of rows based on the condition of rows - python

I have a dataframe which has two columns, 'Group' and 'Sample Number'
The column 'Group' has sample number '11' which is UNIQUE. and each group will have only one '11' Sample Number, followed by the sample numbers in range of 21 to 29 ( for example, 21, 22 23, 24, 25, 26, 27 , 28 , 29) and followed by the sample numbers in range of 31 to 39 (for example, 31, 32, 33, 34, 35, 36, 37, 38, 39). Hence each group should have one '11' sample number, at least one sample number in the range of 21 to 29 and at least one sample number in the rande of 31 to 39.
I wish to compute in such a way that my code goes through each group and
Check if there is a sample number 11 in the group or not.
Check if there is at least one sample number in the range of 21
to 29 .
Check if there is at least one sample number in the range
of 31 to 39
If any of these three conditions does not match then the code removes the entire group from the dataframe
Below is the dataframe in table format:
Group
Sample_Number
Z007
11
Z007
21
Z007
22
Z007
23
Z007
31
Z007
32
Z008
11
Z008
31
Z008
32
Z008
33
Z009
11
Z009
21
Z009
22
Z009
23
Z010
21
Z010
22
Z010
23
Z010
24
Z010
31
Z010
32
Z010
33
Z010
34
df = pd.DataFrame([[Z007, 11],[Z007, 21] , [Z007, 22], [Z007, 23], [Z007, 31],[Z007, 32],[Z008, 11],[Z008, 31],[Z008, 32],[Z008, 33],[Z009, 11],[Z009, 21],[Z009, 22],[Z009, 23], [Z010, 21],[Z010, 22],[Z010, 23], [Z010, 24],[Z010, 31],[Z010, 32],[Z010, 33],[Z010, 34], columns=['Group', 'Sample_Number'])
The code should remove the group 'Z008' as it does not have the sample number in the range of 21 to 29. It should remove the group 'Z009' as it does not have the sample number in the range of 31 to 39. Also it should remove the group 'Z010' as it does not have the sample number '11'.
Expected answer is below:
Group
Sample_Number
Z007
11
Z007
21
Z007
22
Z007
23
Z007
31
Z007
32
I could do it only for sample number 11 but struggling to do the same for the other sample numbers in the range of (21 to 29 ) and (31 to 39), below is the code for sample number 11
invalid_group_no = [i for i in df['Group'].unique() if
df[df['Group']== i]["Sample_Number"].to_list().count(11)!=1]
Can anyone please help me with the other sample numbers? Please feel free to implement your own ways. Any help is appreciated.

Try this:
groups = set(df['Group'][df['Sample_Number'] == 11]) & set(df['Group'][df['Sample_Number'].isin(range(21,30))]) & set(df['Group'][df['Sample_Number'].isin(range(31,40))])
df = df[df['Group'].isin(groups)]
Group Sample_Number
0 Z007 11
1 Z007 21
2 Z007 22
3 Z007 23
4 Z007 31
5 Z007 32

Related

Pandas dataframe Plotly line chart with two lines

I have a pandas dataframe as below and I would like to produce a few charts with the data. 'Name' column are the names of the accounts, 'Number' column is the number of users under each count, and the months columns are the login times of each account in every month.
Acc User Jan Feb Mar Apr May June
Nora 39 5 13 16 22 14 20
Bianca 53 14 31 22 21 20 29
Anna 65 30 17 18 28 12 13
Katie 46 9 12 30 34 25 15
Melissa 29 29 12 30 10 4 9
1st: I would like to monitor the trend of logins from January to May. One line illustrates Bianca's login and the other line illustrates everyone else's login.
2nd: I would like to monitor the percentage change of logins from January to May. One line illustrates Bianca's login percentage change and the other line illustrates everyone else's login percentage change.
Thank you for your time and assistance. I'm a beginner at this. I appreciate any help on this! Much appreciated!!
I suggest best approach to group is use categoricals. pct_change is not a direct aggregate function so it's a bit more involved to get it.
import io
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO("""Acc User Jan Feb Mar Apr May June
Nora 39 5 13 16 22 14 20
Bianca 53 14 31 22 21 20 29
Anna 65 30 17 18 28 12 13
Katie 46 9 12 30 34 25 15
Melissa 29 29 12 30 10 4 9"""), sep="\s+")
# just setup 2 plot areas
fig, ax = plt.subplots(1,2, figsize=[20,5])
# want to divide data into 2 groups
df["grp"] = pd.Categorical(df["Acc"], ["Bianca","Others"])
df["grp"].fillna("Others", inplace=True)
# just get it out of the way...
df.drop(columns="User", inplace=True)
# simple plot where function exists directly. Not transform to get lines..
df.groupby("grp").sum().T.plot(ax=ax[0])
# a bit more sophisticated to get pct change...
df.groupby("grp").sum().T.assign(
Bianca=lambda x: x["Bianca"].pct_change().fillna(0)*100,
Others=lambda x: x["Others"].pct_change().fillna(0)*100
).plot(ax=ax[1])
output

Python: Predicting series of numbers without INPUT to a NN

I have a random list of series (integers) along with dates in a csv like:
1/1/2019,34 44 57 62 70
12/28/2018,09 10 25 37 38
12/25/2018,02 08 42 43 50
12/21/2018,10 13 61 62 70
12/18/2018,13 22 32 60 69
12/14/2018,05 22 26 43 49
12/11/2018,04 38 39 54 59
12/7/2018,04 10 20 33 57
12/4/2018,28 31 41 42 50
The list goes all the way back to year 1997. What I am trying is to predict the next series (or as closest as possible) based on these data:
The size of the list (2336)
What have I tried?
The approach that I've used so far is (e.g. for 1/1/2019,34 44 57 62 70):
1) Get the occurrence of each number in the list, i.e. the number 34 has occurred 170 times out the total list (2336).
2) Find the percentage of each number that has occurred. i.e.
Perc/Chances(34) = Occurrence/TotalNo.
Chances(34) = 170/2336
Chances(34) = 0.072 ~ 07
One way to get the list would be to just find the 5 numbers from the list with the least Percentages. but that won't be much effective.
On the other hand, Now I have a data which has each number, its percentage and its occurrence. Is there any way I can somehow train a neural network that predicts the next series? or closest.
Hierarchy:
Where comp_data.csv contains data like:
1/1/2019,34 44 57 62 70
12/28/2018,09 10 25 37 38
12/25/2018,02 08 42 43 50
12/21/2018,10 13 61 62 70
12/18/2018,13 22 32 60 69
12/14/2018,05 22 26 43 49
12/11/2018,04 38 39 54 59
12/7/2018,04 10 20 33 57
12/4/2018,28 31 41 42 50
and occurrence.csv contains:
34,170
44,197
57,36
62,38
70,37
09,186
10,210
25,197
37,185
38,206
02,217
08,185
and report.csv contains the number, occurrence and its percentage:
34,3,11
44,1,03
57,5,19
62,5,19
70,5,19
09,1,03
10,5,19
25,2,07
37,3,11
38,2,07
02,1,03
08,2,07
So I have the list of series, its occurrences over a period of time, and the percentages. Is there anyway I can create a NN that expects some INPUTS trains over a data and predicts the OUT (a series in this case)
The Problem:
Which ones would be the Input? As it is a pure random problem. PS. I cannot provide any Input since I need a series without INPUT. Perhaps, a LSTM Network for Regression?

Break Existing Dataframe Apart Based on Multi Index

I have an existing dataframe that is sorted like this:
In [3]: result_GB_daily_average
Out[3]:
NREL Avert
Month Day
1 1 14.718417 37.250000
2 40.381167 45.250000
3 42.512646 40.666667
4 12.166896 31.583333
5 14.583208 50.416667
6 34.238000 45.333333
7 45.581229 29.125000
8 60.548479 27.916667
9 48.061583 34.041667
10 20.606958 37.583333
11 5.418833 70.833333
12 51.261375 43.208333
13 21.796771 42.541667
14 27.118979 41.958333
15 8.230542 43.625000
16 14.233958 48.708333
17 28.345875 51.125000
18 43.896375 55.500000
19 95.800542 44.500000
20 53.763104 39.958333
21 26.171437 50.958333
22 20.372688 66.916667
23 20.594042 42.541667
24 16.889083 48.083333
25 16.416479 42.125000
26 28.459625 40.125000
27 1.055229 49.833333
28 36.798792 42.791667
29 27.260083 47.041667
30 23.584917 55.750000
... ... ...
12 2 34.491604 55.916667
3 26.444333 53.458333
4 15.088333 45.000000
5 10.213500 32.083333
6 19.087688 17.000000
7 23.078292 17.375000
8 41.523667 29.458333
9 17.173854 37.833333
10 11.488687 52.541667
11 15.203479 30.000000
12 8.390917 37.666667
13 70.067062 23.458333
14 24.281729 25.583333
15 31.826104 33.458333
16 5.085271 42.916667
17 3.778229 46.916667
18 31.276958 57.625000
19 7.399458 46.916667
20 18.531958 39.291667
21 26.831937 35.958333
22 55.514000 32.375000
23 24.018875 34.041667
24 54.454125 43.083333
25 57.379812 25.250000
26 94.520833 33.958333
27 49.693854 27.500000
28 2.406438 46.916667
29 7.133833 53.916667
30 7.829167 51.500000
31 5.584646 55.791667
I would like to split this dataframe apart into 12 different data frames, one for each month, but the problem is they are all slightly different lengths because the amount of days in a month vary, meaning that attempts at using np.array_split have failed. How can I split this based on the Month index?
One solution :
df=result_GB_daily_average
[df.iloc[df.index.get_level_values('Month')==i+1] for i in range(12)]
or, shorter:
[df.ix[i] for i in range(12)]

Efficient way to iterate through a large dataframe

I have a csv file that contains several thousand records of company stock data. It contains the following integer fields:
low_price, high_price, volume_traded
10, 20, 45667
15, 22, 256565
41, 47, 45645
30, 39, 547343
My requirement is to create a new csv file from this data by accumulating the volume_traded at every price level (from low to high). The final result would just be two columns as follows:
price, total_volume_traded
10, 45667
11, 45667
12, 45667
....
....
15, 302232
etc
In other words the final csv contains one record for every price level (not just the high/low but also the prices in-between), along with the total amount of volume_traded at that price level.
I have got this working however it is terribly slow and inefficient. Im sure there must be better ways of accomplishing this.
Basically what i've done is use nested loops:
First iterate through each row.
On each row, create a nested loop to iterate through the price range from low_price to high_price.
Check if the price already exists in the new dataframe, if so add the current volume_traded to it. If it doesn't exist, apppend the price and volume (ie: create a new row).
Below is some of the relevant code. I would be grateful if anyone could advise a better way of doing this in terms of efficiency/speed:
df_exising = #dataframe created from existing csv
df_new = #dataframe for new Price/Volume values
for index, row in df_existing.iterrows():
price = row['low_price']
for i in range(row['low_price'], row['high_price']+1):
volume = row['volume_traded']
df_new = accumulate_volume(df_new, price, volume)
price+=1
def accumulate_volume(df_new, price, volume):
#If price level already exists, add volume to existing
if df_new['Price'].loc[df_new['Price'] == price].count() > 0:
df_new['Volume'].loc[df_new['Price'] == price] += volume
return(df_new)
else:
#first occurrence of price level, add new row
tmp = {'Price':int(price), 'Volume':volume}
return(df_new.append(tmp, ignore_index=True))
#once the above finishes, df_new is written to the new csv file
My guess for why this is so slow is at least partly because 'append' creates a new object every time it's called, and it gets called a LOT. In total, the nested loop from the above code gets run 1595653 times.
I would be very grateful for any assistance.
Let's forget a moment about potential issues with methodology (think about how your results would look if 100k shares traded at a price of 50-51 and 100k traded at 50-59).
Below are a set of commented steps that should achieve your goal:
# Initialize DataFrame.
df = pd.DataFrame({'low': [10, 15, 41, 30],
'high': [20, 22, 47, 39],
'volume': [45667, 256565, 45645, 547343]})
# Initialize a price dictionary spanning range of potential prices.
d = {price: 0 for price in range(min(df.low), max(df.high) + 1)}
# Create helper function to add volume to given price bucket.
def add_volume(price_dict, price, volume):
price_dict[price] += volume
# Use a nested list comprehension to call the function and populate the dictionary.
_ = [[add_volume(d, price, volume) for price in range(low, high + 1)]
for low, high, volume in zip(df.low, df.high, df.volume)]
# Convert the dictionary to a DataFrame and output to csv.
idx = pd.Index(d.keys(), name='price')
df = pd.DataFrame(d.values(), index=idx, columns=['total_volume_traded'])
df.to_csv('output.csv')
>>> df
total_volume_traded
price
10 45667
11 45667
12 45667
13 45667
14 45667
15 302232
16 302232
17 302232
18 302232
19 302232
20 302232
21 256565
22 256565
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 0
41 45645
42 45645
43 45645
44 45645
45 45645
46 45645
47 45645
I would first groupby the 'low_price' column, then sum up the volume_traded, reset the index. This will effectively accumulate all the prices of interest, then you want to sort by the price, this makes the prices monotonic so that we can use it as the index. After setting to be the index we can then call reindex and compute a new index and fill the missing values using method='pad':
In [33]:
temp="""low_price,high_price,volume_traded
10,20,45667
15,22,256565
41,47,45645
10,20,12345
30,39,547343"""
df = pd.read_csv(io.StringIO(temp))
df
Out[33]:
low_price high_price volume_traded
0 10 20 45667
1 15 22 256565
2 41 47 45645
3 10 20 12345
4 30 39 547343
In [34]:
df1 = df.groupby('low_price')['volume_traded'].sum().reset_index()
df1
Out[34]:
low_price volume_traded
0 10 58012
1 15 256565
2 30 547343
3 41 45645
In [36]:
df1.sort(['low_price']).set_index(['low_price']).reindex(index = np.arange(df1['low_price'].min(), df1['low_price'].max()+1), method='pad')
Out[36]:
volume_traded
low_price
10 58012
11 58012
12 58012
13 58012
14 58012
15 256565
16 256565
17 256565
18 256565
19 256565
20 256565
21 256565
22 256565
23 256565
24 256565
25 256565
26 256565
27 256565
28 256565
29 256565
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 547343
41 45645

Finding average value in a single data frame using pandas

import pandas as pd
import csv
l=[]
url = "ftp://ftp.cpc.ncep.noaa.gov/htdocs/degree_days/weighted/daily_data/{}/Population.Heating.txt"
for i in range(1981,2018):
df = pd.read_csv(url.format(i), sep='\|', skiprows=3, engine='python')
l.append(df.loc[9])
print(pd.concat(l))
pd.concat(l)
Region CONUS
19810101 28
19810102 29
19810103 33
19810104 37
19810105 38
19810106 33
19810107 31
19810108 36
19810109 37
19810110 36
...
20171227 37
20171228 38
20171229 35
20171230 34
20171231 40
Name: 9, Length: 13551, dtype: object
>>>
This code will give temperature from 1981 to 2017, and I am trying to find the average value of each month
pd.concat(l).mean() didn't work....
Can anyone help me on this issue? Thank you!

Categories

Resources