I have a data frame with 384 rows (and an additional dummy one in the bigining).
each row has 4 variable I wrote manually. 3 calculated fields based on those 4 variables.
and 3 that are comparing each calculated variable to the row before. each field can have 1 of two values (basically True/False).
Final goal - I want to arrange the data frame in a way that the 64 possible combination of the 6 calculated fields (2^6), occur 6 times (2^6*6=384).
Each iteration does a frequency table (pivot) and if one of the fields differ from 6 it breaks and randomize the order.
The problem that there are 384!-12*6! possible combinations and my computer is running the following script for over 4 days without a solution.
import pandas as pd
from numpy import random
# a function that calculates if a row is congruent or in-congruent
def set_cong(df):
if df["left"] > df["right"] and df["left_size"] > df["right_size"] or df["left"] < df["right"] and df["left_size"] < df["right_size"]:
return "Cong"
else:
return "InC"
# open file and calculate the basic fields
DF = pd.read_csv("generator.csv")
DF["distance"] = abs(DF.right-DF.left)
DF["CR"] = DF.left > DF.right
DF["Cong"] = DF.apply(set_cong, axis=1)
again = 1
# main loop to try and find optimal order
while again == 1:
# make a copy of the DF to not have to load it each iteration
df = DF.copy()
again = 0
df["rand"] = [[random.randint(low=1, high=100000)] for i in range(df.shape[0])]
# as 3 of the fields are calculated based on the previous row the first one is a dummy and when sorted needs to stay first
df.rand.loc[0] = 0
Sorted = df.sort_values(['rand'])
Sorted["Cong_n1"] = Sorted.Cong.eq(Sorted.Cong.shift())
Sorted["Side_n1"] = Sorted.CR.eq(Sorted.CR.shift())
Sorted["Dist_n1"] = Sorted.distance.eq(Sorted.distance.shift())
# here the dummy is deleted
Sorted = Sorted.drop(0, axis=0)
grouped = Sorted.groupby(['distance', 'CR', 'Cong', 'Cong_n1', 'Dist_n1', "Side_n1"])
for name, group in grouped:
if group.shape[0] != 6:
again = 1
break
Sorted.to_csv("Edos.csv", sep="\t",index=False)
print ("bye")
the data frame looks like this:
left right size_left size_right distance cong CR distance_n1 cong_n1 side_n1
1 6 22 44 5 T F dummy dummy dummy
5 4 44 22 1 T T F T F
2 3 44 22 1 F F T F F
Related
import pandas as pd
import numpy as np
data_dir = 'data_r14.csv'
data = pd.read_csv(data_dir)
# print(data)
signals = data['signal']
value_counts = signals.value_counts()
buy_count = value_counts[1]
signals_code = [1, 2]
buy_sell_rows = data.loc[data['signal'].isin(signals_code)]
data_without_signals = data[~data['signal'].isin(signals_code)]
random_0_indexes = np.random.choice(data_without_signals.index.values, buy_count)
value_counts2 = data_without_signals['signal'].value_counts()
# print(value_counts2)
for index in random_0_indexes:
row = data.loc[index, :]
# df = row.to_frame()
print(row)
buy_sell_rows.append(row)
# print(buy_sell_rows)
# print(signals.loc[index, :])
# print(random_0_rows)
print(buy_sell_rows)
# print(buy_sell_rows['signal'].value_counts())
So I have a dataframe where I have a column named signal where the values are either 0, 1, or 2 and I want to balance them by having equal amount rows for each value because they are very unbalanced I have only 1984 row of non zero value and over 20000 rows of zero value.
So I created a new dataframe where all the values are zeroes and called it data_without_signals then I get a random list of indexes from it, then I run a loop to get that row to append it to another dataframe I created called buy_sell_rows where only non zero values are in, but the issue is that row is being appened.
As said in my comment, I think your general approach could be simplified by randomly sampling the different signals:
# my test signal of 0s, 1s and 2s
test = pd.DataFrame({"data" : [0,0,0,1,1,1,1,1,1,1,2,2,2,2,2,2]})
# get the lowest size of any group, which is the size all groups should be reduced to
max_size = test.groupby("data")["data"].count().min()
# sample
output = (test
.groupby(["data"])
.agg(sample = ("data", lambda x : x.sample(max_size).to_list()))
.explode("sample")
.reset_index(drop=True)
)
and the output for this test is:
sample
0
0
1
0
2
0
3
1
4
1
5
1
6
2
7
2
8
2
I have a Data Frame which contains a column like this:
pct_change
0 NaN
1 -0.029767
2 0.039884 # period of one
3 -0.026398
4 0.044498 # period of two
5 0.061383 # period of two
6 -0.006618
7 0.028240 # period of one
8 -0.009859
9 -0.012233
10 0.035714 # period of three
11 0.042547 # period of three
12 0.027874 # period of three
13 -0.008823
14 -0.000131
15 0.044907 # period of one
I want to get all the periods where the pct change was positive into a list, so with the example column it will be:
raise_periods = [1,2,1,3,1]
Assuming that the column of your dataframe is a series called y which contains the pct_changes, the following code provides a vectorized solution without loops.
y = df['pct_change']
raise_periods = (y < 0).cumsum()[y > 0]
raise_periods.groupby(raise_periods).count()
eventually, the answer provided by #gioxc88 didn't get me where I wanted, but it did put me in the right direction.
what I ended up doing is this:
def get_rise_avg_period(cls, df):
df[COMPOUND_DIFF] = df[NEWS_COMPOUND].diff()
df[CONSECUTIVE_COMPOUND] = df[COMPOUND_DIFF].apply(lambda x: 1 if x > 0 else 0)
# group together the periods of rise and down changes
unfiltered_periods = [list(group) for key, group in itertools.groupby(df.consecutive_high.values.tolist())]
# filter out only the rise periods
positive_periods = [li for li in unfiltered_periods if 0 not in li]
I wanted to get the average length of this positive periods, so I added this at the end:
period = round(np.mean(positive_periods_lens))
This in continuation to my previous question:
How to fetch the modified rows after comparing 2 versions of same data frame
I am now done with the MODIFICATIONS, however, I am using below method for finding the INSERTS and DELETES.
It work fine, however, it takes a lot of time. Typically for a CSV file which has 10 columns and 10M rows.
For my problem,
INSERT are the records which are not in old file, but in new file.
DELETE are the records which are in old file, but not in new file.
Below is the code:
def getInsDel(df_old,df_new,key):
#concatinating old and new data to generate comparisons
df = pd.concat([df_new,df_old])
df= df.reset_index(drop = True)
#doing a group by for getting the frequency of each key
print('Grouping data for frequency of key...')
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
df_delta = df.reindex(idx)
df_delta_freq = df_delta.groupby(key).size().reset_index(name='Freq')
#Filtering data for frequency = 1, since these will be the target records for DELETE and INSERT
print('Creating data frame to get records with Frequency = 1 ...')
filter = df_delta_freq['Freq']==1
df_delta_freq_ins_del = df_delta_freq.where(filter)
#Dropping row with NULL
df_delta_freq_ins_del = df_delta_freq_ins_del.dropna()
print('Creating data frames of Insert and Deletes ...')
#Creating INSERT dataFrame
df_ins = pd.merge(df_new,
df_delta_freq_ins_del[key],
on = key,
how = 'inner'
)
#Creating DELETE dataFrame
df_del = pd.merge(df_old,
df_delta_freq_ins_del[key],
on = key,
how = 'inner'
)
print('size of INSERT file: ' + str(df_ins.shape))
print('size of DELETE file: ' + str(df_del.shape))
return df_ins,df_del
The section where I am doing a group by for the frequency of each key, it takes around 80% of the total time, so for my CSV it takes around 12-15 mins.
There must be a faster approach for doing this?
For your reference, below is my result expectation:
For example, Old data is:
ID Name X Y
1 ABC 1 2
2 DEF 2 3
3 HIJ 3 4
and new data set is:
ID Name X Y
2 DEF 2 3
3 HIJ 55 42
4 KLM 4 5
Where ID is the Key.
Insert_DataFrame should be:
ID Name X Y
4 KLM 4 5
Deleted_DataFrame should be:
ID Name X Y
1 ABC 1 2
to be deleted
delete=pd.merge(old,new,how='left',on='ID',indicator=True)
delete=delete.loc[delete['_merge']=='left_only']
delete.dropna(1,inplace=True)
to be inserted
insert=pd.merge(new,old,how='left',on='ID',indicator=True)
insert=insert.loc[insert['_merge']=='left_only']
insert.dropna(1,inplace=True)
I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2
I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...
What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.
I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
axis=1
)
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454