I'm doing a python code for data analysis. I would like to mark the lines, in a new column, that have the same value in EMP, RAZAO and ATRIB columns and add the values in MONTCALC is zero. For exemple:
Example of datas
In this image the lines marked with color are subgroup and if you add the values of MONTCALC column the result is 0.
My code:
conciliation_df_temp = conciliation_df.copy()
doc_clear = 1
for i in conciliation_df_temp.index:
if conciliation_df_temp.loc[i,'DOC_COMP'] == "":
company = conciliation_df_temp.loc[i,'EMP']
gl_account = conciliation_df_temp.loc[i,'RAZAO']
assignment = conciliation_df_temp.loc[i,'ATRIB']
df_temp = conciliation_df_temp.loc[(cconciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment)]
if round(df_temp['MONTCALC'].sum(),2) == 0:
conciliation_df_temp.loc[(conciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment),'DOC_COMP'] = doc_clear
doc_clear += 1
The performance with few lines (10,000) is good execute less than 1 minute. In the 1 minute also has read a text file, file handling and convertion to dataframe. But if I put a text file with more than 1 million lines the script does't execute, I wait 5 hours with out return.
What do I do to improve performance this code?
Regards!!
Sorry my English
I tried delete lines in dataFrame to decrease size dataFrame to search be faster, but the execution was slower.
It seems like you could just check if the sum of the group is zero:
import pandas as pd
df = pd.DataFrame([
[3000,1131500040,8701731701,-156002.08],
[3000,1131500040,8701731701, 156002.08],
[3000,1131500040,"EA-17012.2.22", -3990],
[3000,1131500040,"EA-17012.2.22", 400],
[3000,1131500040,"000100000103", -35822.86],
[3000,1131500040,"000100000103", 35822.86],
[3000,1131500040,"000100000103", -35822.86],
[3000,1131500040,"000100000103", 35822.86]
], columns=['EMP','RAZAO','ATRIB','MONTCALC']
)
df['zero'] = df.groupby(['EMP','RAZAO','ATRIB'])['MONTCALC'].transform(lambda x: sum(x)==0)
print(df)
Output
EMP RAZAO ATRIB MONTCALC zero
0 3000 1131500040 8701731701 -156002.08 True
1 3000 1131500040 8701731701 156002.08 True
2 3000 1131500040 EA-17012.2.22 -3990.00 False
3 3000 1131500040 EA-17012.2.22 400.00 False
4 3000 1131500040 000100000103 -35822.86 True
5 3000 1131500040 000100000103 35822.86 True
6 3000 1131500040 000100000103 -35822.86 True
7 3000 1131500040 000100000103 35822.86 True
Related
I want to extract two first symbols in case three first symbols match a certain pattern (first two symbols should be any of those inside the brackets [ptkbdgG_fvsSxzZhmnNJlrwj], the third symbol should be any of those inside the brackets[IEAOYye|aouKLM#)3*<!(#0~q^LMOEK].
The first two lines work correctly.
The last lines do not work and I do not understand why. The code doesn`t give any errors, it just does nothing for those
# extract tree first symbols and save them in the new column
df['first_three_symbols'] = df['ITEM'].str[0:3]
#create a boolean column on condition whether first three symbols contain symbols
df["ccv"] = df["first_three_symbols"].str.contains('[ptkbdgG_fvsSxzZhmnNJlrwj][ptkbdgG_fvsSxzZhmnNJlrwj][IEAOYye|aouKLM#)3*<!(#0~q^LMOEK]')
#create another column for True values in the previous column
if df["ccv"].item == True:
df['first_two_symbols'] = df["ITEM"].str[0:2]
Here is my output:
ID ITEM FREQ first_three_symbols ccv
0 0 a 563 a False
1 1 OlrMndmEn 1 Olr False
2 2 OlrMndSpOrtl#r 0 Olr False
3 3 AG#l 74 AG# False
4 4 AG#lbMm 24 AG# False
... ... ... ... ... ...
51723 51723 zytzWt# 8 zyt False
51724 51724 zytzytOst 0 zyt False
51725 51725 zYxtIx 5 zYx False
51726 51726 zYxtIxkWt 0 zYx False
51727 51727 zyZe 4 zyZ False
[51728 rows x 5 columns]
you can either create a function, use apply method :
def f(row):
if row["ccv"] == True:
return row["ITEM"].str[0:2]
else:
return None
df['first_two_symbols'] = df.apply(f,axis=1)
or you can use np.wherefunction from numpy package.
I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2
I have a pandas dataframe with approximately 3 million rows.
I want to partially aggregate the last column in seperate spots based on another variable.
My solution was to separate the dataframe rows into a list of new dataframes based on that variable, aggregate the dataframes, and then join them again into a single dataframe. The problem is that after a few 10s of thousands of rows, I get a memory error. What methods can I use to improve the efficiency of my function to prevent these memory errors?
An example of my code is below
test = pd.DataFrame({"unneeded_var": [6,6,6,4,2,6,9,2,3,3,1,4,1,5,9],
"year": [0,0,0,0,1,1,1,2,2,2,2,3,3,3,3],
"month" : [0,0,0,0,1,1,1,2,2,2,3,3,3,4,4],
"day" : [0,0,0,1,1,1,2,2,2,2,3,3,4,4,5],
"day_count" : [7,4,3,2,1,5,4,2,3,2,5,3,2,1,3]})
test = test[["year", "month", "day", "day_count"]]
def agg_multiple(df, labels, aggvar, repl=None):
if(repl is None): repl = aggvar
conds = df.duplicated(labels).tolist() #returns boolean list of false for a unique (year,month) then true until next unique pair
groups = []
start = 0
for i in range(len(conds)): #When false, split previous to new df, aggregate count
bul = conds[i]
if(i == len(conds) - 1): i +=1 #no false marking end of last group, special case
if not bul and i > 0 or bul and i == len(conds):
sample = df.iloc[start:i , :]
start = i
sample = sample.groupby(labels, as_index=False).agg({aggvar:sum}).rename(columns={aggvar : repl})
groups.append(sample)
df = pd.concat(groups).reset_index(drop=True) #combine aggregated dfs into new df
return df
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
I suppose that I could potentially apply the function to small samples of the dataframe, to prevent a memory error and then combine those, but I'd rather improve the computation time of my function.
This function does the same, and is 10 times faster.
test.groupby(["year", "month"], as_index=False).agg({"day_count":sum}).rename(columns={"day_count":"month_count"})
There are almost always pandas methods that are pretty optimized for tasks that will vastly outperform iteration through the dataframe. If I understand correctly, in your case, the following will return the same exact output as your function:
test2 = (test.groupby(['year', 'month'])
.day_count.sum()
.to_frame('month_count')
.reset_index())
>>> test2
year month month_count
0 0 0 16
1 1 1 10
2 2 2 7
3 2 3 5
4 3 3 5
5 3 4 4
To check that it's the same:
# Your original function:
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
>>> test == test2
year month month_count
0 True True True
1 True True True
2 True True True
3 True True True
4 True True True
5 True True True
I'm new to pandas and enchant. I want to check orthography in short sentences using python.
I have a pandas data frame:
id_num word
1 live haapy
2 know more
3 ssweam good
4 eeat little
5 dream alot
And I want to achieve the next table with column “check”
id_num word check
1 live haapy True, False
2 know more True, True
3 ssweam good False, True
4 eeat little False, True
5 dream alot True, False
What is the best way to do this?
I tried this code:
import enchant
dic = enchant.Dict("ru_Eng")
df['list_word'] = df['word'].str.split() #Get list of all words in each sentence using split()
row = list()
for row in df[['id_num', 'list_word']].iterrows():
r = row[1]
for word in r.list_word:
rows.append((r.id_num, word))
df2 = pd.DataFrame(rows, columns=['id_num', 'word']) #Make the table with id_num column and a column of separate words
Then I got new data frame (df2):
id_num word
1 live
1 haapy
2 know
2 more
3 ssweam
3 good
4 eeat
4 little
5 dream
5 alot
After that I check words using:
column = df2['word']
for i in column:
n = dic.check(i)
print(n)
The result is:
True
False
True
True
False
True
False
True
True
False
Check is carried out correctly but when I tried to put this result to a new pandas data frame column I got all False values for all words.
for i in column:
df2['res'] = dic.check(i)
Resulted data frame:
id_num word res
1 live False
1 haapy False
2 know False
2 more False
3 ssweam False
3 good False
4 eeat False
4 little False
5 dream False
5 alot False
I will be grateful for any help!
I am playing with the really nice code #piRSquared has provided and this code can be seen below.
I have added another condition if row[col2] == 4000 and this is only seen once in the additional column I added. As expected this additional code has the function yield only a single row as the condition is only seen once.
My question is how can the code be modified to then yield another row after the move is >= move_size.
Desired output is two rows. One when row['B'] == 4000 (as the code produces now) and another when a move is seen >= move_size in Col A. I see these as a trade entry and exit so it would be nice to have an order id in another dataframe column df['C'] as per desired output shown below.
Code from original post:
#starting python community conventions
import numpy as np
import pandas as pd
# n is number of observations
n = 5000
day = pd.to_datetime(['2013-02-06'])
# irregular seconds spanning 28800 seconds (8 hours)
seconds = np.random.rand(n) * 28800 * pd.Timedelta(1, 's')
# start at 8 am
start = pd.offsets.Hour(8)
# irregular timeseries
tidx = day + start + seconds
tidx = tidx.sort_values()
s = pd.Series(np.random.randn(n), tidx, name='A').cumsum()
s.plot()
Generator function with slight modification:
def mover_df(df, col,col2, move_size=10):
ref = None
for i, row in df.iterrows():
#added test condition for new col2 signal column
if row[col2] == 4000:
if ref is None or (abs(ref - row.loc[col]) >= move_size):
yield row
ref = row.loc[col]
Generate data
df = s.to_frame()
df['B'] = range(0,len(df))
moves_df = pd.concat(mover_df(df, 'A','B', 3), axis=1).T
Current output:
A B
2013-02-06 14:30:43.874386317 -50.136432 4000.0
Desired output:
(Values in cols A,B on the second row would be whatever the code generates,I have just added random values to show the format I'm interested in. Col C is the trade id and for every two rows this would increment +1)
A B C
2013-02-06 14:30:43.874386317 -50.136432 4000.0 1
2013-02-06 14:30:43.874386317 -47.136432 6000.0 1
I have been tying to code this for hours (doesn't help with the kids running around the house now its the school holidays...) and appreciate any help. Would be fantastic to get input from #piRSquared but appreciate people are busy.
I don't have too much experience with generators or Pandas, but does this work? My data has different output due to the random seed so I am not sure.
I changed the generator to include the alternative case given, that the first column row[col2] == 4000, so calling the generator twice should give both values:
def mover_df(df, col, col2, move_size=10, found=False):
ref = None
for i, row in df.iterrows():
#added test condition for new col2 signal column
if row[col2] == 4000:
if ref is None or (abs(ref - row.loc[col]) >= move_size):
yield row
found = True # flag that we found the first row we want
ref = row.loc[col]
elif found: # if we found the first row, find the second meeting the condition
if ref is None or (abs(ref - row.loc[col]) >= move_size):
yield row
And then you can use it like this:
data_generator = mover_df(df, 'A', 'B', 3)
moves_df = pd.concat([data.next(), data.next()], axis=1).T
I'd edit the mover_df like this
note:
I changed 4000 condition to % 1000 == 0 to give a few more samples
def mover_df(df, move_col, look_col, move_size=10):
ref, seen = None, False
for i, row in df.iterrows():
#added test condition for new col2 signal column
look_cond = row[look_col] % 1000 == 0
if look_cond and not seen:
yield row
ref, seen = row.loc[move_col], True
elif seen:
move_cond = (abs(ref - row.loc[move_col]) >= move_size)
if move_cond:
yield row
ref, seen = None, False
df = s.to_frame()
df['B'] = range(0,len(df))
moves_df = pd.concat(mover_df(df, 'A','B', 3), axis=1).T
print(moves_df)
A B
2013-02-06 08:00:03.264481639 0.554390 0.0
2013-02-06 08:04:26.609855185 -2.479520 35.0
2013-02-06 09:38:07.962175581 -15.042391 1000.0
2013-02-06 09:40:50.737806497 -18.385956 1026.0
2013-02-06 11:13:03.018013689 -29.074125 2000.0
2013-02-06 11:14:30.980633575 -32.221009 2019.0
2013-02-06 12:49:41.432845325 -35.048040 3000.0
2013-02-06 12:50:28.098114592 -38.881795 3012.0
2013-02-06 14:27:15.008225195 13.437165 4000.0
2013-02-06 14:27:32.790466500 9.513736 4003.0
caveat
This will continue to look for an exit until it is found or you reach the end of the dataframe even if you reach another potential entry point. Meaning, in my example, I look every 1000 rows and enter. I then look for when the move is greater than 10 and exit. If I do not find a move greater than 10 before the next 1000 row market arrives, I'll ignore that 1000 row marker and continue looking for an exit.
The philosophy was that if I'm in the trade, I have to exit. I don't want to enter into another trade prior to resolving the one I'm still in.