Too many columns resulting in `PerformanceWarning: DataFrame is highly fragmented` - python
I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))
df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
To fix this I have tried:
dfAnimal = df.copy
dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)
dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)
df = pd.concat(dfAnimal,dfFish)
The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?
Try creating all your new columns in a dict, and then convert that dict into a dataframe, and then use pd.concat to add the resulting dataframe (containing the new columns) to the original dataframe:
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
Added to your original code, it would be something like this:
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
##### These are the new lines #####
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
##### End of new lines #####
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
Related
check for value in pandas Dataframe cell which has a list
I have the following df: df = pd.DataFrame(columns=['Place', 'PLZ','shortName','Parzellen']) new_row1 = {'Place':'Winterthur', 'PLZ':[8400, 8401, 8402, 8404, 8405, 8406, 8407, 8408, 8409, 8410, 8411], 'shortName':'WIN', 'Parzellen':[]} new_row2 = {'Place':'Opfikon', 'PLZ':[8152], 'shortName':'OPF', 'Parzellen':[]} new_row3 = {'Place':'Stadel', 'PLZ':[8174], 'shortName':'STA', 'Parzellen':[]} new_row4 = {'Place':'Kloten', 'PLZ':[8302], 'shortName':'KLO', 'Parzellen':[]} new_row5 = {'Place':'Niederhasli', 'PLZ':[8155,8156], 'shortName':'NIH', 'Parzellen':[]} new_row6 = {'Place':'Bassersdorf', 'PLZ':[8303], 'shortName':'BAS', 'Parzellen':[]} new_row7 = {'Place':'Oberglatt', 'PLZ':[8154], 'shortName':'OBE', 'Parzellen':[]} new_row8 = {'Place':'Bülach', 'PLZ':[8180], 'shortName':'BUE', 'Parzellen':[]} df = df.append(new_row1, ignore_index=True) df = df.append(new_row2, ignore_index=True) df = df.append(new_row3, ignore_index=True) df = df.append(new_row4, ignore_index=True) df = df.append(new_row5, ignore_index=True) df = df.append(new_row6, ignore_index=True) df = df.append(new_row7, ignore_index=True) df = df.append(new_row8, ignore_index=True) print (df) Now I have a number like 8405 and I want to know the Place or whole Row which has this number under df['PLZ']. I also tried with classes but it was hard to get all Numbers of all Objects because I want to be able to call all PLZ in a list and also check, if I have any number, to which Place it belongs. Maybe there is an obvious better way and I just don't know it.
try with boolean masking and map() method: df[df['PLZ'].map(lambda x:8405 in x)] OR via boolean masking and agg() method: df[df['PLZ'].agg(lambda x:8405 in x)] #you can also use apply() in place of agg output of above code: Place PLZ shortName Parzellen 0 Winterthur [8400, 8401, 8402, 8404, 8405, 8406, 8407, 840... WIN []
Pandas read_csv reads list in column as type float
I have the following CSV file (shortened): "","PacketTime","FrameLen","FrameCapLen","IPHdrLen","IPLen","TCPLen","TCPHdrLen","WindowSize","WindowSizeValue","BytesInFlight","PushBytesSent","ACKRTT","Payload","TLSRecordContentType","TLSRecordLen","TLSAppData","Movement","Distance","Speed","Delay","Loss","Interval" "1",0.056078384,116,116,20,102,50,32,83,83,50,50,NA,"17:03:03:00:2d:89:da:a1:d8:d8:a5:9b:38:29:9e:0e:b1:51:c9:a0:e1:66:af:57:e2:a2:e6:c1:16:16:eb:2e:26:02:ec:f6:4e:f7:90:05:20:3c:45:61:14:0d:c4:c3:df:2e",23,45,"89:da:a1:d8:d8:a5:9b:38:29:9e:0e:b1:51:c9:a0:e1:66:af:57:e2:a2:e6:c1:16:16:eb:2e:26:02:ec:f6:4e:f7:90:05:20:3c:45:61:14:0d:c4:c3:df:2e",1,1,25,0,0,"0" "2",0.056106291,66,66,20,52,0,32,84,84,NA,NA,2.7907e-05,NA,NA,NA,NA,1,1,25,0,0,"0" "3",2.058089106,116,116,20,102,50,32,83,83,50,50,NA,"17:03:03:00:2d:ba:92:5d:6a:18:1e:d5:89:6a:6a:a3:f7:5a:cf:dd:4d:f8:38:1f:4b:ad:1b:3f:94:8a:07:fa:9b:27:c8:06:34:cd:10:a3:08:d0:db:01:42:2b:2d:27:fa:dd",23,45,"ba:92:5d:6a:18:1e:d5:89:6a:6a:a3:f7:5a:cf:dd:4d:f8:38:1f:4b:ad:1b:3f:94:8a:07:fa:9b:27:c8:06:34:cd:10:a3:08:d0:db:01:42:2b:2d:27:fa:dd",1,1,25,0,0,"2" "4",2.058114719,66,66,20,52,0,32,84,84,NA,NA,2.5613e-05,NA,NA,NA,NA,1,1,25,0,0,"2" "5",4.060316193,116,116,20,102,50,32,83,83,50,50,NA,"17:03:03:00:2d:c5:5d:a0:5d:7c:6f:4e:70:31:18:0d:a2:0b:ac:dd:19:18:59:4d:3e:d7:f4:a6:92:5d:4e:98:4e:ed:ae:5a:d2:e8:cd:d2:83:b0:82:91:48:88:0e:d4:ed:09",23,45,"c5:5d:a0:5d:7c:6f:4e:70:31:18:0d:a2:0b:ac:dd:19:18:59:4d:3e:d7:f4:a6:92:5d:4e:98:4e:ed:ae:5a:d2:e8:cd:d2:83:b0:82:91:48:88:0e:d4:ed:09",1,1,25,0,0,"4" "6",4.060340382,66,66,20,52,0,32,84,84,NA,NA,2.4189e-05,NA,NA,NA,NA,1,1,25,0,0,"4" "7",6.063347757,116,116,20,102,50,32,83,83,50,50,NA,"17:03:03:00:2d:7e:cc:44:54:43:45:71:c6:5f:75:15:ad:f7:ce:81:a6:31:51:ce:76:0a:03:52:60:72:fc:17:9f:be:f7:92:06:b6:80:64:38:0d:6f:6e:a0:df:ea:b9:16:8e",23,45,"7e:cc:44:54:43:45:71:c6:5f:75:15:ad:f7:ce:81:a6:31:51:ce:76:0a:03:52:60:72:fc:17:9f:be:f7:92:06:b6:80:64:38:0d:6f:6e:a0:df:ea:b9:16:8e",1,1,25,0,0,"6" "8",6.06337245,66,66,20,52,0,32,84,84,NA,NA,2.4693e-05,NA,NA,NA,NA,1,1,25,0,0,"6" "9",8.065573696,116,116,20,102,50,32,83,83,50,50,NA,"17:03:03:00:2d:e3:07:5a:eb:d6:b7:3b:55:6b:77:57:99:76:fa:f4:43:38:34:d4:82:60:40:10:eb:90:2a:01:14:21:aa:db:a0:d3:c4:eb:6a:e8:08:05:4e:59:ca:67:f1:63",23,45,"e3:07:5a:eb:d6:b7:3b:55:6b:77:57:99:76:fa:f4:43:38:34:d4:82:60:40:10:eb:90:2a:01:14:21:aa:db:a0:d3:c4:eb:6a:e8:08:05:4e:59:ca:67:f1:63",1,1,25,0,0,"8" "10",8.065602121,66,66,20,52,0,32,84,84,NA,NA,2.8425e-05,NA,NA,NA,NA,1,1,25,0,0,"8" "11",10.066978328,116,116,20,102,50,32,83,83,50,50,NA,"17:03:03:00:2d:d2:2e:ed:cc:21:12:20:66:cb:6d:41:5f:5a:b8:ea:53:2d:7a:ff:f7:ca:07:91:07:64:51:a4:91:6e:28:58:6f:17:29:8d:7f:2c:ca:c4:22:a7:81:d9:af:3c",23,45,"d2:2e:ed:cc:21:12:20:66:cb:6d:41:5f:5a:b8:ea:53:2d:7a:ff:f7:ca:07:91:07:64:51:a4:91:6e:28:58:6f:17:29:8d:7f:2c:ca:c4:22:a7:81:d9:af:3c",1,1,25,0,0,"10" "12",10.067001964,66,66,20,52,0,32,84,84,NA,NA,2.3636e-05,NA,NA,NA,NA,1,1,25,0,0,"10" "13",12.069526007,116,116,20,102,50,32,83,83,50,50,NA,"17:03:03:00:2d:6b:4e:48:e6:ce:0b:f5:2c:18:df:36:c1:08:56:7a:f1:5e:be:f5:8a:e2:b7:84:87:30:66:c9:de:60:ac:4a:ad:80:4b:44:64:3b:21:96:18:c7:42:c8:03:20",23,45,"6b:4e:48:e6:ce:0b:f5:2c:18:df:36:c1:08:56:7a:f1:5e:be:f5:8a:e2:b7:84:87:30:66:c9:de:60:ac:4a:ad:80:4b:44:64:3b:21:96:18:c7:42:c8:03:20",1,1,25,0,0,"12" "14",12.069551287,66,66,20,52,0,32,84,84,NA,NA,2.528e-05,NA,NA,NA,NA,1,1,25,0,0,"12" Now, when I call read_csv I do the following: # min max for the packet time def min_max(s): s = s.astype('float64') return s.max()-s.min() def to_list(df): return df.T.apply(lambda x: x.to_list(), axis='columns') def group(csv): df = pd.read_csv(csv) df_other = df.groupby('Interval')\ .apply(to_list)\ .drop(columns='PacketTime') s_Interval = df.groupby('Interval')['PacketTime']\ .apply(min_max) final_df = pd.concat([df_other,s_Interval], axis= 'columns') final_df.drop(['Unnamed: 0'], axis=1, inplace=True) return final_d dataset = group("csv_location") dataset.drop(['Interval'], axis=1, inplace=True) ptime = dataset.pop('PacketTime') target = dataset.pop('Movement') other_targets = pd.DataFrame([dataset.pop(x) for x in ['Distance', 'Speed', 'Delay', 'Loss']]) However, when I loop through the dataset columns -- you will notice the columns contains lists -- the first column seems to be of <class 'list'>, but the second somehow is of <class 'float'>. Here is the loop I am doing: columns = list(dataset) for col in columns: df = pd.DataFrame(dataset[col].astype('list').tolist()) df.columns = [col+"_"+str(y) for y in range(len(df.columns))] df = df.dropna(axis='columns') dataset.drop(col, axis=1, inplace=True) dataset = pd.concat([dataset, df], axis=1) The error I am getting is this error for the line df = pd.DataFrame(dataset[col].tolist()): TypeError: object of type 'float' has no len()
Python pandas DF question : trying to drop a column but nothing happens (df.drop) - code also runs without any errors
I am trying to delete a column called Rank but nothing happens. The remaining code all executes without any issue but the column itself remains in the output file. I've highlighted the part of the code that is not working. def read_csv(): file = "\mona" + yday+".csv" #df=[] df = pd.read_csv(save_path+file,skiprows=3,encoding = "ISO-8859-1",error_bad_lines=False) return df # replace . with / in column EPIC def tickerchange(): df=read_csv() df['EPIC'] = df['EPIC'].str.replace('.','/') return df def consolidate_AB_listings(): df=tickerchange() Aline = df.loc[(df['EPIC'] =='RDSA'),'Mkt Cap (àm)'] Bline = df.loc[(df['EPIC'] =='RDSB'),'Mkt Cap (àm)'] df.loc[(df['EPIC'] =='RDSA'),'Mkt Cap (àm)']= float(Aline) + float(Bline) df = df.loc[(df.Ind != 'I/E')] df = df.loc[(df.Ind != 'FL')] df = df.loc[(df.Ind != 'M')] df = df.loc[(df.EPIC != 'RDSB')] return df def ranking_mktcap(): df = consolidate_AB_listings() df['Rank']= df['Mkt Cap (àm)'].rank(ascending=False) df = df.loc[(df.Rank != 1)] df['Rank1']= df['Mkt Cap (Em)'].rank(ascending=False) ## This doesn't seem to work df = df.drop(df['Security'], 1) return df def save_outputfile(): #df = drop() df = ranking_mktcap() df.to_csv(r'S:\Index_Analytics\UK\Index Methodology\FTSE\Py_file_download\MonitoredList.csv', index=False) print("finished") if __name__ == "__main__": main() read_csv() tickerchange() consolidate_AB_listings() ranking_mktcap() save_outputfile()
DataFrame.drop() takes the following: DataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise'). When you call df = df.drop(df['Security'], 1) it's using df['security'] as the labels to drop. And the 1 is being passed through the axis parameter. If you want to drop the column 'Security' then you'd want to do: df = df.drop('Security', axis=1) # this is same as df = df.drop(labels='Security', axis=1) # you can also specify the column name directly, like this df = df.drop(columns='Security') Note: the columns= parameter can take a single lable (str) like above, or can take a list of column names.
Try by replacing df = df.drop(df['Security'], 1) By df.drop(['Security'],axis=1, inplace=True)
I had the same issue and all I did was add inplace = True So it will be df = df.drop(df['Security'], 1, inplace = True)
Pandas set element style dependent on another dataframe mith multi index
I have previously asked the question Pandas set element style dependent on another dataframe, which I have a working solution to, but now I am trying to apply it to a data frame with a multi index and I am getting an error, which I do not understand. Problem I have a pandas df and accompanying boolean matrix. I want to highlight the df depending on the boolean matrix. Data import pandas as pd import numpy as np from datetime import datetime date = pd.date_range(start = datetime(2016,1,1), end = datetime(2016,2,1), freq = "D") i = len(date) dic = {'X':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B']), 'Y':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B']), 'Z':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B'])} df = pd.concat(dic.values(),axis=1,keys=dic.keys()) boo = [True, False] bool_matrix = {'X':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B']), 'Y':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B']), 'Z':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B'])} bool_matrix =pd.concat(bool_matrix.values(),axis=1,keys=bool_matrix.keys()) My attempted solution def highlight(value): return 'background-color: green' my_style = df.style for column in df.columns: for i in df[column].index: data = bool_matrix.loc[i, column] if data: my_style = df.style.use(my_style.export()).applymap(highlight, subset = pd.IndexSlice[i, column]) my_style Results The above throws an AttributeError: 'Series' object has no attribute 'applymap' I do not understand what is returning as a Series. This is a single value I am subsetting and this solution worked for non multi-indexed df's as shown below. Without Multi-index import pandas as pd import numpy as np from datetime import datetime np.random.seed(24) date = pd.date_range(start = datetime(2016,1,1), end = datetime(2016,2,1), freq = "D") df = pd.DataFrame({'A': np.linspace(1, 100, len(date))}) df = pd.concat([df, pd.DataFrame(np.random.randn(len(date), 4), columns=list('BCDE'))], axis=1) df['date'] = date df.set_index("date", inplace = True) boo = [True, False] bool_matrix = pd.DataFrame(np.random.choice(boo, (len(date), 5),p=[0.3,.7]), index = date,columns=list('ABCDE')) def highlight(value): return 'background-color: green' my_style = df.style for column in df.columns: for i in bool_matrix.index: data = bool_matrix.loc[i, column] if data: my_style = df.style.use(my_style.export()).applymap(highlight, subset = pd.IndexSlice[i,column]) my_style Documentation The docs make reference to CSS Classes and say that "Index label cells include level where k is the level in a MultiIndex." I am obviouly indexing this wrong, but am stumped on how to proceed.
It's very nice that there is a runable example. You can use df.style.apply(..., axis=None) to apply a highlight method to the whole dataframe. With your df and bool_matrix, try this: def highlight(value): d = value.copy() for c in d.columns: for r in df.index: if bool_matrix.loc[r, c]: d.loc[r, c] = 'background-color: green' else: d.loc[r, c] = '' return d df.style.apply(highlight, axis=None) Or to make codes simple, you can try: def highlight(value): return bool_matrix.applymap(lambda x: 'background-color: green' if x else '') df.style.apply(highlight, axis=None) Hope this is what you need.
Python & Pandas: How to return a copy of a dataframe?
Here is the problem. I use a function to return a randomized data, data1 = [3,5,7,3,2,6,1,6,7,8] data2 = [1,5,2,1,6,4,3,2,7,8] df = pd.DataFrame(data1, columns = ['c1']) df['c2'] = data2 def randomize_data(df): df['c1_ran'] = df['c1'].apply(lambda x: (x + np.random.uniform(0,1))) df['c1']=df['c1_ran'] # df.drop(['c1_ran'], 1, inplace=True) return df temp_df = randomize_data(df) display(df) display(temp_df) However, the df (source data) and the temp_df (randomized_data) is the same. Here is the result: How can I make the temp_df and df different from each other? I find I can get rid of the problem by adding df.copy() at the beginning of the function def randomize_data(df): df = df.copy() But I'm not sure if this is the right way to deal with it?
Use DataFrame.assign(): def randomize_data(df): return df.assign(c1=df.c1 + np.random.uniform(0, 1, df.shape[0]))
I think you are right, and DataFrame.copy() have an optional argument 'deep'. You can find details in http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html