Reset Index error when index all nulls - python

I am working with a CSV in a format that I can not change. It contains a multiindex. The raw file looks like this:
I use the following code to perform a multiindex, then stack and then reset index on it. It works.
import pandas as pd
myfile = 'c:/temp/myfile.csv'
df = pd.read_csv(myfile, header=[0, 1], tupleize_cols=True)
df.columns = [c for _, c in df.columns[:3]] + [c for c in df.columns[3:]]
df = df.set_index(list(df.columns[:3]), append = True)
df.columns = pd.MultiIndex.from_tuples(df.columns, names = ['hour', 'field'])
df.stack(level=['hour'])
df2 = df.reset_index().copy()
df2
Sometimes the "Zone" field is left blank, though.
Putting the file through the same code gives me this error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-15-8e51ff24c0c4> in <module>()
6 df.columns = pd.MultiIndex.from_tuples(df.columns, names = ['hour', 'field'])
7 df.stack(level=['hour'])
----> 8 df2 = df.reset_index().copy()
9 df2
C:\Anaconda3\lib\site-packages\pandas\core\frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
2832
2833 # to ndarray and maybe infer different dtype
-> 2834 level_values = _maybe_casted_values(lev, lab)
2835 if level is None or i in level:
2836 new_obj.insert(0, col_name, level_values)
C:\Anaconda3\lib\site-packages\pandas\core\frame.py in _maybe_casted_values(index, labels)
2796 if labels is not None:
2797 mask = labels == -1
-> 2798 values = values.take(labels)
2799 if mask.any():
2800 values, changed = com._maybe_upcast_putmask(values,
IndexError: cannot do a non-empty take from an empty axes.
Ideally, I would like to keep the NaNs in the df post reset.

I ran into the same problem. This is my hack:
# Loop through the index columns
for clmNm in df_w_idx.index.names:
print(clmNm)
# Make a new column in the dataframe
df_w_idx[clmNm] = df_w_idx.index.get_level_values(clmNm)
# Now you can reset the index
df_w_idx = df_w_idx.reset_index(drop=True).copy()
df_w_idx
Below is fully reproducible code. I am sure there are better ways
import pandas as pd
import numpy as np
import random
import string
# Create 12 random strings 3 char long
rndm_strgs = [''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(3)) for i in range(12)]
rndm_strgs[0] = None
rndm_strgs[5] = None
# Make Dataframe
df = pd.DataFrame({'A' : list('pandasisgood'),
'B' : np.nan,
'C' : rndm_strgs,
'D' : np.random.rand(12)})
# Set an Index -> Columns have Nans
df_w_idx = df.set_index(['A','B','C'])
for clmNm in df_w_idx.index.names:
print(clmNm)
df_w_idx[clmNm] = df_w_idx.index.get_level_values(clmNm)
df_w_idx = df_w_idx.reset_index(drop=True).copy()
df_w_idx
Also See issue 6322 in git. It looks closed

Related

Too many columns resulting in `PerformanceWarning: DataFrame is highly fragmented`

I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))
df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
To fix this I have tried:
dfAnimal = df.copy
dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)
dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)
df = pd.concat(dfAnimal,dfFish)
The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?
Try creating all your new columns in a dict, and then convert that dict into a dataframe, and then use pd.concat to add the resulting dataframe (containing the new columns) to the original dataframe:
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
Added to your original code, it would be something like this:
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
##### These are the new lines #####
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
##### End of new lines #####
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)

Applying a missing value distribution of a Dataframe to a subset of the Dataframe: Needs to be faster

I have a large pandas Dataframe (20k rows). Mocking up some data:
columns = [chr(i) for i in range(ord('a'),ord('z')+1)]
df = pd.DataFrame(np.random.randint(0,100,size=(20000, 26)), columns=columns)
for col in df.columns:
df.loc[df.sample(frac=0.4).index, col] = np.nan
In this Dataframe some of the columns might contain missing values indicated with NaN. I need to filter out missing values for a single column of and return a new Dataframe that has no missing values in that column:
col_name = "a"
dframe = df.copy()
df_col = dframe[~dframe[col_name].isnull()]
Now df_col still may have missing values in the other columns of the subset. But what I have lost is when missing values co-occur with what I filtered out. So if col_name is "A" it might be that "D" is usually missing when "A" is missing. Now it appears that "D" is always present in df_col.
I want to take the missing distribution from dframe and randomly sample from df_col to simulate the missing values. By missing distribution I mean combinations of column names that have NaN values and their proportions:
{
["A", "E", "G"]: 0.24,
["G", "Z"]: 0.01,
["G"]: 0.32,
...,
["R", "M"]: 0.09
}
I have functions that do this but they are too slow for my needs:
from typing import List, Dict
import pandas as pd
import numpy as np
def get_freq_dict(df: pd.DataFrame) -> Dict[str, float]:
num_samples = df.shape[0]
col_names = df.columns
list_of_lists = df.apply(lambda row: [i for i in col_names if np.isnan(row[i])], axis=1).tolist()
output = {}
for lis in list_of_lists:
output.setdefault(tuple(lis), list()).append(1)
for a, b in output.items():
output[a] = sum(b) / float(num_samples)
return output
def add_in_missingness(df_col, freq_dist) -> pd.DataFrame:
sample_list = []
df_m = df_col.copy()
for key in freq_dist:
sample = df_m.sample(frac=freq_dist[key], replace=False, random_state=1)
# remove idx from options
blacklist = list(sample.index)
df_m = df_m[~df_m.index.isin(blacklist)]
col_names = list(key)
sample[col_names] = np.NaN
sample_list.append(sample)
df_col = pd.concat(sample_list)
return df_col
running it
%%time
import time
freq_dist = get_freq_dict(df)
df_ = add_in_missingness(df_col, freq_dist)
Seems to work but it takes far too long for my purposes:
CPU times: user 1min 1s, sys: 439 ms, total: 1min 1s
Wall time: 1min 1s
I need help making these function efficient. Any ideas?

Python pandas DF question : trying to drop a column but nothing happens (df.drop) - code also runs without any errors

I am trying to delete a column called Rank but nothing happens. The remaining code all executes without any issue but the column itself remains in the output file. I've highlighted the part of the code that is not working.
def read_csv():
file = "\mona" + yday+".csv"
#df=[]
df = pd.read_csv(save_path+file,skiprows=3,encoding = "ISO-8859-1",error_bad_lines=False)
return df
# replace . with / in column EPIC
def tickerchange():
df=read_csv()
df['EPIC'] = df['EPIC'].str.replace('.','/')
return df
def consolidate_AB_listings():
df=tickerchange()
Aline = df.loc[(df['EPIC'] =='RDSA'),'Mkt Cap (àm)']
Bline = df.loc[(df['EPIC'] =='RDSB'),'Mkt Cap (àm)']
df.loc[(df['EPIC'] =='RDSA'),'Mkt Cap (àm)']= float(Aline) + float(Bline)
df = df.loc[(df.Ind != 'I/E')]
df = df.loc[(df.Ind != 'FL')]
df = df.loc[(df.Ind != 'M')]
df = df.loc[(df.EPIC != 'RDSB')]
return df
def ranking_mktcap():
df = consolidate_AB_listings()
df['Rank']= df['Mkt Cap (àm)'].rank(ascending=False)
df = df.loc[(df.Rank != 1)]
df['Rank1']= df['Mkt Cap (Em)'].rank(ascending=False)
## This doesn't seem to work
df = df.drop(df['Security'], 1)
return df
def save_outputfile():
#df = drop()
df = ranking_mktcap()
df.to_csv(r'S:\Index_Analytics\UK\Index Methodology\FTSE\Py_file_download\MonitoredList.csv', index=False)
print("finished")
if __name__ == "__main__":
main()
read_csv()
tickerchange()
consolidate_AB_listings()
ranking_mktcap()
save_outputfile()
DataFrame.drop() takes the following: DataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise').
When you call df = df.drop(df['Security'], 1) it's using df['security'] as the labels to drop. And the 1 is being passed through the axis parameter.
If you want to drop the column 'Security' then you'd want to do:
df = df.drop('Security', axis=1)
# this is same as
df = df.drop(labels='Security', axis=1)
# you can also specify the column name directly, like this
df = df.drop(columns='Security')
Note: the columns= parameter can take a single lable (str) like above, or can take a list of column names.
Try by replacing
df = df.drop(df['Security'], 1)
By
df.drop(['Security'],axis=1, inplace=True)
I had the same issue and all I did was add inplace = True
So it will be df = df.drop(df['Security'], 1, inplace = True)

Pandas set element style dependent on another dataframe mith multi index

I have previously asked the question Pandas set element style dependent on another dataframe, which I have a working solution to, but now I am trying to apply it to a data frame with a multi index and I am getting an error, which I do not understand.
Problem
I have a pandas df and accompanying boolean matrix. I want to highlight the df depending on the boolean matrix.
Data
import pandas as pd
import numpy as np
from datetime import datetime
date = pd.date_range(start = datetime(2016,1,1), end = datetime(2016,2,1), freq = "D")
i = len(date)
dic = {'X':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B']),
'Y':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B']),
'Z':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B'])}
df = pd.concat(dic.values(),axis=1,keys=dic.keys())
boo = [True, False]
bool_matrix = {'X':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B']),
'Y':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B']),
'Z':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B'])}
bool_matrix =pd.concat(bool_matrix.values(),axis=1,keys=bool_matrix.keys())
My attempted solution
def highlight(value):
return 'background-color: green'
my_style = df.style
for column in df.columns:
for i in df[column].index:
data = bool_matrix.loc[i, column]
if data:
my_style = df.style.use(my_style.export()).applymap(highlight, subset = pd.IndexSlice[i, column])
my_style
Results
The above throws an AttributeError: 'Series' object has no attribute 'applymap'
I do not understand what is returning as a Series. This is a single value I am subsetting and this solution worked for non multi-indexed df's as shown below.
Without Multi-index
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(24)
date = pd.date_range(start = datetime(2016,1,1), end = datetime(2016,2,1), freq = "D")
df = pd.DataFrame({'A': np.linspace(1, 100, len(date))})
df = pd.concat([df, pd.DataFrame(np.random.randn(len(date), 4), columns=list('BCDE'))],
axis=1)
df['date'] = date
df.set_index("date", inplace = True)
boo = [True, False]
bool_matrix = pd.DataFrame(np.random.choice(boo, (len(date), 5),p=[0.3,.7]), index = date,columns=list('ABCDE'))
def highlight(value):
return 'background-color: green'
my_style = df.style
for column in df.columns:
for i in bool_matrix.index:
data = bool_matrix.loc[i, column]
if data:
my_style = df.style.use(my_style.export()).applymap(highlight, subset = pd.IndexSlice[i,column])
my_style
Documentation
The docs make reference to CSS Classes and say that "Index label cells include level where k is the level in a MultiIndex." I am obviouly indexing this wrong, but am stumped on how to proceed.
It's very nice that there is a runable example.
You can use df.style.apply(..., axis=None) to apply a highlight method to the whole dataframe.
With your df and bool_matrix, try this:
def highlight(value):
d = value.copy()
for c in d.columns:
for r in df.index:
if bool_matrix.loc[r, c]:
d.loc[r, c] = 'background-color: green'
else:
d.loc[r, c] = ''
return d
df.style.apply(highlight, axis=None)
Or to make codes simple, you can try:
def highlight(value):
return bool_matrix.applymap(lambda x: 'background-color: green' if x else '')
df.style.apply(highlight, axis=None)
Hope this is what you need.

adding rows to empty dataframe with columns

I am using Pandas and want to add rows to an empty DataFrame with columns already established.
So far my code looks like this...
def addRows(cereals,lines):
for i in np.arange(1,len(lines)):
dt = parseLine(lines[i])
dt = pd.Series(dt)
print(dt)
# YOUR CODE GOES HERE (add dt to cereals)
cereals.append(dt, ignore_index = True)
return(cereals)
However, when I run...
cereals = addRows(cereals,lines)
cereals
the dataframe returns with no rows, just the columns. I am not sure what I am doing wrong but I am pretty sure it has something to do with the append method. Anyone have any ideas as to what I am doing wrong?
There are two probably reasons your code is not operating as intended:
cereals.append(dt, ignore_index = True) is not doing what you think it is. You're trying to append a series, not a DataFrame there.
cereals.append(dt, ignore_index = True) does not modify cereals in place, so when you return it, you're returning an unchanged copy. An equivalent function would look like this:
--
>>> def foo(a):
... a + 1
... return a
...
>>> foo(1)
1
I haven't tested this on my machine, but I think you're fixed solution would look like this:
def addRows(cereals, lines):
for i in np.arange(1,len(lines)):
data = parseLine(lines[i])
new_df = pd.DataFrame(data, columns=cereals.columns)
cereals = cereals.append(new_df, ignore_index=True)
return cereals
by the way.. I don't really know where lines is coming from, but right away I would at least modify it to look like this:
data = [parseLine(line) for line in lines]
cereals = cereals.append(pd.DataFrame(data, cereals.columns), ignore_index=True)
How to add an extra row to a pandas dataframe
You could also create a new DataFrame and just append that DataFrame to your existing one. E.g.
>>> import pandas as pd
>>> empty_alph = pd.DataFrame(columns=['letter', 'index'])
>>> alph_abc = pd.DataFrame([['a', 0], ['b', 1], ['c', 2]], columns=['letter', 'index'])
>>> empty_alph.append(alph_abc)
letter index
0 a 0.0
1 b 1.0
2 c 2.0
As I noted in the link, you can also use the loc method on a DataFrame:
>>> df = empty_alph.append(alph_abc)
>>> df.loc[df.shape[0]] = ['d', 3] // df.shape[0] just finds next # in index
letter index
0 a 0.0
1 b 1.0
2 c 2.0
3 d 3.0

Categories

Resources