Pandas: Efficiently splitting entries - python

I have a Pandas dataframe with columns as such:
event_id, obj_0_type, obj_0_foo, obj_0_bar, obj_1_type, obj_1_foo, obj_1_bar, obj_n_type, obj_n_foo, obj_n_bar, ....
For example:
col_idx = ['event_id']
[col_idx.extend(('obj_%d_id' %d, 'obj_%d_foo' %d, 'obj_%d_bar' %d)) for d in range(5)]
event_id = np.array(range(0,5))
data = np.random.rand(15,5)
data = np.vstack((event_id, data))
df = DataFrame(data.T, index = range(5), columns = col_idx)
I would like to split each individual row of the dataframe so that I'd have a single entry per object, as such:
event_id, obj_type, obj_foo, obj_bar
Where event_id would be shared among all the objects of a given event.
There are lots of very slow ways of doing it (iterating over the dataframe rows and creating new series objects) but those are atrociously slow and obviously unpythonic. Is there a simpler way I am missing?

With some suggestions from some people in #pydata on freenode, this is what I came up with:
data = []
for d in range(5):
temp = df.ix[:, ['event_id', 'obj_%d_id' % d, 'obj_%d_foo' % d, 'obj_%d_bar' % d]]
temp.columns = ['event_id', 'obj_id', 'obj_foo', 'obj_bar']
# Giving columns unique names.
temp.index = temp['event_id']*10 + d
# Creating a unique index.
data.append(temp)
concat(data)
This works and is reasonably fast!

Related

Pandas Concat to Multiindex on Columns

Been playing around with this for a while. I'm working with test data, of which I have a series of test points, a series of sensors, and for each sensor I have min/max/avg/stdev data.
I had played around with the idea of simply appending e.g. "_min" on the end of each tag and creating a dataframe of nColumns*3 width. But... that feels hacky -- and when I call to plot the variables, I'm going to have to process the string value to add that suffix on... feels clumsy.
It seems that a multiindex is the right way to do it, which would allow me to handle the sensor name, and the measurement individually.
I'm currently reading in the data like:
data = pd.read_excel(os.path.join(working_path, working_dir, staticDataFileName),
sheet_name='sheet', skiprows = 6, nrows=2000, usecols = 'A,D:G', names = ["Tag", "Min", "Max", "Avg", "Stdev"], dtype={'Tag': str})
I'm then splitting the dataframe into each individual variable.
df_min = data[["Tag", "Min"]]
...
I currently have some code working where I only have a single average value.
temp = readRawData(wd, f, dataset)
# Drop the bad rows
temp.drop(temp.index[temp['Tag'] == '0'], inplace = True)
temp2 = temp.T
temp2.rename(columns=temp2.iloc[0], inplace = True)
temp2.drop(temp2.index[0], inplace = True)
I need to transpose the dataframe to get the tag names as columns, and then set the columns to the tag names. I then drop the first index, which now is just the tag names. In my code, I am looping over all files, and create the dataframe for all datapoints with
data = pd.concat([data, temp2])
Somewhere in there, I need to figure out how to create this multiindex dataframe. Most of the examples given in the pandas user guide LINK have the indices as multi-level, not columns. The example they give.. I'm having a hard time following.
I'm looking for guidance on how to take a series of dataframe which look like
df_min
Tag1 Tag2 TagN
0 min1 min2 minN
df_avg
Tag1 Tag2 TagN
0 avg1 avg2 avgN
and combine them into
df
Tag1 Tag2 ... TagN
Min Max Avg Min Max Avg Min Max Avg
0 min1 max1 avg1 min2 max2 avg2 minN maxN avgN
Of course, is this is a terrible idea, please let me know. Thanks!
I was able to make this work by using a solution here:
https://stackoverflow.com/a/47338266/14066896
It's not pretty... but it seems to be working
for f in staticDataFileName:
temp_all = readRawData(wd, f)
temp_all.drop(temp_all.index[temp_all['Tag'] == '0'], inplace = True)
column_list = []
steady_dict = dict()
temp = temp_all.T
temp.rename(columns=temp.iloc[0], inplace=True)
temp.drop(temp.index[0], inplace=True)
temp.reset_index(inplace=True)
temp.drop(columns=['index'], inplace=True)
#create column names
for column in temp.columns:
column_list.append((column, "Min"))
column_list.append((column, "Max"))
column_list.append((column, "Avg"))
column_list.append((column, "Stdev"))
j = 0
for columnName, columnData in temp.iteritems():
temp_dict = dict()
temp_dict["Min"] = temp.iloc[0, j]
temp_dict["Max"] = temp.iloc[1, j]
temp_dict["Avg"] = temp.iloc[2, j]
temp_dict["Stdev"] = temp.iloc[3, j]
j += 1
steady_dict[columnName] = temp_dict
t = pd.DataFrame(steady_dict).unstack().to_frame().T
t.columns = pd.MultiIndex.from_tuples(column_list)
#correctStaticData(temp2, wd2)
data = pd.concat([data, t])

Reformatting a dataframe to access it for sort after concatenating two series

I've joined or concatenated two series into a dataframe. However one of the issues I'm not facing is that I have no column headings on the actual data that would help me do a sort
hist_a = pd.crosstab(category_a, category, normalize=True)
hist_b = pd.crosstab(category_b, category, normalize=True)
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index])
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index])
df_plots = pd.concat([counts_a, counts_b], axis=1).fillna(0)
The data looks like the following:
0 1
category
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
and I'd like to do a sort, but there are no proper column headings
df_plots = df_plots.sort_values(by=['0?'])
But the dataframe seems to be in two parts. How could I better structure the dataframe to have 'proper' columns such as '0' or 'plot a' rather than being indexable by an integer, which seems to be hard to work with.
category plot a plot b
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
Just rename the columns of the dataframe, for example:
df = pd.DataFrame({0:[1,23]})
df = df.rename(columns={0:'new name'})
If you have a lot of columns you rename all of them at once like:
df = pd.DataFrame({0:[1,23]})
rename_dict = {key: f'Col {key}' for key in df.keys() }
df = df.rename(columns=rename_dict)
You can also define the series with the name, so you avoid changing the name afterwards:
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index], name = 'counts_a')
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index], name = 'counts_b')

Pandas: how to add a dataframe inside a cell of another dataframe?

I have an empty dataframe like the following:
simReal2013 = pd.DataFrame(index = np.arange(0,1,1))
Then I read as dataframes some .csv files.
stat = np.arange(0,5)
xv = [0.005, 0.01, 0.05]
br = [0.001,0.005]
for i in xv:
for j in br:
I = 0
for s in stat:
string = 'results/2013/real/run_%d_%f_%f_15.0_10.0_T0_RealNet.csv'%(s,i,j)
sim = pd.read_csv(string, sep=' ')
I += np.array(sim.I)
sim.I = I / 5
col = '%f_%f'%(i,j)
simReal2013.insert(0, col, sim)
I would like to add the dataframe that I read in a cell of simReal2013. In doing so I get the following error:
ValueError: Wrong number of items passed 9, placement implies 1
Yes, putting a dataframe inside of a dataframe is probably not the way you want to go, but if you must, this is one way to do it:
df_in=pd.DataFrame([[1,2,3]]*2)
d={}
d['a']=df
df_out=pd.DataFrame([d])
type(df_out.loc[0,"a"])
>>> pandas.core.frame.DataFrame
Maybe a dictionary of dataframes would suffice for your use.

Iterating over multiple pandas dataframe is slow

I'm trying to find the number of similar words for all rows in Dataframe1 for every single row with words in Dataframe 2.
Based on the similarities I want to create a new data frame with where columns = N rows of dataframe2
values = similarity.
My current code is working, but it runs very slow. I'm not sure how to optimize it...
df = pd.DataFrame([])
for x in range(10000):
save = {}
terms_1 = data['text_tokenized'].iloc[x]
save['code'] = data['code'].iloc[x]
for y in range(3000):
terms_2 = data2['terms'].iloc[y]
similar_n = len(list(terms_2.intersection(terms_1)))
save[data2['code'].iloc[y]] = similar_n
df = df.append(pd.DataFrame([save]))
Update: new code (still running slow)
def get_sim(x, terms):
similar_n = len(list(x.intersection(terms)))
return similar_n
for index in icd10_terms.itertuples():
code,terms = index[1],index[2]
data[code] = data['text_tokenized'].apply(get_sim, args=(terms,))

Iteratively add columns of various length to DataFrame

I have few categorical columns (description) in my DataFrame df_churn which i'd like to convert to numerical values. And of course I'd like to create a lookup table because i will need to convert them back eventually.
The problem is that every column has a different number of categories so appending to df_categories is not easy and I cant think of any simple way of do so.
Here is what I have so far. It stops after first column, because of the different length.
cat_clmn = ['CLI_REGION','CLI_PROVINCE','CLI_ORIGIN','cli_origin2','cli_origin3', 'ONE_PRD_TYPE_1']
df_categories = pd.DataFrame()
def categorizer(_clmn):
for clmn in cat_clmn:
dict_cat = {key: value for value, key in enumerate(df_churn[clmn].unique())}
df_categories[clmn] = dict_cat.values()
df_categories[clmn + '_key'] = dict_cat.keys()
df_churn[clmn + '_CAT'] = df_churn[clmn].map(dict_cat)
categorizer(cat_clmn)
There is a temporary solution, but I am sure it can be done in a better way.
df_CLI_REGION = pd.DataFrame()
df_CLI_PROVINCE = pd.DataFrame()
df_CLI_ORIGIN = pd.DataFrame()
df_cli_origin2 = pd.DataFrame()
df_cli_origin3 = pd.DataFrame()
df_ONE_PRD_TYPE_1 = pd.DataFrame()
cat_clmn = ['CLI_REGION','CLI_PROVINCE','CLI_ORIGIN','cli_origin2','cli_origin3', 'ONE_PRD_TYPE_1']
df_lst = [df_CLI_REGION,df_CLI_PROVINCE,df_CLI_ORIGIN,df_cli_origin2,df_cli_origin3, df_ONE_PRD_TYPE_1]
def categorizer(_clmn):
for clmn, df in zip(cat_clmn,df_lst):
d = {key: value for value, key in enumerate(df_churn[clmn].unique())}
df[clmn] = d.values()
df[clmn + '_key'] = d.keys()
df_churn[clmn + '_CAT'] = df_churn[clmn].map(d)
categorizer(cat_clmn)

Categories

Resources