Pandas Concat to Multiindex on Columns - python

Been playing around with this for a while. I'm working with test data, of which I have a series of test points, a series of sensors, and for each sensor I have min/max/avg/stdev data.
I had played around with the idea of simply appending e.g. "_min" on the end of each tag and creating a dataframe of nColumns*3 width. But... that feels hacky -- and when I call to plot the variables, I'm going to have to process the string value to add that suffix on... feels clumsy.
It seems that a multiindex is the right way to do it, which would allow me to handle the sensor name, and the measurement individually.
I'm currently reading in the data like:
data = pd.read_excel(os.path.join(working_path, working_dir, staticDataFileName),
sheet_name='sheet', skiprows = 6, nrows=2000, usecols = 'A,D:G', names = ["Tag", "Min", "Max", "Avg", "Stdev"], dtype={'Tag': str})
I'm then splitting the dataframe into each individual variable.
df_min = data[["Tag", "Min"]]
...
I currently have some code working where I only have a single average value.
temp = readRawData(wd, f, dataset)
# Drop the bad rows
temp.drop(temp.index[temp['Tag'] == '0'], inplace = True)
temp2 = temp.T
temp2.rename(columns=temp2.iloc[0], inplace = True)
temp2.drop(temp2.index[0], inplace = True)
I need to transpose the dataframe to get the tag names as columns, and then set the columns to the tag names. I then drop the first index, which now is just the tag names. In my code, I am looping over all files, and create the dataframe for all datapoints with
data = pd.concat([data, temp2])
Somewhere in there, I need to figure out how to create this multiindex dataframe. Most of the examples given in the pandas user guide LINK have the indices as multi-level, not columns. The example they give.. I'm having a hard time following.
I'm looking for guidance on how to take a series of dataframe which look like
df_min
Tag1 Tag2 TagN
0 min1 min2 minN
df_avg
Tag1 Tag2 TagN
0 avg1 avg2 avgN
and combine them into
df
Tag1 Tag2 ... TagN
Min Max Avg Min Max Avg Min Max Avg
0 min1 max1 avg1 min2 max2 avg2 minN maxN avgN
Of course, is this is a terrible idea, please let me know. Thanks!

I was able to make this work by using a solution here:
https://stackoverflow.com/a/47338266/14066896
It's not pretty... but it seems to be working
for f in staticDataFileName:
temp_all = readRawData(wd, f)
temp_all.drop(temp_all.index[temp_all['Tag'] == '0'], inplace = True)
column_list = []
steady_dict = dict()
temp = temp_all.T
temp.rename(columns=temp.iloc[0], inplace=True)
temp.drop(temp.index[0], inplace=True)
temp.reset_index(inplace=True)
temp.drop(columns=['index'], inplace=True)
#create column names
for column in temp.columns:
column_list.append((column, "Min"))
column_list.append((column, "Max"))
column_list.append((column, "Avg"))
column_list.append((column, "Stdev"))
j = 0
for columnName, columnData in temp.iteritems():
temp_dict = dict()
temp_dict["Min"] = temp.iloc[0, j]
temp_dict["Max"] = temp.iloc[1, j]
temp_dict["Avg"] = temp.iloc[2, j]
temp_dict["Stdev"] = temp.iloc[3, j]
j += 1
steady_dict[columnName] = temp_dict
t = pd.DataFrame(steady_dict).unstack().to_frame().T
t.columns = pd.MultiIndex.from_tuples(column_list)
#correctStaticData(temp2, wd2)
data = pd.concat([data, t])

Related

How to concatenate a series to a pandas dataframe in python?

I would like to iterate through a dataframe rows and concatenate that row to a different dataframe basically building up a different dataframe with some rows.
For example:
`IPCSection and IPCClass Dataframes
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
if (secrow[0] in clrow[0]):
pdList = [finalpatentclasses, pd.DataFrame(secrow), pd.DataFrame(clrow)]
finalpatentclasses = pd.concat(pdList, axis=0, ignore_index=True)
display(finalpatentclasses)
The output is:
I want the nan values to dissapear and move all the data under the correct columns. I tried axis = 1 but messes up the column names. Append does not work as well all values are placed diagonally at the table with nan values as well.
Alright, I have figured it out. The idea is that you create a newrowDataframe and concatenate all the data in a list from there you can add it to the dataframe and then conc with the final dataframe.
Here is the code:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
newrow = pd.DataFrame(columns=allcolumns)
values = np.concatenate((secrow.values, subclrow.values), axis=0)
newrow.loc[len(newrow.index)] = values
finalpatentclasses = pd.concat([finalpatentclasses, newrow], axis=0)
finalpatentclasses.reset_index(drop=false, inplace=True)
display(finalpatentclasses)
Update the code below is more efficient:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns, IPCSubClass.columns, IPCGroup.columns), axis = 0)
newList = []
for secrow in IPCSection.itertuples():
for clrow in IPCClass.itertuples():
if (secrow[1] in clrow[1]):
values = ([secrow[1], secrow[2], subclrow[1], subclrow[2]])
new_row = {IPCSection.columns[0]: [secrow[1]], IPCSection.columns[1]: [secrow[2]],
IPCClass.columns[0]: [clrow[1]], IPCClass.columns[1]: [clrow[2]]}
newList.append(values)
finalpatentclasses = pd.DataFrame(newList, columns=allcolumns)
display(finalpatentclasses)

Split a dataframe into two dataframe using first column that have a string values in python

I have two .txt file where I want to separate the data frame into two parts using the first column value. If the value is less than "H1000", we want in a first dataframe and if it is greater or equal to "H1000" we want in a second dataframe.First column starts the value with H followed by a four numbers. I want to ignore H when comparing numbers less than 1000 or greater than 1000 in python.
What I have tried this thing,but it is not working.
ht_data = all_dfs.index[all_dfs.iloc[:, 0] == "H1000"][0]
print(ht_data)
Here is my code:
if (".txt" in str(path_txt).lower()) and path_txt.is_file():
txt_files = [Path(path_txt)]
else:
txt_files = list(Path(path_txt).glob("*.txt"))
for fn in txt_files:
all_dfs = pd.read_csv(fn,sep="\t", header=None) #Reading file
all_dfs = all_dfs.dropna(axis=1, how='all') #Drop the columns where all columns are NaN
all_dfs = all_dfs.dropna(axis=0, how='all') #Drop the rows where all columns are NaN
print(all_dfs)
ht_data = all_dfs.index[all_dfs.iloc[:, 0] == "H1000"][0]
print(ht_data)
df_h = all_dfs[0:ht_data] # Head Data
df_t = all_dfs[ht_data:] # Tene Data
Can anyone help me how to achieve this task in python?
Assuming this data
import pandas as pd
data = pd.DataFrame(
[
["H0002", "Version", "5"],
["H0003", "Date_generated", "8-Aug-11"],
["H0004", "Reporting_period_end_date", "19-Jun-11"],
["H0005", "State", "AW"],
["H1000", "Tene_no/Combined_rept_no", "E75/3794"],
["H1001", "Tenem_holder Magnetic Resources", "NL"],
],
columns = ["id", "col1", "col2"]
)
We can create a mask of over and under a pre set threshold, like 1000.
mask = data["id"].str.strip("H").astype(int) < 1000
df_h = data[mask]
df_t = data[~mask]
If you want to compare values of the format val = HXXXX where X is a digit represented as a character, try this:
val = 'H1003'
val_cmp = int(val[1:])
if val_cmp < 1000:
# First Dataframe
else:
# Second Dataframe

Pandas assign value based on next row(s)

Consider this simple pandas DataFrame with columns 'record', 'start', and 'param'. There can be multiple rows with the same record value, and each unique record value corresponds to the same start value. However, the 'param' value can be different for the same 'record' and 'start' combination:
pd.DataFrame({'record':[1,2,3,4,4,5,6,7,7,7,8], 'start':[0,5,7,13,13,19,27,38,38,38,54], 'param':['t','t','t','u','v','t','t','t','u','v','t']})
I'd like to make a column 'end' that takes the value of 'start' in the row with the next unique value of 'record'. The values of column 'end' should be:
[5,7,13,19,19,27,38,54,54,54,NaN]
I'm able to do this using a for loop, but I know this is not preferred when using pandas:
max_end = 100
for idx, row in df.iterrows():
try:
n = 1
next_row = df.iloc[idx+n]
while next_row['start'] == row['start']:
n = n+1
next_row = df.iloc[idx+n]
end = next_row['start']
except:
end = max_end
df.at[idx, 'end'] = end
Is there an easy way to achieve this without a for loop?
I have no doubt there is a smarter solution but here is mine.
df1['end'] = df1.drop_duplicates(subset = ['record', 'start'])['start'].shift(-1).reindex(index = df1.index, method = 'ffill')
-=EDIT=-
Added subset into drop_duplicates to account for question amendment
This solution is equivalent to #Quixotic22 although more explicit.
df = pd.DataFrame({
'record':[1,2,3,4,4,5,6,7,7,7,8],
'start':[0,5,7,13,13,19,27,38,38,38,54],
'param':['t','t','t','u','v','t','t','t','u','v','t']
})
max_end = 100
df["end"] = None # create new column with empty values
loc = df["record"].shift(1) != df["record"] # record where the next value is diff from previous
df.loc[loc, "end"] = df.loc[loc, "start"].shift(-1) # assign desired values
df["end"].fillna(method = "ffill", inplace = True) # fill remaining missing values
df.loc[df.index[-1], "end"] = max_end # override last value
df

DataFrame with one column 0 to 100

I need a DataFrame of one column ['Week'] that has all values from 0 to 100 inclusive.
I need it as a Dataframe so I can perform a pd.merge
So far I have tried creating an empty DataFrame, creating a series of 0-100 and then attempting to append this series to the DataFrame as a column.
alert_count_list = pd.DataFrame()
week_list= pd.Series(range(0,101))
alert_count_list['week'] = alert_count_list.append(week_list)
Try this:
df = pd.DataFrame(columns=["week"])
df.loc[:,"week"] = np.arange(101)
alert_count_list = pd.DataFrame(np.zeros(101), columns=['week'])
or
alert_count_list = pd.DataFrame({'week':range(101)})
You can try:
week_vals = []
for i in range(0, 101):
week_vals.append(i)
df = pd.Dataframe(columns = ['week'])
df['week'] = week_vals

Pandas: how to add a dataframe inside a cell of another dataframe?

I have an empty dataframe like the following:
simReal2013 = pd.DataFrame(index = np.arange(0,1,1))
Then I read as dataframes some .csv files.
stat = np.arange(0,5)
xv = [0.005, 0.01, 0.05]
br = [0.001,0.005]
for i in xv:
for j in br:
I = 0
for s in stat:
string = 'results/2013/real/run_%d_%f_%f_15.0_10.0_T0_RealNet.csv'%(s,i,j)
sim = pd.read_csv(string, sep=' ')
I += np.array(sim.I)
sim.I = I / 5
col = '%f_%f'%(i,j)
simReal2013.insert(0, col, sim)
I would like to add the dataframe that I read in a cell of simReal2013. In doing so I get the following error:
ValueError: Wrong number of items passed 9, placement implies 1
Yes, putting a dataframe inside of a dataframe is probably not the way you want to go, but if you must, this is one way to do it:
df_in=pd.DataFrame([[1,2,3]]*2)
d={}
d['a']=df
df_out=pd.DataFrame([d])
type(df_out.loc[0,"a"])
>>> pandas.core.frame.DataFrame
Maybe a dictionary of dataframes would suffice for your use.

Categories

Resources