Pandas dataframe create from dict of crossings - python

I want to create a simple matrix where I have as index the name of a software requirement and as column all the software test cases in the project.
Where one SWRS is covered by one SWTS, I need to place "something" (for example a cross).
In my code draft, I create an empty dataframe and then iterate to place the cross:
import pandas as pd
struct = {
"swrslist":["swrs1","swrs2","swrs3","swrs4"],
"swtslist":["swts1","swts2","swts3","swts4","swts5","swts6"],
"mapping":
{
"swrs1": ["swts1", "swts3", "swts4"],
"swrs2": ["swts2", "swts3", "swts5"],
"swrs4": ["swts1", "swts3", "swts5"]
}
}
if __name__ == "__main__":
df = pd.DataFrame( index = pd.Index(pd.Series(struct["swrslist"])),
columns = pd.Index(struct["swtslist"]))
print(df)
for key in struct["mapping"].keys():
for elem in struct["mapping"][key]:
print(key, elem)
df.at[key,elem] = "x"
print(df)
df.to_excel("mapping.xlsx")
the output is the following
swts1 swts2 swts3 swts4 swts5 swts6
swrs1 x NaN x x NaN NaN
swrs2 NaN x x NaN x NaN
swrs3 NaN NaN NaN NaN NaN NaN
swrs4 x NaN x NaN x NaN
I know that create an empty dataframe and then iterate is not efficient.
I tried to create the dataframe as following
df = pd.DataFrame(struct["mapping"], index = pd.Index(pd.Series(struct["swrslist"])),
columns = pd.Index(struct["swtslist"]))
but it creates an empty dataframe:
swts1 swts2 swts3 swts4 swts5 swts6
swrs1 NaN NaN NaN NaN NaN NaN
swrs2 NaN NaN NaN NaN NaN NaN
swrs3 NaN NaN NaN NaN NaN NaN
swrs4 NaN NaN NaN NaN NaN NaN
Furthermore, in future I plan to provide different values if a SWTS is a pass, fail or not executed.
How can I create the dataframe efficently, rather that iterate on the "mapping" entries?

Though I used for loop too, how about this?
df = pd.DataFrame(index=pd.Index(pd.Series(struct["swrslist"])), columns=pd.Index(struct["swtslist"]))
for key, value in struct["mapping"].items():
df.loc[key, value] = "x"

Related

Adding a value at the end of a column in a multindex column dataframe

I have a simple problem that probably has a simple solution but I couldn't found it anywhere. I have the following multindex column Dataframe:
mux = pd.MultiIndex.from_product(['A','B','C'], ['Datetime', 'Str', 'Ret']])
dfr = pd.DataFrame(columns=mux)
| A | B | C |
|Datetime|Str|Ret|Datetime|Str|Ret|Datetime|Str|Ret|
I need to add values one by one at the end of a specific subcolumn. For example add one value at the end of column A sub-column Datetime and leave the rest of the row as it is, then add another value to column B sub-column Str and again leave the rest of the values in the same row untouched and so on. So my questions are: Is it possible to target individual locations in this type of Dataframes? How? and also Is it possible to append not a full row but an individual value always at the end after the previous value without knowing where the end is?. Thank you so much for your answers.
IIUC, you can use .loc:
idx = len(dfr) # get the index of the next row after the last one
dfr.loc[idx, ('A', 'Datetime')] = pd.to_datetime('2021-09-24')
dfr.loc[idx, ('B', 'Str')] = 'Hello'
dfr.loc[idx, ('C', 'Ret')] = 4.3
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 00:00:00 NaN NaN NaN Hello NaN NaN NaN 4.3
Update
I mean for example when I have different number of values in different columns (for example 6 values in column A-Str but only 4 in column B-Datetime) but I donĀ“t really know. In that case what I need is to add the next value in that column after the last one so I need to know the index of the last non Nan value of that particular column so I can use it in your answer because if I use len(dfr) while trying to add value to the column that only has 4 values it will end up in the 7th row instead of the 5th row, this is because one of the columns may have more values than the others.
You can do it easily using last_valid_index. Create a convenient function append_to_col to append values inplace in your dataframe:
def append_to_col(col, val):
idx = dfr[col].last_valid_index()
dfr.loc[idx+1 if idx is not None else 0, col] = val
# Fill your dataframe
append_to_col(('A', 'Datetime'), '2021-09-24')
append_to_col(('A', 'Datetime'), '2021-09-25')
append_to_col(('B', 'Str'), 'Hello')
append_to_col(('C', 'Ret'), 4.3)
append_to_col(('C', 'Ret'), 8.2)
append_to_col(('A', 'Datetime'), '2021-09-26')
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 NaN NaN NaN Hello NaN NaN NaN 4.3
1 2021-09-25 NaN NaN NaN NaN NaN NaN NaN 8.2
2 2021-09-26 NaN NaN NaN NaN NaN NaN NaN NaN

Dataframe updates with pandas that includes duplicated column headers

I am incredibly new to pandas python module and have a problem I'm trying to solve. Take the following dataframe as an example. This was read in from a .csv where "link" is the column header for the last three columns:
summary link link.1 link.2
0 test PCR-12345 PCR-54321 PCR-65432
1 test2 NaN NaN NaN
2 test3 DR-1234 PCR-1244 NaN
3 test4 PCR-4321 DR-4321 NaN
My goal is to update the dataframe to the following:
summary link link.1 link.2
0 test NaN NaN NaN
1 test2 NaN NaN NaN
2 test3 DR-1234 NaN NaN
3 test4 NaN DR-4321 NaN
So the criteria is basically, if the column header is "link.X" AND the value contains a string that starts with "PCR-", update it to an empty/NaN value.
How do I loop through each row's values, check the header and value, and replace if criteria is satisfied?
Let's try pd.Series.str.startswith and pd.Series.mask:
# columns starting with `link`
cols = df.columns[df.columns.str[:4]=='link']
# for each `link` column, mask the `PCR` with `NaN`:
df[cols] = df[cols].apply(lambda x: x.mask(x.str.startswith('PCR')==True) )
Output:
summary link link.1 link.2
0 test NaN NaN NaN
1 test2 NaN NaN NaN
2 test3 DR-1234 NaN NaN
3 test4 NaN DR-4321 NaN
Here is another way. uses str.startswith() to find the columns that start with link, then where() to keep only the cases that are true.:
cols = df.columns.str.startswith('link')
df.loc[:,cols] = df.loc[:,cols].where(df.loc[:,cols].replace(r'[-].*','',regex=True).eq('DR'))
I used #Quang Hoang's answer. I also need to make sure the headers were written back out to a csv as their original values. I did that by first doing this to grab the original header:
with open('test.csv') as f:
orig_header = f.readline()
orig_header = orig_header.split (",")
orig_header[-1] = orig_header[-1].strip() #get rid of newline
I then went ahead and did the data manipulation with Quang's sugeestion. After that I write it back out to a csv with the original header:
df.to_csv('test_updated.csv', index = False, header=orig_header)

select range of values for all columns in pandas dataframe

I have a dataframe 'DF', part of which looks like this:
I want to select only the values between 0 and 0.01, to form a new dataframe(with blanks where the value was over 0.01)
To do this, i tried:
similarity = []
for x in DF:
similarity.append([DF[DF.between(0, 0.01).any(axis=1)]])
simdf = pd.DataFrame(similarity)
simdf.to_csv("similarity.csv")
However, i get the error AttributeError: 'DataFrame' object has no attribute 'between'
How do i select a range of values and create a new data frame with these?
Just do the two comparisons:
df_new = df[(df>0) & (df<0.01)]
Example:
import pandas as pd
df = pd.DataFrame({"a":[0,2,4,54,56,4],"b":[4,5,7,12,3,4]})
print(df[(df>5) & (df<33)])
a b
0 NaN NaN
1 NaN NaN
2 NaN 7.0
3 NaN 12.0
4 NaN NaN
5 NaN NaN
If want blank string instead of NaN:
df[(df>5) & (df<33)].fillna("")

Fast elementwise apply function using index and column name - pandas

I have a dataframe which can be simplified like this:
df = pd.DataFrame(index = ['01/11/2017', '01/11/2017', '01/11/2017', '02/11/2017', '02/11/2017', '02/11/2017'],
columns = ['Period','_A', '_B', '_C'] )
df.Period = [1, 2, 3, 1, 2, 3]
df
which looks like:
Date Period _A _B _C
01/11/2017 1 NaN NaN NaN
01/11/2017 2 NaN NaN NaN
01/11/2017 3 NaN NaN NaN
02/11/2017 1 NaN NaN NaN
02/11/2017 2 NaN NaN NaN
02/11/2017 3 NaN NaN NaN
And I want to apply my function to each cell
Get_Y(Date, Period, Location)
(where _A, _B, _C, ... are the locations).
Get_Y is a complex function, that looks up data from other dataframes using the Date, Period and Location, and based on criteria gives a value for Y (a float between 0 and 1).
I have managed to make this work with iterrows:
for index, row in PeriodDF.iterrows():
date = index
Period = row.loc[row.index[0]]
LocationList = row.index[1:]
print(date, Period)
for Location in LocationList :
PeriodDF.loc[(PeriodDF.index == date)&(PeriodDF.Period == Period), Location] = Get_Y(date, Period, Location)
But this takes over 1 hour.
There must be a way to do this faster in pandas.
I have tried creating 3 dataframes, one an array of the Period, one an array of the Location, and one of the Date, but not sure how to apply Get_Y elementwise, using the value from each dataframe.

combining different data which have 1 axis in common python numpy, write to table

I have a set of numpy 2d arrays which all have one axis in common, I wish to put them in order on the same 'table'.
a=np.loadtxt('file',unpack=True,dtype='str')
b=np.loadtxt('file',unpack=True,dtype='str')
c=np.loadtxt('file',unpack=True,dtype='str')
d=np.loadtxt('file',unpack=True,dtype='str')
from these arrays a[0],b[0],c[0],d[0] are all times, and a[1],b[1],c[1] and d[1] and values of different things. i wish to put them all on the same axis. I also wish to put Nan values in where there are no values from the arrays. For example In the end i will get a table like the one below. Is there an easy way to do this in python? The problem is a,b,c,d are all different lengths, so I need to put in NANS where some of the variables dont have values, time also needs to be generated from all 4 variables.
time a b c d
t1 Nan value nan nan
t2 value nan nan nan
t3 value nan value nan
t4 nan nan value nan
t5 value nan value value
t6 nan nan value value
t7 nan nan nan value
t8 nan nan value nan
t9 nan nan value value
Pandas might not be fast for matrix operation but it's good for generating "table":
import pandas as pd
d = {'time'= a[0],'a':a[1],'b':b[1],'c':c[1],'d':d[1]} # give column name to each column where your numpy arrays have same axis
df=pd.DataFrame(data=d)
I think pandas fills Nan automatically where data is missing

Categories

Resources