How to create a summary row in Pandas from audit fields - python

I am trying to derive a single row based on original input and then various changes to individual column values at different points in time. I have simplified the list below.
I have read in some data into my dataframe as so:
A B C D E
0 h h h h h
1 x
2 y 1
3 2 3
row 0 - "h" represents my original record.
rows 1 - 3 are changes over time to a specific column
I would like to create a single "result row" that would look something like:
'x', 'y, '2', '3' 'h'
Is there a simple way to do this with Pandas and Python with out excessive looping?

You can get it as a list like so:
>>> [df[s][df[s].last_valid_index()] for s in df]
['x', 'y', 2, 3, 'h']
If you need it as appended or something with a name, then you need to provide it with an index and then append it, like so:
df.append(pd.Series(temp, index=df.columns, name='total'))
# note, this returns a new object
# where 'temp' is the output of the code above

You can just try
#df=df.replace({'':np.nan})
df.ffill().iloc[[-1],:]

Related

Comparing two Astropy Tables, based on a condition related to specific columns

My apologies if this is a duplicate but I couldn't find anything exactly like this myself.
I have two Astropy tables, let's say X and Y. Each has multiple columns but what I want to do is to compare them by setting various conditions on different columns.
For example, table X looks like this and has 1000 rows and 9 columns (let's say):
Name_X (str)
Date_X (float64)
Date (int32)
...
GaiaX21-116383
59458.633888888886
59458
...
GaiaX21-116382
59458.504375
59458
...
and table Y looks like this and has 500 rows and 29 columns (let's say):
Name_Y (str14)
Date_Y (float64)
Date (int32)
...
GaiaX21-117313
59461.911724537036
59461
...
GaiaX21-118760
59466.905173611114
59466
...
I want to compare the two tables- basically, check if the same 'Name' exists in both Tables. If it does, then I treat that as a "match" and take that entire row and put it in a new table and discard everything else (or store them in another temp Table).
So I wrote a function like this:
def find_diff(table1, table2, param): # table1 is bigger, param defines which column, assuming they have the same names;
temp = Table(table1[0:0])
table3 = Table(table1[0:0])
for i in range(0, len(table1)):
for j in range(0, len(table2)):
if table1[param][i] != table2[param][j]:
# temp.add_row(table2[j])
# else:
table3.add_row(table1[i])
return table3
While this in principle, works, it also takes a huge amount of time to finish. So it simply isn't practical to be running the code this way. Similarly, I want to apply other conditions for other columns (cross-matching the observation dates, for example).
Any suggestions would be greatly helpful, thank you!
It sounds like you want to do a table join on the name columns. This can be done as documented at https://docs.astropy.org/en/stable/table/operations.html#join.
E.g.
# Assume table_x and table_y
from astropy.table import join
table_xy = join(table_x, table_y, keys_left='Name_X', keys_right='Name_Y')
As a full example with non-unique key values:
In [10]: t1 = Table([['x', 'x', 'y', 'z'], [1,2,3,4]], names=['a', 'b'])
In [11]: t2 = Table([['x', 'y', 'y', 'Q'], [10,20,30,40]], names=['a', 'c'])
In [12]: table.join(t1, t2, keys='a')
Out[12]:
<Table length=4>
a b c
str1 int64 int64
---- ----- -----
x 1 10
x 2 10
y 3 20
y 3 30
I believe this site would be your best friend for this problem: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
So in theory I believe you would want something like this:
result = pd.merge(table_y, table_x, on="Name")
So the key difference here would be that you might need to play with the column names for the tables so that they have the same name. What this will do though is it will match on the "Name" column between the two tables and if they are the same then it will put it in the results variable. From there you can do whatever you would like with the dataframe

Obtain a view of a DataFrame using the loc method

I am trying to obtain a view of a pandas dataframe using the loc method but it is not working as expected when I am modifying the original DataFrame.
I want to extract a row/slice of a DataFrame using the loc method so that when a modification is done to the DataFrame, the slice reflects the change.
Let's have a look at this example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':np.arange(0,5,2), 'a':np.arange(3), 'b':np.arange(3)}).set_index('ID')
df
a b
ID
0 0 0
2 1 1
4 2 2
Now I create a slice using loc:
slice1 = df.loc[[2],]
slice1
a b
ID
2 1 1
Then I modify the original DataFrame:
df.loc[2, 'b'] = 9
df
a b
ID
0 0 0
2 1 9
4 2 2
But unfortunately our slice does not reflect this modification as I would be expecting for a view:
slice1
a b
ID
2 1 1
My expectation:
a b
ID
2 1 9
I found an ugly fix using a mix of iloc and loc but I hope there is a nicer way to obtain the result I am expecting.
Thank you for your help.
Disclaimer: This is not an answer.
I tried testing how over-writing the values in chained assignment vs .loc referring to the pandas documentation link that was shared by #Quang Hoang above.
This is what I tried:
dfmi = pd.DataFrame([list('abcd'),
list('efgh'),
list('ijkl'),
list('mnop')],
columns=pd.MultiIndex.from_product([['one', 'two'],
['first', 'second']]))
df1 = dfmi['one']['second']
df2 = dfmi.loc[:, ('one', 'second')]
Output of both df1 and df2:
0 b
1 f
2 j
3 n
Iteration 1:
value = ['z', 'x', 'c', 'v']
dfmi['one']['second'] = value
Output df1:
0 z
1 x
2 c
3 v
Iteration 2:
value = ['z', 'x', 'c', 'v']
dfmi.loc[:, ('one', 'second')] = value
Output df2:
0 z
1 x
2 c
3 v
The assignment of new sets is changing the values in both the cases.
The documentation says:
Quote 1: 'method 2 (.loc) is much preferred over method 1 (chained [])'
Quote 2:
'Outside of simple cases, it’s very hard to predict whether "getitem" (used by chained option) will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the "setitem" (used by .loc) will modify dfmi or a temporary object that gets thrown out immediately afterward.'
I am not able to understand the explanation above. If the value in dfmi can change (in my case) and may not change (like in Benoit's case) then which way to obtain the result? Not sure if I am missing a point here.
Looking for help
The reason the slice didn't reflect the changes you made in the original dataframe is b/c you created the slice first.
When you create a slice, you create a "copy" of a slice of the data. You're not directly linking the two.
The short answer here is that you have two options 1) changed the original df first, then create a slice 2) don't slice, just do your operations referencing the original df using .loc or iloc
The memory address of your dataframe and slice are different, so changes in dataframe won't reflect in the slice-
The answer is to change the value in the dataframe and then slice it -

Python: How to pass row and next row DataFrame.apply() method?

I have DataFrame with thousands rows. Its structure is as below
A B C D
0 q 20 'f'
1 q 14 'd'
2 o 20 'a'
I want to compare the A column of current row and next row. If those values are equal I want to add the value of B column which has lower the value to D column of compared row which has greater value. Then I want to remove the moved column value of column B. It's like a swap process.
A B C D
0 q 20 'f' 14
1 o 20 'a'
I have thousands rows and iloc, loc, at methods work slow. At least I want to use DataFrame apply method. I tried some code samples but they didn't work.
I want to do something as below:
DataFrame.apply(lambda row: self.compare(row, next(row)), axis=1))
I have a compare method but I couldn't pass next row to the compare method. How can I pass it to the method? Also I am open to hear faster pandas solutions.
Best not to do that with apply as it will be slow; you can look at using shift, e.g.
df['A_shift'] = df['A'].shift(1)
df['Is_Same'] = 0
df.loc[df.A_shift == df.A, 'Is_Same'] = 1
Gets a bit more complicated if you're doing the shift within groups, but still possible.

Tuple key Dictionary to Table/Graph 3-dimensional

I have a dictionary like this:
dict = {(100,22,123):'55%',(110,24,123):'58%'}
Where, for example, the elements of the tuple are (x,y,z) and the value is the error rate of something... I want to print that dictionary but I'm not very clear how to do it or in what format to do it (which would be better to see easily the information, maybe: x - y - z - Rate ).
I found that: Converting Dictionary to Dataframe with tuple as key ,but I think it does not fit what I want and I can not understand it.
Thank you
You can use Series with reset_index, last only set new column names:
import pandas as pd
d = {(100,22,123):'55%',(110,24,123):'58%'}
df = pd.Series(d).reset_index()
df.columns = ['a','b','c', 'd']
print (df)
a b c d
0 100 22 123 55%
1 110 24 123 58%

Get integer row index of MultiIndex Series

I have a pandas Series with a MultiIndex, and I want to get the integer row numbers that belong to one level of the MultiIndex.
For example, if I have sample data s
s = pandas.Series([10, 23, 2, 19],
index=pandas.MultiIndex.from_product([['a', 'b'], ['c', 'd']]))
which looks like this:
a c 10
d 23
b c 2
d 19
I want to get the row numbers that correspond to the level b. So here, I'd get [2, 3] as the output, because the last two rows are under b. Also, I really only need the first row that belongs under b.
I wanted to get the numbers so that I can compare across Series. Say I have five Series objects with a b level. These are time-series data, and b corresponds to a condition that was present during some of the observations (and c is a sub-condition, etc). I want to see which Series had the conditions present at the same time.
Edit: To clarify, I don't need to compare the values themselves, just the indices. For example, in R if I had this dataframe:
d = data.frame(col_1 = c('a','a','b','b'), col_2 = c('c','d','c','d'), col_3 = runif(4))
Then the command which(d$col_1 == 'b') would produce the results I want.
If the index that you want to index by is the outermost one you can use loc
df.loc['b']
To get the first row I find the head method the easiest
df.loc['b'].head(1)
The idiomatic way to do the second part of your question is as follows. Say your series are named series1, series2 and series3.
big_series = pd.concat([series1, series2, series3])
big_series.loc['b']

Categories

Resources