How to save dataframe with loop that produces NaN as output? - python

I have written a function random_sample() where the output is a large dataframe with 2 row and 751 rows. Every time I run the function there is a novel data frame see below.
W_332 W_333 W_334 ... W_1066 W_1067 W_1068 W_1069
0 0.098432 0.094451 0.096085 ... 0.090937 0.068576 0.085326 0.095416
1 0.164170 0.197848 0.161228 ... 0.272259 0.283551 0.229989 0.230067
[2 rows x 751 columns]
When I run the following code
for g in range(5):
sample = random_sample()
ref = {g:sample}
empt_dict.append(ref)
df_pd = pd.DataFrame(empt_dict)
df_pd
I get the following table as my output but I can see all of the 5 dataframes of the above [2 rows x 751 columns] repeated with a novel set of numbers but not together
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
How do I save the 5 iterative 2 by 751 from my function? Thank you!

Related

Pandas Matrix Addition/Division between index values and column header values

I have the following the data frame and I want to add the index value plus the column header and divide by 2.
The initial grid would like this:
grid = pd.DataFrame(
columns=[1,2,3,4,5],
index = [1,2,3,4,5]
)
grid
This results in the following:
1 2 3 4 5
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
What I'm looking to get is: For example, index of 1 + column header of 3 results in 4, which divided by 2 is 2
1 2 3 4 5
1 1 1.5 2 2.5 3
2 1.5 2 2.5 3 3.5
3 2 2.5 3 3.5 4
4 2.5 3 3.5 4 4.5
5 3 3.5 4 4.5 5
You can use broadcasting here:
grid[:] = (grid.index.values[:,None] + grid.columns.values)/2

Select rows with specific values in columns and include rows with NaN in pandas dataframe

I have a DataFrame df that looks something like this:
df
a b c
0 0.557894 -0.196294 -0.020490
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
5 -0.337374 NaN -0.771888
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
9 -2.345448 2.443669 -1.409422
I want to select the rows that have a value over some value, which I would normally do using:
new_df = df[df['c'] >= .5]
but that will return:
a b c
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
5 -0.337374 NaN 0.771888
8 0.737413 NaN 0.679575
I want to get those rows, but also keep the rows that have nan values in column 'c'. I haven't been able to find a question asking the same thing, they usually ask for one or the other, but not both. I can hard code the rows that I want to drop since I know the specific values, but I was wondering if there is a better solution. The end result should look something like this:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
Only dropping rows 0,5 and 9 since they are less than .5 in columns 'c'
You should use the | (or) operator.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.557894,1.138774,np.nan,-0.069319,1.040089,-0.337374,-1.813278,np.nan,0.737413,-2.345448],
'b': [-0.196294,-0.699224,2.384483,np.nan,-0.271777,np.nan,-1.564666,np.nan,np.nan,2.443669],
'c': [-0.020490,np.nan,0.554292,1.162941,np.nan,-0.771888,np.nan,np.nan,0.679575,-1.409422]})
df = df[(df['c'] >= .5) | (df['c'].isnull())]
print(df)
Output:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
You should be able to do this by
new_df = df[df['c'] >=5 or df['c'] == 'NaN']

How to fill and merge df with 10 empty rows?

how to fill df with empty rows or create a df with empty rows.
have df :
df = pd.DataFrame(columns=["naming","type"])
how to fill this df with empty rows
Specify index values:
df = pd.DataFrame(columns=["naming","type"], index=range(10))
print (df)
naming type
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
If need empty strings:
df = pd.DataFrame('',columns=["naming","type"], index=range(10))
print (df)
naming type
0
1
2
3
4
5
6
7
8
9

MultiIndex slicing doesn't work as expected (error involving lexsorted tuples)

I've got a problem, and it just doesn't make sense. I've got a large pd.DataFrame that I reduced in size so that I could easily show it in an example (called test1):
>>> print(test1)
value TIME \
star 0 1 2 3 4
0 1952.205873 1952.205873 1952.205873 1952.205873 1952.205873
1 1952.226307 1952.226307 1952.226307 1952.226307 1952.226307
2 1952.246740 1952.246740 1952.246740 1952.246740 1952.246740
3 1952.267174 1952.267174 1952.267174 1952.267174 1952.267174
value CNTS \
star 5 0 1 2
0 1952.205873 575311.432228 534103.079080 179471.239561
1 1952.226307 571480.854183 533138.021051 187456.451900
2 1952.246740 555631.798095 530263.846685 203247.734806
3 1952.267174 553639.056784 527058.335157 210088.229427
value
star 3 4 5
0 121884.201457 39003.397835 2089.321993
1 122796.312201 39552.401359 2810.010142
2 123500.068304 39158.050385 2652.409086
3 124357.387418 38881.565235 2721.908129
and I want to perform slice indexing on it. However it just doesn't seem to work. Here is what I try:
test.loc[:,(slice(None),0)]
and I get this error:
*** KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (0)'
This isn't the first time I've had this error or asked the question, but I still don't understand how to fix it and what's wrong.
Even more confusing, is that the following code seems to work without a hitch:
import pandas as pd
import numpy as np
column_values = ['TIME', 'XPOS']
target = range(0,2)
mindex = pd.MultiIndex.from_product([column_values, target], names=['value', 'target'])
df = pd.DataFrame(columns=mindex, index=range(10), dtype=float)
print(df.loc[:,(slice(None),0)])
I just don't understand what's happening and what's wrong here.
You need only sort MultiIndex in columns by sort_index:
df = df.sort_index(axis=1)
You can also check docs - sorting a multiindex.
Sample (columns are not lexsorted):
#your sample, only swap values in column_values
column_values = ['XPOS', 'TIME']
target = range(0,2)
mindex = pd.MultiIndex.from_product([column_values, target], names=['value', 'target'])
df = pd.DataFrame(columns=mindex, index=range(10), dtype=float)
print (df)
value XPOS TIME
target 0 1 0 1
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
print (df.columns.is_lexsorted())
False
df = df.sort_index(axis=1)
print (df.columns.is_lexsorted())
True
print(df.loc[:,(slice(None),0)])
value TIME XPOS
target 0 0
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN

Column operations in Pandas

Say I have a dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
I would like to substract the entries in column df.a from all other columns. In other words, I would like to get a dataframe that holds as columns the following columns:
|col_b - col_a | col_c - col_a | col_d - col_a|
I have tried df - df.a but this yields something odd:
0 1 2 3 a b c d e
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
How can I do this type of columnwise operations in Pandas? Also, just wondering, what does df -df.a do?
You probably want
>>> df.sub(df.a, axis=0)
a b c d e
0 0 0.112285 0.267105 0.365407 -0.159907
1 0 0.380421 0.119536 0.356203 0.096637
2 0 -0.100310 -0.180927 0.112677 0.260202
3 0 0.653642 0.566408 0.086720 0.256536
df-df.a is basically trying to do the subtraction along the other axis, so the indices don't match, and when using binary operators like subtraction "mismatched indices will be unioned together" (as the docs say). Since the indices don't match, you wind up with
0 1 2 3 a b c d e.
For example, you could get to the same destination more indirectly by transposing things,
(df.T - df.a).T, which by flipping df means that the default axis is now the right one.

Categories

Resources