I have simple code in databricks:
import pandas as pd
data_frame = pd.read_csv('/dbfs/some_very_large_file.csv')
data_frame.isna().sum()
Out[41]:
A 0
B 0
C 0
D 0
E 0
..
T 0
V 0
X 0
Z 0
Y 0
Length: 287, dtype: int64
How can i see all column (A to Y) names along with is N/A values? Tried setting pd.set_option('display.max_rows', 287) and pd.set_option('display.max_columns', 287) but this doesn't seem to work here. Also isna() and sum() methods do not have any arguments that would allow me to manipulate output as far as i can say.
The default settings for pandas display options are set to 10 rows maximum. If the df to be displayed exceeds this number, it will be centrally truncated. To view the entire frame, you need to change the display options.
To display all rows of df:
pd.set_option('display.max_rows',None)
Ex:
>>> df
A B C
0 4 8 8
1 13 17 13
2 19 13 2
3 9 9 16
4 14 19 19
.. .. .. ..
7 7 2 2
8 5 7 2
9 18 12 17
10 10 5 11
11 5 3 18
[12 rows x 3 columns]
>>> pd.set_option('display.max_rows',None)
>>> df
A B C
0 4 8 8
1 13 17 13
2 19 13 2
3 9 9 16
4 14 19 19
5 3 17 12
6 9 13 17
7 7 2 2
8 5 7 2
9 18 12 17
10 10 5 11
11 5 3 18
Documentation:
pandas.set_option
Related
I have rand_df1:
np.random.seed(1)
rand_df1 = pd.DataFrame(np.random.randint(0, 40, size=(3, 2)), columns=list('AB'))
print(rand_df1, '\n')
A B
0 37 12
1 8 9
2 11 5
Also, rand_df2:
rand_df2 = pd.DataFrame(np.random.randint(0, 40, size=(3, 2)), columns=list('AB'))
rand_df2 = rand_df2.loc[rand_df2.index.repeat(rand_df2['B'])]
print(rand_df2, '\n')
A B
1 16 1
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
I need to reassign values in the first dataframe col 'A' with values in 'A' of the second dataframe by index. Desired output of rand_df1:
A B
0 37 12
1 16 1
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
If I've interpreted your question correctly, you are looking to append new rows onto rand_df2. These rows are to be selected from rand_df1 where they have an index which does not appear in rand_df2. Is that correct?
This will do the trick:
rand_df2_new = rand_df2.append(rand_df1[~rand_df1.index.isin(rand_df2.index)]).sort_index()
Thanks to Henry Yik for his solution:
rand_df2.combine_first(rand_df1)
A B
0 37 12
1 16 1
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
Also, tested this with extra column in one dataframe, that doesn't appears in second dataframe and backward situation. It works good.
Dataframe with more than 10 rows is incorrectly sorted on python3.5.9 after converting to json and back to pandas.DataFrame.
from pandas import DataFrame, read_json
columns = ['a', 'b', 'c']
data = [[1*i, 2*i, 3*i] for i in range(11)]
df = DataFrame(columns=columns, data=data)
print(df)
# a b c
# 0 0 0 0
# 1 1 2 3
# 2 2 4 6
# 3 3 6 9
# 4 4 8 12
# 5 5 10 15
# 6 6 12 18
# 7 7 14 21
# 8 8 16 24
# 9 9 18 27
# 10 10 20 30
new_df = read_json(df.to_json())
print(new_df)
# a b c
# 0 0 0 0
# 1 1 2 3
# 10 10 20 30 # this should be the last line
# 2 2 4 6
# 3 3 6 9
# 4 4 8 12
# 5 5 10 15
# 6 6 12 18
# 7 7 14 21
# 8 8 16 24
# 9 9 18 27
So DataFrame which was created with read_json seems to be sorting indexes like strings (1,10,2,3,...) instead of ints (1,2,3..).
Behaviour generated with Python 3.5.9 (default, Jan 4 2020, 04:09:01) (docker image python:3.5-stretch)
Everything seems to be working fine on my local machine (Python 3.8.1 (default, Dec 21 2019, 20:57:38)).
pandas==0.25.3 was used on both instances.
Is where a way to fix this without upgrading python?
Use sort_values to sort the dataframe on the column a. Something like below:
new_df = read_json(df.to_json())
#sort column
print(new_df.sort_values('a'))
#sort index
print(new_df.sort_index())
#ouput
a b c
0 0 0 0
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
6 6 12 18
7 7 14 21
8 8 16 24
9 9 18 27
10 10 20 30
``
The df looks like below:
A B C
1 8 23
2 8 22
3 8 45
4 9 45
5 6 12
6 8 10
7 11 12
8 9 67
I want to create a new df with the occurence of 8 in 'B' and the next row value of 8.
New df:
The df looks like below:
A B C
1 8 23
2 8 22
3 8 45
4 9 45
6 8 10
7 11 12
Use boolean indexing with compared by shifted values with | for bitwise OR:
df = df[df.B.shift().eq(8) | df.B.eq(8)]
print (df)
A B C
0 1 8 23
1 2 8 22
2 3 8 45
3 4 9 45
5 6 8 10
6 7 11 12
I have a dataframe as follows :
df1=pd.DataFrame(np.arange(24).reshape(6,-1),columns=['a','b','c','d'])
and I want to take 3 set of rows and convert them to columns with following order
Numpy reshape doesn't give intended answer
pd.DataFrame(np.reshape(df1.values,(3,-1)),columns=['a','b','c','d','e','f','g','h'])
In [258]: df = pd.DataFrame(np.hstack(np.split(df1, 2)))
In [259]: df
Out[259]:
0 1 2 3 4 5 6 7
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
In [260]: import string
In [261]: df.columns = list(string.ascii_lowercase[:len(df.columns)])
In [262]: df
Out[262]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
Create 3d array by reshape:
a = np.hstack(np.reshape(df1.values,(-1, 3, len(df1.columns))))
df = pd.DataFrame(a,columns=['a','b','c','d','e','f','g','h'])
print (df)
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
This uses the reshape/swapaxes/reshape idiom for rearranging sub-blocks of NumPy arrays.
In [26]: pd.DataFrame(df1.values.reshape(2,3,4).swapaxes(0,1).reshape(3,-1), columns=['a','b','c','d','e','f','g','h'])
Out[26]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
If you want a pure pandas solution:
df.set_index([df.index % 3, df.index // 3])\
.unstack()\
.sort_index(level=1, axis=1)\
.set_axis(list('abcdefgh'), axis=1, inplace=False)
Output:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
I want to create categorical variables from my data with this method:
cat.var condition
1 x > 10
2 x == 10
3 x < 10
I try using C() method from patsy , but it doesn't work, I know in stata I have to use code below, but after searching I didn't find any clean way to do this in pyhton:
generate mpg3 = .
(74 missing values generated)
replace mpg3 = 1 if (mpg <= 18)
(27 real changes made)
replace mpg3 = 2 if (mpg >= 19) & (mpg <=23)
(24 real changes made)
replace mpg3 = 3 if (mpg >= 24) & (mpg <.)
(23 real changes made
you can do it this way (we will do it just for column: a):
In [36]: df
Out[36]:
a b c
0 10 12 6
1 12 8 8
2 10 5 8
3 14 7 7
4 7 12 11
5 14 11 8
6 7 7 14
7 11 9 11
8 5 14 9
9 9 12 9
10 7 8 8
11 13 9 8
12 13 14 6
13 9 7 13
14 12 7 5
15 6 9 8
16 6 12 12
17 7 12 13
18 7 7 6
19 8 13 9
df.a[df.a < 10] = 3
df.a[df.a == 10] = 2
df.a[df.a > 10] = 1
In [40]: df
Out[40]:
a b c
0 2 12 6
1 1 8 8
2 2 5 8
3 1 7 7
4 3 12 11
5 1 11 8
6 3 7 14
7 1 9 11
8 3 14 9
9 3 12 9
10 3 8 8
11 1 9 8
12 1 14 6
13 3 7 13
14 1 7 5
15 3 9 8
16 3 12 12
17 3 12 13
18 3 7 6
19 3 13 9
In [41]: df.a = df.a.astype('category')
In [42]: df.dtypes
Out[42]:
a category
b int32
c int32
dtype: object
I'm using this df as a sample.
>>> df
A
0 3
1 13
2 10
3 31
You could use .ix like this:
df['CAT'] = [np.nan for i in range(len(df.index))]
df.ix[df.A > 10, 'CAT'] = 1
df.ix[df.A == 10, 'CAT'] = 2
df.ix[df.A < 10, 'CAT'] = 3
Or define a function to do the job, like this:
def do_the_job(x):
ret = 3
if (x > 10):
ret = 1
elif (x == 10):
ret = 2
return ret
and finally run this over the right Series in your df, like this:
>> df['CAT'] = df.A.apply(do_the_job)
>> df
A CAT
0 3 3
1 13 1
2 10 2
3 31 1
I hope this help!