Pandas dataframe values reassignment by index - python

I have rand_df1:
np.random.seed(1)
rand_df1 = pd.DataFrame(np.random.randint(0, 40, size=(3, 2)), columns=list('AB'))
print(rand_df1, '\n')
A B
0 37 12
1 8 9
2 11 5
Also, rand_df2:
rand_df2 = pd.DataFrame(np.random.randint(0, 40, size=(3, 2)), columns=list('AB'))
rand_df2 = rand_df2.loc[rand_df2.index.repeat(rand_df2['B'])]
print(rand_df2, '\n')
A B
1 16 1
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
I need to reassign values in the first dataframe col 'A' with values in 'A' of the second dataframe by index. Desired output of rand_df1:
A B
0 37 12
1 16 1
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7

If I've interpreted your question correctly, you are looking to append new rows onto rand_df2. These rows are to be selected from rand_df1 where they have an index which does not appear in rand_df2. Is that correct?
This will do the trick:
rand_df2_new = rand_df2.append(rand_df1[~rand_df1.index.isin(rand_df2.index)]).sort_index()

Thanks to Henry Yik for his solution:
rand_df2.combine_first(rand_df1)
A B
0 37 12
1 16 1
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
2 12 7
Also, tested this with extra column in one dataframe, that doesn't appears in second dataframe and backward situation. It works good.

Related

Stacking columns

I have a df that looks like
L.1
L.2
G.1
G.2
1
5
9
13
2
6
10
14
3
7
11
15
4
8
12
16
This is just an arbitrary example but the structure of my df is the exactly the same. 4 column titles and then numbers under them. I would like to stack my columns in a way that it will look like
L
G
1
9
2
10
3
11
4
12
5
13
6
14
7
15
8
16
If someone could help me in solving this, it would be great as I am having a really hard time doing this.
Use wide_to_long with remove MultiIndex in DataFrame.reset_index with drop=True:
df = (pd.wide_to_long(df.reset_index(), stubnames=['L','G'], i='index', j='tmp', sep='.')
.reset_index(drop=True))
print (df)
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
Or split columns by str.split with DataFrame.stack and sorting MultiIndex by DataFrame.sort_index, last also remove MultiIndex:
df.columns = df.columns.str.split('.', expand=True)
df = df.stack().sort_index(level=[1,0]).reset_index(drop=True)
print (df)
G L
0 9 1
1 10 2
2 11 3
3 12 4
4 13 5
5 14 6
6 15 7
7 16 8
You can make each column to list and concatenate them and create a new dataframe based on the new list:
import pandas as pd
df = pd.DataFrame({'L.1': [1, 2, 3, 4], 'L.2': [5, 6, 7, 8], 'G.1':[9, 10, 11, 12], 'G.2': [13, 14, 15, 16]})
new_df = pd.DataFrame({'L':df['L.1'].tolist()+df['L.2'].tolist(),
'G':df['G.1'].tolist()+df['G.2'].tolist()})
Printing new_df will give you:
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
The columns have a pattern, some start with L, others start with G. We can use pivot_longer from pyjanitor to abstract the process; simply pass a list of new column names, and pass a regular expression to match the patterns:
df.pivot_longer(index = None,
names_to = ['L', 'G'],
names_pattern = ['^L', '^G'])
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16
Using pivot_longer, you can use the .value approach, along with a regular expression that contains groups - the grouped part is retained as a column header:
df.pivot_longer(index = None,
names_to = ".value",
names_pattern = r"(.).")
L G
0 1 9
1 2 10
2 3 11
3 4 12
4 5 13
5 6 14
6 7 15
7 8 16

Pandas df.isna().sum() not showing all column names

I have simple code in databricks:
import pandas as pd
data_frame = pd.read_csv('/dbfs/some_very_large_file.csv')
data_frame.isna().sum()
Out[41]:
A 0
B 0
C 0
D 0
E 0
..
T 0
V 0
X 0
Z 0
Y 0
Length: 287, dtype: int64
How can i see all column (A to Y) names along with is N/A values? Tried setting pd.set_option('display.max_rows', 287) and pd.set_option('display.max_columns', 287) but this doesn't seem to work here. Also isna() and sum() methods do not have any arguments that would allow me to manipulate output as far as i can say.
The default settings for pandas display options are set to 10 rows maximum. If the df to be displayed exceeds this number, it will be centrally truncated. To view the entire frame, you need to change the display options.
To display all rows of df:
pd.set_option('display.max_rows',None)
Ex:
>>> df
A B C
0 4 8 8
1 13 17 13
2 19 13 2
3 9 9 16
4 14 19 19
.. .. .. ..
7 7 2 2
8 5 7 2
9 18 12 17
10 10 5 11
11 5 3 18
[12 rows x 3 columns]
>>> pd.set_option('display.max_rows',None)
>>> df
A B C
0 4 8 8
1 13 17 13
2 19 13 2
3 9 9 16
4 14 19 19
5 3 17 12
6 9 13 17
7 7 2 2
8 5 7 2
9 18 12 17
10 10 5 11
11 5 3 18
Documentation:
pandas.set_option

How to copy the current row and the next row value in a new dataframe using python?

The df looks like below:
A B C
1 8 23
2 8 22
3 8 45
4 9 45
5 6 12
6 8 10
7 11 12
8 9 67
I want to create a new df with the occurence of 8 in 'B' and the next row value of 8.
New df:
The df looks like below:
A B C
1 8 23
2 8 22
3 8 45
4 9 45
6 8 10
7 11 12
Use boolean indexing with compared by shifted values with | for bitwise OR:
df = df[df.B.shift().eq(8) | df.B.eq(8)]
print (df)
A B C
0 1 8 23
1 2 8 22
2 3 8 45
3 4 9 45
5 6 8 10
6 7 11 12

create categorical variables by condition in python with pandas or statsmodels

I want to create categorical variables from my data with this method:
cat.var condition
1 x > 10
2 x == 10
3 x < 10
I try using C() method from patsy , but it doesn't work, I know in stata I have to use code below, but after searching I didn't find any clean way to do this in pyhton:
generate mpg3 = .
(74 missing values generated)
replace mpg3 = 1 if (mpg <= 18)
(27 real changes made)
replace mpg3 = 2 if (mpg >= 19) & (mpg <=23)
(24 real changes made)
replace mpg3 = 3 if (mpg >= 24) & (mpg <.)
(23 real changes made
you can do it this way (we will do it just for column: a):
In [36]: df
Out[36]:
a b c
0 10 12 6
1 12 8 8
2 10 5 8
3 14 7 7
4 7 12 11
5 14 11 8
6 7 7 14
7 11 9 11
8 5 14 9
9 9 12 9
10 7 8 8
11 13 9 8
12 13 14 6
13 9 7 13
14 12 7 5
15 6 9 8
16 6 12 12
17 7 12 13
18 7 7 6
19 8 13 9
df.a[df.a < 10] = 3
df.a[df.a == 10] = 2
df.a[df.a > 10] = 1
In [40]: df
Out[40]:
a b c
0 2 12 6
1 1 8 8
2 2 5 8
3 1 7 7
4 3 12 11
5 1 11 8
6 3 7 14
7 1 9 11
8 3 14 9
9 3 12 9
10 3 8 8
11 1 9 8
12 1 14 6
13 3 7 13
14 1 7 5
15 3 9 8
16 3 12 12
17 3 12 13
18 3 7 6
19 3 13 9
In [41]: df.a = df.a.astype('category')
In [42]: df.dtypes
Out[42]:
a category
b int32
c int32
dtype: object
I'm using this df as a sample.
>>> df
A
0 3
1 13
2 10
3 31
You could use .ix like this:
df['CAT'] = [np.nan for i in range(len(df.index))]
df.ix[df.A > 10, 'CAT'] = 1
df.ix[df.A == 10, 'CAT'] = 2
df.ix[df.A < 10, 'CAT'] = 3
Or define a function to do the job, like this:
def do_the_job(x):
ret = 3
if (x > 10):
ret = 1
elif (x == 10):
ret = 2
return ret
and finally run this over the right Series in your df, like this:
>> df['CAT'] = df.A.apply(do_the_job)
>> df
A CAT
0 3 3
1 13 1
2 10 2
3 31 1
I hope this help!

Iterating through DataFrame row index in reverse order [duplicate]

This question already has answers here:
Right way to reverse a pandas DataFrame?
(6 answers)
Closed 2 years ago.
I know how to iterate through the rows of a pandas DataFrame:
for id, value in df.iterrows():
but now I'd like to go through the rows in reverse order (id is numeric, but doesn't coincide with row number). Firstly I thought of doing a sort on index data.sort(ascending = False) and then running the same iteration procedure, but it didn't work (it seem to still go from smaller id to larger).
How can I accomplish this?
Iterating through a DataFrame is usually a bad idea, unless you use Cython. If you really have to, you can use the slice notation to reverse the DataFrame:
In [8]: import pandas as pd
In [9]: pd.DataFrame(np.arange(20).reshape(4,5))
Out[9]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
In [10]: pd.DataFrame(np.arange(20).reshape(4,5))[::-1]
Out[10]:
0 1 2 3 4
3 15 16 17 18 19
2 10 11 12 13 14
1 5 6 7 8 9
0 0 1 2 3 4
In [11]: for row in pd.DataFrame(np.arange(20).reshape(4,5))[::-1].iterrows():
...: print row
...:
(3, 0 15
1 16
2 17
3 18
4 19
Name: 3)
(2, 0 10
1 11
2 12
3 13
4 14
Name: 2)
(1, 0 5
1 6
2 7
3 8
4 9
Name: 1)
(0, 0 0
1 1
2 2
3 3
4 4
Name: 0)

Categories

Resources