combine data in pandas

combine data in pandas - python

I have a pandas dataframe like this:
index integer_2_x integer_2_y
0 49348 NaN
1 26005 NaN
2 5 NaN
3 NaN 26
4 26129 NaN
5 129 NaN
6 NaN 26
7 NaN 17
8 60657 NaN
9 17031 NaN
I want to make a third column that looks like this by taking the numeric value in the first and the second and eliminating the NaN. How do I do this?
index integer_2_z
0 49348
1 26005
2 5
3 26
4 26129
5 129
6 26
7 17
8 60657
9 17031

One way is to use the update function.
import pandas as np
import numpy as np
# some artificial data
# ========================
df = pd.DataFrame({'X':[10,20,np.nan,40,np.nan], 'Y':[np.nan,np.nan,30,np.nan,50]})
print(df)
X Y
0 10 NaN
1 20 NaN
2 NaN 30
3 40 NaN
4 NaN 50
# processing
# =======================
df['Z'] = df['X']
# for every missing value in column Z, replace it with value in column Y
df['Z'].update(df['Y'])
print(df)
X Y Z
0 10 NaN 10
1 20 NaN 20
2 NaN 30 30
3 40 NaN 40
4 NaN 50 50

I used http://pandas.pydata.org/pandas-docs/stable/basics.html#general-dataframe-combine
import pandas as pd
import numpy as np
df = pd.read_csv("data", sep="\s*") # cut and pasted your data into 'data' file
df["integer_2_z"] = df["integer_2_x"].combine(df["integer_2_y"], lambda x, y: np.where(pd.isnull(x), y, x))
Output
index integer_2_x integer_2_y integer_2_z
0 0 49348 NaN 49348
1 1 26005 NaN 26005
2 2 5 NaN 5
3 3 NaN 26 26
4 4 26129 NaN 26129
5 5 129 NaN 129
6 6 NaN 26 26
7 7 NaN 17 17
8 8 60657 NaN 60657
9 9 17031 NaN 17031

Maybe you can simply use the fillna function.
# Creating the DataFrame
df = pd.DataFrame({'integer_2_x': [49348, 26005, 5, np.nan, 26129, 129, np.nan, np.nan, 60657, 17031],
'integer_2_y': [np.nan, np.nan, np.nan, 26, np.nan, np.nan, 26, 17, np.nan, np.nan]})
# Using fillna to fill a new column
df['integer_2_z'] = df['integer_2_x'].fillna(df['integer_2_y'])
# Printing the result below, you can also drop x and y columns if they are no more required
print(df)
integer_2_x integer_2_y integer_2_z
0 49348 NaN 49348
1 26005 NaN 26005
2 5 NaN 5
3 NaN 26 26
4 26129 NaN 26129
5 129 NaN 129
6 NaN 26 26
7 NaN 17 17
8 60657 NaN 60657
9 17031 NaN 17031

Related

Remove pandas row that is based on previous row

I have the following dataframe, which the value should be increasing. Originally the dataframe has some unknown values.
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
10
3
11
3
12
13
14
15
5
Based on the assumsion that the value should be increasing, I would like to remove the value at index 10 and 11. This would be the desired dataframe:
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
12
13
14
15
5
Thank you very much

Assuming NaN in the empty cells (if not, temporarily replace them with NaN), use boolean indexing:
# if not NaNs uncomment below
# and use s in place of df['value'] afterwards
# s = pd.to_numeric(df['value'], errors='coerce')
# is the cell empty?
m1 = df['value'].isna()
# are the values strictly increasing?
m2 = df['value'].ge(df['value'].cummax())
out = df[m1|m2]
Output:
index value
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0

Try this:
def del_df(df):
df_no_na = df.dropna().reset_index(drop = True)
num_tmp = df_no_na['value'][0] # First value which is not NaN.
del_index_list = [] # indicies to delete
for row_index in range(1, len(df_no_na)):
if df_no_na['value'][row_index] > num_tmp : #Increasing
num_tmp = df_no_na['value'][row_index] # to compare following two values.
else : # Not increasing(same or decreasing)
del_index_list.append(df_no_na['index'][row_index]) # index to delete
df_goal = df.drop([df.index[i] for i in del_index_list])
return df_goal
output:
index value
0 0 1.0
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0

Insert rows from dataframeB to DataframeA with keys and without Merge

I have a dataframe with thousand records as:
ID to from Date price Type
1 69 18 2/2020 10 A
2 11 12 2/2020 5 A
3 18 10 3/2020 4 B
4 10 11 3/2020 10 A
5 12 69 3/2020 4 B
6 12 20 3/2020 3 B
7 69 21 3/2020 3 A
The output that i want is :
ID to from Date price Type ID to from Date price Type
1 69 18 2/2020 4 A 5 12 69 3/2020 4 B
1' 69 18 2/2020 6 A Nan Nan Nan Nan Nan Nan
2 11 12 2/2020 5 A Nan Nan Nan Nan Nan Nan
4 10 11 3/2020 4 A 3 18 10 3/2020 4 B
4' 10 11 3/2020 6 A Nan Nan Nan Nan Nan Nan
Nan Nan Nan Nan Nan Nan 6 12 20 3/2020 3 B
7 69 21 3/2020 3 A Nan Nan Nan Nan Nan Nan
The idea is to iterate over row , if the type is B , put the row next to the first record with type A and from = TO ,
if the price are equals its ok , if its not split the row with higher price , and the new price will be soustracted.
i divise the dataframe in type A and B , and im trying to iterate both of them
grp = df.groupby('type')
transformed_df_list = []
for idx, frame in grp:
frame.reset_index(drop=True, inplace=True)
transformed_df_list.append(frame.copy())
A = pd.DataFrame([transformed_df_list[0])
B= pd.DataFrame([transformed_df_list[1])
for i , row in A.iterrows():
for i, row1 in B.iterrows():
if row['to'] == row1['from']:
if row['price'] == row1['price']:
row_df = pd.DataFrame([row1])
output = pd.merge(A ,B, how='left' , left_on =['to'] , right_on =['from'] )
The problem is that with merge function a get several duplicate rows and i cant check the price to split the row ?
There is way to insert B row in A dataframe witout merge function ?

pivot table in specific intervals pandas

I have a one column dataframe which looks like this:
Neive Bayes
0 8.322087e-07
1 3.213342e-24
2 4.474122e-28
3 2.230054e-16
4 3.957606e-29
5 9.999992e-01
6 3.254807e-13
7 8.836033e-18
8 1.222642e-09
9 6.825381e-03
10 5.275194e-07
11 2.224289e-06
12 2.259303e-09
13 2.014053e-09
14 1.755933e-05
15 1.889681e-04
16 9.929193e-01
17 4.599619e-05
18 6.944654e-01
19 5.377576e-05
I want to pivot it to wide format but with specific intervals. The first 9 rows should make up 9 columns of the first row, and continue this pattern until the final table has 9 columns and has 9 times fewer rows than now. How would I achieve this?

Using pivot_table:
df.pivot_table(columns=df.index % 9, index=df.index // 9, values='Neive Bayes')
0 1 2 3 4 \
0 8.322087e-07 3.213342e-24 4.474122e-28 2.230054e-16 3.957606e-29
1 6.825381e-03 5.275194e-07 2.224289e-06 2.259303e-09 2.014053e-09
2 6.944654e-01 5.377576e-05 NaN NaN NaN
5 6 7 8
0 0.999999 3.254807e-13 8.836033e-18 1.222642e-09
1 0.000018 1.889681e-04 9.929193e-01 4.599619e-05
2 NaN NaN NaN NaN

Construct multiindex, set_index and unstack
iix = pd.MultiIndex.from_arrays([np.arange(df.shape[0]) // 9,
np.arange(df.shape[0]) % 9])
df_wide = df.set_index(iix)['Neive Bayes'].unstack()
Out[204]:
0 1 2 3 4 \
0 8.322087e-07 3.213342e-24 4.474122e-28 2.230054e-16 3.957606e-29
1 6.825381e-03 5.275194e-07 2.224289e-06 2.259303e-09 2.014053e-09
2 6.944654e-01 5.377576e-05 NaN NaN NaN
5 6 7 8
0 0.999999 3.254807e-13 8.836033e-18 1.222642e-09
1 0.000018 1.889681e-04 9.929193e-01 4.599619e-05
2 NaN NaN NaN NaN

Appending Pandas DataFrame in a loop

Let's say I have a df such as this:
df = pd.DataFrame({'A': [1,2,3,4,5], 'A_z': [2,3,4,5,6], 'B': [3,4,5,6,7], 'B_z': [4,5,6,7,8],
'C': [5,6,7,8,9], 'C_z': [6,7,8,9,10]})
Which looks like this:
A A_z B B_z C C_z
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
What I'm looking to do is create a new df and for each letter (A,B,C) append this new df vertically with the data from the two columns per letter so that it looks like this:
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
As far as I'm concerned something like this should work fine:
for col in df.columns:
if col[-1] != 'z':
new_df = new_df.append(df[[col, col + '_z']])
However this results in the following mess:
A A_z B B_z C C_z
0 1.0 2.0 NaN NaN NaN NaN
1 2.0 3.0 NaN NaN NaN NaN
2 3.0 4.0 NaN NaN NaN NaN
3 4.0 5.0 NaN NaN NaN NaN
4 5.0 6.0 NaN NaN NaN NaN
0 NaN NaN 3.0 4.0 NaN NaN
1 NaN NaN 4.0 5.0 NaN NaN
2 NaN NaN 5.0 6.0 NaN NaN
3 NaN NaN 6.0 7.0 NaN NaN
4 NaN NaN 7.0 8.0 NaN NaN
0 NaN NaN NaN NaN 5.0 6.0
1 NaN NaN NaN NaN 6.0 7.0
2 NaN NaN NaN NaN 7.0 8.0
3 NaN NaN NaN NaN 8.0 9.0
4 NaN NaN NaN NaN 9.0 10.0
What am I doing wrong? Any help would be really appreciated, cheers.
EDIT:
After the kind help from jezrael the renaming of the columns in his answer got me thinking about a possible way to do it using my original train of thought.
I can now also achieve the new df I want using the following:
for col in df:
if col[-1] != 'z':
d = df[[col, col + '_z']]
d.columns = ['Letter', 'Letter_z']
new_df = new_df.append(d)
The different columns names were clearly what was causing the problem which is something I wasn't aware of at the time. Hope this helps anyone.

One ide is use Series.str.split with expand=True for MultiIndex, then use rename for avoid NaNs and finally new columns names, reshape by DataFrame.stack, sort for correct order by DataFrame.sort_index and last remove MultiIndex:
df.columns = df.columns.str.split('_', expand=True)
df = df.rename(columns=lambda x:'Letter_z' if x == 'z' else 'Letter', level=1)
df = df.stack(0).sort_index(level=[1,0]).reset_index(drop=True)
print (df)
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
Or if possible simplify problem with reshape all non z values to one column and all z values to another use numpy.ravel:
m = df.columns.str.endswith('_z')
a = df.loc[:, ~m].to_numpy().T.ravel()
b = df.loc[:, m].to_numpy().T.ravel()
df = pd.DataFrame({'Letter': a,'Letter_z': b})
print (df)
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10

You can use the function concat and a list comprehension:
cols = df.columns[~df.columns.str.endswith('_z')]
func = lambda x: 'letter_z' if x.endswith('_z') else 'letter'
pd.concat([df.filter(like=i).rename(func, axis=1) for i in cols])
or
cols = df.columns[~df.columns.str.endswith('_z')]
pd.concat([df.filter(like=i).set_axis(['letter', 'letter_z'], axis=1, inplace=False) for i in cols])

How to set a value of a panda dataframe between two indices?

I would like to set a value to a panda dataframe based on the values of another column. In a nutshell, for example, if I wanted to set indices of a column my_column of a pandas dataframe pd where another column, my_interesting_column is between 10 and 30, I would like to do something like:
start_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(10)
end_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(30)
pd["my_column"].between(star_index, end_index)= some_value
As a simple illustration, suppose I have the following dataframe
df = pd.DataFrame(np.arange(10, 20), columns=list('A'))
df["B"]=np.nan
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 NaN
4 14 NaN
5 15 NaN
6 16 NaN
7 17 NaN
8 18 NaN
9 19 NaN
How can I do something like
df.where(df["A"].is_between(13,16))= 5
So that the end results looks like
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 5
4 14 5
5 15 5
6 16 5
7 17 NaN
8 18 NaN
9 19 NaN

pd.loc[start_idx:end_idx, 'my_column'] = some_value
I think this is what you are looking for
df.loc[(df['A'] >= 13) & (df['A'] <= 16), 'B'] = 5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

combine data in pandas - python

Related

Remove pandas row that is based on previous row

Insert rows from dataframeB to DataframeA with keys and without Merge

pivot table in specific intervals pandas

Appending Pandas DataFrame in a loop

How to set a value of a panda dataframe between two indices?

Categories

Resources