I'm new to Pandas so please bear with me; I have a dataframe A:
one two three
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
And a dataframe B, which represents relationships between columns in A:
one two # these get mutated in place
three 1 1
one 0 0
I need to use this to multiply values in-place with values in other columns. The output should be:
one two three
0 9 45 9
1 20 60 10
2 33 77 11
3 48 96 12
So in this case I have made the adjustments for each row:
one *= three
two *= three
Is there an efficient way to use this with Pandas / Numpy?
Take a look at here
In [37]: df
Out[37]:
one two three
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
In [38]: df['one'] *= df['three']
In [39]: df['two'] *= df['three']
In [40]: df
Out[40]:
one two three
0 9 45 9
1 20 60 10
2 33 77 11
3 48 96 12
Related
How to split a single column containing 1000 rows into chunk of two columns containing 500 rows per column in pandas.
I have a csv file that contains a single column and I need to split this into multiple columns. Below is the format in csv.
Steps I took:
I had multiple csv files containing one column with 364 rows. I concatenated them after converting them into a dataframe, but it copies the file in a linear fashion.
Code I tried
monthly_list = []
for file in ['D0_monthly.csv','c1_monthly.csv','c2_monthly.csv','c2i_monthly.csv','c3i_monthly.csv','c4i_monthly.csv','D1_monthly.csv','D2i_monthly.csv','D3i_monthly.csv','D4i_monthly.csv',
'D2j_monthly.csv','D3j_monthly.csv','D4j_monthly.csv','c2j_monthly.csv','c3j_monthly.csv','c4j_monthly.csv']:
monthly_file = pd.read_csv(file,header=None,index_col=None,skiprows=[0])
monthly_list.append(monthly_file)
monthly_all_file = pd.concat(monthly_list)
How the data is:
column1
1
2
3
.
.
364
1
2
3
.
.
364
I need to split the above column in the format shown below.
What the data should be:
column1
column2
1
1
2
2
3
3
4
4
5
5
.
.
.
.
.
.
364
364
Answer updated to work for arbitrary number of columns
You could start with number of columns or row length. For a given initial column length you could calculate one given the other. In this answer I use desired target column length - tgt_row_len.
nb_groups = 4
tgt_row_len = 5
df = pd.DataFrame({'column1': np.arange(1,tgt_row_len*nb_groups+1)})
print(df)
column1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
...
17 18
18 19
19 20
Create groups in the index for the following grouping operation
df.index = df.reset_index(drop=True).index // tgt_row_len
column1
0 1
0 2
0 3
0 4
0 5
1 6
1 7
...
3 17
3 18
3 19
3 20
dfn = (
df.groupby(level=0).apply(lambda x: x['column1'].reset_index(drop=True)).T
.rename(columns = lambda x: 'col' + str(x+1)).rename_axis(None)
)
print(dfn)
col1 col2 col3 col4
0 1 6 11 16
1 2 7 12 17
2 3 8 13 18
3 4 9 14 19
4 5 10 15 20
Previous answer that handles creating two columns
This answer just shows 10 target rows as an example. That can easily be changed to 364 or 500.
A dataframe where column1 contains 2 sets of 10 rows
tgt_row_len = 10
df = pd.DataFrame({'column1': np.tile(np.arange(1,tgt_row_len+1),2)})
print(df)
column1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 1
11 2
12 3
13 4
14 5
15 6
16 7
17 8
18 9
19 10
Move the bottom set of rows to column2
df.assign(column2=df['column1'].shift(-tgt_row_len)).iloc[:tgt_row_len].astype(int)
column1 column2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10
I don't know if anyone has a more efficient solution but using pd. merge on a temp column should solve your issue. Here is a quick implementation of what you could write.
csv1['temp'] = 1
csv2['temp'] = 1
new_df=pd.merge(csv1,csv2,on=["temp"])
new_df.drop("temp",axis=1)
I hope this helps!
I tried to check other questions but didn't find what I needed.
I have a dataframe df:
a b
0 6 4
1 5 6
2 2 2
3 7 4
4 3 6
5 5 2
6 4 7
and a second dataframe df2
d
0 60
1 50
5 50
6 40
I want to replace the values in df['a'] with the values in df2['d'] - but only in the relevant indices.
Output:
a b
0 60 4
1 50 6
2 2 2
3 7 4
4 3 6
5 50 2
6 40 7
All other questions I saw like this one referring to a single value, but I want to replace the values based on entire column.
I know I can iterate the rows one by one and replace the values, but I'm looking for a more efficient way.
Note: df2 does not have indices that are not in df. I want to replace all values in df2 with the values of df.
Simply use indexing:
df.loc[df2.index, 'a'] = df2['d']
output:
a b
0 60 4
1 50 6
2 2 2
3 7 4
4 3 6
5 50 2
6 40 7
I have a following pandas sample dataset:
Dim1 Dim2 Dim3 Dim4
0 1 2 7 15
1 1 10 12 2
2 9 19 18 16
3 4 2 4 15
4 8 1 9 5
5 14 18 3 14
6 19 9 9 17
I want to make a complex comparison based on all 4 columns and generate a column called Domination_count. For every row, I want to calculate how many other rows the given one dominates. Domination is defined as "being better in one dimension, while not being worse in the others". A is better than B if the value of A is less than B.
The final result should become:
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
Some explanation behind the final numbers:
the option 0 is better than option 2 and 6
the option 1 is better than option 2
option 2, 5,6 are better than no other option
option 3 and 4 are better than option 2, 6
I could not think of any code that allows me to compare multiple columns simultaneously. I found this approach which does not do the comparison simultaneously.
Improving on the answer:
My first answer worked if there were no equal rows. In the case of equal rows they would increment the domination count because they are not worse than the other rows.
This somewhat simpler solution takes care of that problem.
#create a dataframe with a duplicate row
df = pd.DataFrame([[1, 2, 7, 15],[1, 10,12,2],[9, 19,18,16],[4, 2, 4, 15],[8, 1, 9, 5],[14,18,3, 14],[19,9, 9, 17], [14,18,3, 14]], #[14,18,3, 14]
columns = ['Dim1','Dim2','Dim3','Dim4']
)
df2 = df.copy()
def domination(row,df):
#filter for all rows where none of the columns are worse
df = df[(row <= df).all(axis = 1)]
#filter for rows where any column is better.
df = df[(row < df).any(axis = 1)]
return len(df)
df['Domination_count'] = df.apply(domination, args=[df], axis = 1)
df
This will correctly account for the criteria in the post and will not count the duplicate row in the domination column
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
7 14 18 3 14 0
My previous solution counts the equal rows:
df2['Domination_count'] = df2.apply(lambda x: (x <= df2).all(axis=1).sum() -1, axis=1)
df2
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 1
6 19 9 9 17 0
7 14 18 3 14 1
Original Solution
I like this as a solution. It takes each row of the dataframe and compares each element it to all rows of the dataframe to see if that element is less than or equal to the other rows (not worse than). Then, it counts the rows where all of the elements are not worse than the other rows. This counts the current row which is never worse than itself so we subtract 1.
df['Domination_count'] = df.apply(lambda x: (x <= df).all(axis=1).sum() -1, axis=1)
The result is:
Dim1 Dim2 Dim3 Dim4 Domination_count
0 1 2 7 15 2
1 1 10 12 2 1
2 9 19 18 16 0
3 4 2 4 15 2
4 8 1 9 5 2
5 14 18 3 14 0
6 19 9 9 17 0
In one line using list comprehension:
df['Domination_count'] = [(df.loc[df.index!=row] - df.loc[row].values.squeeze() > 0).all(axis = 1).sum() for row in df.index]
Subtract each row from all remaining rows elementwise, then count rows with all positive values (meaning that each corresponding value in the row we subtracted was lower) in the resulting dataframe.
I may have gotten your definition of domination wrong, so perhaps you'll need to change strict positivity check for whatever you need.
A simple iterative solution:
df['Domination_count']=0 #initialize column to zero
cols = df.columns[:-1] # select all columns but the domination_count
for i in range(len(df.index)): # loop through all the 4 columns
for j in range(len(df.index)):
if np.all(df.loc[i,cols]<=df.loc[j,cols]) and i!=j: # for every ith value check if its smaller than the jth value given that i!=j
df.loc[i,'Domination_count']+=1 #increment by 1
Below is my data frame:
data = pd.DataFrame([['A',1,15,100,123],['A',2,16,50,7],['A',3,17,100,5],['B',1,20,75,123],['B',2,25,125,7],['B',3,23,100,7],['C',1,5,85,12],['C',2,1,25,6],['C',3,7,100,7]],columns = ['Group','Ranking','Data1','Data2','Correspondence'])
Group Ranking Data1 Data2 Correspondence
0 A 1 15 100 123
1 A 2 16 50 7
2 A 3 17 100 5
3 B 1 20 75 123
4 B 2 25 125 7
5 B 3 23 100 7
6 C 1 5 85 12
7 C 2 1 25 6
8 C 3 7 100 7
I have already sorted the data frame based on 'Group'. However, I still need to sort the data frame based on data for each Group. For each group, Data1 must be sorted based on lowest to highest value and once it is sorted, value in column Data2 will follow the position of Data1. The column Correspondence will not be touched (stay as in original df) and column ranking stays as it is as well. I have used df.sort_values(), but I am unable to get my result as below:
Group Ranking Data1 Data2 Correspondence
0 A 1 15 100 123
1 A 2 16 50 7
2 A 3 17 100 5
3 B 1 20 75 123
4 B 2 23 100 7
5 B 3 25 125 7
6 C 1 1 25 12
7 C 2 5 85 6
8 C 3 7 100 7
So basically my aim is: sort value in Data1 from lowest to highest within each Group, the value in Data2 will follow the movement of Data1 after sorting, while column Correspondence stays where it originally stands.
Thanks.
Use DataFrame.sort_values with both columns and assign back numpy array with .values:
cols = ['Data1','Data2']
data[cols] = data.sort_values(['Group','Data1'])[cols].values
#pandas 0.24+
#data[cols] = data.sort_values(['Group','Data1'])[cols].to_numpy()
print (data)
Group Ranking Data1 Data2 Correspondence
0 A 1 15 100 123
1 A 2 16 50 7
2 A 3 17 100 5
3 B 1 20 75 123
4 B 2 23 100 7
5 B 3 25 125 7
6 C 1 1 25 12
7 C 2 5 85 6
8 C 3 7 100 7
Did you try like this?
data2 = data.sort_values(by = ['Group', 'Data1'], ascending = (True, False)).reset_index()
data2['Correspondence'] = data['Correspondence']
have you tried sort_values function? based on documentation you can do it like this:
data.sort_values(['Group', 'Data1'], ascending=[True, False])
You can try something like:
data.sort_values(['Group', 'Data1', 'Data2'], ascending=[True, True, False])
And if you want some columns to be descending you have to set that column to False.
Use left join (merge) as follows
df2 = df.sort_values(['Group', 'Data1']).reset_index()
df3 = df2[['Group', 'Ranking', 'Data1', 'Data2']].join(df[['Correspondence']])
df3
which will give the result as follows
Group Ranking Data1 Data2 Correspondence
0 A 1 15 100 123
1 A 2 16 50 7
2 A 3 17 100 5
3 B 3 7 100 123
4 B 1 20 75 7
5 B 3 23 100 7
6 B 2 25 125 12
7 C 2 1 25 6
8 C 1 5 85 7
I have a dataset with only two columns. I would like to extract a small part out of it based on some condition on one column. Consider this as my dataset.
A B
1 10
1 9
2 11
3 12
3 11
4 9
Suppose I want to extract only those rows which have values in B from 10 - 12. so I would get a new dataset as:
A B
1 10
2 11
3 12
3 11
I tried using df.loc[df["B"] == range(10, 12)] but it dose not work, can someone help me with this?
You can use .between
In [1031]: df.loc[df.B.between(10, 12)]
Out[1031]:
A B
0 1 10
2 2 11
3 3 12
4 3 11
Or, isin
In [1032]: df.loc[df.B.isin(range(10, 13))]
Out[1032]:
A B
0 1 10
2 2 11
3 3 12
4 3 11
Or, query
In [1033]: df.query('10 <= B <= 12')
Out[1033]:
A B
0 1 10
2 2 11
3 3 12
4 3 11
Or, good'ol boolean
In [1034]: df.loc[(df.B >= 10) & (df.B <= 12)]
Out[1034]:
A B
0 1 10
2 2 11
3 3 12
4 3 11
Here's one more (not using .loc() or .query()) which looks more like the initial (unsuccessful) attempt:
df[df.B.isin(range(10,13))]