This question already has answers here:
How to shift a column in Pandas DataFrame
(9 answers)
Closed 4 years ago.
I have a dataframe like this,
d = {'ID': ["A", "A", "B", "B", "C", "C", "D", "D", "E", "E", "F", "F"],
'value': [23, 23, 52, 52, 36, 36, 46, 46, 9, 9, 110, 110]}
df = pd.DataFrame(data=d)
ID value
0 A 23
1 A 23
2 B 52
3 B 52
4 C 36
5 C 36
6 D 46
7 D 46
8 E 9
9 E 9
10 F 110
11 F 110
Basically, I replicate original data set(n rows). The dataframe I want to get seems like this,
ID value
0 A 23
1 B 23
2 B 52
3 C 52
4 C 36
5 D 36
6 D 46
7 E 46
8 E 9
9 F 9
Move column of value one unit down and remain all pairs of values. So, I lost first of A, last of F and last two value 110. Finally, I have 2n-2 rows.
I think need:
df = df.set_index('ID').shift().iloc[1:-1].reset_index()
print (df)
ID value
0 A 23.0
1 B 23.0
2 B 52.0
3 C 52.0
4 C 36.0
5 D 36.0
6 D 46.0
7 E 46.0
8 E 9.0
9 F 9.0
I think this will solve your problem
import pandas as pd
d = {'ID': ["A", "A", "B", "B", "C", "C", "D", "D", "E", "E","F","F"],
'value': [23, 23, 52, 52, 36, 36, 46, 46, 9, 9, 110, 110]}
df = pd.DataFrame(data=d)
df['value']=df['value'].shift(1)
df2=df[1:11] #here 11 is n-1,it depends on number of rows
print(df2)
So if you just want to shift ID by -1 and exclude last 2 rows:
df['ID'] = df['ID'].shift(-1)
result = df[:-2]
Related
Assume I have the following two pandas DataFrames:
df1 = pd.DataFrame({"A": [1, 2, 3],
"B": ["a", "b", "c"],
"C": [7, 43, 15]})
df2 = pd.DataFrame({"A": [4, 5],
"B": ["c", "d"],
"C": [12, 19]})
Now, I want to iterate over the rows in df1, and if a certain condition is met for that row, add the row to df2.
For example:
for i, row in df1.iterrows():
if row["C"] == 43:
df2 = pd.concat([row, df2])
df2.head()
Should give me output:
A B C
4 c 12
5 d 19
2 b 43
But instead I get an output where the column names of the DataFrames appear in the rows:
0 A B C
A 2 NaN NaN NaN
B b NaN NaN NaN
C 43 NaN NaN NaN
0 NaN 4.0 c 12.0
1 NaN 5.0 d 19.0
How to solve this?
I think you just need concat with boolean indexing on df1.
pd.concat([df2, df1[df1['C'] == 43]], ignore_index=True)
The df1[df1['C']==43]] part takes a slice of df1 based on the condition the column C being equal to 43 and concats it to df2.
Output:
A B C
0 4 c 12
1 5 d 19
2 2 b 43
Change your code to this
for i, row in df1.iterrows():
if row["C"] == 43:
df2.loc[len(df2.index)] = row
df2.head()
Use pd.merge()
pd.merge() take list of two dfs and merge them horizontally if no axis is defined.
In your case pass df2 along with df1[df1["C"] == 43] which will return only those rows who have 43 in its column C.
reset_index() so that output don't have duplicate index values.
df2 = pd.concat([df2,df1[df1["C"] == 43]]).reset_index(drop=True)
print(df2)
A B C
0 4 c 12
1 5 d 19
2 2 b 43
I am new to pandas, I am facing issue with replacing. So I am creating a function which replaces the values of column of a data frame based on the parameters. The condition is that it should replace all values of the column with only one value, as show below:
Though I tried getting an error 'lenght didn't match'
def replace(df,column,condition):
for i in column:
for j in condition:
df[i]=j
return df
column = ['A','C']
condition = 11,34
df
A B C
0 12 5 1
1 13 6 5
2 14 7 7
replace(df,column,condition)
my excepted output:
A B C
0 11 5 34
1 11 6 34
2 11 7 34
Edit: I initially sugggested using apply but then realized that is not necessary since you are ignoring the existing values in the series. This is simpler and should serve your purposes.
Example:
import pandas as pd
data = [[12, 5, 1], [13, 5, 5], [14, 7, 7]]
df = pd.DataFrame(data, columns = ["A", "B", "C"])
def replace(df, columns, values):
for one_column, one_value in zip(columns, values):
df[one_column] = one_value
return df
print(replace(df, ["A", "C"], [11, 34]))
Output:
A B C
0 11 5 34
1 11 5 34
2 11 7 34
Using key:value pair, convert condition and column into a dict, unpack and assign the values
df.assign(**dict(zip(column, condition)))
I have a sample table like this:
Dataframe: df
Col1 Col2 Col3 Col4
A 1 10 i
A 1 11 k
A 1 12 a
A 2 10 w
A 2 11 e
B 1 15 s
B 1 16 d
B 2 21 w
B 2 25 e
B 2 36 q
C 1 23 a
C 1 24 b
I'm trying to get all records/rows of the groups (Col1, Col2) that has the smaller number of records AND skipping over those groups that have only 1 record (in this example Col1 = 'C'). So, the output would be as follows:
A 2 10 w
A 2 11 e
B 1 15 s
B 1 16 d
since group (A,2) has 2 records compared to group (A,1) which has 3 records.
I tried to approach this issue from different angles but just can't seem to get the result that I need. I am able to find the groups that I need using a combination of groupby, filter and agg but how do I now use this as a select filter on df? After spending a lot of time on this, I wasn't even sure that the approach was correct as it looked overly complicated. I am sure that there is an elegant solution but I just can't see it.
Any advise on how to approach this would be greatly appreciated.
I had this to get the groups for which I wanted the rows displayed:
groups = df.groupby(["Col1, Col2"])["Col2"].agg({'no':'count'})
filteredGroups = groups.groupby(level=0).filter(lambda group: group.size > 1)
print filteredGroups.groupby(level=0).agg('idxmin')
The second line was to account for groups that may have only one record as those I don't want to consider. Honestly, I tried so many variations and approaches that eventually did not give me the result that I wanted. I see that all answers are not one-liners so that at least I don't feel like I was over thinking the problem.
df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df['rnk'] = df.groupby('Col1')['sz'].rank(method='min')
df['rnk_rev'] = df.groupby('Col1')['sz'].rank(method='min',ascending=False)
df.loc[ (df['rnk'] == 1.0) & (df['rnk_rev'] != 1.0) ]
Col1 Col2 Col3 Col4 sz rnk rnk_rev
3 A 2 10 w 2 1.0 4.0
4 A 2 11 e 2 1.0 4.0
5 B 1 15 s 2 1.0 4.0
6 B 1 16 d 2 1.0 4.0
Edit: changed "count" to "size" (as in #Marco Spinaci's answer) which doesn't matter in this example but might if there were missing values.
And for clarity, here's what the df looks like before dropping the selected rows.
Col1 Col2 Col3 Col4 sz rnk rnk_rev
0 A 1 10 i 3 3.0 1.0
1 A 1 11 k 3 3.0 1.0
2 A 1 12 a 3 3.0 1.0
3 A 2 10 w 2 1.0 4.0
4 A 2 11 e 2 1.0 4.0
5 B 1 15 s 2 1.0 4.0
6 B 1 16 d 2 1.0 4.0
7 B 2 21 w 3 3.0 1.0
8 B 2 25 e 3 3.0 1.0
9 B 2 36 q 3 3.0 1.0
10 C 1 23 a 2 1.0 1.0
11 C 1 24 b 2 1.0 1.0
Definitely not a nice answer, but it should work:
tmp = df[['col1','col2']].groupby(['col1','col2'], as_index=False).size()
df['occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]][df.col2[i]])
df['min_occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]].min())
df[df.occurrencies == df.min_occurrencies]
But there must be a more clever way to use groupby than creating an auxiliary data frame...
The following is a solution that is based on the groupby.apply methodology. Other simpler methods are available by creating data Series as in JohnE's method which is superior I would say.
The solution works by grouping the dataframe at the Col1 level and then passing a function to apply that further groups the data by Col2. Each sub_group is then assessed to yield the smallest group. Note that ties in size will be determined by whichever is evaluated first. This may not be desirable.
#create data
import pandas as pd
df = pd.DataFrame({
"Col1" : ["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"],
"Col2" : [1, 1, 1, 2, 2, 1, 1, 2, 2, 2],
"Col3" : [10, 11, 12, 10, 11, 15, 16, 21, 25, 36],
"Col4" : ["i", "k", "a", "w", "e", "s", "d", "w", "e", "q"]
})
Grouped = df.groupby("Col1")
def transFunc(x):
smallest = [None, None]
sub_groups = x.groupby("Col2")
for group, data in sub_groups:
if not smallest[1] or len(data) < smallest[1]:
smallest[0] = group
smallest[1] = len(data)
return sub_groups.get_group(smallest[0])
Grouped.apply(transFunc).reset_index(drop = True)
Edit to assign the result
result = Grouped.apply(transFunc).reset_index(drop = True)
print(result)
I would like to add a shorter yet readable version of JohnE's solution
df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df.groupby('Col1').filter(lambda x: x['sz'].rank(method='min') == 1 and x['sz'].rank(method='min', ascending=False) != 1)
I have the following DataFrame:
df1 = pd.DataFrame(columns=["DATE","QTY1", "QTY2", "ID", "CODE"])
df1["DATE"] = ["2000-05-01", "2000-05-03", "2001-01-15", "2001-02-20", "2001-02-22"]
df1["QTY1"] = [10, 11, 12,5,4]
df1["QTY2"] = [100, 101, 102,15,14]
df1["ID"] = [1,2,3,4,5]
df1["CODE"] = ["A", "B", "C", "D", "E"]
df2 = pd.DataFrame(columns=["DATE","QTY1", "QTY2", "ID", "CODE"])
df2["DATE"] = ["2000-05-02", "2000-05-04", "2001-01-12", "2001-03-28", "2001-08-21", "2005-07-01"]
df2["QTY1"] = [9, 101, 11,5.1,100, 10]
df2["QTY2"] = [99, 12, 1000,6,3, 1]
df2["ID"] = [1,2,3,8,5, 9]
df2["CODE"] = ["F", "G", "H", "I", "L", "M"]
df1:
DATE QTY1 QTY2 ID CODE
0 2000-05-01 10 100 1 A
1 2000-05-03 11 101 2 B
2 2001-01-15 12 102 3 C
3 2001-02-20 5 15 4 D
4 2001-02-22 4 14 5 E
df2
DATE QTY1 QTY2 ID CODE
0 2000-05-02 9.0 99 1 F
1 2000-05-04 101.0 12 2 G
2 2001-01-12 11.0 1000 3 H
3 2001-03-28 5.1 6 8 I
4 2001-08-21 100.0 3 5 L
5 2005-07-01 10 1 9 M
My goal is to create a signature with some tolerance for each row and match rows on both DF that fall in such interval.
The signature for each row is structure as follow:
DATE (with tolerance of +/- 5 days)
Qty1 (with tolerance of 10%)
Qty2 (with tolerance of 10%)
ID (perfect match).
So for example the matching result for the above DFs will return the following rows (the first line from each DF) grouped by signature:
Signature1 2000-05-01 10 100 1 A
2000-05-02 9.0 99 1 F
All the others rows do not respect one or more tolerances.
Currently I am doing that with a classical for loop using iterrows() and checking all fields but for large DFs performances are quite poor.
I was wondering if there is a more pandas-like approach that could help me to speed it up.
Thanks
I have two data frames, d1 and d2, both with the same categorical variables. However, the categories of a particular variable might be different.
For example, for variable v1 in data frame d1, we have the following categories or levels: "a", "b", "c", "d",
"e"
and for the same variable v1, in data frame d2 we have levels: "a", "b", "c"
I want to then transform v1 in data frame d1 such that the only the levels common with d2 remain and the rest are relabeled as "other", i.e., d1["v1"] should be transformed to: "a", "b", "c", "other", "other"
Both data frames have over 4 million observations and hence I am looking for a fast way to do this.
Example below:
d1 = pd.DataFrame({"id": range(1, 11), "v1": ["a", "b", "c", "d", "e", "a", "e", "d", "a", "d"]})
d2 = pd.DataFrame({"id": range(1, 11), "v1": ["a", "b", "c", "a", "c", "b", "c", "a", "b", "a"]})
d1
id v1
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 a
6 7 e
7 8 d
8 9 a
9 10 d
[10 rows x 2 columns]
d2
id v1
0 1 a
1 2 b
2 3 c
3 4 a
4 5 c
5 6 b
6 7 c
7 8 a
8 9 b
9 10 a
[10 rows x 2 columns]
After transformation, new d1 should look like:
d1
id v1
0 1 a
1 2 b
2 3 c
3 4 other
4 5 other
5 6 a
6 7 other
7 8 other
8 9 a
9 10 other
[10 rows x 2 columns]
How about
d1.ix[~d1.v1.isin(d2.v1.unique()), 'v1'] = 'other'
Edit On reflection, an explanation would be good too. :)
d2.v1.unique() - select unique values in d2.v1
d1.v1.isin() - find those values in d1.v1
d1.ix[~..., 'v1'] - invert, select rows that match condition and change v1 column on those rows
Edit 2 Sorry, my original answer changed both rows in d1 to other. Updated.