I have the following dataframe, where I would like to sort the columns according to the name.
1 | 13_1 | 13_10| 13_2 | 2 | 3
9 | 31 | 2 | 1 | 3 | 4
I am trying to sort the columns in the following way:
1 | 2 | 3 | 13_1 | 13_2 | 13_10
9 | 3 | 4 | 31 | 1 | 2
I've been trying to solve this using df.sort_index(axis=1, inplace=True), however the result turns out to be the same as my initial dataframe. I.e:
1 | 13_1 | 13_10| 13_2 | 2 | 3
9 | 31 | 2 | 1 | 3 | 4
It seems it recognizes 13_1 as 1.31 and not as 13.1. Furthermore, I tried a conversion of the column names from string to float. However, this turns out to treat 13_1 and 13_10 both as 13.1 giving me duplicate column names.
natsort
from natsort import natsorted
df = df.reindex(natsorted(df.columns), axis=1)
# 1 2 3 13_1 13_2 13_10
#0 9 3 4 31 1 2
first of all, natsort from the other answers looks awesome, I'd totally use that.
In case you don't want to install a new package:
Seems like you want to sort, numerically, first by the number before the _ and then by the number after it as a tie break. this means you just want a tuple sort order, when splitting to tuple by _.
try this:
df = df[sorted(df.columns, key=lambda x: tuple(map(int,x.split('_'))))]
Output:
1 2 3 13_1 13_2 13_10
9 3 4 31 1 2
Here is one way using natsorted
from natsort import natsorted, ns
df=df.reindex(columns=natsorted(df.columns))
Out[337]:
1 2 3 13_1 13_2 13_10
0 9 3 4 31 1 2
Another way we stack with pandas no 3rd party lib :-)
idx=df.columns.to_series().str.split('_',expand=True).astype(float).reset_index(drop=True).sort_values([0,1]).index
df=df.iloc[:,idx]
Out[355]:
1 2 3 13_1 13_2 13_10
0 9 3 4 31 1 2
Related
I am trying to create a function that will take in CSV files and create dataframes and concatenate/sum like so:
id number_of_visits
0 3902932804358904910 2
1 5972629290368575970 1
2 5345473950081783242 1
3 4289865755939302179 1
4 36619425050724793929 19
+
id number_of_visits
0 3902932804358904910 5
1 5972629290368575970 10
2 5345473950081783242 3
3 4289865755939302179 20
4 36619425050724793929 13
=
id number_of_visits
0 3902932804358904910 7
1 5972629290368575970 11
2 5345473950081783242 4
3 4289865755939302179 21
4 36619425050724793929 32
My main issue is that in the for loop after I create the dataframes, I tried to concatenate by df += new_df and new_df wasn't being added. So I tried the following implementation.
def add_dfs(files):
master = []
big = pd.DataFrame({'id': 0, 'number_of_visits': 0}, index=[0]) # dummy df to initialize
for k in range(len(files)):
new_df = create_df(str(files[k])) # helper method to read, create and clean dfs
master.append(new_df) #creates a list of dataframes with in master
for k in range(len(master)):
big = pd.concat([big, master[k]]).groupby(['id', 'number_of_visits']).sum().reset_index()
# iterate through list of dfs and add them together
return big
Which gives me the following
id number_of_visits
1 1000036822946495682 2
2 1000036822946495682 4
3 1000044447054156512 1
4 1000044447054156512 9
5 1000131582129684623 1
So the number_of_visits for each user_id aren't actually adding together, they're just being sorted in order by number_of_visits
Pass your list of dataframes directly to concat() then group on the id and sum.
>>> pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
id number_of_visits
0 36619425050724793929 32
1 3902932804358904910 7
2 4289865755939302179 21
3 5345473950081783242 4
4 5972629290368575970 11
def add_dfs(files):
master = []
for f in files:
new_df = create_df(f)
master.append(new_df)
big = pd.concat(master).groupby('id').number_of_visits.sum().reset_index()
return big
You can use
df1['number_of_visits'] += df2['number_of_visits']
this gives you:
| | id | number_of_visits |
|---:|---------------------:|-------------------:|
| 0 | 3902932804358904910 | 7 |
| 1 | 5972629290368575970 | 11 |
| 2 | 5345473950081783242 | 4 |
| 3 | 4289865755939302179 | 21 |
| 4 | 36619425050724793929 | 32 |
I have a df which looks like this:
a b
apple | 7 | 2 |
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
other | 8 | 9 |
I want to select a given row say by name, say "apple" and move it to a new location, say -1 (second last row)
desired output
a b
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
apple | 7 | 2 |
other | 8 | 9 |
Is there any functions available to achieve this?
Use Index.difference for remove value and numpy.insert for add value to new index, last use DataFrame.reindex or DataFrame.loc for change order of rows:
a = 'apple'
idx = np.insert(df.index.difference([a], sort=False), -1, a)
print (idx)
Index(['google', 'swatch', 'merc', 'apple', 'other'], dtype='object')
df = df.reindex(idx)
#alternative
#df = df.loc[idx]
print (df)
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
This seems good, I am using pd.Index.insert() and pd.Index.drop_duplicates():
df.reindex(df.index.insert(-1,'apple').drop_duplicates(keep='last'))
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
I'm not aware of any built-in function, but one approach would be to manipulate the index only, then use the new index to re-order the DataFrame (assumes all index values are unique):
name = 'apple'
position = -1
new_index = [i for i in df.index if i != name]
new_index.insert(position, name)
df = df.loc[new_index]
Results:
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
Currently I am working on clustering problem and I have a problem with copying the values from one dataframe to the original dataframe.
CustomerID | Date | Time| TotalSum | CohortMonth| CohortIndex
--------------------------------------------------------------------
0 |17850.0|2017-11-29||08:26:00|15.30|2017-11-01|1|
--------------------------------------------------------------------
1 |17850.0|2017-11-29||08:26:00|20.34|2017-11-01|1|
--------------------------------------------------------------------
2 |17850.0|2017-11-29||08:26:00|22.00|2017-11-01|1|
--------------------------------------------------------------------
3 |17850.0|2017-11-29||08:26:00|20.34|2017-11-01|1|
--------------------------------------------------------------------
And the dataframe with values (clusters) to copy:
CustomerID|Cluster
------------------
12346.0 | 1
------------------
12346.0 | 1
------------------
12346.0 | 1
------------------
Please help me with the problem: How to copy values from the second df based on Customer ID criteria to the first dataframe.
I tried the code like this:
df.merge(ic,left_on='CustomerID',right_on='Cluster',how='left').drop('CustomerID',1).fillna('')
But it doesn't work and I get an error...
Besides it tried a version of such code as:
df, ic = [d.reset_index(drop=True) for d in (df, ic)]
ic.join(df[['CustomerID']])
But it gets the same error or error like the 'Customer ID' not in df...
Sorry if it's not clear and bad formatted question...It is my first question on stackoverflow. Thank you all.
UPDATE
I have tried this
df1=df.merge(ic,left_on='CustomerID',right_on='Cluster',how='left')
if ic['CustomerID'].values != df1['CustomerID_x'].values:
df1.Cluster=ic.Cluster
else:
df1.Cluster='NaN'
But I've got different cluster for the same customer.
CustomerID_x| Date | Time | TotalSum | CohortMonth | CohortIndex | CustomerID_y | Cluster
0|17850.0|2017-11-29||08:26:00|15.30 | 2017-11-01 | 1 | NaN | 1.0
1|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 0.0
2|17850.0|2017-11-29||08:26:00|22.00 | 2017-11-01 | 1 | NaN | 1.0
3|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 2.0
4|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 1.0
Given what you've written, I think you want:
>>> df1 = pd.DataFrame({"CustomerID": [17850.0] * 4, "CohortIndex": [1,1,1,1] })
>>> df1
CustomerID CohortIndex
0 17850.0 1
1 17850.0 1
2 17850.0 1
3 17850.0 1
>>> df2
CustomerID Cluster
0 12346.0 1
1 17850.0 1
2 12345.0 1
>>> pd.merge(df1, df2, 'left', 'CustomerID')
CustomerID CohortIndex Cluster
0 17850.0 1 1
1 17850.0 1 1
2 17850.0 1 1
3 17850.0 1 1
I cannot figure out how to get rid of rows (but keep the first occurence and get rid of every row that has the value) with some condition.
I tried using drop_duplicate but this will get rid of everything. I just want to get rid of some rows with a specific value (Within the same column)
Data is formatted like so:
Col_A | Col_B
5 | 1
5 | 2
1 | 3
5 | 4
1 | 5
5 | 6
I want it like (based on Col_A):
Col_A | Col_B
5 | 1
5 | 2
1 | 3
5 | 4
5 | 6
Use idxmax and check the index. This of course assumes your index is unique.
m = df.Col_A.eq(1) # replace 1 with your desired bad value
df.loc[~m | (df.index == m.idxmax())]
Col_A Col_B
0 5 1
1 5 2
2 1 3
3 5 4
5 5 6
Try this:
df1=df.copy()
mask=df['Col_A'] == 5
df1.loc[mask,'Col_A'] = df1.loc[mask,'Col_A']+range(len(df1.loc[mask,'Col_A']))
df1=df1.drop_duplicates(subset='Col_A',keep='first')
print(df.iloc[df1.index])
Output:
Col_A Col_B
0 5 1
1 5 2
2 1 3
3 5 4
5 5 6
Let's say I have these 2 pandas dataframes:
id | userid | type
1 | 20 | a
2 | 20 | a
3 | 20 | b
4 | 21 | a
5 | 21 | b
6 | 21 | a
7 | 21 | b
8 | 21 | b
I want to obtain the number of times 'b follows a' for each user, and obtain a new dataframe like this:
userid | b_follows_a
20 | 1
21 | 2
I know I can do this using for loop. However, I wonder if there is a more elegant solution to this.
You can use shift() to check if a is followed by b with vectorized & and then count the trues with a sum:
df.groupby('userid').type.apply(lambda x: ((x == "a") & (x.shift(-1) == "b")).sum()).reset_index()
#userid type
#0 20 1
#1 21 2
Creative solution:
In [49]: df.groupby('userid')['type'].sum().str.count('ab').reset_index()
Out[49]:
userid type
0 20 1
1 21 2
Explanation:
In [50]: df.groupby('userid')['type'].sum()
Out[50]:
userid
20 aab
21 ababb
Name: type, dtype: object