I have a pandas column that contains a lot of string that appear less than 5 times, I do not to remove these values however I do want to replace them with a placeholder string called "pruned". What is the best way to do this?
df= pd.DataFrame(['a','a','b','c'],columns=["x"])
# get value counts and set pruned I want something that does as follows
df[df[count<2]] = "pruned"
I suspect there is a more efficient way to do this, but simple way to do it is to build a dict of counts and then prune if those values are below the count threshold. Consider the example df:
df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
foo
0 12
1 11
2 4
3 15
4 6
5 12
6 4
7 7
# make a dict with counts
count_dict = {d:(df['foo']==d).sum() for d in df.foo.unique()}
# assign that dict to a column
df['bar'] = [count_dict[d] for d in df.foo]
# loc in the 'pruned' tag
df.loc[df.bar < 2, 'foo']='pruned'
Returns as desired:
foo bar
0 12 2
1 pruned 1
2 4 2
3 pruned 1
4 pruned 1
5 12 2
6 4 2
7 pruned 1
(and of course you would change 2 to 5 and dump that bar column if you want).
UPDATE
Per request for an in-place version, here is a one-liner that can do it without assigning another column or creating that dict directly (and thanks #TrigonaMinima for the values_count() tip):
df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
print(df)
df.foo = df.foo.apply(lambda row: 'pruned' if (df.foo.value_counts() < 2)[row] else row)
print(df)
which returns again as desired:
foo
0 12
1 11
2 4
3 15
4 6
5 12
6 4
7 7
foo
0 12
1 pruned
2 4
3 pruned
4 pruned
5 12
6 4
7 pruned
This is the solution I ended up using based on the answer above.
import pandas as pd
df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
# make a dict with counts
count_dict = dict(df.foo.value_counts())
# assign that dict to a column
df['temp_count'] = [count_dict[d] for d in df.foo]
# loc in the 'pruned' tag
df.loc[df.temp_count < 2, 'foo']='pruned'
df = df.drop(["temp_count"], axis=1)
Related
I currently have a pandas DataFrame df with the size of 168078 rows × 43 columns. A summary of df is shown below:
doi gender order year ... count
9384155 10.1103/PRL.102.039801 male 1 2009 ... 1
...
3679211 10.1103/PRD.69.024009 male 2 2004 ... 501
The df is currently sorted by count, and therefore varies from 1 to 501.
I would like to split the df into 501 smaller subdata by splitting it by count. In other words, at the end of the process, I would have 501 different sub-df with each characteristic count value.
Since the number of resulting (desired) DataFrames is quite high, and since it is a quantitative data, I was wondering if:
a) it is possible to split the DataFrame that many times (if yes, then how), and
b) it is possible to name each DataFrame quantitatively without manually assigning a name 501 times; i.e. for example, df with count == 1 would be df.1 without having to assign it.
The best practice you can do is create a dictionary of data frames. Below I show you an example:
df=pd.DataFrame({'A':[4,5,6,7,7,5,4,5,6,7],
'count':[1,2,3,4,5,6,7,8,9,10],
'C':['a','b','c','d','e','f','g','h','i','j']})
print(df)
A count C
0 4 1 a
1 5 2 b
2 6 3 c
3 7 4 d
4 7 5 e
5 5 6 f
6 4 7 g
7 5 8 h
8 6 9 i
9 7 10 j
Now we create the dictionary. As you can see the key is the value of count in each row.
keep in mind that here Series.unique is used to make that in the case where there are two rows with the same count value then they are created in the same dictionary.
dfs={key:df[df['count']==key] for key in df['count'].unique()}
Below I show the content of the entire dictionary created and how to access it:
for key in dfs:
print(f'dfs[{key}]')
print(dfs[key])
print('-'*50)
dfs[1]
A count C
0 4 1 a
--------------------------------------------------
dfs[2]
A count C
1 5 2 b
--------------------------------------------------
dfs[3]
A count C
2 6 3 c
--------------------------------------------------
dfs[4]
A count C
3 7 4 d
--------------------------------------------------
dfs[5]
A count C
4 7 5 e
--------------------------------------------------
dfs[6]
A count C
5 5 6 f
--------------------------------------------------
dfs[7]
A count C
6 4 7 g
--------------------------------------------------
dfs[8]
A count C
7 5 8 h
--------------------------------------------------
dfs[9]
A count C
8 6 9 i
--------------------------------------------------
dfs[10]
A count C
9 7 10 j
--------------------------------------------------
you can just use groupby to get the result like below
here
g.groups: will give group name (group id) for each group
g.get_group: will give you one group with given group name
import numpy as np
import pandas as pd
df=pd.DataFrame({'A':np.random.choice(["a","b","c", "d"], 10),
'count':np.random.choice(10,10)
})
g = df.groupby("count")
for key in g.groups:
print(g.get_group(key))
print("\n---------------")
Result
A count
3 c 0
---------------
A count
9 a 2
---------------
A count
0 c 3
2 b 3
---------------
A count
1 b 4
5 d 4
6 a 4
7 b 4
---------------
A count
8 c 5
---------------
A count
4 d 8
---------------
How to divide an column into 5 groups by the column's value sorted.
and add a column by the groups
for example
import pandas as pd
df = pd.DataFrame({'x1':[1,2,3,4,5,6,7,8,9,10]})
and I want add columns like this:
You probably want to look at pd.cut, and set the argument bins to an integer of however many groups you want, and the labels argument to False (to return integer indicators of your groups instead of ranges):
df['add_col'] = pd.cut(df['x1'], bins=5, labels=False) + 1
>>> df
x1 add_col
0 1 1
1 2 1
2 3 2
3 4 2
4 5 3
5 6 3
6 7 4
7 8 4
8 9 5
9 10 5
Note that the + 1 is only there so that your groups are numbered 1 to 5, as in your desired output. If you don't say + 1 they will be numbered 0 to 4
Description
Long story short, I need a way to sort a DataFrame by a specific column, given a specific function which is analagous to usage of "key" parameter in python built-in sorted() function. Yet there's no such "key" parameter in pd.DataFrame.sort_value() function.
The approach used for now
I have to create a new column to store the "scores" of a specific row, and delete it in the end. The problem of this approach is that the necessity to generate a column name which does not exists in the DataFrame, and it could be more troublesome when it comes to sorting by multiple columns.
I wonder if there's a more suitable way for such purpose, in which there's no need to come up with a new column name, just like using a sorted() function and specifying parameter "key" in it.
Update: I changed my implementation by using a new object instead of generating a new string beyond those in the columns to avoid collision, as shown in the code below.
Code
Here goes the example code. In this sample the DataFrame is needed to be sort according to the length of the data in row "snippet". Please don't make additional assumptions on the type of the objects in each rows of the specific column. The only thing given is the column itself and a function object/lambda expression (in this example: len) that takes each object in the column as input and produce a value, which is used for comparison.
def sort_table_by_key(self, ascending=True, key=len):
"""
Sort the table inplace.
"""
# column_tmp = "".join(self._table.columns)
column_tmp = object() # Create a new object to avoid column name collision.
# Calculate the scores of the objects.
self._table[column_tmp] = self._table["snippet"].apply(key)
self._table.sort_values(by=column_tmp, ascending=ascending, inplace=True)
del self._table[column_tmp]
Now this is not implemented, check github issue 3942.
I think you need argsort and then select by iloc:
df = pd.DataFrame({
'A': ['assdsd','sda','affd','asddsd','ffb','sdb','db','cf','d'],
'B': list(range(9))
})
print (df)
A B
0 assdsd 0
1 sda 1
2 affd 2
3 asddsd 3
4 ffb 4
5 sdb 5
6 db 6
7 cf 7
8 d 8
def sort_table_by_length(column, ascending=True):
if ascending:
return df.iloc[df[column].str.len().argsort()]
else:
return df.iloc[df[column].str.len().argsort()[::-1]]
print (sort_table_by_length('A'))
A B
8 d 8
6 db 6
7 cf 7
1 sda 1
4 ffb 4
5 sdb 5
2 affd 2
0 assdsd 0
3 asddsd 3
print (sort_table_by_length('A', False))
A B
3 asddsd 3
0 assdsd 0
2 affd 2
5 sdb 5
4 ffb 4
1 sda 1
7 cf 7
6 db 6
8 d 8
How it working:
First get lengths to new Series:
print (df['A'].str.len())
0 6
1 3
2 4
3 6
4 3
5 3
6 2
7 2
8 1
Name: A, dtype: int64
Then get indices by sorted values by argmax, for descending ordering is used this solution:
print (df['A'].str.len().argsort())
0 8
1 6
2 7
3 1
4 4
5 5
6 2
7 0
8 3
Name: A, dtype: int64
Last change ordering by iloc:
print (df.iloc[df['A'].str.len().argsort()])
A B
8 d 8
6 db 6
7 cf 7
1 sda 1
4 ffb 4
5 sdb 5
2 affd 2
0 assdsd 0
3 asddsd 3
I am currently using Pandas and Python to handle much of the repetitive tasks, I need done for my master thesis. At this point, I have written some code (with help from stack overflow) that, based on some event dates in one file, finds a start and end date to use as a date range in another file. These dates are then located and appended to an empty list, which I can then output to excel. However, using the below code I get a dataframe with 5 columns and 400.000 + rows (which is basically what I want), but not how I want the data outputted to excel. Below is my code:
end_date = pd.DataFrame(data=(df_sample['Date']-pd.DateOffset(days=2)))
start_date = pd.DataFrame(data=(df_sample['Date']-pd.offsets.BDay(n=252)))
merged_dates = pd.merge(end_date,start_date,left_index=True,right_index=True)
ff_factors = []
for index, row in merged_dates.iterrows():
time_range= (df['Date'] > row['Date_y']) & (df['Date'] <= row['Date_x'])
df_factor = df.loc[time_range]
ff_factors.append(df_factor)
appended_data = pd.concat(ff_factors, axis=0)
I need the data to be 5 columns and 250 rows (columns are variable identifiers) side by side, so that when outputting it to excel I have, for example column A-D and then 250 rows for each column. This then needs to be repeated for column E-H and so on. Using iloc, I can locate the 250 observations using appended_data.iloc[0:250], with both 5 columns and 250 rows, and then output it to excel.
Are the any way for me to automate the process, so that after selecting the first 250 and outputting it to excel, it selects the next 250 and outputs it next to the first 250 and so on?
I hope the above is precise and clear, else I'm happy to elaborate!
EDIT:
The above picture illustrate what I get when outputting to excel; 5 columns and 407.764 rows. What I needed is to get this split up into the following way:
The second picture illustrates how I needed the total sample to be split up. The first five columns and corresponding 250 rows needs to be as the second picture. When I do the next split using iloc[250:500], I will get the next 250 rows, which needs to be added after the initial five columns and so on.
You can do this with a combination of np.reshape, which can be made to behave as desired on individual columns, and which should be much faster than a loop through the rows, and pd.concat, to join the dataframes it makes back together:
def reshape_appended(df, target_rows, pad=4):
df = df.copy() # don't modify in-place
# below line adds strings, '0000',...,'0004' to the column names
# this ensures sorting the columns preserves the order
df.columns = [str(i).zfill(pad)+df.columns[i] for i in range(len(df.columns))]
#target number of new columns per column in df
target_cols = len(df.index)//target_rows
last_group = pd.DataFrame()
# below conditional fires if there will be leftover rows - % is mod
if len(df.index)%target_rows != 0:
last_group = df.iloc[-(len(df.index)%target_rows):].reset_index(drop=True)
df = df.iloc[:-(len(df.index)%target_rows)] # keep rows that divide nicely
#this is a large list comprehension, that I'll elaborate on below
groups = [pd.DataFrame(df[col].values.reshape((target_rows, target_cols),
order='F'),
columns=[str(i).zfill(pad)+col for i in range(target_cols)])
for col in df.columns]
if not last_group.empty: # if there are leftover rows, add them back
last_group.columns = [pad*'9'+col for col in last_group.columns]
groups.append(last_group)
out = pd.concat(groups, axis=1).sort_index(axis=1)
out.columns = out.columns.str[2*pad:] # remove the extra characters in the column names
return out
last_group takes care of any rows that don't divide evenly into sets of 250. The playing around with column names enforces proper sorting order.
df[col].values.reshape((target_rows, target_cols), order='F')
Reshapes the values in the column col of df into the shape specified by the tuple (target_rows, target_cols), with the ordering Fortran uses, indicated by F.
columns=[str(i).zfill(pad)+col for i in range(target_cols)]
is just giving names to these columns, with any eye to establishing proper ordering afterward.
Ex:
df = pd.DataFrame(np.random.randint(0, 10, (23, 3)), columns=list('abc'))
reshape_appended(df, 5)
Out[160]:
a b c a b c a b c a b c a b c
0 8 3 0 4 1 9 5 4 7 2 3 4 5.0 7.0 2.0
1 1 6 1 3 5 1 1 6 0 5 9 4 6.0 0.0 1.0
2 3 1 3 4 3 8 9 3 9 8 7 8 7.0 3.0 2.0
3 4 0 1 5 5 6 6 4 4 0 0 3 NaN NaN NaN
4 9 7 3 5 7 4 6 5 8 9 5 5 NaN NaN NaN
df
Out[161]:
a b c
0 8 3 0
1 1 6 1
2 3 1 3
3 4 0 1
4 9 7 3
5 4 1 9
6 3 5 1
7 4 3 8
8 5 5 6
9 5 7 4
10 5 4 7
11 1 6 0
12 9 3 9
13 6 4 4
14 6 5 8
15 2 3 4
16 5 9 4
17 8 7 8
18 0 0 3
19 9 5 5
20 5 7 2
21 6 0 1
22 7 3 2
My best guess to solving the problem would be to try and loop, until the counter is greater than length, so
i = 250 # counter
j = 0 # left limit
for x in range(len("your dataframe")):
appended_data.iloc[j:i]
i+=250
if i > len("your df"):
appended_data.iloc[j:(len("your df"))
break
else:
j = i
I have a data frame according to below:
id_1 id_2 value
1 0 1
1 1 2
1 2 3
2 0 4
2 1 1
3 0 5
3 1 1
4 0 5
4 1 1
4 2 6
4 3 7
11 0 8
11 1 14
13 0 10
13 1 9
I would like to take out a random sample of size n, without replacement, from this table based on id_1. This row needs to be unique with respect to the id_1 column and can only occur once.
End result something like:
id_1 id_2 value
1 1 2
2 0 4
4 3 7
13 0 10
I have tried to do a group by and use the indices to take out a row through random.sample but it dosent go all the way.
Can someone give me a pointer on how to make this work? Code for DF below!
As always, thanks for time and input!
/swepab
df = pd.DataFrame({'id_1' : [1,1,1,2,2,3,3,4,4,4,4,11,11,13,13],
'id_2' : [0,1,2,0,1,0,1,0,1,2,3,0,1,0,1],
'value_col' : [1,2,3,4,1,5,1,5,1,6,7,8,14,10,9]})
You can do this using vectorized functions (not loops) using
import numpy as np
uniqued = df.id_1.reindex(np.random.permutation(df.index)).drop_duplicates()
df.ix[np.random.choice(uniqued.index, 1, replace=False)]
uniqued is created by a random shuffle + choice of a unique element by id_1. Then, a random sample (without replacement) is generated on it.
This samples one random per id:
for id in sorted(set(df["id_1"])):
print(df[df["id_1"] == id].sample(1))
PS:
translated above solution using pythons list comprehension, returning a list of of indices:
idx = [df[df["id_1"] == val].sample(1).index[0] for val in sorted(set(df["id_1"]))]