Solved Below
Issue: Cannot .groupby() sort because a single value is a string type object. The value at Index 5, ColA 10 for Data In is the issue. The value at Index 5 for ColA, 10, is a string object. pd.to_numeric() properly sorts the column if only sorted by that column.
Question: Can a single value in ColA be converted?
Method:
ind = pd.to_numeric(df['ColA'], errors='coerce').fillna(999).astype(int).argsort()
df = df.reindex(ind)
df = df.groupby(df.ColA).apply(pd.DataFrame.sort_values, 'ColB')
df = df.reset_index(drop=True)
Data in:
Index ColA ColB ColC
0 2 14-5 MumboJumbo
1 4 18-2 MumboJumbo2
2 2 24-5 MumboJumbo3
3 3 23-8 MumboJumbo4
4 2 13-6 MumboJumbo5
5 10 86-1 MumboJumbo6
6 10 42-1 MumboJumbo7
7 2 35-6 MumboJumbo8
8 Load NaN MumboJumbo9
Desired Output:
Index ColA ColB ColC
0 2 13-6 MumboJumbo5
1 2 14-5 MumboJumbo
2 2 24-5 MumboJumbo3
3 2 35-6 MumboJumbo8
4 3 23-8 MumboJumbo4
5 4 18-2 MumboJumbo2
6 10 42-1 MumboJumbo7
7 10 86-1 MumboJumbo6
8 Load NaN MumboJumbo9
Thanks!
I don't really understand the problem in the question but you can select specific values in a DataFrame using iloc (positional index) or loc (label index). Since you are asking to replace the value in the fifth row in the first column in your dataset, we use iloc.
df.iloc[from_row:to_row,column_position]
To convert value '10' in ColA at row 5 to an int('10') you simply select it and then update it.
df.iloc[5:6,0] = 10
If you don't know the location of the value you need to convert, then iloc and loc is no help.
There are several ways to convert all values in a column to a specific dtype. One way would be using a lambda-function.
df[column_name].apply(lambda x: int(x))
The lambda above will break because your data also contains the string Load and you can't convert that to an int. One way to solve this is adding conditions to your lambda.
df[column_name].apply(lambda x: int(x) if something else something)
Given the data in your question the most straightforward way would be to check if x is not 'Load':
df[column_name].apply(lambda x: int(x) if x != 'Load' else x)
This becomes a hassle if you have loads of actual strings in your column. If you want to use a lambda you could make a list of the actual strings. And then check if x is in the list.
list_of_strings = ['Load', 'Road', 'Toad']
df[column_name].apply(lambda x: int(x) if x not in list_of_strings else x)
Another way would be to write a separate function to manage the converting using try/catch blocks.
Related
I have a dataframe like this:
Seq Value
20-35-ABCDE 14268142.986651151
21-33-ABEDFD 4204281194.109206
61-72-ASDASD 172970.7123134008912
61-76-ASLDKAS 869238.232460215262
63-72-ASDASD string1
63-76-OIASD 20823821.49471747433
64-76-ASDAS(s)D string1
65-72-AS*AS 8762472.99003354316
65-76-ASYAD*S* 32512348.3285536161
66-76-A(AD(AD)) 3843230.72933184169
I want to rank the rows based on the Value, highest to lowest, and return the top 50% of the rows (where the row number could change over time).
I wrote this to do the ranking:
df = pd.read_csv(sys.argv[1],sep='\t')
df.columns =['Seq', 'Value']
#cant do because there are some strings
#pd.to_numeric(df['Value'])
df2 = df.sort_values(['Value'], ascending=True).head(10)
print(df2)
The output is like this:
Seq Value
17210 ASK1 0.0
15061 ASD**ASDHA 0.0
41110 ASD(£)DA 1.4355078174305618
50638 EILMH 1000.7985554926368
62019 VSEFMTRLF 10000.89805735126
41473 LEDSAGES 10002.182707004016
41473 LEDSASDES 10000886.012834921
So I guess it sorted them by string instead of floats, but I'm struggling to understand how to sort by float because some of the entries in that column say string1 (and I want all the string1 to go to the end of the list, i.e. I want to sort by all the floats, and then just put all the string1s at the end), and then I want to be able to return the Seq values in the top 50% of the sorted rows.
Can someone help me with this, even just the sorting part?
The problem is that your column is storing the values as strings, so they will sort according to string sorting, not numeric sorting. You can sort numerically using the key of DataFrame.sort_values, which also allows you to preserve the string values in that column.
Another option would be to turn that column into a numeric column before the sort, but then non-numeric values must be replaced with NaN
Sample data
import pandas as pd
df = pd.DataFrame({'Seq': [1,2,3,4,5],
'Value': ['11', '2', '1.411', 'string1', '91']})
# String sorting
df.sort_values('Value')
#Seq Value
#2 3 1.411
#0 1 11
#1 2 2
#4 5 91
#3 4 string1
Code
# Numeric sorting
df.sort_values('Value', key=lambda x: pd.to_numeric(x, errors='coerce'))
Seq Value
2 3 1.411
1 2 2
0 1 11
4 5 91
3 4 string1
Honestly, I think it would make more sense to use one column for the float values and one for the strings. That said, you can convert to numeric only for the sorting using the key parameter of sort_values. The NaN/strings will be pushed to the end.
df.sort_values(by='Value', key=lambda x: pd.to_numeric(x, errors='coerce'))
output:
Seq Value
2 61-72-ASDASD 172970.7123134008912
3 61-76-ASLDKAS 869238.232460215262
9 66-76-A(AD(AD)) 3843230.72933184169
7 65-72-AS*AS 8762472.99003354316
0 20-35-ABCDE 14268142.986651151
5 63-76-OIASD 20823821.49471747433
8 65-76-ASYAD*S* 32512348.3285536161
1 21-33-ABEDFD 4204281194.109206
4 63-72-ASDASD string1
6 64-76-ASDAS(s)D string1
alternative splitting the floats and strings apart:
s = df['Value']
(df.assign(Value=pd.to_numeric(s, errors='coerce'),
Strings=lambda d: s.where(d['Value'].isna())
)
.sort_values(by=['Value', 'Strings'])
)
output:
Seq Value Strings
2 61-72-ASDASD 1.729707e+05 NaN
3 61-76-ASLDKAS 8.692382e+05 NaN
9 66-76-A(AD(AD)) 3.843231e+06 NaN
7 65-72-AS*AS 8.762473e+06 NaN
0 20-35-ABCDE 1.426814e+07 NaN
5 63-76-OIASD 2.082382e+07 NaN
8 65-76-ASYAD*S* 3.251235e+07 NaN
1 21-33-ABEDFD 4.204281e+09 NaN
4 63-72-ASDASD NaN string1
6 64-76-ASDAS(s)D NaN string1
I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay
When I use the following code:
print(self.df.groupby(by=[2])[3].agg(['sum']))
On the following Dataframe:
0 1 2 3 4 5 6 7
0 15 LCU Test 1 308.02 170703 ALCU 4868 MS10
1 16 LCU Test 2 127.37 170703 ALCU 4868 MS10
The sum function is not completed correctly because the value column (col 3) returns a concatenated string of the values (308.02127.37) instead of maintaining the integrity of the individual values to allow operation.
It seems like your 3rd column is a string. Did you load in your dataframe using dtype=str?
Furthermore, try not to hardcode your columns. You can use .astype or pd.to_numeric to cast and then apply sum:
self.df.groupby(self.df.columns[2])[self.df.columns[3]].agg(
lambda x: pd.to_numeric(x, errors='coerce').sum()
)
Or
self.df.groupby(self.df.columns[2])[self.df.columns[3]].agg(
lambda x: x.astype(float).sum()
)
I have a dataframe looking generated by:
df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])
It looks like
0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4
I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?
Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this
0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5
Could you also please advice how to accomplish something like this?
Many thanks please let me know in case you need further information.
IIUC, you could use str.strip and then check if the stripped string is empty.
Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:
df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)
df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())
df
Note: If you must forward fill NaN values with first element of that group, then you must do this:
df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))
Breakup of steps:
Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.
df[1].str.strip().dropna()
0 tes t # operates only on indices where strings are present(empty strings included)
2 test1
3
Name: 1, dtype: object
The reindexing part isn't a necessary step as it only computes on the indices where strings are present.
Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.
I have a two column dataframe df, each row are distinct, one element in one column can map to one or more than one elements in another column. I want to filter OUT those elements. So in the final dataframe, one element in one column only map to a unique element in another column.
What I am doing is to groupby one column and count the duplicates, then remove rows with counts more than 1. and do it again for another column. I am wondering if there is a better, simpler way.
Thanks
edit1: I just realize my solution is INCORRECT, removing multi-mapping elements in column A reduces the number of mapping in column B, consider the following example:
A B
1 4
1 3
2 4
1 maps to 3,4 , so the first two rows should be removed, and 4 maps to 1,2. The final table should be empty. However, my solution will keep the last row.
Can anyone provide me a fast and simple solution ? thanks
Well, You could do something like the following:
>>> df
A B
0 1 4
1 1 3
2 2 4
3 3 5
You only want to keep a row if no other row has the value of 'A' and no other row as that value of 'B'. Only row three meets those conditions in this example:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Aone.merge(Bone,on=['A','B'],how='inner')
A B
0 3 5
Explanation:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Aone
A B
2 2 4
3 3 5
The above grabs the rows that may be allowed based on looking at column 'A' alone.
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Bone
A B
1 1 3
3 3 5
The above grabs the rows that may be allowed based on looking at column 'B' alone. And then merging the intersection leaves you with rows that only meet both conditions:
>>> Aone.merge(Bone,on=['A','B'],how='inner')
Note, you could also do a similar thing using groupby/transform. But transform tends to be slowish so I didn't do it as an alternative.