I have a pandas dataframe that looks like this. The rows and the columns have the same name.
name a b c d e f g
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10
I can get the 5 number of largest values by passing df['column_name'].nlargest(n=5) but if I have to return 50 % of the largest in descending order, is there anything that is inbuilt in pandas of it I have to write a function for it, how can I get them? I am quite new to python. Please help me out.
UPDATE : So let's take column a into consideration and it has values like 10, 5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of them. so the total in this case is 36. 50% of these values would be 18. So from column a, I want to select 10 and 8 only. Similarly I want to go through all the other columns and select 50%.
Sorting is flexible :)
df.sort_values('column_name',ascending=False).head(int(df.shape[0]*.5))
Update: frac argument is available only on .sample(), not in .head or .tail. df.sample(frac=.5) does give 50% but head and tail expects only int. df.head(frac=.5) fails with TypeError: head() got an unexpected keyword argument 'frac'
Note: on int() vs round()
int(3.X) == 3 # True Where 0 >= X >=9
round(3.45) == 3 # True
round(3.5) == 4 # True
So when doing .head(int/round ...) do think of what behaviour fits your need.
Updated: Requirements
So let's take column a into consideration and it has values like 10,
5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of
them. so the total, in this case, is 36. 50% of these values would be
18. So from column a, I want to select 10 and 8 only. Similarly, I want to go through all the other columns and select 50%. -Matt
A silly hack would be to sort, find the cumulative sum, find the middle by dividing it with the sum total and then use that to select part of your sorted column. e.g.
import pandas as pd
data = pd.read_csv(
pd.compat.StringIO("""name a b c d e f g
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10"""),
sep=' ', index_col='name'
).dropna(axis=1).apply(
pd.to_numeric, errors='coerce', downcast='signed')
x = data[['a']].sort_values(by='a',ascending=False)[(data[['a']].sort_values(by='a',ascending=False).cumsum()
/data[['a']].sort_values(by='a',ascending=False).sum())<=.5].dropna()
print(x)
Outcome:
You could sort the data frame and display only 90% of the data
df.sort_values('column_name',ascending=False).head(round(0.9*len(df)))
data.csv
name,a,b,c,d,e,f,g
a,10,5,4,8,5,6,4
b,5,10,6,5,4,3,3
c,-,4,9,3,6,5,7
d,6,9,8,6,6,8,2
e,8,5,4,4,14,9,6
f,3,3,-,4,5,14,7
g,4,5,8,9,6,7,10
test.py
#!/bin/python
import pandas as pd
def percentageOfList(l, p):
return l[0:int(len(l) * p)]
df = pd.read_csv('data.csv')
print(percentageOfList(df.sort_values('b', ascending=False)['b'], 0.9))
i have a csv file that contains columns like StateName, Population, CityName... Note that for every state u can have multiple city name thus multiple population fo the same city
what i want to have is to group the StateName with the highest three population of the same city.
what i have: (image click to see)
what i want to have (image click to see)
my code is:
def answer_six():
x=census_df['STNAME'].unique()
census_df2 = df = pd.DataFrame()
for a in x :
census_dfcopy = census_df.copy()
census_dfcopy = census_dfcopy.set_index(['STNAME'])
census_dfcopy = census_dfcopy.loc[a]
census_dfcopy = census_dfcopy.reset_index()
census_dfcopy = census_dfcopy.set_index(['CENSUS2010POP'])
census_dfcopy1=census_dfcopy.sort_index(ascending = False)
census_dfcopy1= census_dfcopy1.append(census_dfcopy1)
census_dfcopy1.groupby('STNAME')
return census_dfcopy1.head(3)
answer_six()
i only get the last 3 value of the last state.
to download the csv file please visit the link:
https://drive.google.com/open?id=1ptE6MRQ1NGrfRYBB7NKjqhOJZXlxScPo
You could do
census_df.groupby('STNAME').CENSUS2010POP.nlargest(3)
In action:
In [51]: df
Out[51]:
ctyname pop stname
0 0 10 a
1 1 9 a
2 2 1 a
3 3 3 a
4 4 12 b
5 5 12 b
6 6 13 b
7 7 14 b
8 8 4 c
9 9 3 c
10 10 2 c
11 11 1 c
In [68]: df.groupby('stname').pop.nlargest(3)
Out[68]:
stname
a 0 10
1 9
3 3
b 7 14
6 13
4 12
c 8 4
9 3
10 2
Description
Long story short, I need a way to sort a DataFrame by a specific column, given a specific function which is analagous to usage of "key" parameter in python built-in sorted() function. Yet there's no such "key" parameter in pd.DataFrame.sort_value() function.
The approach used for now
I have to create a new column to store the "scores" of a specific row, and delete it in the end. The problem of this approach is that the necessity to generate a column name which does not exists in the DataFrame, and it could be more troublesome when it comes to sorting by multiple columns.
I wonder if there's a more suitable way for such purpose, in which there's no need to come up with a new column name, just like using a sorted() function and specifying parameter "key" in it.
Update: I changed my implementation by using a new object instead of generating a new string beyond those in the columns to avoid collision, as shown in the code below.
Code
Here goes the example code. In this sample the DataFrame is needed to be sort according to the length of the data in row "snippet". Please don't make additional assumptions on the type of the objects in each rows of the specific column. The only thing given is the column itself and a function object/lambda expression (in this example: len) that takes each object in the column as input and produce a value, which is used for comparison.
def sort_table_by_key(self, ascending=True, key=len):
"""
Sort the table inplace.
"""
# column_tmp = "".join(self._table.columns)
column_tmp = object() # Create a new object to avoid column name collision.
# Calculate the scores of the objects.
self._table[column_tmp] = self._table["snippet"].apply(key)
self._table.sort_values(by=column_tmp, ascending=ascending, inplace=True)
del self._table[column_tmp]
Now this is not implemented, check github issue 3942.
I think you need argsort and then select by iloc:
df = pd.DataFrame({
'A': ['assdsd','sda','affd','asddsd','ffb','sdb','db','cf','d'],
'B': list(range(9))
})
print (df)
A B
0 assdsd 0
1 sda 1
2 affd 2
3 asddsd 3
4 ffb 4
5 sdb 5
6 db 6
7 cf 7
8 d 8
def sort_table_by_length(column, ascending=True):
if ascending:
return df.iloc[df[column].str.len().argsort()]
else:
return df.iloc[df[column].str.len().argsort()[::-1]]
print (sort_table_by_length('A'))
A B
8 d 8
6 db 6
7 cf 7
1 sda 1
4 ffb 4
5 sdb 5
2 affd 2
0 assdsd 0
3 asddsd 3
print (sort_table_by_length('A', False))
A B
3 asddsd 3
0 assdsd 0
2 affd 2
5 sdb 5
4 ffb 4
1 sda 1
7 cf 7
6 db 6
8 d 8
How it working:
First get lengths to new Series:
print (df['A'].str.len())
0 6
1 3
2 4
3 6
4 3
5 3
6 2
7 2
8 1
Name: A, dtype: int64
Then get indices by sorted values by argmax, for descending ordering is used this solution:
print (df['A'].str.len().argsort())
0 8
1 6
2 7
3 1
4 4
5 5
6 2
7 0
8 3
Name: A, dtype: int64
Last change ordering by iloc:
print (df.iloc[df['A'].str.len().argsort()])
A B
8 d 8
6 db 6
7 cf 7
1 sda 1
4 ffb 4
5 sdb 5
2 affd 2
0 assdsd 0
3 asddsd 3
I am curious if their is a way to display my columns of data right beside each other instead of 7 columns and then the remaining columns below.
This is the output
aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg \
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
3 1 2 3 4 5 6 7
hhhhhhhh iiiiiiii
1 8 9
2 8 9
3 8 9
Press any key to continue . . .
This is the code
import pandas as pd
df = pd.DataFrame(index=['1','2','3'])
df['aaaaaaaa'] = 1
df['bbbbbbbb'] = 2
df['cccccccc'] = 3
df['dddddddd'] = 4
df['eeeeeeee'] = 5
df['ffffffff'] = 6
df['gggggggg'] = 7
df['hhhhhhhh'] = 8
df['iiiiiiii'] = 9
print (df.head())
Were you trying to combine two different csv files? If so you need to use the append method. DataFrame.append
I have the following pd.DataFrame:
AllData =
a#a.6 f#s.2 c#c.2 d#w.4 k#a.3
1 8 3 3 8
4 4 7 4 3
6 8 9 1 6
3 4 5 6 1
7 6 0 8 1
And I would like to create a new pd.DataFrame with only the columns whose names are keys in the following dictionary:
my_dict={a#a.6 : value1, c#c.2 : value2, d#w.4 : value5}
So the new DataFrame would be:
FilteredData =
a#a.6 c#c.2 d#w.4
1 3 3
4 7 4
6 9 1
3 5 6
7 0 8
What is the most efficient way of doing this?
I have tried to use:
FilteredData = AllData.filter(regex=my_dict.keys)
but unsurprisingly, this didn't work. Any suggestions/advice welcome
Cheers, Alex
You can also do this without the filter method at all like this:
FilteredData = AllData[my_dict.keys()]
Pandas dataframes have a method called filter that will return a new dataframe. Try this
FilteredData = AllData.filter(items=my_dict.keys())