How to reverse the order of a specific column in python - python

I have extracted some data online and I would like to reverse the first column order.
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/us_weekly.html").content, 'html.parser')
for e in soup.select('#spotifyweekly tr:has(td)'):
data.append({
'Frequency':e.td.text,
'Artists':e.a.text,
'Songs':e.a.find_next_sibling('a').text
})
data2 = data[:100]
print(data2)
data = pd.DataFrame(data2).to_excel('Kworb_Weekly.xlsx', index = False)
And here is my output:
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/TmGmI.png
I've used [::-1], but it reversed all the columns and I just only want to reverse the first column.

Your first column is 'Frequency', so you can get that column from the data frame, and use [::] on both sides:
data = pd.DataFrame(data2)
print(data)
data['Frequency'][::1] = data['Frequency'][::-1]
print(data)
Got this as the output:
Frequency Artists Songs
0 1 SZA Kill Bill
1 2 PinkPantheress Boy's a liar Pt. 2
2 3 Miley Cyrus Flowers
3 4 Morgan Wallen Last Night
4 5 Lil Uzi Vert Just Wanna Rock
.. ... ... ...
95 96 Lizzo Special
96 97 Glass Animals Heat Waves
97 98 Frank Ocean Pink + White
98 99 Foo Fighters Everlong
99 100 Meghan Trainor Made You Look
[100 rows x 3 columns]
Frequency Artists Songs
0 100 SZA Kill Bill
1 99 PinkPantheress Boy's a liar Pt. 2
2 98 Miley Cyrus Flowers
3 97 Morgan Wallen Last Night
4 96 Lil Uzi Vert Just Wanna Rock
.. ... ... ...
95 5 Lizzo Special
96 4 Glass Animals Heat Waves
97 3 Frank Ocean Pink + White
98 2 Foo Fighters Everlong
99 1 Meghan Trainor Made You Look
[100 rows x 3 columns]
Process finished with exit code 0

Related

pandas: increment based on a condition in another column

I have a dataframe that has one column only like the following.(a minimal example)
import pandas as pd
dataframe =pd.DataFrame({'text': ['##weather','how is today?', 'we go out', '##rain',
'my day is rainy', 'I am not feeling well','rainy
blues','##flower','the blue flower', 'she likes red',
'this flower is nice']})
I would like to add a second column called 'id' and increment every time the row contains '##'. so my desired output would be,
text id
0 ##weather 100
1 how is today? 100
2 we go out 100
3 ##rain 101
4 my day is rainy 101
5 I am not feeling well 101
6 rainy blues 101
7 ##flower 102
8 the blue flower 102
9 she likes red 102
10 this flower is nice 102
so far i have done the following which does not return the right output as i want.
dataframe['id']= 100
dataframe.loc[dataframe['text'].str.contains('## intent:'), 'id'] += 1
You can try groupby with ngroup
m = dataframe['text'].str.contains('##').cumsum()
dataframe['id'] = dataframe.groupby(m).ngroup() + 100
print(dataframe)
text id
0 ##weather 100
1 how is today? 100
2 we go out 100
3 ##rain 101
4 my day is rainy 101
5 I am not feeling well 101
6 rainy 101
7 blues 101
8 ##flower 102
9 the blue flower 102
10 she likes red 102
11 this flower is nice 102

How to run hypothesis test with pandas data frame and specific conditions?

I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924

Python : get random data from dataframe pandas

Have a df with values :
name algo accuracy
tom 1 88
tommy 2 87
mark 1 88
stuart 3 100
alex 2 99
lincoln 1 88
How to randomly pick 4 records from df with a condition that at least one record should be picked from each unique algo column values. here, algo column has only 3 unique values (1 , 2 , 3 )
Sample outputs:
name algo accuracy
tom 1 88
tommy 2 87
stuart 3 100
lincoln 1 88
sample output2:
name algo accuracy
mark 1 88
stuart 3 100
alex 2 99
lincoln 1 88
One way
num_sample, num_algo = 4, 3
# sample one for each algo
out = df.groupby('algo').sample(n=num_sample//num_algo)
# append one more sample from those that didn't get selected.
out = out.append(df.drop(out.index).sample(n=num_sample-num_algo) )
Another way is to shuffle the whole data, enumerate the rows within each algo, sort by that enumeration and take the required number of samples. This is slightly more code than the first approach, but is cheaper and produces more balanced algo counts:
# shuffle data
df_random = df['algo'].sample(frac=1)
# enumerations of rows with the same algo
enums = df_random.groupby(df_random).cumcount()
# sort with `np.argsort`:
enums = enums.sort_values()
# pick the first num_sample indices
# these will be indices of the samples
# so we can use `loc`
out = df.loc[enums.iloc[:num_sample].index]

Pivoting count of column value using python pandas

I have student data with id's and some values and I need to pivot the table for count of ID.
Here's an example of data:
id name maths science
0 B001 john 50 60
1 B021 Kenny 89 77
2 B041 Jessi 100 89
3 B121 Annie 91 73
4 B456 Mark 45 33
pivot table:
count of ID
5
Lots of different ways to approach this, I would use either shape or nunique() as Sandeep suggested.
data = {'id' : ['0','1','2','3','4'],
'name' : ['john', 'kenny', 'jessi', 'Annie', 'Mark'],
'math' : [50,89,100,91,45],
'science' : [60,77,89,73,33]}
df = pd.DataFrame(data)
print(df)
id name math science
0 0 john 50 60
1 1 kenny 89 77
2 2 jessi 100 89
3 3 Annie 91 73
4 4 Mark 45 33
then pass either of the following:
df.shape() which gives you the length of a data frame.
or
in:df['id'].nunique()
out:5

Python - Performing Max Function on Multiple Groupby

I have a data frame below that shows the price of wood and steel from two different suppliers.
I would like to add a column that shows the highest price for the opposite item (i.e. if line is wood, it would pull steel) from the same supplier.
For example, the "Steel" row for "Tom" would show his highest wood price which is 42.
The code I have so far simply returns the highest price for the original item (i.e. not the opposite, so for Tom's steel row returns 24 but I would have wanted it to return 42).
I think this is an issue with pulling the max value for a multi-group. I have tried a number of different ways but just cannot seem to get it.
Any thoughts would be greatly appreciated.
import pandas as pd
import numpy as np
data = {'Supplier':['Tom', 'Tom', 'Tom', 'Bill','Bill','Bill'],'Item':['Wood','Wood','Steel','Steel','Steel','Wood'],'Price':[42,33,24,16,12,18]}
df = pd.DataFrame(data)
df['Opp_Item'] = np.where(df['Item']=="Wood", "Steel", "Wood")
df['Opp_Item_Max'] = df.groupby(['Supplier','Opp_Item'])['Price'].transform(max)
print(df)
Supplier Item Price Opp_Item Opp_Item_Max
0 Tom Wood 42 Steel 42
1 Tom Wood 33 Steel 42
2 Tom Steel 24 Wood 24
3 Bill Steel 16 Wood 16
4 Bill Steel 12 Wood 16
5 Bill Wood 18 Steel 18
If you can find the per supplier+item maximum, then you can just swap the values and assign them back through a join:
v = df.groupby(['Supplier', 'Item']).Price.max().unstack(-1)
# This reversal operation works under the assumption that
# there are only two items and that they are opposites of each other.
v[:] = v.values[:, ::-1]
df = (df.set_index(['Supplier', 'Item'])
.join(v.stack().to_frame('Opp_Item_Max'), how='left')
.reset_index())
print(df)
Supplier Item Price Opp_Item_Max
0 Bill Steel 16 18
1 Bill Steel 12 18
2 Bill Wood 18 16
3 Tom Steel 24 42
4 Tom Wood 42 24
5 Tom Wood 33 24
Note: Ordering of your data will not be preserved after the join.
You could map to the opposite values before a groupby, and then merge this back to the original DataFrame.
d = {'Steel': 'Wood', 'Wood': 'Steel'}
df.merge(df.assign(Item = df.Item.map(d))
.groupby(['Supplier', 'Item'], as_index=False).max(),
on=['Supplier', 'Item'],
how='left',
suffixes=['', '_Opp_Item'])
Supplier Item Price Price_Opp_Item
0 Tom Wood 42 24
1 Tom Wood 33 24
2 Tom Steel 24 42
3 Bill Steel 16 18
4 Bill Steel 12 18
5 Bill Wood 18 16

Categories

Resources