Here is an example dataset found from google search close to my datasets in my environment
I'm trying to get output like this
import pandas as pd
import numpy as np
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df=pd.DataFrame(data, columns=['Product','State','Sales'])
df1=df.sort_values('State')
#df1['Total']=df1.groupby('State').count()
df1['line']=df1.groupby('State').cumcount()+1
print(df1.to_string(index=False))
Commented out line throws this error
ValueError: Columns must be same length as key
Tried with size() it gives NaN for all rows
Hope someone points me to right direction
Thanks in advance
I think this should work for 'Total':
df1['Total']=df1.groupby('State')['Product'].transform(lambda x: x.count())
Try this:
df = pd.DataFrame(data).sort_values("State")
grp = df.groupby("State")
df["Total"] = grp["State"].transform("size")
df["line"] = grp.cumcount() + 1
Related
So, I was working on titanic dataset to extract Title(Mr,Ms,Mrs) from Name column from Data frame(df). Its has 1309 rows.
for ind,name in enumerate(df['Name']):
if type(name)==str:
inf = name.find(', ') + 2
df.loc[ind+1,'Title'] = name[inf:name.find('.')]
else :
print(name,ind)
This peice of code gives the following output
nan 1309
As supposed it had to stop for ind=1308, but it goes one step further even if not indicated to do so.
What could be the flaw here? Is it due to the fact that I am using 1 based indexing of the data frame?
If so, what could be done here to prevent such behaviour?
I am new to this platform, so please ask for clarifications in case of any discrepancies.
Here is a short Example:-
import numpy as np
import pandas as pd
dict1 = {'Name':['Hey, Mr.','Hello, Ms.','Hi, Mrs,','Welcome, Master.','Yes, Mr.'],'ind':[1,2,3,4,5]}
df = pd.DataFrame(data = dict1)
df.set_index('ind')
for ind,name in enumerate(df['Name']):
if type(name)==str:
inf = name.find(', ') + 2
df.loc[ind+1,'Title'] = name[inf:name.find('.')]
else :
print(name,ind)
print(df['Title'])
import pandas as pd
nba = pd.read_csv("nba.csv")
names = pd.Series(nba['Name'])
data = nba['Salary']
nba_series = (data, index=[names])
print(nba_series)
Hello I am trying to convert the columns 'Name' and 'Salary' into a series from a dataframe. I need to set the names as the index and the salaries as the values but i cannot figure it out. this is my best attempt so far anyone guidance is appreciated
I think you are over-thinking this. Simply construct it with pd.Series(). Note the data needs to be with .values, otherwis eyou'll get Nans
import pandas as pd
nba = pd.read_csv("nba.csv")
nba_series = pd.Series(data=nba['Salary'].values, index=nba['Name'])
Maybe try set_index?
nba.set_index('name', inlace = True )
nba_series = nba['Salary']
This might help you
import pandas as pd
nba = pd.read_csv("nba.csv")
names = nba['Name']
#It's automatically a series
data = nba['Salary']
#Set names as index of series
data.index = nba_series
data.index = names might be correct but depends on the data
I am doing Sentiment Analysis on Bitcoin News. During my coding a TypeError Problem occured. I hope you can help me and thank you very much in advance!
from newsapi.newsapi_client import NewsApiClient
from textblob import TextBlob
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import datetime
from datetime import time
import csv
from dateutil import parser
api = NewsApiClient(api_key='my key')
all_articles = api.get_everything(q='bitcoin',
sources='bbc-news,the-verge,financial-times,metro,business-insider,reuters,bloomberg,cnbc,cbc-news,fortune,crypto-coins-news',
domains='bbc.co.uk,techcrunch.com',
from_param='2019-10-20',
to='2019-11-19',
language='en',
sort_by='relevancy',
page_size=100)
news= pd.DataFrame(all_articles['articles'])
news['polarity'] = news.apply(lambda x: TextBlob(x['description']).sentiment.polarity, axis=1)
news['subjectivity'] = news.apply(lambda x: TextBlob(x['description']).sentiment.subjectivity, axis=1)
news['date']= news.apply(lambda x: parser.parse(x['publishedAt']).strftime('%Y.%m.%d'), axis=1)
news['time']= news.apply(lambda x: parser.parse(x['publishedAt']).strftime('%H:%M'), axis=1)
Then this TypeError occurs:
imgur_link
As #Hayat correctly pointed out, some of the rows in description column have None values which are causing the exception. There are four such rows, see the screenshot below.
You should remove such rows and operate on the ones that have proper data. You can filter rows that have None in description column using
news_filtered = news[news['description'].notnull()]
news_filtered['polarity'] = news_filtered.apply(lambda x: TextBlob(x['description']).sentiment.polarity, axis=1)
You may have to repeat the above for other columns.
You need to debug your code. You are passing None value.
x['description'] might have some None value in there.
news['polarity'] = news.apply(lambda x: TextBlob(x['description']).sentiment.polarity, axis=1)
Make sure in your preprocessing stage you don't have any None or NaN value in your dataframe
I want to execute SVM light and SVM rank,
so I need to process my data into the format of SVM light.
But I had a big problem....
My Python codes are below:
import pandas as pd
import numpy as np
from sklearn.datasets import dump_svmlight_file
self.df = pd.DataFrame()
self.df['patent_id'] = patent_id_list
self.df['Target'] = class_list
self.df['backward_citation'] = backward_citation_list
self.df['uspc_originality'] = uspc_originality_list
self.df['science_linkage'] = science_linkage_list
self.df['sim_bc_structure'] = sim_bc_structure_list
self.df['claim_num'] = claim_num_list
self.qid = dataset_list
X = self.df[np.setdiff1d(self.df.columns, ['patent_id','Target'])]
y = self.df.Target
dump_svmlight_file(X,y,'test.dat',zero_based=False, query_id=self.qid,multilabel=False)
The output file "test.dat" is look like this:
But the real data is look like this:
I got a wrong index....
Take first instance for example, the value of column 1 is 7, and the values of column 2~4 are zeros, the value of column 5 is 2....
So my expected result is look like this:
1 qid:1 1:7 5:2
but the column index of output file are totally wrong....
and unfortunately... I cannot figure out where is the problem occur....
I cannot fix this problem for a long time....
Thank you for help!!
I change the data structure, I use np.array to produce array-like input.
Finally, I succeed!
If you're interested in loading into a numpy array, try:
X = clicks_train[:,0:2]
y = clicks_train[:,2]
where 2 is the index of the target column
I've pulled some stock data from Quandl for both Crude Oil prices (WTI) and Caterpillar (CAT) price. When I concatenate the two dataframes together I'm left with some NaNs. My ultimate goal is to run a .Pearsonr() to assess the correlation (along with p-values), however I can't get Pearsonr() to work because of all the Nan's. So I'm trying to clean them up. When I use the .fillNA() function it doesn't seem to be working. I've even tried .interpolate() as well as .dropna(). None of them appear to work. Here is my working code.
import Quandl
import pandas as pd
import numpy as np
#WTI Data#
WTI_daily = Quandl.get("DOE/RWTC", collapse="daily",trim_start="1986-10-10", trim_end="1986-10-15")
WTI_daily.columns = ['WTI']
#CAT Data
CAT_daily = Quandl.get("YAHOO/CAT.6", collapse = "daily",trim_start="1986-10-10", trim_end="1986-10-15")
CAT_daily.columns = ['CAT']
#Combine Data Frames
daily_price_df = pd.concat([CAT_daily, WTI_daily], axis=1)
print daily_price_df
#Verify they are dataFrames:
def really_a_df(var):
if isinstance(var, pd.DataFrame):
print "DATAFRAME SUCCESS"
else:
print "Wahh Wahh"
return 'done'
print really_a_df(daily_price_df)
#Fill NAs
#CAN'T GET THIS TO WORK!!
daily_price_df.fillna(method='pad', limit=8)
print daily_price_df
# Try to interpolate
#CAN'T GET THIS TO WORK!!
daily_price_df.interpolate()
print daily_price_df
#Drop NAs
#CAN'T GET THIS TO WORK!!
daily_price_df.dropna(axis=1)
print daily_price_df
For what it's worth I've managed to get the function working when I create a dataframe from scratch using this code:
import pandas as pd
import numpy as np
d = {'a' : 0., 'b' : 1., 'c' : 2.,'d':None,'e':6}
d_series = pd.Series(d, index=['a', 'b', 'c', 'd','e'])
d_df = pd.DataFrame(d_series)
d_df = d_df.fillna(method='pad')
print d_df
Initially I was thinking that perhaps my data wasn't in dataframe form, but I used a simple test to confirm they are in fact dataframe. The only conclusion I that remains (in my opinion) is that it is something about the structure of the Quandl dataframe, or possibly the TimeSeries nature. Please know I'm somewhat new to python so structure answers for a begginner/novice. Any help is much appreciated!
pot shot - have you just forgotten to assign or use the inplace flag.
daily_price_df = daily_price_df.fillna(method='pad', limit=8)
OR
daily_price_df.fillna(method='pad', limit=8, inplace=True)