I'm trying to build the below dataframe
df = pd.DataFrame(columns=['Year','Revenue','Gross Profit','Operating Profit','Net Profit'])
rep_vals =['year','net_sales','gross_income','operating_income','profit_to_equity_holders']
for i in range (len(yearly_reports)):
df.loc[i] = [yearly_reports[i].x for x in rep_vals]
However I get error as per.. 'Report' object has no attribute 'x'
The below (brute force version) of the code works:
for i in range (len(yearly_reports)):
df.loc[i] = [yearly_reports[i].year,yearly_reports[i].net_sales ,
yearly_reports[i].gross_income, yearly_reports[i].operating_income,
yearly_reports[i].profit_to_equity_holders]
My issue is however I want to add a lot more columns and also I don't want to fetch every item from my yearly_reports into the dataframe, how can I iterate the values I want more effeciently please?
Instead of using .x, use [x].
yearly_reports[i][x]
Also, it is probably a bad idea / not necessary / slow to iterate over your dataframe like this. Have a look at join/merge which might be a lot faster.
Related
This should be an easy solution but I'm stuck. I have a bunch of DataFrames stored in a list. I need to randomly select one of the DataFrames, but also acquire the list index location of that dataframe and store it in a variable for later use. My attempt currently throws the following error: "Can only compare identically-labeled " "DataFrame objects"
I have also used enumerate() methods in for loops before, so maybe it could be used to solve this problem as well.
random_df = random.choice(df_list)
random_df_il = cluster_list.index(random_df)
You could use enumerate and "unpack" the random choice using:
random_df_il, random_df = random.choice(list(enumerate(df_list)))
You can do a choice among the list indexes, then select your df:
ix = range(len(df_list))
i_rand = random.choice(ix)
random_df = df_list[i_rand]
You can also directly pick a random integer with random.randint(0, len(df_list)-1).
I have the following dataframe:
I would like to use this code to compare the means between my entire dataframe:
F_statistic, pVal = stats.f_oneway(percentage_age_ss.iloc[:,0:1],
percentage_age_ss.iloc[:,1:2],
percentage_age_ss.iloc[:,2:3],
percentage_age_ss.iloc[:,3:4]) etc...
However, I don't want to use each time .iloc because it takes too much time. Do you I have another way to do it?
Thanks
get a list of columns using list comprehension, then use star syntax to expand it into the arglist:
stats.f_oneway(*(percentage_age_ss[col] for col in percentage_age_ss.columns))
or, just
stats.f_oneway(*(percentage_age_ss.T.values))
This is what I have currently, I get the error int is 'int' object is not iterable. If I understand correctly my issue is that BIKE_AVAILABLE is assigned a number at the top of my project with a number so instead of looking at the column it is looking at that number and hitting an error. How should I go about going through the column? I apologize in advance for the newby question
for i in range(len(stations[BIKES_AVAILABLE]) -1):
most_bikes = max(stations[BIKES_AVAILABLE])
sort(stations[BIKES_AVAILABLE]).remove(max(stations[BIKES_AVAILABLE]))
if most_bikes == max(stations[BIKES_AVAILABLE]):
second_most = max(stations[BIKES_AVAILABLE])
index_1 = index(most_bikes)
index_2 = index(second_most)
most_bikes = max(data[0][index_1], data[0][index_2])
return most_bikes
Another method that might be better for you to use with data manipulation is to try the pandas module.
Then you could do this:
import pandas as pd
data = pd.read_csv('bicycle_data.csv')
# Alternative:
# most_sales = data['sold'].max()
most_sales = max(data['sold'])
Now you don't have to worry about indexing columns with numbers:
You can also do something like this:
sorted_data = data.sort_values(by='sold', ascending=False)
# Displays top 5 sold bicycles.
print(sorted_data.head(5))
More importantly if you enjoy using indexes, there is a function to get you the index of the max value called idxmax built into pandas.
Using a generator inside max()
If you have a CSV file named test.csv, with contents:
line1,3,abc
line2,1,ahc
line3,9,sbc
line4,4,agc
You can use a generator expression inside the max() function for a memory efficient solution (i.e. no list is created).
If you wanted to do this for the second column, then:
max(int(l.split(',')[1]) for l in open("test.csv").readlines())
which would give 9 for this example.
Update
To get the row (index), you need to store the index of the max number in the column so that you can access this:
max(((i,int(l.split(',')[1])) for i,l in enumerate(open("test.csv").readlines())),key=lambda t:t[1])[0]
which gives 2 here as the line in test.csv (above) with the max number in column 2 (which is 9) is 2 (i.e. the third line).
This works fine, but you may prefer to just break it up slightly:
lines = open("test.csv").readlines()
max(((i,int(l.split(',')[1])) for i,l in enumerate(lines)),key=lambda t:t[1])[0]
Assuming a csv structure like so:
data = ['1,blue,15,True',
'2,red,25,False',
'3,orange,35,False',
'4,yellow,24,True',
'5,green,12,True']
If I want to get the max value from the 3rd column I would do this:
largest_number = max([n.split(',')[2] for n in data])
I'm trying to add a new column to my dataframe for each time I run my funciton. This causes the error: 'ValueError: Length of values does not match length of index'. I assume this is because the list I add to df as a new column varies in length with every run of the function.
I have seen many threads suggest using concad, but this probably won't work for me as I can't seem to use concad and just overwrite my existing df - and I need one complete df at the end with a column from each run of my function.
My code functions like this:
df = DataFrame()
mylist = []
def myfunc(number):
mylist = []
for x in range(0,10):
if 'some condition':
mylist.append(x)
df['results%d' % number] = mylist
So for each function iteration I'm adding contents of 'mylist' as a new dataframe column. At second iteration this causes above mentioned error. I need some way of letting python ignore index/column length. From the threads suggesting using concad, I get that passing that giving the instruction 'axis=1' fixes the problem of different lengths - so solution might be parallel to that.
Alternative I could create a range of lists either before function definition or at beginning of it - one list for each 'number' parameter passed to the function, but this is a very primitive solution
I'm not exactly clear on what you're trying to do, but maybe you want something like this?
df = DataFrame()
def myfunc(number):
row_index = 0
for x in range(0,10):
if 'some condition':
df.loc[row_index, 'results%d' % number] = x
row_index += 1
I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.