This should be an easy solution but I'm stuck. I have a bunch of DataFrames stored in a list. I need to randomly select one of the DataFrames, but also acquire the list index location of that dataframe and store it in a variable for later use. My attempt currently throws the following error: "Can only compare identically-labeled " "DataFrame objects"
I have also used enumerate() methods in for loops before, so maybe it could be used to solve this problem as well.
random_df = random.choice(df_list)
random_df_il = cluster_list.index(random_df)
You could use enumerate and "unpack" the random choice using:
random_df_il, random_df = random.choice(list(enumerate(df_list)))
You can do a choice among the list indexes, then select your df:
ix = range(len(df_list))
i_rand = random.choice(ix)
random_df = df_list[i_rand]
You can also directly pick a random integer with random.randint(0, len(df_list)-1).
Related
I'm trying to build the below dataframe
df = pd.DataFrame(columns=['Year','Revenue','Gross Profit','Operating Profit','Net Profit'])
rep_vals =['year','net_sales','gross_income','operating_income','profit_to_equity_holders']
for i in range (len(yearly_reports)):
df.loc[i] = [yearly_reports[i].x for x in rep_vals]
However I get error as per.. 'Report' object has no attribute 'x'
The below (brute force version) of the code works:
for i in range (len(yearly_reports)):
df.loc[i] = [yearly_reports[i].year,yearly_reports[i].net_sales ,
yearly_reports[i].gross_income, yearly_reports[i].operating_income,
yearly_reports[i].profit_to_equity_holders]
My issue is however I want to add a lot more columns and also I don't want to fetch every item from my yearly_reports into the dataframe, how can I iterate the values I want more effeciently please?
Instead of using .x, use [x].
yearly_reports[i][x]
Also, it is probably a bad idea / not necessary / slow to iterate over your dataframe like this. Have a look at join/merge which might be a lot faster.
actually I am economist and not a programmer, I hope there is a good Samaritan who can help me out.
I have a list called iShares:
iShares = ['IE0005042456', 'IE00B1FZS574', 'IE00B0M62Q58', 'IE00BF4RFH31', 'IE0031442068', 'IE00BGL86Z12', 'IE00BD45KH83']
Also I have a list of dataframes with identical structure:
Lookup_Sector_IE0005042456
Lookup_Sector_IE00B1FZS574
Lookup_Sector_IE00B0M62Q58
Lookup_Sector_IE00BF4RFH31
Lookup_Sector_IE0031442068
Lookup_Sector_IE00BGL86Z12
Lookup_Sector_IE00BD45KH83
I would like to concatenate these dataframes to only one data frame with the name Lookup_Sector. This is possible by hardcoding the names:
Lookup_Sector = pd.concat([Lookup_Sector_IE0005042456, Lookup_Sector_IE00B1FZS574, Lookup_Sector_IE00B0M62Q58, Lookup_Sector_IE00BF4RFH31, Lookup_Sector_IE0031442068, Lookup_Sector_IE00BGL86Z12, Lookup_Sector_IE00BD45KH83])
But what I rather would like to is not to hardcode and instead be more flexible and use the list iShares by first creating another list with the names of the dataframes, called Lookup_Sector_list:
Lookup_Sector_list = ['Lookup_Sector_' + i for i in iShares]
Then in the next step I would love to concetenate by using a for loop:
for i in Lookup_Sector_list:
Sector = pd.concat([Lookup_Sector_list[i]])
If I do so I get an error:
TypeError: list indices must be integers or slices, not str
How can I concatenate the dataframes without hardcoding the names and instead using the list iShares?
TY so much in advance.
Since all your dataframes are in the global environment, You can select them using globals()[dat] and then concatenate them:
pd.concat([globals()['Lookup_Sector_' + i ]for i in iShares])
I want to create new dataframes using method query and a for looping, but when I try to make this happen
this error appears UndefinedVariableError: name 'i' is not defined.
I tried to do this using this code:
for sigla in sigla_estados:
nome_estado_df = 'dataset_' + sigla
for i in range(28):
nome_estado_df = consumo_alimentar.query("UF == #lista_estados[i]")
My list (lista_estados) has 27 items, so I tried to pass through all using range.
I couldn't realize what is the problem, I am beginner.
From your code I suppose you want to create multiple dataframes, each one of them containing the rows in consumo_alimentar that apply to one specific country (column UF with a name that matches the country names in lista_estados).
I also assume that you have an array (sigla_estados) that contains the country codes of countries in lista_estados and that have the same length that lista_estados and arranged in such a way that the country code of lista_estados[x] is equal to sigla_estados[x] for all x.
If my assumptions are right, this code could work:
for i in range(len(lista_estados)):
estado = lista_estados[i]
sigla = sigla_estados[i]
mask = consumo_alimentar['UF'] == estado
nome_estado_df[sigla] = consumo_alimentar[mask]
With that code you'll get an array of data frames that I think is more or less what you want to. If you want to use the query method, this should also work:
for i in range(len(lista_estados)):
estado = lista_estados[i]
sigla = sigla_estados[i]
query_str = "UF == #estado"
nome_estado_df[sigla] = consumo_alimentar.query(query_str)
This is what I have currently, I get the error int is 'int' object is not iterable. If I understand correctly my issue is that BIKE_AVAILABLE is assigned a number at the top of my project with a number so instead of looking at the column it is looking at that number and hitting an error. How should I go about going through the column? I apologize in advance for the newby question
for i in range(len(stations[BIKES_AVAILABLE]) -1):
most_bikes = max(stations[BIKES_AVAILABLE])
sort(stations[BIKES_AVAILABLE]).remove(max(stations[BIKES_AVAILABLE]))
if most_bikes == max(stations[BIKES_AVAILABLE]):
second_most = max(stations[BIKES_AVAILABLE])
index_1 = index(most_bikes)
index_2 = index(second_most)
most_bikes = max(data[0][index_1], data[0][index_2])
return most_bikes
Another method that might be better for you to use with data manipulation is to try the pandas module.
Then you could do this:
import pandas as pd
data = pd.read_csv('bicycle_data.csv')
# Alternative:
# most_sales = data['sold'].max()
most_sales = max(data['sold'])
Now you don't have to worry about indexing columns with numbers:
You can also do something like this:
sorted_data = data.sort_values(by='sold', ascending=False)
# Displays top 5 sold bicycles.
print(sorted_data.head(5))
More importantly if you enjoy using indexes, there is a function to get you the index of the max value called idxmax built into pandas.
Using a generator inside max()
If you have a CSV file named test.csv, with contents:
line1,3,abc
line2,1,ahc
line3,9,sbc
line4,4,agc
You can use a generator expression inside the max() function for a memory efficient solution (i.e. no list is created).
If you wanted to do this for the second column, then:
max(int(l.split(',')[1]) for l in open("test.csv").readlines())
which would give 9 for this example.
Update
To get the row (index), you need to store the index of the max number in the column so that you can access this:
max(((i,int(l.split(',')[1])) for i,l in enumerate(open("test.csv").readlines())),key=lambda t:t[1])[0]
which gives 2 here as the line in test.csv (above) with the max number in column 2 (which is 9) is 2 (i.e. the third line).
This works fine, but you may prefer to just break it up slightly:
lines = open("test.csv").readlines()
max(((i,int(l.split(',')[1])) for i,l in enumerate(lines)),key=lambda t:t[1])[0]
Assuming a csv structure like so:
data = ['1,blue,15,True',
'2,red,25,False',
'3,orange,35,False',
'4,yellow,24,True',
'5,green,12,True']
If I want to get the max value from the 3rd column I would do this:
largest_number = max([n.split(',')[2] for n in data])
I'm trying to add a new column to my dataframe for each time I run my funciton. This causes the error: 'ValueError: Length of values does not match length of index'. I assume this is because the list I add to df as a new column varies in length with every run of the function.
I have seen many threads suggest using concad, but this probably won't work for me as I can't seem to use concad and just overwrite my existing df - and I need one complete df at the end with a column from each run of my function.
My code functions like this:
df = DataFrame()
mylist = []
def myfunc(number):
mylist = []
for x in range(0,10):
if 'some condition':
mylist.append(x)
df['results%d' % number] = mylist
So for each function iteration I'm adding contents of 'mylist' as a new dataframe column. At second iteration this causes above mentioned error. I need some way of letting python ignore index/column length. From the threads suggesting using concad, I get that passing that giving the instruction 'axis=1' fixes the problem of different lengths - so solution might be parallel to that.
Alternative I could create a range of lists either before function definition or at beginning of it - one list for each 'number' parameter passed to the function, but this is a very primitive solution
I'm not exactly clear on what you're trying to do, but maybe you want something like this?
df = DataFrame()
def myfunc(number):
row_index = 0
for x in range(0,10):
if 'some condition':
df.loc[row_index, 'results%d' % number] = x
row_index += 1