I created an empty datafame,drugs:
drugs = pd.DataFrame({'name':[],
'value':[]})
after that, I wanted to add data from another data frame to it using a for loop:
for i in range(227,498):
value = drug_users[drug_users[drug_users.columns[i]] == 1][drug_users.columns[i]].sum() / 10138
name = drug_users.columns[i]
d2 = pd.DataFrame({'name':[name],
'value':[value]})
print(d2)
pd.concat([drugs, d2], ignore_index = True, axis = 0)
but when I take a sample from drugs, I get the error:
ValueError: a must be greater than 0 unless no samples are taken
The concat method return a new dataframe instead of changing the current dataframe. You need to assign the return value, e.g.:
drugs = pd.concat([drugs, d2], ignore_index = True, axis = 0)
You need to assign the return value from the concat function, I'm afraid it is not an inplace operation.
Related
I have the following part of code:
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
where I am extracting some values based on some columns of the DataFrame batch. Since the initial DataFrame df can be quite large, I need to find an efficient way of doing the following:
Putting together the results of the for loop in a new DataFrame with columns unique_request, unique_ua, reply_length_avg and response4xx at each iteration.
Stacking these DataFrames below of each other at each iteration.
I tried to do the following:
df_final = pd.DataFrame()
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = [unique_request, unique_ua, reply_length_avg, response4xx]
df_final = pd.concat([df_final, concat], axis = 1, ignore_index = True)
return df_final
But I am getting the following error:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
Any idea of what should I try?
First of all avoid using pd.concat to build the main dataframe inside a for loop as it gets exponentially slower. The problem you are facing is that pd.concat should receive as input a list of dataframes, however you are passing [df_final, concat] which, in essence, is a list containing 2 elements: one dataframe and one list of dataframes. Ultimately, it seems you want to stack the dataframes vertically, thus axis should be 0 and not 1.
Therefore, I suggest you to do the following:
df_final = []
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = pd.concat([unique_request, unique_ua, reply_length_avg, response4xx], axis = 1, ignore_index = True)
df_final.append(concat)
df_final = pd.concat(df_final, axis = 0, ignore_index = True)
return df_final
Note that pd.concat receives a list of dataframes and not a list that contains a list inside of it! Also, this approach is way faster since the pd.concat inside the for loop doesn't get bigger every iteration :)
I hope it helps!
I can't apply the alterations I make to dataframes inside a dictionary. The changes are done with a for loop.
The problem is that although the loop works because the single iterated df makes the changes, they do not apply to the dictionary they are in.
The end goal is to create a merge of all the dataframes since they come from different excel sheets and sheets.
Here the code:
Import the two excel files, assigning None to the Sheet_Name parameter in order to import all the sheets of the document into a dict. I have 8 sheet in EEG excel file and 5 in SC file
import numpy as np
impody pandas as np
eeg = pd.read_excel("path_file", sheet_name = None)
sc = pd.read_excel("path_file" sheet_name = None)
Merges the first dictionary with the second one with the update method. Now the EEG dict contains both EEG and SC.
So now I have a dict with 13 df inside
eeg.update(sc)
The loop for is needed in order to carry out some modifications inside the single df.
reset the index to a specific column (common on all df), change its name, add a prefix on the variable that corresponds to the key of the df and lastly change the 0 with nan.
for key, df in eeg.items():
df.set_index(('Unnamed: 0'), inplace = True)
df.index.rename(('Sbj'), inplace = True)
df = df.add_prefix( key + '_')
df.replace (0, np.nan, inplace = True)
Although the loop is set on the dictionary items and the single iterated dataframe works, I don't see the changes on the dictionary df's and therefore can't proceed to extract them into a list, then merge.
As you can see in the fig.1 the single df in the for loop is good!
but when I go to the df in dict, they still result as before.
You need to map your modified dataframe back into your dictionary:
for key, df in eeg.items():
df.set_index(('Unnamed: 0'), inplace = True)
df.index.rename(('Sbj'), inplace = True)
df = df.add_prefix( key + '_')
df.replace (0, np.nan, inplace = True)
eeg[key] = df #map df back into eeg
What you probably want is:
# merge the dataframes in your dictionary into one
df1 = pd.DataFrame()
for key, df in eeg.items():
df1 = pd.concat([df1,df])
# apply index-changes to the merged dataframe
df1.set_index(('Unnamed: 0'), inplace = True)
df1.index.rename(('Sbj'), inplace = True)
df1 = df1.add_prefix( key + '_')
df1.replace (0, np.nan, inplace = True)
I have multiple dataframes with the same columns but different values that look like that
Product 1 Dataframe
Here's the code that generated them
import pandas as pd
d1 = {"Year":[2018,2019,2020],"Quantity": [10,20,30], "Price": [100,200,300]}
df_product1 = pd.DataFrame(data=d1)
d2 = {"Year":[2018,2019,2020],"Quantity": [20,20,50], "Price": [120,110,380]}
df_product2 = pd.DataFrame(data=d2)
d3 = {"Year":[2018,2019,2020],"Quantity": [40,20,70], "Price": [1000,140,380]}
df_product3 = pd.DataFrame(data=d3)
I merge two dataframes and identify suffixes like so
df_total = df_product1.merge(df_product2,on="Year", suffixes = ("_Product1","_Product2"))
And I get
First Merged Dataframe
However, when I merge another dataframe to the result above using:
df_total = df_total.merge(df_product3,on="Year", suffixes = ("_Product","_Product3"))
I get
Final Merged Dataframe
Where there is no suffix for the third product.
I would like the last two columns of the dataframe to be Quantity_Product3, Price_Product3 instead of just Quantity and Price.
Let me know if it is possible or if I need to approach the problem from a completely different angle.
Why you don't get the result you want
It's explained in the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
suffixeslist-like, default is (“_x”, “_y”) A length-2 sequence where
each element is optionally a string indicating the suffix to add to
overlapping column names in left and right respectively. Pass a value
of None instead of a string to indicate that the column name from left
or right should be left as-is, with no suffix. At least one of the
values must not be None.
Suffixes are added to overlapping column names.
See this example - suffixes are added to column b, because both dataframes have a column b, but not to columns a and c, as they are unique and not in common between the two dataframes.
df1 = pd.DataFrame(columns =['a','b'], data = np.random.rand(10,2))
df2 = pd.DataFrame(columns =['b','c'], data = np.random.rand(10,2), index = np.arange(5,15))
# equivalent to an inner join on the indices
out = pd.merge(df1, df2, how ='inner', left_index = True, right_index = True)
A crude solution
Why don't you just rename the columns manually? Not elegant but effective
A possible alternative
The table you are trying to build looks like a pivot. I would look into normalising all your dataframes, concatenating them, then running a pivot on the result.
Depending on your case, this may well be more convoluted and could well be overkill. I mention it because I want to bring your attention to the concepts of pivoting/unpivoting (stacking/unstacking/normalising) data.
The code below takes a df which looks similar to yours and normalises it. For simpler cases you can use pandas.melt(). I don't have the exact data of your example but this should be a good starting point.
def random_dates(start, end, n, unit='D', seed=None):
ndays = (end - start).days + 1
return start + pd.to_timedelta(
np.random.randint(0, ndays, n), unit=unit
)
df = pd.DataFrame()
mysize = 20
df['key'] = np.arange(0,mysize)
df['A_value'] = np.random.randint(0,10000,mysize)
df['A_date'] = random_dates(pd.to_datetime('2010-01-01' ), pd.to_datetime('2019-01-01'), mysize)
df['B_value'] = np.random.randint(-5000,5000,mysize)
df['B_date'] = random_dates(pd.to_datetime('2005-01-01' ), pd.to_datetime('2015-01-01'), mysize)
df['C_value'] = np.random.randint(-10000,10000,mysize)
df['C_date'] = random_dates(pd.to_datetime('2000-01-01' ), pd.to_datetime('2019-01-01'), mysize)
df2 = df.set_index('key', drop=True, verify_integrity = True)
df2 = df2.stack().reset_index()
df2.columns=['key','rawsource','rawvalue']
df2['source'] = df2['rawsource'].apply(lambda x: x[0:1])
df2['metric'] = df2['rawsource'].apply(lambda x: x[2:])
df2 = df2.drop(['rawsource'], axis = 1)
df_piv = df2.pivot_table( index=['key','source'], columns = 'metric' , values ='rawvalue', aggfunc='first' ).reset_index().rename_axis(None, axis=1)
data = pd.read_csv("file.csv")
As = data.groupby('A')
for name, group in As:
current_column = group.iloc[:, i]
current_column.iloc[0] = np.NAN
The problem: 'data' stays the same after this loop, even though I'm trying to set values to np.NAN .
As #ohduran suggested:
data = pd.read_csv("file.csv")
As = data.groupby('A')
new_data = pd.DataFrame()
for name, group in As:
# edit grouped data
# eg group.loc[:,'column'] = np.nan
new_data = new_data.append(group)
.groupby() does not change the initial DataFrame. You might want to store what you do with groupby() on a different variable, and the accumulate it in a different DataFrame using that for loop?
I want to iterate through the rows of a DataFrame and assign values to a new DataFrame. I've accomplished that task indirectly like this:
#first I read the data from df1 and assign it to df2 if something happens
counter = 0 #line1
for index,row in df1.iterrows(): #line2
value = row['df1_col'] #line3
value2 = row['df1_col2'] #line4
#try unzipping a file (pseudo code)
df2.loc[counter,'df2_col'] = value #line5
counter += 1 #line6
#except
print("Error, could not unzip {}") #line7
#then I set the desired index for df2
df2 = df2.set_index(['df2_col']) #line7
Is there a way to assign the values to the index of df2 directly in line5? Sorry my original question was unclear. I'm creating an index based on the something happening.
There are a bunch of ways to do this. According to your code, all you've done is created an empty df2 dataframe with an index of values from df1.df1_col. You could do this directly like this:
df2 = pd.DataFrame([], df1.df1_col)
# ^ ^
# | |
# specifies no data, yet |
# defines the index
If you are concerned about having to filter df1 then you can do:
# cond is some boolean mask representing a condition to filter on.
# I'll make one up for you.
cond = df1.df1_col > 10
df2 = pd.DataFrame([], df1.loc[cond, 'df1_col'])
No need to iterate, you can do:
df2.index = df1['df1_col']
If you really want to iterate, save it to a list and set the index.