I need to fill a pandas dataframe column with empty numpy arrays. I mean that any row has to be an empty array. Something like
df['ColumnName'] = np.empty(0,dtype=float)
but this don't work because it tries to use every value of the array and assign one value per row.
I tried then
for k in range(len(df)):
df['ColumnName'].iloc[k] = np.empty(0,dtype=float)
but still no luck. Any advice ?
You can repeat the np.empty into number of rows and then assign them to the column. Since it aint a scalar it cant be directly assigned like df['x'] = some_scalar.
df = pd.DataFrame({'a':[0,1,2]})
df['c'] = [np.empty(0,dtype=float)]*len(df)
Output :
a c
0 0 []
1 1 []
2 2 []
You can also use a simple comprehension
df = pd.DataFrame({'a':[0,1,2]})
df['c'] = [[] for i in range(len(df))]
Output
a c
0 0 []
1 1 []
2 2 []
Related
I am trying to select a range within a data frame based on the values. I have logic for what i am trying to implement in excel and i just need to translate it into a python script. I need to return a range of rows from where the value in starting where Column A value is and ending where Column B has that same value. Example below:
index
A
B
output range
0
dsdfsdf
1
2
3
4
quwfi
5
dsdfsdf
0:5
6
quwfi
4:6
One thing to note the value in Column B will always be lower down the list than Column A
So far I have tried to just grab the index of Column A and put it on the row in output range for Column B using,
df['output range'] = np.where(df['B'] != "", (df.index[df['A'] == df.at[df['B']].value]))
This gives me a ValueError: Invalid call for scalar access (getting)!
Removing the np.where portion of it does not change the result
This should give you the required behavior:
df = pd.DataFrame({'A': ['dsdfsdf','','','','quwfi','',''],'B': ['','','','','','dsdfsdf','quwfi']})
def get_range(x):
if x != '':
first_index = df[df['A'] == x].index.values[0]
current_index = df[df['B'] ==x].index.values[0]
return f"{first_index}:{current_index}"
return ''
df['output range']= df['B'].apply(get_range)
df
I'm with a challenge in python/pandas script.
My data is a gene expression table, which is organized as follow:
Basically, Index 0 contain the both conditions studied, while Index 1 has the information about the gene identified between the samples.
Then, I would like to produce a table with index 0 and 1 close together, as follow:
I've tried a lot of things, such as generate a list of index 0 to join in index 1...
Save me, guys, please!
Thank you
Assuming your first row of column names are in row 0, and your second column names are in row 1 try this:
df.columns = [f'{c1}.{c2}'.strip('.') for c1,c2 in zip(df.loc[0], df.loc[1])]
df.loc[2:]
Should look like this
According to OP's comment, I change the add_suffix function.
construct the dataframe
s1 = "Gene name,Description,Foldchange,Anova,Sample 1,Sample 2,Sample 3,Sample 4,Sample 5,Sample 6".split(",")
s2 = "HK1,Hexokinase,Infinity,0.05,1213,1353,14356,0,0,0".split(",")
df = pd.DataFrame(s2).T
df.columns = s1
define a function, (change the funcition according to different situations)
def add_suffix(x):
try:
flag = int(x[-1])
except:
return x
if flag <= 4:
return x + '.Conditon1'
else:
return x + '.Condition2'
and then assign the columns
cols = df.columns.to_series().apply(add_suffix)
df.columns = cols
I have the following data frame of the form:
1 2 3 4 5 6 7 8
A C C T G A T C
C A G T T A D N
Y F V H Q A F D
I need to randomly select a column k times where k is the number of columns in the given sample. My program creates a list of empty lists of size k and then randomly selects a column from the dataframe to be appended to the list. Each list must be unique and cannot have duplicates.
From the above example dataframe, an expected output should be something like:
[[2][4][6][1][7][3][5][8]]
However I am obtaining results like:
[[1][1][3][6][7][8][8][2]]
What is the most pythonic way to go about doing this? Here is my sorry attempt:
k = len(df.columns)
k_clusters = [[] for i in range(k)]
for i in range(len(k_clusters)):
for j in range(i + 1, len(k_clusters)):
k_clusters[i].append((df.sample(1, axis=1)))
if k_clusters[i] == k_clusters[j]:
k_clusters[j].pop(0)
k_clusters[j].append(df.sample(1, axis=1)
Aside from the shuffling step, your question is very similar to How to change the order of DataFrame columns?. Shuffling can be done in any number of ways in Python:
cols = np.array(df.columns)
np.random.shuffle(cols)
Or using the standard library:
cols = list(df.columns)
random.shuffle(cols)
You do not want to do cols = df.columns.values, because that will give you write access to the underlying column name data. You will then end up shuffling the column names in-place, messing up your dataframe.
Rearranging your columns is then easy:
df = df[cols]
You can use numpy.random.shuffle to just shuffle the column indexes. Because from your question, this is what I assume you want to do.
An example:
import numpy as np
to_shuffle = np.array(df.columns)
np.random.shuffle(to_shuffle)
print(to_shuffle)
I have several dataframes in a list, obtained after using np.array_split and I want to concat some of then into a single dataframe. In this example, I want to concat 3 dataframes contained in b (all but the 2nd one, which is the element b[1] in the list):
df = pd.DataFrame({'country':['a','b','c','d'],
'gdp':[1,2,3,4],
'iso':['x','y','z','w']})
a = np.array_split(df,4)
i = 1
b = a[:i]+a[i+1:]
desired_final_df = pd.DataFrame({'country':['a','c','d'],
'gdp':[1,3,4],
'iso':['x','z','w']})
I have tried to create an empty df and then use append through a loop for the elements in b but with no complete success:
CV = pd.DataFrame()
CV = [CV.append[(b[i])] for i in b] #try1
CV = [CV.append(b[i]) for i in b] #try2
CV = pd.DataFrame([CV.append[(b[i])] for i in b]) #try3
for i in b:
CV.append(b) #try4
I have reached to a solution which works but it is not efficient:
CV = pd.DataFrame()
CV = [CV.append(b) for i in b][0]
In this case, I get in CV three times the same dataframe with all the rows and I just get the first of them. However, in my real case, in which I have big datasets, having three times the same would result in much more time of computation.
How could I do that without repeating operations?
According to the docs, DataFrame.append does not work in-place, like lists. The resulting DataFrame object is returned instead. Catching that object should be enough for what you need:
df = pd.DataFrame()
for next_df in list_of_dfs:
df = df.append(next_df)
You may want to use the keyword argument ignore_index=True in the append call so that the indices become continuous, instead of starting from 0 for each appended DataFrame (assuming that the index of the DataFrames in the list all start from 0).
To cancatenate multiple DFs, resetting index, use pandas.concat:
pd.concat(b, ignore_index=True)
output
country gdp iso
0 a 1 x
1 c 3 z
2 d 4 w
I am trying to apply a function on multiple columns and in turn create multiple columns to count the length of each entry.
Basically I have 5 columns with indexes 5,7,9,13 and 15 and each entry in those columns is a string of the form 'WrappedArray(|2008-11-12, |2008-11-12)' and in my function I try to strip the wrappedArray part and split the two values and count the (length - 1) using the following;
def updates(row,num_col):
strp = row[num_col.strip('WrappedAway')
lis = list(strp.split(','))
return len(lis) - 1
where num_col is the index of the column and cal take the value 5,7,9,13,15.
I have done this but only for 1 column:
fn = lambda row: updates(row,5)
col = df.apply(fn, axis=1)
df = df.assign(**{'count1':col.values})
I basically want to apply this function to ALL the columns (not just 5 as above) with the indexes mentioned and then create a separate column associated with columns 5,7,9,13 and 15 all in short code instead of doing that separately for each value.
I hope I made sense.
In regards to finding the amount of elements in the list, looks like you could simply use str.count() to find the amount of ',' in the strings. And in order to apply a defined function to a set of columns you could do something like:
cols = [5,7,9,13,15]
for col in cols:
col_counts = {'{}_count'.format(col): df.iloc[:,col].apply(lambda x: x.count(','))}
df = df.assign(**col_counts)
Alternatively you can also usestrip('WrappedAway').split(',') as you where using:
def count_elements(x):
return len(x.strip('WrappedAway').split(',')) - 1
for col in cols:
col_counts = {'{}_count'.format(col):
df.iloc[:,col].apply(count_elements)}
df = df.assign(**col_counts)
So for example with the following dataframe:
df = pd.DataFrame({'A': ['WrappedArray(|2008-11-12, |2008-11-12, |2008-10-11)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'B': ['WrappedArray(|2008-11-12,|2008-11-12)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'C': ['WrappedArray(|2008-11-12|2008-11-12)', 'WrappedArray(|2008-11-12|2008-11-12)']})
Redefining the set of columns on which we want to count the amount of elements:
for col in [0,1,2]:
col_counts = {'{}_count'.format(col):
df.iloc[:,col].apply(count_elements)}
df = df.assign(**col_counts)
Would yield:
A \
0 WrappedArray(|2008-11-12, |2008-11-12, |2008-1...
1 WrappedArray(|2008-11-12, |2008-11-12)
B \
0 WrappedArray(|2008-11-12,|2008-11-12)
1 WrappedArray(|2008-11-12, |2008-11-12)
C 0_count 1_count 2_count
0 WrappedArray(|2008-11-12|2008-11-12) 2 1 0
1 WrappedArray(|2008-11-12|2008-11-12) 1 1 0
You are confusing row-wise and column-wise operations by trying to do both in one function. Choose one or the other. Column-wise operations are usually more efficient and you can utilize Pandas str methods.
Setup
df = pd.DataFrame({'A': ['WrappedArray(|2008-11-12, |2008-11-12, |2008-10-11)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'B': ['WrappedArray(|2008-11-12,|2008-11-12)', 'WrappedArray(|2008-11-12|2008-11-12)']})
Logic
# perform operations on strings in a series
def calc_length(series):
return series.str.strip('WrappedAway').str.split(',').str.len() - 1
# apply to each column and join to original dataframe
df = df.join(df.apply(calc_length).add_suffix('_Length'))
Result
print(df)
A \
0 WrappedArray(|2008-11-12, |2008-11-12, |2008-1...
1 WrappedArray(|2008-11-12, |2008-11-12)
B A_Length B_Length
0 WrappedArray(|2008-11-12,|2008-11-12) 2 1
1 WrappedArray(|2008-11-12|2008-11-12) 1 0
I think we can use pandas str.count()
df= pd.DataFrame({
"col1":['WrappedArray(|2008-11-12, |2008-11-12)',
'WrappedArray(|2018-11-12, |2017-11-12, |2018-11-12)'],
"col2":['WrappedArray(|2008-11-12, |2008-11-12,|2008-11-12,|2008-11-12)',
'WrappedArray(|2018-11-12, |2017-11-12, |2018-11-12)']})
df["col1"].str.count(',')