I have a pandas.Dataframe df with one of the column headers being 'X'. Let's say this is of size (N,M). N=3,M=2 in this example:
X Y
0 1 a
1 2 b
2 3 c
I have a 1D numpy.array arr of size (Q,), that contains values, some of which are repeats. Q=5 in this example:
array([1, 2, 3, 2, 2])
I would like to create a new pandas.Dataframe df_op that contains rows from df, where each row.X matches an entry from arr. This means some rows are extracted more than once, and the resultant df_op has size (Q,M). If possible, I would like to keep the same order of entries as in arr as well.
X Y
0 1 a
1 2 b
2 3 c
3 2 b
4 2 b
Using the usual boolean indexing does not work, because that only picks up unique rows. I would also like to avoid loops if possible, because Q is large.
How can I get df_op? Thank you.
Use indexing to get multiple times the same row:
x = [1, 2, 3, 2, 2]
df = pd.DataFrame({'X': [1, 2, 3], 'Y': ['a', 'b', 'c']})
out = df.set_index('X').loc[x].reset_index()
Output:
>>> out
X Y
0 1 a
1 2 b
2 3 c
3 2 b
4 2 b
i have a pandas dataframe where the columns are named like:
0,1,2,3,4,.....,n
i would like to drop every 3rd column so that i get a new dataframe where i would have the columns like:
0,1,3,4,6,7,9,.....,n
I have tried like this:
shape = df.shape[1]
for i in range(2,shape,3):
df = df.drop(df.columns[i], axis=1)
but i get an error saying index is out of bound and i assume this happens because the shape of the dataframe changes when i am dropping the columns. if i just don't store the output of the "for" loop, then the code works but i don't get my new dataframe.
How do i solve this?
Thanks
The issue with code is, each time you drop a column in your loop, you end up with a different set of columns because you overwrite the df back after each iteration. When you try to drop the next 3rd column of THAT new set of columns, you not only drop the wrong one, you end up running out of columns eventually. That's why you get the error you are getting.
iter1 -> 0,1,3,4,5,6,7,8,9,10 ... n #first you drop 2 which is 3rd col
iter2 -> 0,1,3,4,5,7,8,9,10 ... n #next you drop 6 which is 6th col (should be 5)
iter3 -> 0,1,3,4,5,7,8,9, ... n #next you drop 10 which is 9th col (should be 8)
What you want to do is calculate the indexes beforehand and then remove them in one go.
You can simply just get the indexes of columns you want to remove with range and then drop those.
drop_idx = list(range(2,df.shape[1],3)) #Indexes to drop
df2 = df.drop(drop_idx, axis=1) #Drop them at once over axis=1
print('old columns->', list(df.columns))
print('idx to drop->', drop_idx)
print('new columns->',list(df2.columns))
old columns-> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
idx to drop-> [2, 5, 8]
new columns-> [0, 1, 3, 4, 6, 7, 9]
Note: This works only because your columns names are same as indexes. If however, your column names are not like that, you will have to do an extra step of fetching the column names based on the index you want to drop.
drop_idx = list(range(2,df.shape[1],3))
drop_cols = [j for i,j in enumerate(df.columns) if i in drop_idx] #<--
df2 = df.drop(drop_cols, axis=1)
Here is solution with inverted logic - select all columns with removed each 3rd column.
You can filter values by compare added 1 to helper array, with 3 modulo compare for not equal 0 and pass to DataFrame.loc:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = df.loc[:, (np.arange(len(df.columns)) + 1) % 3 != 0]
print (df)
A B D E
0 a 4 1 5
1 b 5 3 3
2 c 4 5 6
3 d 5 7 9
4 e 5 1 2
5 f 4 0 4
You can use list comprehension to filter columns:
df = df[[k for k in df.columns if (k + 1) % 3 != 0]]
If the names are different (e.g. strings) and you want to discard every 3rd column regardless of its name, then:
df = df[[k for i, k in enumerate(df.columns, 1) if i % 3 != 0]]
I have a pandas dataframe which has only one column, the value of each cell in the column is a list/array of numbers, this list is of length 100 and this length is consistent across all the cell values.
We need to convert each list value as a column value, in other words have a dataframe which has 100 columns and each column value is at a list/array item.
Something like this
becomes
It can be done with iterrows() as shown below, but we have around 1.5 million rows and need a scalable solution as iterrows() would take alot of time.
cols = [f'col_{i}' for i in range(0, 4)]
df_inter = pd.DataFrame(columns = cols)
for index, row in df.iterrows():
df_inter.loc[len(df_inter)] = row['message']
You can do this:
In [28]: df = pd.DataFrame({'message':[[1,2,3,4,5], [3,4,5,6,7]]})
In [29]: df
Out[29]:
message
0 [1, 2, 3, 4, 5]
1 [3, 4, 5, 6, 7]
In [30]: res = pd.DataFrame(df.message.tolist(), index= df.index)
In [31]: res
Out[31]:
0 1 2 3 4
0 1 2 3 4 5
1 3 4 5 6 7
I think this would work:
df.message.apply(pd.Series)
To use dask to scale (assuming it is installed):
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=8)
ddf.message.apply(pd.Series, meta={0: 'object'})
This is my dataframe:
df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 3, 4],
'col2': [1, 3, 2, 4, 6, 5, 7]})
I try to recode values based on how often they appear in my dataset, here I want to relabel every value which occurs only once to "other". This is the desired output:
#desired
"col1": [1,1,1,2,2,"other", "other"]
I tried this but it did not work:
df["recoded"] = np.where(df["col1"].value_counts() > 1, df["col1"], "other")
My idea is to save the value counts and filter them and then loop over the result array, but this seems overly complicated. Is there an easy "pythonic/pandas" way to archieve this?
You are close - need Series.map for same length of Series like original DataFrame:
df["recoded"] = np.where(df["col1"].map(df["col1"].value_counts()) > 1, df["col1"], "other")
Or use GroupBy.transform with count values by GroupBy.size:
df["recoded"] = np.where(df.groupby('col1')["col1"].transform('size') > 1,
df["col1"],
"other")
If need check duplicates use Series.duplicated with keep=False for return mask by all duplicates:
df["recoded"] = np.where(df["col1"].duplicated(keep=False), df["col1"], "other")
print (df)
0 1 1 1
1 1 3 1
2 1 2 1
3 2 4 2
4 2 6 2
5 3 5 other
6 4 7 other
Each column of the Dataframe needs their values to be normalized according the value of the first element in that column.
for timestamp, prices in data.iteritems():
normalizedPrices = prices / prices[0]
print normalizedPrices # how do we update the DataFrame with this Series?
However how do we update the DataFrame once we have created the normalized column of data? I believe if we do prices = normalizedPrices we are merely acting on a copy/view of the DataFrame rather than the original DataFrame itself.
It might be simplest to normalize the entire DataFrame in one go (and avoid looping over rows/columns altogether):
>>> df = pd.DataFrame({'a': [2, 4, 5], 'b': [3, 9, 4]}, dtype=np.float) # a DataFrame
>>> df
a b
0 2 3
1 4 9
2 5 4
>>> df = df.div(df.loc[0]) # normalise DataFrame and bind back to df
>>> df
a b
0 1.0 1.000000
1 2.0 3.000000
2 2.5 1.333333
Assign to data[col]:
for col in data:
data[col] /= data[col].iloc[0]
import numpy
data[0:] = data[0:].values/data[0:1].values