I have a dataframe, and series of the same vertical size as df, I want to assign
that series to ALL columns of the DataFrame.
What is the natural why to do it ?
For example
df = pd.DataFrame([[1, 2 ], [3, 4], [5 , 6]] )
ser = pd.Series([1, 2, 3 ])
I want all columns of "df" to be equal to "ser".
PS Related:
One way to solve it via answer:
How to assign dataframe[ boolean Mask] = Series - make it row-wise ? I.e. where Mask = true take values from the same row of the Series (creating all true mask), but I guess there should be some more
simple way.
If I need NOT all, but SOME columns - the answer is given here:
Assign a Series to several Rows of a Pandas DataFrame
Use to_frame with reindex:
a = ser.to_frame().reindex(columns=df.columns, method='ffill')
print (a)
0 1
0 1 1
1 2 2
2 3 3
But it seems easier is solution from comment, there was added columns parameter if need same order columns as original with real data:
df = pd.DataFrame({c:ser for c in df.columns}, columns=df.columns)
Maybe a different way to look at it:
df = pd.concat([ser] * df.shape[1], axis=1)
Related
Using pandas, I open some csv files in a loop and set the index to the cycleID column, except the cycleID column is not unique. See below:
for filename in all_files:
abfdata = pd.read_csv(filename, index_col=None, header=0)
abfdata = abfdata.set_index("cycleID", drop=False)
for index, row in abfdata.iterrows():
print(row['cycleID'], row['mean'])
This prints the 2 columns (cycleID and mean) of the dataframe I am interested in for further computations:
1 1.5020712104685252e-11
1 6.56683605063102e-12
2 1.3993315187144084e-11
2 -8.670502467042485e-13
3 7.0270625256163566e-12
3 9.509995221868016e-12
4 1.2901435995915644e-11
4 9.513106448422182e-12
The objective is to use the rows corresponding to the same cycleID and calculate the difference between the mean column values. So, if there are 8 rows in the table, the final array or list would store 4 values.
I want to make it scalable as well where there can be 3 or more rows with the same cycleIDs. In that case, each cycleID could have 2 or more mean differences.
Update: Instead of creating a new ques about it, I thought I'd add here.
I used the diff and groupby approach as mentioned in the solution. It works great but I have this extra need to save one of the mean values (odd row or even row doesn't matter) in a new column and make that part of the new data frame as well. How do I do that?
You can use groupby
s2= df.groupby(['cycleID'])['mean'].diff()
s2.dropna(inplace=True)
output
1 -8.453876e-12
3 -1.486037e-11
5 2.482933e-12
7 -3.388330e-12
8 3.000000e-12
UPDATE
d = [[1, 1.5020712104685252e-11],
[1, 6.56683605063102e-12],
[2, 1.3993315187144084e-11],
[2, -8.670502467042485e-13],
[3, 7.0270625256163566e-12],
[3, 9.509995221868016e-12],
[4, 1.2901435995915644e-11],
[4, 9.513106448422182e-12]]
df = pd.DataFrame(d, columns=['cycleID', 'mean'])
df2 = df.groupby(['cycleID']).diff().dropna().rename(columns={'mean': 'difference'})
df2['mean'] = df['mean'].iloc[df2.index]
difference mean
1 -8.453876e-12 6.566836e-12
3 -1.486037e-11 -8.670502e-13
5 2.482933e-12 9.509995e-12
7 -3.388330e-12 9.513106e-12
I'm working in Python with a pandas DataFrame of video games, each with a genre. I'm trying to remove any video game with a genre that appears less than some number of times in the DataFrame, but I have no clue how to go about this. I did find a StackOverflow question that seems to be related, but I can't decipher the solution at all (possibly because I've never heard of R and my memory of functional programming is rusty at best).
Help?
Use groupby filter:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 5 6
In [13]: df.groupby("A").filter(lambda x: len(x) > 1)
Out[13]:
A B
0 1 2
1 1 4
I recommend reading the split-combine-section of the docs.
Solutions with better performance should be GroupBy.transform with size for count per groups to Series with same size like original df, so possible filter by boolean indexing:
df1 = df[df.groupby("A")['A'].transform('size') > 1]
Or use Series.map with Series.value_counts:
df1 = df[df['A'].map(df['A'].value_counts()) > 1]
#jezael solution works very well, Here is a different approach to filter based on values count :
For example, if the dataset is :
df = pd.DataFrame({'a': [1,2,3,3,1,6], 'b': [11,2,33,4,55,6]})
Convert and save the count as a dictionary
ount_freq = dict(df['a'].value_counts())
Create a new column and copy the target column, map the dictionary with newly created column
df['count_freq'] = df['a']
df['count_freq'] = df['count_freq'].map(count_freq)
Now we have a new column with count freq, you can now define a threshold and filter easily with this column.
df[df.count_freq>1]
Additionlly, in case one wants to filter and have 'count' column:
attr = 'A'
limit = 10
df2 = df.groupby(attr)[attr].agg(count='count')
df2 = df2.loc[df2['count'] > limit].reset_index()
print(df2)
#outputs rows with grouped 'A' count > 10 and columns ==> index, count, A
I might be a little late to this party but:
df = pd.DataFrame(df_you_have.groupby(['IdA', 'SomeOtherA'])['theA_you_want_to_count'].count())
df.reset_index(inplace=True)
This is how you create a new dataframe and then just filter it...
df[df['A']>100]
I would like to find the numeric difference between two or more columns of two different dataframe.
The following
would be the starting table.
This one Table (Table 2)
contains the single values that I need to subtract to Table 1.
I would like to get a third table where I get the numeric differences between each row of Table 1 and the single row from Table 2. Any help?
Try
df.subtract(df2.values)
with df being your starting table and df2 being Table 2.
Can you try this and see if this is what you need:
import pandas as pd
df = pd.DataFrame({'A':[5, 3, 1, 2, 2], 'B':[2, 3, 4, 2, 2]})
df2 = pd.DataFrame({'A':[1], 'B':[2]})
pd.DataFrame(df.values-df2.values, columns=df.columns)
Out:
A B
0 4 0
1 2 1
2 0 2
3 1 0
4 1 0
you can just do df1-df2.values like below this will use numpy broadcast to substract all df2 from all rows but df2 must have only one row
example
df1 = pd.DataFrame(np.arange(15).reshape(-1,3), columns="A B C".split())
df2 = pd.DataFrame(np.ones(3).reshape(-1,3), columns="A B C".split())
df1-df2.values
Sorry for editing the question again but as I dug deeper into it I realized that it boils down to the question if I can access the values of a column and the values of a rows index in the same way. This, for me, would seem quite natural as the rows index and a column are actually very similar entities.
For example, if I define a DataFrame with a two-level rows multi-index like that:
df = pd.DataFrame(data=None, index=pd.MultiIndex.from_product([['A', 'B'], [1, 2]], names=['X', 'Y']))
df.insert(loc=0, column='DATA', value=[1, 2, 3, 4])
Which gives
DATA
X Y
A 1 1
2 2
B 1 3
2 4
To access column values I can, e.g., use df.DATA or df.loc[:, 'DATA']. Consequently, to select all rows where DATA is 2, I can do df.loc[df.DATA == 2, :] or df.loc[df.loc[:, 'DATA'] == 2, :].
However, to do the same operation on, say, the index column Y, this does not work. Neither df.Y nor df.loc[:, 'Y']. And therefore I can't select rows based on index values like above: df.loc[df.Y == 2, :] or df.loc[df.loc[:, 'Y'] == 2, :] do not work.
Which is a pity as this requires to write different code depending on if the column is a normal column or part of the index. Or is there another way to do that which works for both columns and indexes?
If you want to call .loc[:,'Y'] reset the index and then call it i.e
df.reset_index().loc[:,'Y']
Output:
0 1
1 2
2 1
3 2
Name: Y, dtype: object
If you want to select the data based on condtion then
df.reset_index()[df.reset_index().Y == 2].set_index(['X','Y'])
Output:
DATA
X Y
A 2 2
B 2 4
I'm trying to re-insert back into a pandas dataframe a column that I extracted and of which I changed the order by sorting it.
Very simply, I have extracted a column from a pandas df:
col1 = df.col1
This column contains integers and I used the .sort() method to order it from smallest to largest. And did some operation on the data.
col1.sort()
#do stuff that changes the values of col1.
Now the indexes of col1 are the same as the indexes of the overall df, but in a different order.
I was wondering how I can insert the column back into the original dataframe (replacing the col1 that is there at the moment)
I have tried both of the following methods:
1)
df.col1 = col1
2)
df.insert(column_index_of_col1, "col1", col1)
but both methods give me the following error:
ValueError: cannot reindex from a duplicate axis
Any help will be greatly appreciated.
Thank you.
Consider this DataFrame:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 5, 4]}, index=[0, 0, 1])
df
Out:
A B
0 1 6
0 2 5
1 3 4
Assign the second column to b and sort it and take the square, for example:
b = df['B']
b = b.sort_values()
b = b**2
Now b is:
b
Out:
1 16
0 25
0 36
Name: B, dtype: int64
Without knowing the exact operation you've done on the column, there is no way to know whether 25 corresponds to the first row in the original DataFrame or the second one. You can take the inverse of the operation (take the square root and match, for example) but that would be unnecessary I think. If you start with an index that has unique elements (df = df.reset_index()) it would be much easier. In that case,
df['B'] = b
should work just fine.