I have a dataframe in pandas where each column has different value range. For example:
My desired output is:
First, it makes 2-level index, then unstack based on multiindex and, finally, rename columns.
df = pd.DataFrame({'axis_x': [0, 1, 2, 0, 1, 2, 0, 1, 2], 'axis_y': [0, 0, 0, 1, 1, 1, 2, 2, 2],
'data': ['diode', 'switch', 'coil', '$2.2', '$4.5', '$3.2', 'colombia', 'china', 'brazil']})
df = df.set_index(['axis_x', 'axis_y']).unstack().rename(columns={0: 'product', 1: 'price', 2: 'country'})
print(df)
Prints:
data
axis_y product price country
axis_x
0 diode $2.2 colombia
1 switch $4.5 china
2 coil $3.2 brazil
I am trying to count frequencies of an array.
I've read this post, I am using DataFrame and get a series.
>>> a = np.array([1, 1, 5, 0, 1, 2, 2, 0, 1, 4])
>>> df = pd.DataFrame(a, columns=['a'])
>>> b = df.groupby('a').size()
>>> b
a
0 2
1 4
2 2
4 1
5 1
dtype: int64
>>> b.iloc[:,-1]
when i try to get the last column, i got this error.
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/pan/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1472, in __getitem__
return self._getitem_tuple(key) File "/Users/pan/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 2013, in _getitem_tuple
self._has_valid_tuple(tup) File "/Users/pan/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 220, in _has_valid_tuple
raise IndexingError('Too many indexers') pandas.core.indexing.IndexingError: Too many indexers
how to get the last column of b?
Since pandas.Series is a
One-dimensional ndarray with axis labels
If you want to get just the frequencies column, i.e. the values of
your series, use:
b.tolist()
or, alternatively:
b.to_dict()
to keep both labels and frequencies.
P.S:
For your specific task consider also collections package:
>>> from collections import Counter
>>> a = [1, 1, 5, 0, 1, 2, 2, 0, 1, 4]
>>> c = Counter(a)
>>> list(c.values())
[2, 4, 2, 1, 1]
Problem is output of GroupBy.size is Series, and Series have no columns, so is possible get last value only:
b.iloc[-1]
If use:
b.iloc[:,-1]
it return last column in Dataframe.
Here : means all rows and -1 in second position last column.
So if create DataFrame from Series:
b1 = df.groupby('a').size().reset_index(name='count')
it working like expected.
I have a problem that as to be solved as efficient as possible. My current approach kind of works, but is extreme slow.
I have a dataframe with multiple columns, in this case I only care for one of them. It contains positive continuous numbers and some zeros.
my goal: is to find the row where nearly no zeros appear in the following rows.
To make clear what I mean I wrote this example to replicate my problem:
df = pd.DataFrame([0,0,0,0,1,0,1,0,0,2,0,0,0,1,1,0,1,2,3,4,0,4,0,5,1,0,1,2,3,4,
0,0,1,2,1,1,1,1,2,2,1,3,6,1,1,5,1,2,3,4,4,4,3,5,1,2,1,2,3,4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'))
There are some zeros at the beginning, but they get less after some time.
Here comes my unoptimized code to visualize the number of zeros:
zerosum = 0 # counter for all zeros that have appeared so far
for i in range(len(df)):
if(df[0][i]== 0.0):
df.loc[df.index[i],'zerosum']=zerosum
zerosum+=1
else:
df.loc[df.index[i],'zerosum']=zerosum
df['zerosum'].plot()
With that unoptimized code I can see the distribution of zeros over time.
My expected output: would be in this example the date 01-Jan-2018 08:00, because no zeros appear after that date.
The problem I have when dealing with my real data is some single zeros can appear later. Therefore I can't just pick the last row that contains a zero. I have to somehow inspect the distribution of zeros and ignore later outliers.
Note: The visualization is not necessary to solve my problem, I just included it to explain my problem as good as possible. Thanks
Ok
Second go
import pandas as pd
import numpy as np
import math
df = pd.DataFrame([0,0,0,0,1,0,1,0,0,2,0,0,0,1,1,0,1,2,3,4,0,4,0,5,1,0,1,2,3,4,
0,0,1,2,1,1,1,1,2,2,1,3,6,1,1,5,1,2,3,4,4,4,3,5,1,2,1,2,3,4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'),
columns=['values'])
We create a column that contains the rank of each zero, and zero if there is a non-zero value
df['zero_idx'] = np.where(df['values']==0,np.cumsum(np.where(df['values']==0,1,0)), 0)
We can use this column to get the location of any zero of any rank. I dont know what your criteria is for naming a zero an outlier. But lets say we want to make sure at we are past at least 90% of all zeros...
# Total number of zeros
n_zeros = max(df['zero_idx'])
# Get past at least this percentage
tolerance = 0.9
# The rank of the abovementioned zero
rank_tolerance = math.ceil(tolerance * n_zeros)
df[df['zero_idx']==rank_tolerance].index
Out[44]: DatetimeIndex(['2018-01-01 07:30:00'], dtype='datetime64[ns]', freq='15T')
Okay, If you need to get the index after the last zero occurred, you can try this:
last = 0
for i in range(len(df)):
if(df[0][i] == 0):
last = i
print(df.iloc[last+1])
or by Filtering:
new = df.loc[df[0]==0]
last = df.index.get_loc(new.index[-1])
print(df.iloc[last+1])
here my solution using a filter and cumsum:
df = pd.DataFrame([0, 0, 0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1, 2, 3, 4, 0, 4, 0, 5, 1, 0, 1, 2, 3, 4,
0, 0, 1, 2, 1, 1, 1, 1, 2, 2, 1, 3, 6, 1, 1, 5, 1, 2, 3, 4, 4, 4, 3, 5, 1, 2, 1, 2, 3, 4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'))
a = df[0] == 0
df['zerosum'] = a.cumsum()
maxval = max(df['zerosum'])
firstdate = df[df['zerosum'] == maxval].index[1]
print(firstdate)
output:
2018-01-01 08:00:00
I used the follow way to read my data:
df=pd.read_csv("file.dat",delim_whitespace=True,header=None,skiprows=None)
df.head()
and then I obtained:
0, 1, 0, 1, 0, 0, 0, 0, 0, 0.5
1, 1, 0, 1, 0, 0, 0, 0, 0, 1.5
...
It is shown that all the columns (except for the last one) contain the number+','. However, I just need the number value (without the commas) for each column. How should I read the table?
Your option delim_whitespace=True is equivalent to sep='\s+' if your file has value separeted with commas ommit the line delim_whitespace=True, you don't need the sep = ',' as it is the default value.
try:
df = pd.read_csv("file.dat", header=None, skiprows=None)
I have a large dataframe with the following heads
import pandas as pd
f = pd.Dataframe(columns=['month', 'Family_id', 'house_value'])
Months go from 0 to 239, Family_ids up to 10900 and house values vary. So the dataframe has more than 2 and a half million lines.
I want to filter the Dataframe only for those in which there is a difference between the final house price and its initial for each family.
Some sample data would look like this:
f = pd.DataFrame({'month': [0, 0, 0, 0, 0, 1, 1, 239, 239], 'family_id': [0, 1, 2, 3, 4, 0, 1, 0, 1], 'house_value': [10, 10, 5, 7, 8, 10, 11, 10, 11]})
And from that sample, the resulting dataframe would be:
g = pd.DataFrame({'month': [0, 1, 239], 'family_id': [1, 1, 1], 'house_value': [10, 11, 11]})
So I thought in a code that would be something like this:
ft = f[f.loc['month'==239, 'house_value'] > f.loc['month'==0, 'house_value']]
Also tried this:
g = f[f.house_value[f.month==239] > f.house_value[f.month==0] and f.family_id[f.month==239] == f.family_id[f.month==0]]
And the above code gives an error Keyerror: False and ValueError Any ideas. Thanks.
Use groupby.filter:
(f.sort_values('month')
.groupby('family_id')
.filter(lambda g: g.house_value.iat[-1] != g.house_value.iat[0]))
# family_id house_value month
#1 1 10 0
#6 1 11 1
#8 1 11 239
As commented by #Bharath, your approach errors out because for boolean filter, it expects the boolean series to have the same length as the original data frame, which is not true in both of your cases due to the filter process you applied before the comparison.