I have a very large DataFrame that looks like this:
A B
SPH2008 3/21/2008 1 2
3/21/2008 1 2
3/21/2008 1 2
SPM2008 6/21/2008 1 2
6/21/2008 1 2
6/21/2008 1 2
And I have the following code which is intended to flatten and acquire the unique pairs of the two indeces into a new DF:
indeces = [df.index.get_level_values(0), df.index.get_level_values(1)]
tmp = pd.DataFrame(data=indeces).T.drop_duplicates()
tmp.columns = ['ID', 'ExpirationDate']
tmp.sort_values('ExpirationDate', inplace=True)
However, this operation takes a remarkably long amount of time. Is there a more efficient way to do this?
pandas.DataFrame.index.drop_duplicates
pd.DataFrame([*df.index.drop_duplicates()], columns=['ID', 'ExpirationDate'])
ID ExpirationDate
0 SPH2008 3/21/2008
1 SPM2008 6/21/2008
With older versions of Python that can't unpack in that way
pd.DataFrame(df.index.drop_duplicates().tolist(), columns=['ID', 'ExpirationDate'])
IIUC, You can also groupby the levels of your multiindex, then create a dataframe from that with your desired columns:
>>> pd.DataFrame(df.groupby(level=[0,1]).groups.keys(), columns=['ID', 'ExpirationDate'])
ID ExpirationDate
0 SPH2008 3/21/2008
1 SPM2008 6/21/2008
Related
I have this dataframe and I need to drop all duplicates but I need to keep first AND last values
For example:
1 0
2 0
3 0
4 0
output:
1 0
4 0
I tried df.column.drop_duplicates(keep=("first","last")) but it doesn't word, it returns
ValueError: keep must be either "first", "last" or False
Does anyone know any turn around for this?
Thanks
You could use the panda's concat function to create a dataframe with both the first and last values.
pd.concat([
df['X'].drop_duplicates(keep='first'),
df['X'].drop_duplicates(keep='last'),
])
you can't drop both first and last... so trick is too concat data frames of first and last.
When you concat one has to handle creating duplicate of non-duplicates. So only concat unique indexes in 2nd Dataframe. (not sure if Merge/Join would work better?)
import pandas as pd
d = {1:0,2:0,10:1, 3:0,4:0}
df = pd.DataFrame.from_dict(d, orient='index', columns=['cnt'])
print(df)
cnt
1 0
2 0
10 1
3 0
4 0
Then do this:
d1 = df.drop_duplicates(keep=("first"))
d2 = df.drop_duplicates(keep=("last"))
d3 = pd.concat([d1,d2.loc[set(d2.index) - set(d1.index)]])
d3
Out[60]:
cnt
1 0
10 1
4 0
Use a groupby on your column named column, then reindex. If you ever want to check for duplicate values in more than one column, you can extend the columns you include in your groupby.
df = pd.DataFrame({'column':[0,0,0,0]})
Input:
column
0 0
1 0
2 0
3 0
df.groupby('column', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[0, -1]]).reset_index(level=0, drop=True)
Output:
column
0 0
3 0
I am trying to find not common data from 2 data frame.
df1
df1 = pd.DataFrame({
'contact_id': [1,2,3,4]
})
contact_id
0 1
1 2
2 3
3 4
df2
df2 = pd.DataFrame({
'contact_id': [1,3,4,5]
})
contact_id
0 1
1 3
2 4
3 5
Expected output
contact_id
0 2
1 5
I have tried to use below code but getting incorrect
df = df2[~df2.contact_id.isin(df1.contact_id)]
Can anyone help me how can I get expected output
try merge() with indicator=True and then filter out values by using query() finally drop extra column by using drop():
out=(df1.merge(df2,indicator=True,on='contact_id',how='outer')
.query("_merge!='both'").drop('_merge',1))
output of out:
contact_id
1 2
4 5
Alternatively,
you can merge two datasets , drop duplicates and the index. If you want to keep the index then remove the reset_index method.
pd.concat([df1,df2]).drop_duplicates(keep=False).reset_index(drop =True)
Pandas:
data = data.dropna(axis = 'columns')
I am trying to do something similar using a cudf dataframe but the apis don't offer this functionality.
My solution is to convert to a pandas df, do the above command, then re-convert to a cudf. Is there a better solution?
cuDF now supports column based dropna, so the following will work:
import cudf
df = cudf.DataFrame({'a':[0,1,None], 'b':[None,0,2], 'c':[1,2,3]})
print(df)
a b c
0 0 null 1
1 1 0 2
2 null 2 3
df.dropna(axis='columns')
c
0 1
1 2
2 3
Until dropna is implemented, you can check the null_count of each column and drop the ones with null_count>0.
The documentation on dataframe.query() is very terse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html . I was also unable to find examples of projections by web search.
So I tried simply providing the column names: that gave a syntax error. Likewise for typing select and then the column names. So .. how to do this?
After playing around with this for a while and reading through the source code for DataFrame.query, I can't figure out a way to do it.
If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval() for selecting columns and df.query() for filtering on rows.
UPDATE:
javadba points out that the return value of eval is not a dataframe. For example, to flesh out jreback's example a bit more...
df.eval('A')
returns a Pandas Series, but
df.eval(['A', 'B'])
does not return at DataFrame, it returns a list (of Pandas Series).
So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc/loc, e.g.
df.loc[0:4, ['A', 'C']]
output
A C
0 -0.497163 -0.046484
1 1.331614 0.741711
2 1.046903 -2.511548
3 0.314644 -0.526187
4 -0.061883 -0.615978
Dataframe.query is more like the where clause in a SQL statement than the select part.
import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
To select a column or columns you can use the following:
df['A'] or df.loc[:,'A']
or
df[['A','B']] or df.loc[:,['A','B']]
To use the .query method you do something like
df.query('A > B') which would return all the rows where the value in column A is greater than the value in column b.
A B C D
2000-01-03 1.265936 -0.866740 -0.678886 -0.094709
2000-01-04 1.491390 -0.638902 -0.443982 -0.434351
2000-01-05 2.205930 2.186786 1.004054 0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589
Which is more readable in my opinion that boolean index selection with
df[df['A'] > df['B']]
How about
df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]
Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.
but you would need to filter for rows otherwise it doesn't work.
for filtering columns only better use .loc or .iloc
pandasql
https://pypi.python.org/pypi/pandasql/0.1.0
Here is an example from the following blog http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html . The inputs are two DataFrames meat and births : and this approach gives the projections, filtering, aggregation and sorting expected from sql.
#maxpower did mention this package is buggy: so let's see.. At least the code from the blog and shown below works fine.
pysqldf = lambda q: sqldf(q, globals())
q = """
SELECT
m.date
, m.beef
, b.births
FROM
meat m
LEFT JOIN
births b
ON m.date = b.date
WHERE
m.date > '1974-12-31';
"""
meat = load_meat()
births = load_births()
df = pysqldf(q)
The output is a pandas DataFrame as desired.
It is working great for my particular use case (evaluating us crimes)
odf = pysqldf("select %s from df where sweapons > 10 order by sweapons desc limit 10" %scols)
p('odf\n', odf)
odf
: SMURDER SRAPE SROBBERY SAGASSLT SOTHASLT SVANDLSM SWEAPONS
0 0 0 0 1 1 10 54
1 0 0 0 0 1 0 52
2 0 0 0 0 1 0 46
3 0 0 0 0 1 0 43
4 0 0 0 0 1 0 33
5 1 0 2 16 28 4 32
6 0 0 0 7 17 4 30
7 0 0 0 0 1 0 29
8 0 0 0 7 16 3 29
9 0 0 0 1 0 5 28
Update I have done a bunch of stuff with pandasql now: calculated fields, limits, aliases, cascaded dataframes.. it is just so productive.
Another update (3 yrs later) This works but warning it is very slow (seconds vs milliseconds) –
Just a simpler example solution (using get):
My goal:
I want the lat and lon columns out of the result of the query.
My table details:
df_city.columns
Index(['name', 'city_id', 'lat', 'lon', 'CountryName',
'ContinentName'], dtype='object')
# All columns
city_continent = df_city.get(df_city['ContinentName']=='Oceania')
# Only lat and lon
city_continent[['lat', 'lon']]
lat lon
113883 -19.12753 -169.84623
113884 -19.11667 -169.90000
113885 -19.10000 -169.91667
113886 -46.33333 168.85000
113887 -46.36667 168.55000
... ... ...
347956 -23.14083 113.77630
347957 -31.48023 131.84242
347958 -28.29967 153.30142
347959 -35.60358 138.10548
347960 -35.02852 117.83416
3712 rows × 2 columns
Is there a way, without use of loops, to reindex a DataFrame using a dict? Here is an example:
df = pd.DataFrame([[1,2], [3,4]])
dic = {0:'first', 1:'second'}
I want to apply something efficient to df for obtaining:
0 1
first 1 2
second 3 4
Speed is important, as the index in the actual DataFrame I am dealing with has a huge number of unique values. Thanks
You need the rename function:
df.rename(index=dic)
# 0 1
#first 1 2
#second 3 4
Modified the dic to get the results: dic = {0:'first', 1:'second'}