I´m new to python, sorry for any mistakes I make, I hope you can understand me.
I have a problem quit like dropping duplicate row.But here i view 1,2 the same as 2,1.And there isn't any actually duplicate items in the pandas dataframe. For example,i have df as
first second
1 2
2 1
2 4
4 2
and i need df eventually become:
first second
1 2
2 4
How to tackle this problem .
thanks in advance.
update
Here is another problem,the dataframe have 1860000 rows,so when using this method will raise memory error. Is there some ways to tackle this?
You can use apply with sorted and then drop_duplicates:
print (df.apply(sorted, axis=1))
first second
0 1 2
1 1 2
2 2 4
3 2 4
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
first second
0 1 2
2 2 4
Related
A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4
I would like to know whether I can get some help in "translating" a multi dim list in a single column of a frame in pandas.
I found help here to translate a multi dim list in a column with multiple columns, but I need to translate the data in one
Suppose I have the following list of list
x=[[1,2,3],[4,5,6]]
If I create a frame I get
frame=pd.Dataframe(x)
0 1 2
1 2 3
4 5 6
But my desire outcome shall be
0
1
2
3
4
5
6
with the zero as column header.
I can of course get the result with a for loop, which from my point of view takes much time. Is there any pythonic/pandas way to get it?
Thanks for helping men
You can use np.concatenate
x=[[1,2,3],[4,5,6]]
frame=pd.DataFrame(np.concatenate(x))
print(frame)
Output:
0
0 1
1 2
2 3
3 4
4 5
5 6
First is necessary flatten values of lists and pass to DataFrame constructor:
df = pd.DataFrame([z for y in x for z in y])
Or:
from itertools import chain
df = pd.DataFrame(list(chain.from_iterable(x)))
print (df)
0
0 1
1 2
2 3
3 4
4 5
5 6
If you use numpy you can utilize the method ravel():
pd.DataFrame(np.array(x).ravel())
I am working on a pyspark dataframe which looks like below
id
category
1
A
1
A
1
B
2
B
2
A
3
B
3
B
3
B
I want to unstack the category column and count their occurrences. So, the result I want is shown below
id
A
B
1
2
1
2
1
1
3
Null
3
I tried finding something on the internet that can help me but I couldn't find anything that could give me this specific result.
Short version, dont have to do multiple groupBy's
df.groupBy("id").pivot("category").count().show()
Try this -- (Not sure its optimized)
df = spark.createDataFrame([(1,'A'),(1,'A'),(1,'B'),(2,'B'),(2,'A'),(3,'B'),(3,'B'),(3,'B')],['id','category'])
df = df.groupBy('id','category').count()
df.groupBy('id').pivot('category').sum('count').show()
My dataframe looks like this:
Country Code Duration
A 1 0
A 1 1
A 1 2
A 1 3
A 2 0
A 2 1
A 1 0
A 1 1
A 1 2
I need to get max values from a "Duration" column - not just a maximum value, but a list of maximum values for each sequence of numbers in this column. The output might look like this:
Country Code Duration
A 1 3
A 2 1
A 1 2
I could have grouped by "Code", but its values are often repeating, so that's probably not an option. Any help or tips would be much appreciated.
Using idxmax after create another group key by diff and cumsum
df.loc[df.groupby([df.Country,df.Code.diff().ne(0).cumsum()]).Duration.idxmax()]
Country Code Duration
3 A 1 3
5 A 2 1
8 A 1 2
First we create a mask to mark the sequences. Then we groupby to create the wanted output:
m = (~df['Code'].eq(df['Code'].shift())).cumsum()
df.groupby(m).agg({'Country':'first',
'Code':'first',
'Duration':'max'}).reset_index(drop=True)
Country Code Duration
0 A 1 3
1 A 2 1
2 A 1 2
The problem is slightly unclear. However, assuming that order is important, we can move toward a solution.
import pandas as pd
d = pd.read_csv('data.csv')
s = d.Code
d['series'] = s.ne(s.shift()).cumsum()
print(pd.DataFrame(d.groupby(['Country','Code','series'])['Duration'].max().reset_index()))
Returns:
Country Code series Duration
0 A 1 1 3
1 A 1 3 2
2 A 2 2 1
You can then drop the series.
You might wanna check this link , it might be the answer you're looking for :
pandas groupby where you get the max of one column and the min of another column . It goes as :
result = df.groupby(['Code', 'Country']).agg({'Duration':'max'})[['Duration']].reset_index()
I have a dataframe with 2 index levels:
value
Trial measurement
1 0 13
1 3
2 4
2 0 NaN
1 12
3 0 34
Which I want to turn into this:
Trial measurement value
1 0 13
1 1 3
1 2 4
2 0 NaN
2 1 12
3 0 34
How can I best do this?
I need this because I want to aggregate the data as instructed here, but I can't select my columns like that if they are in use as indices.
The reset_index() is a pandas DataFrame method that will transfer index values into the DataFrame as columns. The default setting for the parameter is drop=False (which will keep the index values as columns).
All you have to do call .reset_index() after the name of the DataFrame:
df = df.reset_index()
This doesn't really apply to your case but could be helpful for others (like myself 5 minutes ago) to know. If one's multindex have the same name like this:
value
Trial Trial
1 0 13
1 3
2 4
2 0 NaN
1 12
3 0 34
df.reset_index(inplace=True) will fail, cause the columns that are created cannot have the same names.
So then you need to rename the multindex with df.index = df.index.set_names(['Trial', 'measurement']) to get:
value
Trial measurement
1 0 13
1 1 3
1 2 4
2 0 NaN
2 1 12
3 0 34
And then df.reset_index(inplace=True) will work like a charm.
I encountered this problem after grouping by year and month on a datetime-column(not index) called live_date, which meant that both year and month were named live_date.
There may be situations when df.reset_index() cannot be used (e.g., when you need the index, too). In this case, use index.get_level_values() to access index values directly:
df['Trial'] = df.index.get_level_values(0)
df['measurement'] = df.index.get_level_values(1)
This will assign index values to individual columns and keep the index.
See the docs for further info.
As #cs95 mentioned in a comment, to drop only one level, use:
df.reset_index(level=[...])
This avoids having to redefine your desired index after reset.
I ran into Karl's issue as well. I just found myself renaming the aggregated column then resetting the index.
df = pd.DataFrame(df.groupby(['arms', 'success'])['success'].sum()).rename(columns={'success':'sum'})
df = df.reset_index()
Short and simple
df2 = pd.DataFrame({'test_col': df['test_col'].describe()})
df2 = df2.reset_index()
A solution that might be helpful in cases when not every column has multiple index levels:
df.columns = df.columns.map(''.join)
Similar to Alex solution in a more generalized form. It keeps the indexes untouched and adds index level as a new columns with its name.
for i in df.index.names:
df[i] = df.index.get_level_values(i)
which gives
value Trial measurement
Trial measurement
1 0 13 1 0
1 3 1 1
2 4 1 2
...