Dataframe reindexing in order - python

I have a dataframe like this
datasource datavalue
0 aaaa.pdf 5
0 bbbbb.pdf 5
0 cccc.pdf 9
I don't know if this is the reason but this seems to be messing a dash display so
I would like to reindex it like
datasource datavalue
0 aaaa.pdf 5
1 bbbbb.pdf 5
2 cccc.pdf 9
I used
data_all.reset_index()
but it is not working, the index are still 0
how it should be done?
EDIT1:
Thanks to the two participants who made me notice my mistake.
I should have put
data_all=data_all.reset_index()
Unfortunately it did not go as expected.
Before:
datasource datavalue
0 aaaa.pdf 5
0 bbbbb.pdf 5
0 cccc.pdf 9
Then
data_all.keys()
Index(['datasource','datavalue'],dtype='object')
So
data_all.reset_index()
After
index datasource datavalue
0 0 aaaa.pdf 5
1 0 bbbbb.pdf 5
2 0 cccc.pdf 9
data_all.keys()
Index(['index','datasource','datavalue'],dtype='object')
As you see one column "index" was added. I suppose I can drop that column but I was expecting something that in one step reindex the df without adding anything
EDIT2: Turns out drop=True was necessary!
Thanks everybody!

I think this is what you are looking for.
df.reset_index(drop=True, inplace=True)
#drop: Do not try to insert index into dataframe columns. This resets the index to the default integer index.
# inplace: Whether to modify the DataFrame rather than creating a new one.

Try:
data_all = data_all.reset_index(drop=True)

Related

Pandas total count each day

I have a large dataset (df) with lots of columns and I am trying to get the total number of each day.
|datetime|id|col3|col4|col...
1 |11-11-2020|7|col3|col4|col...
2 |10-11-2020|5|col3|col4|col...
3 |09-11-2020|5|col3|col4|col...
4 |10-11-2020|4|col3|col4|col...
5 |10-11-2020|4|col3|col4|col...
6 |07-11-2020|4|col3|col4|col...
I want my result to be something like this
|datetime|id|col3|col4|col...|Count
6 |07-11-2020|4|col3|col4|col...| 1
3 |5|col3|col4|col...| 1
2 |10-11-2020|5|col3|col4|col...| 1
4 |4|col3|col4|col...| 2
1 |11-11-2020|7|col3|col4|col...| 1
I tried to use resample like this df = df.groupby(['id','col3', pd.Grouper(key='datetime', freq='D')]).sum().reset_index() and this is my result. I am still new to programming and Pandas but I have read up on pandas docs and am still unable to do it.
|datetime|id|col3|col4|col...
6 |07-11-2020|4|col3|1|0.0
3 |07-11-2020|5|col3|1|0.0
2 |10-11-2020|5|col3|1|0.0
4 |10-11-2020|4|col3|2|0.0
1 |11-11-2020|7|col3|1|0.0
try this:
df = df.groupby(['datetime','id','col3']).count()
If you want the count values for all columns based only on the date, then:
df.groupby('datetime').count()
And you'll get a DataFrame who has the date time as the index and the column cells representing the number of entries for that given index.

python pandas - transforming table

I would like to transform a table which looks similiar to this below:
X|Y|Z|
1|2|3|
3|5|2|
4|2|1|
The result, I want to achive, should look like that:
col|1|2|3|4|5|
X |1|0|1|0|0|
Y |0|2|0|0|1|
Z |1|1|1|0|0|
So, after transformation the new columns should be unique values from previous table, the new values should be populated with count/appearance, and in the index should be the old column names.
I got stuck and i do not know hot to handle with cause I am a newbe in python, so thanks in advance for support.
Regards,
guddy_7
Use apply with value_counts, replace missing values to 0 and transpose by T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
1 2 3 4 5
X 1 0 1 1 0
Y 0 2 0 0 1
Z 1 1 1 0 0

Trying to update a dataframe

I have a dataframe (df) which looks like:
0 1 2 3
0 BBG.apples.S BBG.XNGS.bananas.S 0
1 BBG.apples.S BBG.XNGS.oranges.S 0
2 BBG.apples.S BBG.XNGS.pairs.S 0
3 BBG.apples.S BBG.XNGS.mango.S 0
4 BBG.apples.S BBG.XNYS.mango.S 0
5 BBG.XNGS.bananas.S BBG.XNGS.oranges.S 0
6 BBG.XNGS.bananas.S BBG.XNGS.pairs.S 0
7 BBG.XNGS.bananas.S BBG.XNGS.kiwi.S 0
8 BBG.XNGS.oranges.S BBG.XNGS.pairs.S 0
9 BBG.XNGS.oranges.S BBG.XNGS.kiwi.S 0
10 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
11 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
12 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
13 BBG.XNGS.peaches.S BBG.XNGS.kiwi.S 0
I am trying to update a value (first row, third column) in the dataframe using:
for index, row in df.iterrows():
status = row[3]
if int(status) == 0:
df[index]['3'] = 1
but when I print the dataframe out it remains unchanged.
What am I doing wrong?
Replace your last line by:
df.at[index,'3'] = 1
Obviously as mentioned by others you're better off using a vectorized expression instead of iterating, especially for large dataframes.
You can't modify a data frame by iterating like that. See here.
If you only want to modify the element at [1, 3], you can access it directly:
df[1, 3] = 1
If you're trying to turn every 0 in column 3 to a 1, try this:
df[df['3'] == 0] = 1
EDIT: In addition, the docs for iterrows say that you'll often get a copy back, which is why the operation fails.
If you are trying to update the third column for all rows based on the row having a certain value, as shown in your example code, then it would be much easier use the where method on the dataframe:
df.loc[:,'3'] = df['3'].where(df['3']!=0, 1)
Try to update the row using .loc or .iloc (depend on your needs).
For example, in this case:
if int(status) == 0:
df.iloc[index]['3']='1'

Python How to extract specified string within [ ] brackets in pandas dataframe and create a new column with boolean values

I'm new to programming and would appreciate any of your insights!
I have a data frame like this.
df;
info Price
0 [100:Sailing] $100
1 [150:Boating, 100:Sailing] $200
2 [200:Surfing] $300
I would like to create new columns with activity names based on information in info column and add 1 in the new column if there is a corresponding name in info column. It is going to look like dataframe below.
Price Sailing Boating Surfing
0 $100 1 0 0
1 $200 1 1 0
2 $300 0 0 1
I tried a code blow but did not work..(eventhough this approach works in other columns)
df1 = df.info.str.extract(r'(Boating|Sailing|Surfing)',expand=False)
df2 = pd.concat([df,pd.get_dummies(df1).astype(int)],axis=1)
I have over 10 thousands of data like this so idealy I would like to write a code which automatically extract specified string (like Surfing) in info column, create a new column with the activity name and return 1 or 0 as shown above. I thought that maybe brackets in the data or data type in the dataframe are causing the problem, but I am not sure how to tackle this..
I assumed the format of the values in the info column is like a Python list.
df1 = df['info'].str[1:-1].str.replace(' ', '').str.get_dummies(',')
df1.rename(columns=lambda x: x.rsplit(':')[-1], inplace=True)
df2 = pd.concat([df, df1.astype(int)], axis=1)
df2
Out:
info Price Sailing Boating Surfing
0 [100:Sailing] $100 1 0 0
1 [150:Boating, 100:Sailing] $200 1 1 0
2 [200:Surfing] $300 0 0 1

How do I find duplicate indices in a DataFrame?

I have a pandas DataFrame with a multi-level index ("instance" and "index"). I want to find all the first-level ("instance") index values which are non-unique and to print out those values.
My frame looks like this:
A
instance index
a 1 10
2 12
3 4
b 1 12
2 5
3 2
b 1 12
2 5
3 2
I want to find "b" as the duplicate 0-level index and print its value ("b") out.
You can use the get_duplicates() method:
>>> df.index.get_level_values('instance').get_duplicates()
[0, 1]
(In my example data 0 and 1 both appear multiple times.)
The get_level_values() method can accept a label (such as 'instance') or an integer and retrieves the relevant part of the MultiIndex.
Assuming that your df has an index made of 'instance' and 'index' you could do this:
df1 = df.reset_index().pivot_table(index=['instance','index'], values='A', aggfunc='count')
df1[df1 > 1].index.get_level_values(0).drop_duplicates()
Which yields:
Index([u'b'], dtype='object')
Adding .values at the end (.drop_duplicates().values) will make an array:
array(['b'], dtype=object)
Or the same with one line using .groupby:
df[df.groupby(level=['instance','index']).count() > 1].dropna().index.get_level_values(0).drop_duplicates()
This should give you the whole row which isn't quite what you asked for but might be close enough:
df[df.index.get_level_values('instance').duplicated()]
You want the duplicated method:
df['Instance'].duplicated()

Categories

Resources