python dataframe group rows based on row num [duplicate] - python

This question already has answers here:
How to iterate over consecutive chunks of Pandas dataframe efficiently
(8 answers)
Closed 3 years ago.
I have a dataframe with 40 rows,
and I want to iterate over it so I will have 4 iteration with 10 rows each, serially.
So group#0 will be rows 0-9 , group#1 will be rows 10-19 and so on.
How can I do it?

2 solutions from this stackoverflow question : How to iterate over consecutive chunks of Pandas dataframe efficiently
I advise you to check the link.
Solution from DSM :
for k,g in df.groupby(np.arange(len(df))//10):
print(k,g)
Solution from Ryan :
def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))
for i in chunker(df,5):
print i

import pandas as pd
import numpy as np
df1 = {
'State':['Arizona','Georgia','Newyork','Indiana','Florida'],
'Score1':[4,47,55,74,31]}
df1 = pd.DataFrame(df1,columns=['State','Score1'])
print(df1)
We need to add value (here 430) to the index to generate row number and the result is stored in a new column as shown below.
df1['New_ID'] = df1.index + 430
print(df1)

Related

How to cast pandas series into dataframe [duplicate]

This question already has answers here:
Python dataframe replace last n rows with a list of n elements
(2 answers)
df.append() is not appending to the DataFrame
(2 answers)
Closed 1 year ago.
I'm trying to cast a series of 20 values at the end of a dataframe with more than 20 rows.
The original values are coming from a numpy array 'Y_pred':
[[3495.47227957]
[3493.27865109]
[3491.08502262]
[3488.89139414]
[3486.69776567]
[3484.50413719]
[3482.31050871]
[3480.11688024]
[3477.92325176]
[3475.72962329]
[3473.53599481]
[3471.34236633]
[3469.14873786]
[3466.95510938]
[3464.7614809 ]
[3462.56785243]
[3460.37422395]
[3458.18059548]
[3455.986967 ]
[3453.79333852]]
creating column Y_pred and trying to cast the converted series:
df['Y_pred'] = np.nan
df.Y_pred.iloc[-len(Y_pred):].append(pd.Series({'Y_pred': Y_pred}), ignore_index=True)
result is that all rows are NaN
I tried as well this:
series = pd.Series(Y_pred[:, 0])
df.Y_pred.iloc[-20:].append(series, ignore_index=True)
and
df['Y_pred'].append(Y_pred)
nothing works. How to do it properly?

extract value from pandas dataframe [duplicate]

This question already has answers here:
Extract int from string in Pandas
(8 answers)
Closed 1 year ago.
Below is the dataframe
import pandas as pd
import numpy as np
d = {'col1': ['Get URI||1621992600749||com.particlenews.newsbreak||https://graph.fb.com||2021-05-26 01:30:00||1.3.0-QA-1100||90',
'Get URI||1621992600799||com.particlenews.newsbreak||https://graph.fb.com||2021-05-26 01:30:00||1.3.0-QA-1100||90']}
df = pd.DataFrame(data=d)
and need to extract the "1621992600749" and "1621992600799" values.
i have done it multiple ways , by using the split function
new = df["col1"].str.split("||", n = 1, expand = True)
but doesnt give the expected results, any thoughts will be helpful.
You cna use the extract with regex
df['col1'].str.extract(r'(\d+)')
#output
0
0 1621992600749
1 1621992600799

How doing division each cell dataframe [duplicate]

This question already has answers here:
Pandas sum across columns and divide each cell from that value
(5 answers)
Closed 3 years ago.
I want calculate division of each cell by sum of each row. Actually there are many column not only A and B.
import pandas as pd
data = pd.DataFrame({'A':[1,2,3,1,2,3,1],
'B':[4,5,6,4,5,6,4]]})
sum_row = data.sum(axis=1)
Here is an example of what I expect.
I think this should do the trick
import pandas as pd
data = pd.DataFrame({'A':[1,2,3,1,2,3,1],
'B':[4,5,6,4,5,6,4]})
data['sum_row'] = data.sum(axis=1)
for col in list(data.columns.values):
data[col + ' / Sum_Row'] = [data['A'].iloc[e] / data['sum_row'].iloc[e] for e in range(0, len(data['A']))]

Count instances in a dataframe [duplicate]

This question already has answers here:
Pandas, group by count and add count to original dataframe?
(3 answers)
Closed 3 years ago.
I have a dataframe containing a column of values (X).
df = pd.DataFrame({'X' : [2,3,5,2,2,3,7,2,2,7,5,2]})
For each row, I would like to find how many times it's value of X appears (A).
My expected output is:
create temp column with 1 and groupby and count to get your desired answer
df = pd.DataFrame({'X' : [2,3,5,2,2,3,7,2,2,7,5,2]})
df['temp'] = 1
df['count'] = df.groupby(['X'],as_index=False).transform(pd.Series.count)
del df['temp']
print(df)

Python/Pandas - Query a MultiIndex Column [duplicate]

This question already has answers here:
Select columns using pandas dataframe.query()
(5 answers)
Closed 4 years ago.
I'm trying to use query on a MultiIndex column. It works on a MultiIndex row, but not the column. Is there a reason for this? The documentation shows examples like the first one below, but it doesn't indicate that it won't work for a MultiIndex column.
I know there are other ways to do this, but I'm specifically trying to do it with the query function
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((4,4)))
df.index = pd.MultiIndex.from_product([[1,2],['A','B']])
df.index.names = ['RowInd1', 'RowInd2']
# This works
print(df.query('RowInd2 in ["A"]'))
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
df.columns.names = ['ColInd1', 'ColInd2']
# query on index works, but not on the multiindexed column
print(df.query('index < 2'))
print(df.query('ColInd2 in ["A"]'))
To answer my own question, it looks like query shouldn't be used at all (regardless of using MultiIndex columns) for selecting certain columns, based on the answer(s) here:
Select columns using pandas dataframe.query()
You can using IndexSlice
df.query('ilevel_0>2')
Out[327]:
ColInd1 1 2
ColInd2 A B A B
3 0.652576 0.639522 0.52087 0.446931
df.loc[:,pd.IndexSlice[:,'A']]
Out[328]:
ColInd1 1 2
ColInd2 A A
0 0.092394 0.427668
1 0.326748 0.383632
2 0.717328 0.354294
3 0.652576 0.520870

Categories

Resources