dataFrame duplication extraction row - python

The code below gives exactly the following Jupyter output:
date open high low close volume
0 29/04/1992 2.21 2.21 1.98 1.99 0
1 29/04/1992 2.21 2.21 1.98 1.98 0
2 30/04/1992 2.02 2.32 1.95 1.98 0
size: 6686
no duplicates? False
date open high low close volume
0 29/04/1992 2.21 2.21 1.98 1.99 0
1 29/04/1992 2.21 2.21 1.98 1.98 0
2 30/04/1992 2.02 2.32 1.95 1.98 0
no duplicates? False
size: 6686
What should I change in the duplication-extraction line?
Thanks!
fskilnik
checking = pd.DataFrame(df)
print(checking.head(3))
size2 = len(checking.index)
print('size:',size2)
print('no duplicates?', checking.date.is_unique)
checking.drop_duplicates(['date'], keep='last')
print(checking.head(3))
print('no duplicates?', checking.date.is_unique)
size2 = len(checking.index)
print('size:',size2)

You should add inplace=True to the drop_duplicates method or reassign the dataframe like:
checking.drop_duplicates(['date'], keep='last', inplace=True)
Or:
checking = checking.drop_duplicates(['date'], keep='last')

Related

Combine a row with column in dataFrame and show the corresponding values

So I want to show this data in just two columns. For example, I want to turn this data
Year Jan Feb Mar Apr May Jun
1997 3.45 2.15 1.89 2.03 2.25 2.20
1998 2.09 2.23 2.24 2.43 2.14 2.17
1999 1.85 1.77 1.79 2.15 2.26 2.30
2000 2.42 2.66 2.79 3.04 3.59 4.29
into this
Date Price
Jan-1977 3.45
Feb-1977 2.15
Mar-1977 1.89
Apr-1977 2.03
....
Jan-2000 2.42
Feb-2000 2.66
So far, I have read about how to combine two columns into another dataframe using .apply() .agg(), but no info how to combine them as I showed above.
import pandas as pd
df = pd.read_csv('matrix-A.csv', index_col =0 )
matrix_b = ({})
new = pd.DataFrame(matrix_b)
new["Date"] = df['Year'].astype(float) + "-" + df["Dec"]
print(new)
I have tried this way, but it of course does not work. I have also tried using pd.Series() but no success
I want to ask whether there is any site where I can learn how to do this, or does anybody know correct way to solve this?
Another possible solution, which is based on pandas.DataFrame.stack:
out = df.set_index('Year').stack()
out.index = ['{}_{}'.format(j, i) for i, j in out.index]
out = out.reset_index()
out.columns = ['Date', 'Value']
Output:
Date Value
0 Jan_1997 3.45
1 Feb_1997 2.15
2 Mar_1997 1.89
3 Apr_1997 2.03
4 May_1997 2.25
....
19 Feb_2000 2.66
20 Mar_2000 2.79
21 Apr_2000 3.04
22 May_2000 3.59
23 Jun_2000 4.29
You can first convert it to long-form using melt. Then, create a new column for Date by combining two columns.
long_df = pd.melt(df, id_vars=['Year'], var_name='Month', value_name="Price")
long_df['Date'] = long_df['Month'] + "-" + long_df['Year'].astype('str')
long_df[['Date', 'Price']]
If you want to sort your date column, here is a good resource. Follow those instructions after melting and before creating the Date column.
You can use pandas.DataFrame.melt :
out = (
df
.melt(id_vars="Year", var_name="Month", value_name="Price")
.assign(month_num= lambda x: pd.to_datetime(x["Month"] , format="%b").dt.month)
.sort_values(by=["Year", "month_num"])
.assign(Date= lambda x: x.pop("Month") + "-" + x.pop("Year").astype(str))
.loc[:, ["Date", "Price"]]
)
# Output :
print(out)
​
Date Price
0 Jan-1997 3.45
4 Feb-1997 2.15
8 Mar-1997 1.89
12 Apr-1997 2.03
16 May-1997 2.25
.. ... ...
7 Feb-2000 2.66
11 Mar-2000 2.79
15 Apr-2000 3.04
19 May-2000 3.59
23 Jun-2000 4.29
[24 rows x 2 columns]

Convert fractional values to decimal in Pandas

I have a dataframe with messy data.
df:
1 2 3
-- ------- ------- -------
0 123/100 221/100 103/50
1 49/100 333/100 223/50
2 153/100 81/50 229/100
3 183/100 47/25 31/20
4 2.23 3.2 3.04
5 2.39 3.61 2.69
I want the fractional values to be converted to decimal with the conversion formula being
e.g:
123/100 = (123/100 + 1) = 2.23
333/100 = (333/100 +1) = 4.33
The calculation is fractional value + 1
And of course leave the decimal values as is.
How can I do it in Pandas and Python?
A simple way to do this is to first define a conversion function that will be applied to each element in a column:
def convert(s):
if '/' in s: # is a fraction
num, den = s.split('/')
return 1+(int(num)/int(den))
else:
return float(s)
Then use the .apply function to run all elements of a column through this function:
df['1'] = df['1'].apply(convert)
Result:
df['1']:
0 2.23
1 1.49
2 2.53
3 2.83
4 2.23
5 2.39
Then repeat on any other column as needed.
If you trust the data in your dataset, the simplest way is to use eval or better, suggested by #mozway, pd.eval:
>>> df.replace(r'(\d+)/(\d+)', r'1+\1/\2', regex=True).applymap(pd.eval)
1 2 3
0 2.23 3.21 3.06
1 1.49 4.33 5.46
2 2.53 2.62 3.29
3 2.83 2.88 2.55
4 2.23 3.20 3.04
5 2.39 3.61 2.69

Dataframe split columns value, how to solve error message?

I have a panda dataframe with the following columns:
Stock ROC5 ROC20 ROC63 ROCmean
0 IBGL.SW -0.59 3.55 6.57 3.18
0 EHYA.SW 0.98 4.00 6.98 3.99
0 HIGH.SW 0.94 4.22 7.18 4.11
0 IHYG.SW 0.56 2.46 6.16 3.06
0 HYGU.SW 1.12 4.56 7.82 4.50
0 IBCI.SW 0.64 3.57 6.04 3.42
0 IAEX.SW 8.34 18.49 14.95 13.93
0 AGED.SW 9.45 24.74 28.13 20.77
0 ISAG.SW 7.97 21.61 34.34 21.31
0 IAPD.SW 0.51 6.62 19.54 8.89
0 IASP.SW 1.08 2.54 12.18 5.27
0 RBOT.SW 10.35 30.53 39.15 26.68
0 RBOD.SW 11.33 30.50 39.69 27.17
0 BRIC.SW 7.24 11.08 75.60 31.31
0 CNYB.SW 1.14 4.78 8.36 4.76
0 FXC.SW 5.68 13.84 19.29 12.94
0 DJSXE.SW 3.11 9.24 6.44 6.26
0 CSSX5E.SW -0.53 5.29 11.85 5.54
How can I write in the dataframe a new columns "Symbol" with the stock without ".SW".
Example first row result should be IBGL (modified value IBGL.SW).
Example last row result should be CSSX5E (splited value SSX5E.SW).
If I send the following command:
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
Than I receive an error message:
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
How can I solve this problem?
Thanks a lot for your support.
METHOD 1:
You can do a vectorized operation by str.get(0) -
df['SYMBOL'] = df['Stock'].str.split('.').str.get(0)
METHOD 2:
You can do another vectorized operation by using expand=True in str.split() and then getting the first column.
df['SYMBOL'] = df['Stock'].str.split('.', expand = True)[0]
METHOD 3:
Or you can write a custom lambda function with apply (for more complex processes). Note, this is slower but good if you have your own UDF.
df['SYMBOL'] = df['Stock'].apply(lambda x:x.split('.')[0])
This is not an error, but a warning as you may have probably noticed your script finishes its execution.
edite: Given your comments it seems your issues generate previously in the code, therefore I suggest you use the following:
new_df = new_df.copy(deep=False)
And then proceed to solve it with:
new_df.loc['Symbol'] = new_df['Stock'].str.split('.').str[0]
new_df = new_df.copy()
new_df['Symbol'] = new_df.Stock.str.replace('.SW','')

rearrange dataframe for paired items

I have the a dataframe like this:
id Shimmer
P01_S01_a 2.31
P01_S01_b 3.87
P01_S02_a 2.54
P01_S02_b 2.96
P02_S01_a 1.78
P02_S01_b 3.19
P02_S02_1 2.04
P02_S02_2 2.08
and I want to rearrange it to that:
id Shimmer_a Shimmer_b
P01_S01 2.31 3.87
P01_S02 2.54 2.96
P02_S01 1.78 3.19
P02_S02 2.04 2.08
I think it would be good to start with a range loop because it's always pairwise, but I dont know how to say python to rearrange.
Use Series.replace with dictionary and $ for last values of strings, here last 1, 2, reshape by Series.str.rsplit with n=1 for spliting by last _, then DataFrame.pivot and data cleaning by DataFrame.rename_axis with DataFrame.add_prefix:
df1 = (df.join(df['id'].replace({'1$':'a', '2$':'b'}, regex=True)
.str.rsplit('_', expand=True, n=1))
.pivot(0,1,'Shimmer')
.rename_axis(index='id', columns=None)
.add_prefix('Shimmer_')
.reset_index())
Solution with no rename last values after _ but with counter by GroupBy.cumcount:
df1 = (df.assign(a = df['id'].str.rsplit('_', n=1).str[0],
g = lambda x: x.groupby('a').cumcount())
.pivot('a','g','Shimmer')
.rename(columns=({0:'a', 1:'b'}))
.rename_axis(index='id', columns=None)
.add_prefix('Shimmer_')
.reset_index()
)
print (df1)
id Shimmer_a Shimmer_b
0 P01_S01 2.31 3.87
1 P01_S02 2.54 2.96
2 P02_S01 1.78 3.19
3 P02_S02 2.04 2.08
Combination of str.split and pivot:
temp = df['id'].str.split('_')
df['id'],df['group'] = temp.str[:-1].str.join('_'), temp.str[-1]
df['group'].replace({'1':'a', '2':'b'}, inplace=True)
df = df.pivot(index='id', columns='group', values='Shimmer')
df.columns = ['Shimmer_a', 'Shimmer_b']
Shimmer_a Shimmer_b
id
P01_S01 2.31 3.87
P01_S02 2.54 2.96
P02_S01 1.78 3.19
P02_S02 2.04 2.08

extract only certain rows in a dataframe

I have a dataframe like this:
Code Date Open High Low Close Volume VWAP TWAP
0 US_GWA_BTC 2014-04-01 467.28 488.62 467.28 479.56 74,776.48 482.76 482.82
1 GWA_BTC 2014-04-02 479.20 494.30 431.32 437.08 114,052.96 460.19 465.93
2 GWA_BTC 2014-04-03 437.33 449.74 414.41 445.60 91,415.08 432.29 433.28
.
316 MWA_XRP_US 2018-01-19 1.57 1.69 1.48 1.53 242,563,870.44 1.59 1.59
317 MWA_XRP_US 2018-01-20 1.54 1.62 1.49 1.57 140,459,727.30 1.56 1.56
I want to filter out rows where code which has GWA infront of it.
I tried this code but it's not working.
df.set_index("Code").filter(regex='[GWA_]*', axis=0)
Try using startswith:
df[df.Code.str.startswith('GWA')]

Categories

Resources