Getting the columns of a pandas series - python

I have a pandas.core.series as such:
140228202800 25
130422174258 5
131213194708 3
130726171426 1
I would like to get the first column and second column separately
Column 1:
140228202800
130422174258
131213194708
130726171426
Column 2:
25
5
3
1
I tried the following but no luck.
my_series.iloc[:,0]
my_series.loc[:,0]
my_series[:,0]

The first "column" is the index you can get it using s.index or s.index.to_list() to get obtain it as a list.
To get the series values as a list use s.to_list and in order to get it as a numpy array use s.values.

Related

Index returned by np.argmax of a series within a dataframe slice points to wrong value when used as index into same dataframe

I have a dataframe created from collected sampled data. I then manipulate the dataframe to remove duplicates, sort, and remove saturated values:
df = pd.read_csv(path+ newfilename, header=0, usecols=[0,1,2,3,5,7,10],
names=['ch1_real', 'ch1_imag', 'ch2_real', 'ch2_imag', 'ch1_log_mag', 'ch1_phase',
'ch2_log_mag', 'ch2_phase', 'pr_sample_real', 'pr_sample_imag', 'distance'])
tmp=df.drop_duplicates(subset='distance', keep='first').copy()
tmp.sort_values("distance", inplace=True)
dfUnique=tmp[tmp.distance <65000].copy()
I also add two calculated values (with help from #Stef)
dfUnique['ch1_log_mag']=20np.log10((dfUnique.ch1_real +1jdfUnique.ch1_imag).abs())
dfUnique['ch2_log_mag']=20np.log10((dfUnique.ch2_real +1jdfUnique.ch2_imag).abs())
the problem arises when I try to find the index of the maximum magnitude. It turns out (unexpectedly to me), that dataframes keep there original data indices. So, after sorting and removing rows, the index of given row is not its index in the new ordered dataframe, but its row index within the original dataframe:
ch1_real ch1_imag ch2_real ... distance ch1_log_mag ch2_log_mag
79 0.011960 -0.003418 0.005127 ... 0.0 -38.104414 -33.896518
78 -0.009766 -0.005371 -0.015870 ... 1.0 -39.058001 -34.533870
343 0.002197 0.010990 0.003662 ... 2.0 -39.009865 -37.278737
80 -0.002686 0.010740 0.011960 ... 3.0 -39.116435 -34.902513
341 -0.007080 0.009033 0.016600 ... 4.0 -38.803434 -35.582833
81 -0.004883 -0.008545 -0.016850 ... 12.0 -40.138523 -35.410047
83 -0.009277 0.004883 -0.000977 ... 14.0 -39.589769 -34.848170
84 0.006592 -0.010250 -0.009521 ... 27.0 -38.282239 -33.891250
85 0.004395 0.010010 0.017580 ... 41.0 -39.225735 -34.890353
86 -0.007812 -0.005127 -0.015380 ... 53.0 -40.589187 -35.625615
When I then use:
np.argmax(dfUnique.ch1_log_mag)
to find the index of maximum magnitude, this returns the index in the new ordered dataframe series. But, when I use this to index into the dataframe to extract other values in that row, I get elements from the original dataframe at that row index.
I exported the dataframe to excel to more easily observe what was happening. Column 1 is the dataframe index. Notice that is is different than the row number on the spreadsheet.
The np.argmax command above returns 161. If I look at the new ordered dataframe, index 161 is this row highlighted below (data starts on row two in the spreadsheet, and indices start at 0 in python):
and is correct. However, per the original dataframes order, this was at index 238. When I then try to access ch1_log_max[161],
dfUnique.ch1_log_mag[161]
I get -30.9759, instead of -11.453. It grabbed the value using 161 as the index into original dataframe:
this is pretty scary --that two functions use two different reference frames (at least to a novice python user). How do I avoid this? (How) Do I reindex the dataframe? or should I be using an equivalent pandas way of finding the maximum in a series within a dataframe (assuming the issue is due to how pandas and numpy operate on data)? Is the issue the way I'm creating copies of the dataframe?
If you sort a dataframe, it preserves indices.
import pandas as pd
a = pd.DataFrame(np.random.randn(24).reshape(6,4), columns=list('abcd'))
a.sort_values(by='d', inplace=True)
print(a)
>>>
a b c d
2 -0.553612 1.407712 -0.454262 -1.822359
0 -1.046893 0.656053 1.036462 -0.994408
5 -0.772923 -0.554434 -0.254187 -0.948573
4 -1.660773 0.291029 1.785757 -0.457495
3 0.128831 1.399746 0.083545 -0.101106
1 -0.250536 -0.045355 0.072153 1.871799
In order to reset index, you can use .reset_index(drop=True):
b = a.sort_values(by='d').reset_index(drop=True)
print(b)
>>>
a b c d
0 -0.553612 1.407712 -0.454262 -1.822359
1 -1.046893 0.656053 1.036462 -0.994408
2 -0.772923 -0.554434 -0.254187 -0.948573
3 -1.660773 0.291029 1.785757 -0.457495
4 0.128831 1.399746 0.083545 -0.101106
5 -0.250536 -0.045355 0.072153 1.871799
To find the original index of max value, you can use .idxmax() then use .loc[]:
ix_max = a.d.idxmax()
# or ix_max = np.argmax(a.d)
print(f"ix_max = {ix_max}")
a.loc[ix_max]
>>>
ix_max = 1
a -0.250536
b -0.045355
c 0.072153
d 1.871799
Name: 1, dtype: float64
or if you have got new index order, you can use .iloc:
iix = np.argmax(a.d.values)
print(f"iix = {iix}")
print(a.iloc[iix])
>>>
iix = 5
a -0.250536
b -0.045355
c 0.072153
d 1.871799
Name: 1, dtype: float64
You can have a look at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

How do I create a DataFrame from a list so that the list will be shown as a column and not as one single row?

Using jupyter notebook. I have scraped some data from the web which I named "graphValues". The type of graphValues is a list and the values within the list are string. I would like to put the data of graphValues as a column in a new dataframe named "data". When I do this the dataframe contains only one single element which is the list of graphValues showing as a row, not a column:
data=pd.DataFrame([graphValues])
print(data)
output:
0 [10,0,0,2,0,3,2,4,4,14,11,20,12,18,43,50,20,80...
Something else I tried is putting graphValues in a dictionary as follows:
code:
graphValues2={'Graph Val': graphValues}
data=pd.DataFrame(graphValues2)
print(data)
This gave an error saying:
ValueError: If using all scalar values, you must pass an index<br/>
but if I add an index of lenght x, the df will just contain the same list x times (x being the lenght of graphValues of course).
How can I get the following output? Is there a way without a for loop? What is the most efficient way?
Wanted output:
0 10
1 0
2 0
3 2
4 0
: :
: :
Do not use print. Call the variable without any function like this:
data=pd.DataFrame(graphValues.split(','))
data
Or if the code above doesn't work, use this:
data = pd.DataFrame(graphValues)
data
>>> graphValues="10,0,0,2,0,3".split(",")
>>> data=pd.DataFrame(graphValues)
>>> data
0
0 10
1 0
2 0
3 2
4 0
5 3
pd.DataFrame(['r0c0','r0c1','r0c2']) sets a single column. Add an outer list and pandas thinks you are adding rows (pd.DataFrame([['r0c0','r0c1','r0c2'], ['r1c0','r1c1','r1c2']])). Since graphValues was already a list, you were doing that second one.

How store 2d matrix values in a variable?

import numpy as np
import pandas as df
from numpy import asarray
from numpy import save
files=np.load('arr.npy',allow_pickle=True)
#print(files)
data=df.DataFrame(files)
type(data)
rr=data.shape[0]
for i in range(0,rr):
res=data[0][i]
after running res variable contains last element
but i want all the values
so tell me how to store all the 2d matrix values in python ??
data variable is the dataframe
it contains 9339 rows and 2 columns
but i want 1st column it is the 32x32 matrix
how to store values res variable
Notice that res = data[0][i] initializes a new variable res on the first iteration of the loop (when i is 0), but then keeps reassigning its value to the value in the next row (staying on column 0).
I'm not sure exactly what you want, but it sounds like you just want the first column in a separate variable? Here is how to get the first column, as a pandas series and/or plain list, with a smaller example (9 rows and 2 columns)
import pandas as pd
random_data = np.random.rand(9,2)
data_df = pd.DataFrame(random_data)
print(data_df)
# this gets the first column as a pandas series. Change index from 0 to get another column.
print('\nfirst column:')
first_col = data_df[data_df.columns[0]]
print(first_col)
# if you want a plain list instead of a series
print('\nfirst column as list:')
print(first_col.tolist())
Output:
0 1
0 0.218237 0.323922
1 0.806697 0.371456
2 0.526571 0.993491
3 0.403947 0.299652
4 0.753333 0.542269
5 0.365885 0.534462
6 0.404583 0.514687
7 0.298897 0.637910
8 0.453891 0.234333
first column:
0 0.218237
1 0.806697
2 0.526571
3 0.403947
4 0.753333
5 0.365885
6 0.404583
7 0.298897
8 0.453891
Name: 0, dtype: float64
first column as list:
[0.21823726509923325, 0.8066974875381492, 0.526571422644495, 0.40394686954663594, 0.7533330239460391, 0.36588470364914194, 0.4045827678891364, 0.2988970490642284, 0.45389073978613426]

Indexing a Pandas Dataframe using the index of a Series

I have a TimeSeries and I want to extract the three first three elements and with them create a row of a Pandas Dataframe with three columns. I can do this easily using a Dictionary for example. The problem is that I would like the index of this one row DataFrame to be the Datetime index of the first element of the Series. Here I fail.
For a reproducible example:
CRM
Date
2018-08-30 0.000442
2018-08-29 0.005923
2018-08-28 0.004782
2018-08-27 0.003243
pd.DataFrame({'Reg_Coef_5_1' : ts1.iloc[0][0], 'Reg_Coef_5_2' : ts1.shift(-5).iloc[0][0], \
'Reg_Coef_5_3' : ts1.shift(-10).iloc[0][0]}, index = ts1.iloc[0].index )
I get:
Reg_Coef_5_1 Reg_Coef_5_2 Reg_Coef_5_3
CRM 0.000442 0.001041 -0.00035
Instead I would like the index to be '2018-08-30' a datetime object.
If I understand you correctly, you would like the index to be a date object instead of "CRM" as it is in your example. Just set the index accordingly: index = [ts1.index[0]] instead of index = ts1.iloc[0].index.
df = pd.DataFrame({'Reg_Coef_5_1' : ts1.iloc[0][0], 'Reg_Coef_5_2' : ts1.shift(-5).iloc[0][0], \
'Reg_Coef_5_3' : ts1.shift(-10).iloc[0][0]}, index = [ts1.index[0]] )
But as user10300706 has said, there might be a better way to do what you want, ultimately.
If you're simply trying to recover the index position then do:
index = ts1.index[0]
I would note that if you are shifting your dataframe up incrementally (5/10 respectively) the indexes won't aline. I assume, however, you're trying to build out some lagging indicator.

I need to create a python list object, or any object, out of a pandas DataFrame object grouping pieces of values from different rows

My DataFrame has a string in the first column, and a number in the second one:
GEOSTRING IDactivity
9 wydm2p01uk0fd2z 2
10 wydm86pg6r3jyrg 2
11 wydm2p01uk0fd2z 2
12 wydm80xfxm9j22v 2
39 wydm9w92j538xze 4
40 wydm8km72gbyuvf 4
41 wydm86pg6r3jyrg 4
42 wydm8mzt874p1v5 4
43 wydm8mzmpz5gkt8 5
44 wydm86pg6r3jyrg 5
45 wydm8w1q8bjfpcj 5
46 wydm8w1q8bjfpcj 5
What I want to do is to manipulate this DataFrame in order to have a list object that contains a string, made out of the 5th character for each "GEOSTRING" value, for each different "IDactivity" value.
So in this case, I have 3 different "IDactivity" values, and I will have in my list object 3 strings that look like this:
['2828', '9888','8888']
where again, the symbols you see in each string, are the 5th value of each "GEOSTRING" value.
What I'm asking is a solution, or an approach, that doesn't involve a too complicated for loop and have it as efficient as possible since I have to manipulate lots of data. I'd like it to be clean and fast.
I hope it's clear enough.
this can be done easily as follows as a one liner: (considered to be pretty fast too)
result = df.groupby('IDactivity')['GEOSTRING'].apply(lambda x:''.join(x.str[4])).tolist()
this groups the dataframe by values of IDactivity then select from each corresponding string of GEOSTRING column the 5th element (index 4) and joins it with the other corresponding strings. Finally we add tolist() method to get the output as list not pandas Series.
output:
['2828', '9888', '8888']
Documentation:
pandas.groupby
pandas.apply
Here's a solution involving a temp column, and taking inspiration for the key operation from this answer:
# create a temp column with the character we want from each string
dframe['Temp'] = dframe['GEOSTRING'].apply(lambda x: x[4])
# groupby ID and then concatenate using a sneaky call to .sum()
dframe.groupby('IDactivity')['Temp'].sum().tolist()
Result:
['2828', '9888', '8888']

Categories

Resources