Trying to change a column from an array that has type
to a list.
Tried changing it directly to a list, but it still comes up as a series after checking the type of it.
First I get the first 4 numbers to I can have just the year, then I create a new column in the table called year to hold that new data.
year = df['date'].str.extract(r'^(\d{4})')
df['year'] = pd.to_numeric(year)
df['year'].dtype
print(type(df['year']))
Want the type of 'year' to be a list. Thanks!
If you want to get a list with years values into date column, you could try this:
import pandas as pd
df = pd.DataFrame({'date':['2019/01/02', '2018/02/03', '2017/03/04']})
year = df.date.str.extract(r'(\d{4})')[0].to_list()
print(f'type: {type(year)}: {year}')
# type: <class 'list'>: ['2019', '2018', '2017']
df.date.str.extract returns a new DataFrame with one row for each subject string, and one column for each group, then we take the first (only) group [0]
It seems pretty straightforward to turn a series into a list. The builtin list function works fine:
> df = pd.DataFrame({'date':['2019/01/02', '2018/02/03', '2017/03/04']})
> dates = list(df['date'])
> type(dates)
< <class 'list'>
> dates
< ['2019/01/02', '2018/02/03', '2017/03/04']
Related
I can see from here how to iterate through a list of dates from a datetime index. However, I would like to define the range of dates using:
my_df['Some_Column'].first_valid_index()
and
my_df['Some_Column'].last_valid_index()
My attempt looks like this:
for today_index, values in range(my_df['Some_Column'].first_valid_index() ,my_df['Some_Column'].last_valid_index()):
print(today_index)
However I get the following error:
TypeError: 'Timestamp' object cannot be interpreted as an integer
How do I inform the loop to restrict to those specific dates?
I think you need date_range:
s = my_df['Some_Column'].first_valid_index()
e = my_df['Some_Column'].last_valid_index()
r = pd.date_range(s, e)
And for loop use:
for val in r:
print (val)
If need selecting rows in DataFrame:
df1 = df.loc[s:e]
I'm constructing a DataFrame like so:
dates = [datetime.datetime.today() + datetime.timedelta(days=x) for x in range(0, 2)]
d = DataFrame([[1,2],[3,4]], index = dates, columns = ['a', 'b'])
I want to get a value like so:
d[d.index[0]]['a']
But I get the following error:
KeyError: Timestamp('2018-04-26 16:08:16.120031')
How come?
If you are trying to get the first element from column 'a', you access it like this:
d.loc[d.index[0], 'a']
The way you have it written now, d[d.index[0]] is trying to get a column with name d.index[0].
It depends what you want to do.
If you just want to first row, you could access it with iloc do the following:
d.iloc[0]['a']
If you want to filter the dataframe for example by the year, you could do:
d.loc[d.index.year == 2018, 'a']
d['a'][d.index[0]]
My confusion came from the fact that DataFrame is column first, and not row first as one would expect from general multi-dimension data structures. So in order to get the value, one must switch indices.
dataFrame[coumn][row]
Thanks #Michael for the hint.
First of all, you have to always know with the type of data you are dealing with, in your case, you create a DatetimeIndex:
DatetimeIndex(['2020-08-25 11:00:00.000307403',
'2020-08-25 11:00:00.000558638',
'2020-08-25 11:00:00.002280412',
'2020-08-25 11:00:00.002440933'])
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
and inside the DatetimeIndex, each element of it is a Timestamp:
2020-08-25 11:00:00.000307403
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
As you are working with DatetimeIndexes, you have to index the values by the actual timestamp ('2020-08-25 11:00:00.000307403') and not by the full timestamp variable (Timestamp('2020-08-25 11:00:00.000307403'))
So, instead of doing:
df[Timestamp('2020-08-25 11:00:00.000307403')]
you should do:
df['2020-08-25 11:00:00.000307403']
I lost like two hours to catch this error, since it is a bit stupid to include the type of data inside the variable, the easiest way to solve this is just to parse the DatetimeIndex to string.
For your solution:
d[str(d.index[0]),'a']
should work
I'm using Pandas 0.20.3 in my python 3.X. I want to add one column in a pandas data frame from another pandas data frame. Both the data frame contains 51 rows. So I used following code:
class_df['phone']=group['phone'].values
I got following error message:
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
class_df.dtypes gives me:
Group_ID object
YEAR object
Terget object
phone object
age object
and type(group['phone']) returns pandas.core.series.Series
Can you suggest me what changes I need to do to remove this error?
The first 5 rows of group['phone'] are given below:
0 [735015372, 72151508105, 7217511580, 721150431...
1 []
2 [735152771, 7351515043, 7115380870, 7115427...
3 [7111332015, 73140214, 737443075, 7110815115...
4 [718218718, 718221342, 73551401, 71811507...
Name: phoen, dtype: object
In most cases, this error comes when you return an empty dataframe. The best approach that worked for me was to check if the dataframe is empty first before using apply()
if len(df) != 0:
df['indicator'] = df.apply(assign_indicator, axis=1)
You have a column of ragged lists. Your only option is to assign a list of lists, and not an array of lists (which is what .value gives).
class_df['phone'] = group['phone'].tolist()
The error of the Question-Headline
"ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series"
might as well occur if for what ever reason the table does not have any rows.
Instead of using an if-statement, you can use set result_type argument of apply() function to "reduce".
df['new_column'] = df.apply(func, axis=1, result_type='reduce')
The data assigned to a column in the DataFrame must be a single dimension array. For example, consider a num_arr to be added to a DataFrame
num_arr.shape
(1, 126)
For this num_arr to be added to a DataFrame column, It should be reshaped....
num_arr = num_arr.reshape(-1, )
num_arr.shape
(126,)
Now I could set this arr as a DataFrame column
df = pd.DataFrame()
df['numbers'] = num_arr
I am new to python. I am trying to subset data from a pandas dataframe using values present in a list. Below is a simple example of what I am trying to do.
import pandas as pd
# Create dataframe df which contains only one column having weekdays as values
df = pd.DataFrame({'days':['monday','tuesday','wednesday','thursday','friday']})
# A list containing all seven days of a week
day_list = ['monday','tuesday','wednesday','thursday','friday','saturday','sunday']
# Create a new dataframe which should contain values present in list but missing in dataframe
df1 = df[~df.days.isin(day_list)]
# Output shows empty dataframe
Empty DataFrame
Columns: [days]
Index: []
# This gives error
df2 = df[~day_list.isin(df.days)]
# output from df2 code execution
df2 = df[~day_list.isin(df.days)]
AttributeError: 'list' object has no attribute 'isin'
In R, I can easily get this result using the below condition.
# Code from R
df1 <- day_list[! (day_list %in% df$days), ]
I want to create a new dataframe which contains only those values present in the list day_list but not present in df.days. In this case, it should return 'saturday' and 'sunday' as output. How can I get this result? I have looked at the solution provided in this thread - How to implement 'in' and 'not in' for Pandas dataframe. But it is not solving my problem. Any guidance on this to a Python 3.x newbie would really be appreciated.
I believe you need numpy.setdiff1d with DataFrame constructor:
df1 = pd.DataFrame({'all_days': np.setdiff1d(day_list, df['days'])})
print(df1)
all_days
0 saturday
1 sunday
Another solution is convert list to pandas structure like Series or DataFrame and use isin:
s = pd.Series(day_list)
s1 = s[~s.isin(df['days'])]
print(s1)
5 saturday
6 sunday
dtype: object
df2 = pd.DataFrame({'all_days': day_list})
df1 = df2[~df2['all_days'].isin(df['days'])]
print(df1)
all_days
5 saturday
6 sunday
I have a pandas dataframe, and one of the columns has date values as strings (like "2014-01-01"). I would like to define a different list for each year that is present in the column, where the elements of the list are the index of the row in which the year is found in the dataframe.
Here's what I've tried:
import pandas as pd
df = pd.DataFrame(["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"])
df = df.values.flatten().tolist()
for i in range(len(df)):
df[i] = df[i][0:4]
y2012 = []; y2013 = []; y2014 = []
for i in range(len(df)):
if df[i] == "2012":
y2012.append(i)
elif df[i] == "2013":
y2013.append(i)
else:
y2014.append(i)
print y2014 # [0, 2]
print y2013 # [1]
print y2012 # [3]
Does anyone know a better way of doing this? This way works fine, but I have a lot of years, so I have to manually define each variable and then run it through the for loop, and so the code gets really long. I was trying to use groupby in pandas, but I couldn't seem to get it to work.
Thank you so much for any help!
Scan through the original DataFrame values and parse out the year. Given, that, add the index into a defaultdict. That is, the following code creates a dict, one item per year. The value for a specific year is a list of the rows in which the year is found in the dataframe.
A defaultdict sounds scary, but it's just a dictionary. In this case, each value is a list. If we append to a nonexistent value, then it gets spontaneously created. Convenient!
source
from collections import defaultdict
import pandas as pd
df = pd.DataFrame(["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"])
# df = df.values.flatten().tolist()
dindex = defaultdict(list)
for index,dateval in enumerate(df.values):
year = dateval[0].split('-')[0]
dindex[year].append(index)
assert dindex == {'2014': [0, 2], '2013': [1], '2012': [3]}
print dindex
output
defaultdict(<type 'list'>, {'2014': [0, 2], '2013': [1], '2012': [3]})
Pandas is awesome for this kind of thing, so don't be so hasty to turn your dataframe back into lists right away.
The trick here lies in the .apply() method and the .groupby() method.
Take a dataframe that has strings with ISO formatted dates in it
parse the column containing the date strings into datetime objects
Create another column of years using the datetime.year
attribute of the items in the datetime column
Group the dataframe by the new year column
Iterate over the groupby object and extract your column
Here's some code for you to play with and grok:
import pandas
import dateutil
df = pd.DataFrame({'strings': ["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"]})
df['datetimes'] = df['strings'].apply(dateutil.parser.parse)
df['year'] = df['datetimes'].apply(lambda x: x.year)
grouped_data= df.groupby('year')
lists_by_year = {}
for year, data in grouped_data
lists_by_year [year] = list(data['strings'])
Which gives us a dictionary of lists, where the key is the year and the contents is a list of strings with that year.
print lists_by_year
{2012: ['2012-08-09'],
2013: ['2013-01-01'],
2014: ['2014-01-01', '2014-02-02']}
As it turns out
df.groupby('A') #is just syntactical sugar for df.groupby(df['A'])
This means that all you have to do to group by year is leverage the apply function and re-work the syntax
Solution
getYear = lambda x:x.split("-")[0]
yearGroups = df.groupby(df["dates"].apply(getYear))
Output
for key,group in yearGroups:
print key
2012
2013
2014