I'm trying to merge two dataframes: 'new_df' and 'df3'.
new_df contains years and months, and df3 contains years, months and other columns.
I've cast most of the columns as object, and tried to merge them both.
The merge 'works' as doesn't return an error, but my final datafram is all empty, only the year and month columns are correct.
new_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_test 119 non-null datetime64[ns]
1 year 119 non-null object
2 month 119 non-null object
dtypes: datetime64[ns](1), object(2)
df3
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 53 to 1297
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_number 191 non-null object
1 date 191 non-null object
2 year 191 non-null object
3 country 191 non-null object
4 area 191 non-null object
5 location 191 non-null object
6 activity 191 non-null object
7 fatal_y_n 182 non-null object
8 time 172 non-null object
9 species 103 non-null object
10 month 190 non-null object
dtypes: object(11)
I've tried this line of code:
df_joined = pd.merge(left=new_df, right=df3, how='left', on=['year','month'])
I was expecting a table with only filled fields in all columns, instead i got the table:
Your issue is with the data types for month and year in both columns - they're of type object which gets a bit weird during the join.
Here's a great answer that goes into depth about converting types to numbers, but here's what the code might look like before joining:
# convert column "year" and "month" of new_df
new_df["year"] = pd.to_numeric(new_df["year"])
new_df["month"] = pd.to_numeric(new_df["month"])
And make sure you do the same with df3 as well.
You may also have a data integrity problem as well - not sure what you're doing before you get those data frames, but if it's casting as an 'Object', you may have had a mix of ints/strings or other data types that got merged together. Here's a good article that goes over Panda Data Types. Specifically, and Object data type can be a mix of strings or other data, so the join might get weird.
Hope that helps!
I suppose, it should be easy question for experienced guys. I want to group records by week' day and to have number of records at particular week-day.
Here is my DataFrame rent_week.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1689 entries, 3 to 1832
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1689 non-null int64
1 createdAt 1689 non-null datetime64[ns]
2 updatedAt 1689 non-null datetime64[ns]
3 endAt 1689 non-null datetime64[ns]
4 timeoutAt 1689 non-null datetime64[ns]
5 powerBankId 1689 non-null int64
6 station_id 1689 non-null int64
7 endPlaceId 1686 non-null float64
8 endStatus 1689 non-null object
9 userId 1689 non-null int64
10 station_name 1689 non-null object
dtypes: datetime64[ns](4), float64(1), int64(4), object(2)
memory usage: 158.3+ KB
Data in 'createdAt' columns looks like "2020-07-19T18:00:27.190010000"
I am trying to add new column:
rent_week['a_day'] = rent_week['createdAt'].strftime('%A')
and receive error back: AttributeError: 'Series' object has no attribute 'strftime'.
Meanwhile, if I write:
a_day = datetime.today()
print(a_day.strftime('%A'))
it shows expected result. In my understanding, a_day and rent_week['a_day'] have similar type datetime.
Even request through:
rent_week['a_day'] = pd.to_datetime(rent_week['createdAt']).strftime('%A')
shows me the same error: no strftime attribute.
I even didn't start grouping my data. What I am expecting in result is a DataFrame with information like:
a_day number_of_records
Monday 101
Tuesday 55
...
Try a_day.dt.strftime('%A') - note the additional .dt on your DataFrame column/Series object.
Background: the "similar" type assumption you make is almost correct. However, as a column could be of many types (numeric, string, datetime, geographic, ...), the methods of the underlying values are typically stored in a namespace to not clutter the already broad API (method count) of the Series type itself. That's why string functions are available only through .str, and datetime functions only available through .dt.
You can make a lambda function for conversion and apply that function to the column of "createdAt" Columns. After this step you can groupby based on your requirement. You can take help from this code:
rent_week['a_day'] = rent_week['createdAt'].apply(lambda x: x.strftime('%A'))
Thank you Quamar and Ojdo for your contribution. I found the problem: it is in index
<ipython-input-41-a42a82727cdd>:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
rent_week['a_day'] = rent_week['createdAt'].dt.strftime('%A')
as soon as I reset index
rent_week.reset_index()
both variants are working as expected!
I have a csv-file called 'filename' and want to read in these data as 64float, except the column 'hour'. I managed it with the pd.read_csv - function and an converter.
df = pd.read_csv("../data/filename.csv",
delimiter = ';',
date_parser = ['hour'],
skiprows = 1,
converters={'column1': lambda x: float(x.replace ('.','').replace(',','.'))})
Now, I have two points:
FIRST:
The delimiter works with ; ,but if I take a look in Notepad to my data, there are ',', not ';'. But if I take ',' I get: 'pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 13, saw 9'
SECOND:
If I want to use the converter for all columns, how can I get this?! What`s the right term?
I try to use dtype = float in the readin-function, but I get 'AttributeError: 'NoneType' object has no attribute 'dtype'' Whats happend? Thats the reasion why I want to managed it with the converter.
Data:
,hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind
offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"
This should work:
In [40]:
# text data
temp=''',hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"'''
# so read the csv, pass params quotechar and the thousands character
df = pd.read_csv(io.StringIO(temp), quotechar='"', thousands=',')
df
Out[40]:
Unnamed: 0 hour PV Wind onshore Wind offshore PV.1 Wind onshore.1 \
0 0 1 0 12985.0 9614.0 0 32825.5
1 1 2 0 12908.9 9290.8 0 36052.3
2 2 3 0 12740.9 8886.9 0 38540.9
3 3 4 0 12485.3 8644.5 0 40734.0
4 4 5 0 11188.5 8079.0 0 42688.0
5 5 6 0 11219.0 7594.2 0 43333.5
Wind offshore.1 PV.2 Wind onshore.2 Wind offshore.2
0 9495.7 0 13110.3 10855.5
1 9589.1 0 13670.2 10828.6
2 10087.3 0 14610.8 10828.6
3 10087.3 0 15638.3 10343.7
4 10087.3 0 16809.4 10343.7
5 10025.0 0 18266.9 10343.7
In [41]:
# check the dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 11 columns):
Unnamed: 0 6 non-null int64
hour 6 non-null int64
PV 6 non-null float64
Wind onshore 6 non-null float64
Wind offshore 6 non-null float64
PV.1 6 non-null float64
Wind onshore.1 6 non-null float64
Wind offshore.1 6 non-null float64
PV.2 6 non-null float64
Wind onshore.2 6 non-null float64
Wind offshore.2 6 non-null float64
dtypes: float64(9), int64(2)
memory usage: 576.0 bytes
So basically you need to pass the quotechar='"' and thousands=',' params to read_csv to achieve what you want, see the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
EDIT
If you want to convert after importing (which is a waste when you can do it upfront) then you can do this for each column of interest:
In [43]:
# replace the comma separator
df['Wind onshore'] = df['Wind onshore'].str.replace(',','')
# convert the type
df['Wind onshore'] = df['Wind onshore'].astype(np.float64)
df['Wind onshore'].dtype
Out[43]:
dtype('float64')
It would be faster to replace the comma separator on all the columns of interest first and just call convert_objects like so: df.convert_objects(convert_numeric=True)