So I have two spreadsheets in csv format that I've been provided with for my masters uni course.
Part of the processing of the data involved the merging of the files, followed by running some reports off the merged content using dates. this I've completed successfully, however....
The current date format I'm led to believe is epoch so for example the first date on the spreadsheet is 43471
So, firstly I ran this code first to check what format it was looking at
pd.read_csv('bookloans_merged.csv')
df.info()
This returned the result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Number 1958 non-null int64
1 Title 1958 non-null object
2 Author 1854 non-null object
3 Genre 1958 non-null object
4 SubGenre 1958 non-null object
5 Publisher 1845 non-null object
6 member_number 1958 non-null int64
7 date_of_loan 1958 non-null int64
8 date_of_return 1958 non-null int64
dtypes: int64(4), object(5)
memory usage: 137.8+ KB
I then ran the following code:
# parsing date values
df = pd.read_csv('bookloans_merged.csv')
df[['date_of_loan','date_of_return']] = df[['date_of_loan','date_of_return']].apply(pd.to_datetime, format='%Y-%m-%d %H:%M:%S.%f')
df.to_csv('bookloans_merged_dates.csv', index=False)
Running this again:
pd.read_csv('bookloans_merged_dates.csv')
df.info()
I get:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Number 1958 non-null int64
1 Title 1958 non-null object
2 Author 1854 non-null object
3 Genre 1958 non-null object
4 SubGenre 1958 non-null object
5 Publisher 1845 non-null object
6 member_number 1958 non-null int64
7 date_of_loan 1958 non-null datetime64[ns]
8 date_of_return 1958 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(2), object(5)
memory usage: 137.8+ KB
So I can see the date_of_loan and date_of_return is now datetime64
trouble is, all the dates are now showing as 1970-01-01 00:00:00.000043471
How do I get to 01/03/2019 format please?
Thanks
David.
So I managed to get this figured out, with a little help. Here is the answer
from datetime import datetime
df1 = pd.DataFrame(data_frame, columns=['Title','Author','date_of_loan'])
df1['date_of_loan'] = pd.to_datetime(df1['date_of_loan'], unit='d', origin=pd.Timestamp('1900-01-01'))
df1.sort_values('date_of_loan', ascending=True)
from datetime import datetime
excel_date = 43139
d_time = datetime.fromordinal(datetime(1900, 1, 1).toordinal() + excel_date - 2)
t_time = d_time.timetuple()
print(d_time)
print(t_time)
So how I was able to use that premise in my program was like this
from datetime import datetime
df1 = pd.DataFrame(data_frame, columns=['Title','Author','date_of_loan'])
df1['date_of_loan'] = pd.to_datetime(df1['date_of_loan'], unit='d', origin=pd.Timestamp('1900-01-01'))
df1.sort_values('date_of_loan', ascending=True)
Related
I'm trying to merge two dataframes: 'new_df' and 'df3'.
new_df contains years and months, and df3 contains years, months and other columns.
I've cast most of the columns as object, and tried to merge them both.
The merge 'works' as doesn't return an error, but my final datafram is all empty, only the year and month columns are correct.
new_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_test 119 non-null datetime64[ns]
1 year 119 non-null object
2 month 119 non-null object
dtypes: datetime64[ns](1), object(2)
df3
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 53 to 1297
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_number 191 non-null object
1 date 191 non-null object
2 year 191 non-null object
3 country 191 non-null object
4 area 191 non-null object
5 location 191 non-null object
6 activity 191 non-null object
7 fatal_y_n 182 non-null object
8 time 172 non-null object
9 species 103 non-null object
10 month 190 non-null object
dtypes: object(11)
I've tried this line of code:
df_joined = pd.merge(left=new_df, right=df3, how='left', on=['year','month'])
I was expecting a table with only filled fields in all columns, instead i got the table:
Your issue is with the data types for month and year in both columns - they're of type object which gets a bit weird during the join.
Here's a great answer that goes into depth about converting types to numbers, but here's what the code might look like before joining:
# convert column "year" and "month" of new_df
new_df["year"] = pd.to_numeric(new_df["year"])
new_df["month"] = pd.to_numeric(new_df["month"])
And make sure you do the same with df3 as well.
You may also have a data integrity problem as well - not sure what you're doing before you get those data frames, but if it's casting as an 'Object', you may have had a mix of ints/strings or other data types that got merged together. Here's a good article that goes over Panda Data Types. Specifically, and Object data type can be a mix of strings or other data, so the join might get weird.
Hope that helps!
When I try to read the date from Excel file, I found that
the column called "TOT_SALES" the data type is float64 and all values with % sign.
I want to remove this sign and dividing all values on 100. And at the same time the values in the column in Excel file are regular as I mentioned.
any help how to remove the (%) from the values.
df= pd.read_excel("QVI_transaction_data_1.xlsx")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE 264836 non-null datetime64[ns]
1 STORE_NBR 264836 non-null int64
2 LYLTY_CARD_NBR 264836 non-null int64
3 TXN_ID 264836 non-null int64
4 PROD_NBR 264836 non-null int64
5 PROD_NAME 264836 non-null object
6 PROD_QTY 264836 non-null int64
7 TOT_SALES 264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB
df.head()
this is the result appear in the DataFrame
TOT_SALES
1180.00%
740.00%
420.00%
1080.00%
660.00%
This is the values in the Excel file without (%) sign
TOT_SALES
11.80
7.40
4.20
10.80
6.60
enter image description here
enter image description here
enter image description here
The data set had "deaths" as object and I need to convert it to the INTEGER. I try to use the formula from another thread and it doesn't seem to work.
******Input:******
data.info()
*****Output:*****
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1270 entries, 0 to 1271
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 1270 non-null object
1 leading_cause 1270 non-null object
2 sex 1270 non-null object
3 race_ethnicity 1270 non-null object
4 deaths 1270 non-null object
dtypes: object(5)
memory usage: 59.5+ KB
****Input:****
df = pd.DataFrame({'deaths':['50','30','28']})
print (df)
df = pd.DataFrame({'deaths':['50','30','28']})
print (df)
****Output:****
deaths
0 50
1 30
2 28
****Input:****
print (pd.to_numeric(df.deaths, errors='coerce'))
****Output:****
0 50
1 30
2 28
Name: deaths, dtype: int64
****Input:****
df.deaths = pd.to_numeric(df.deaths, errors='coerce').astype('Int64')
print (df)
****Output:****
deaths
0 50
1 30
2 28
****Input:****
data.info()
****Output:****
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1270 entries, 0 to 1271
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 1270 non-null object
1 leading_cause 1270 non-null object
2 sex 1270 non-null object
3 race_ethnicity 1270 non-null object
4 deaths 1270 non-null object
dtypes: object(5)
memory usage: 59.5+ KB
If you have nulls (np.NaN) in the column it will not convert to int type.
You need to deal with nulls first.
1 Either replace them with an int value:
df.deaths = df.deaths.fillna(0)
df.deaths = df.deaths.astype(int)
2 Or drop null values:
df = df[df.deaths.notna()]
df.deaths = df.deaths.astype(int)
3 Or (preferred) learn to live with them:
# make your other function accept null values
I am having trouble solving one assignment. Well, in a dataframe in one column I have values as text strings (objects). I want to convert this to a numeric value but every time I get an error that I cannot convert the string to a float.
I want to try using regex to convert the string '-1 203.45' into the value '1203.45'. Please help me how the code should be written in Pandas.
I've tested virtually all of the forum hints and none of them work. Please give me a hint.
First what I did:
Read the file csv in different way:
df = pd.read_csv('dane_navision.csv', delimiter=";", decimal= ",",
thousands=" " )
and
df = pd.read_table('dane_navision.csv', delimiter=";", thousands=" ", decimal=',')
I receive such table:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1905 entries, 0 to 1904
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Data księgowania 1905 non-null object
1 Typ zapisu 1905 non-null object
2 Nr zapisu 1905 non-null int64
3 Nr dokumentu 1905 non-null object
4 Nr zapasu 1905 non-null object
5 Opis 1905 non-null object
6 Kod lokalizacji 1905 non-null object
7 Ilość 1905 non-null object
8 Ilość zafakturowana 1905 non-null object
9 MPK - kod 1905 non-null object
10 Nr dok.zewn. 0 non-null float64
11 Kod kategorii zapasu 1826 non-null object
12 Ilość na jednostkę miary 1905 non-null int64
13 Kwota kosztu (rzeczywista) 1905 non-null object
dtypes: float64(1), int64(2), object(11)
memory usage: 208.5+ KB
For me is important 13 column. I wanted to change them from object to float and I can't do it.
I get the following error every time:
ValueError: could not convert string to float: '-1xa0032.02'
I tried many ways with the method:
df['colum_name'].astype(float) - does not work.
df['column_name'].str.replace[" ", "" ] - does not work.
Maybe someone has some idea how to cut space from string and then it will be easier to convert it to number.
price
quantity
high time
10.4
3
2021-11-08 14:26:00-05:00
dataframe = ddg
the datatype for hightime is datetime64[ns, America/New_York]
i want the high time to be only 14:26:00 (getting rid of 2021-11-08 and -05:00) but i got an error when using the code below
ddg['high_time'] = ddg['high_time'].dt.strftime('%H:%M')
I think because it's not the right column name:
# Your code
>>> ddg['high_time'].dt.strftime('%H:%M')
...
KeyError: 'high_time'
# With right column name
>>> ddg['high time'].dt.strftime('%H:%M')
0 14:26
Name: high time, dtype: object
# My dataframe:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 1 non-null float64
1 quantity 1 non-null int64
2 high time 1 non-null datetime64[ns, America/New_York]
dtypes: datetime64[ns, America/New_York](1), float64(1), int64(1)
memory usage: 152.0 bytes