Converting object to Int pandas - python

Hello I have an issue to convert column of object to integer for complete column.
I have a data frame and I tried to convert some columns that are detected as Object into Integer (or Float) but all the answers I already found are working for me
First status
Then I tried to apply the to_numeric method but doesn't work.
To numeric method
Then a custom method that you can find here: Pandas: convert dtype 'object' to int
but doesn't work either: data3['Title'].astype(str).astype(int)
( I cannot pass the image anymore - You have to trust me that it doesn't work)
I tried to use the inplace statement but doesn't seem to be integrated in those methods:
I am pretty sure that the answer is dumb but cannot find it

You need assign output back:
#maybe also works omit astype(str)
data3['Title'] = data3['Title'].astype(str).astype(int)
Or:
data3['Title'] = pd.to_numeric(data3['Title'])
Sample:
data3 = pd.DataFrame({'Title':['15','12','10']})
print (data3)
Title
0 15
1 12
2 10
print (data3.dtypes)
Title object
dtype: object
data3['Title'] = pd.to_numeric(data3['Title'])
print (data3.dtypes)
Title int64
dtype: object
data3['Title'] = data3['Title'].astype(int)
print (data3.dtypes)
Title int32
dtype: object

As python_enthusiast said ,
This command works for me too
data3.Title = data3.Title.str.replace(',', '').astype(float).astype(int)
but also works fine with
data3.Title = data3.Title.str.replace(',', '').astype(int)
you have to use str before replace in order to get rid of commas only then change it to int/float other wise you will get error .

2 years and 11 months later, but here I go.
It's important to check if your data has any spaces, special characters (like commas, dots, or whatever else) first. If yes, then you need to basically remove those and then convert your string data into float and then into an integer (this is what worked for me for the case where my data was numerical values but with commas, like 4,118,662).
data3.Title = data3.Title.str.replace(',', '').astype(flaoat).astype(int)

also you can try this code, work fine with me
data3.Title= pd.factorize(data3.Title)[0]

Version that works with Nulls
With older version of Pandas there was no NaN for int but newer versions of pandas offer Int64 which has pd.NA.
So to go from object to int with missing data you can do this.
df['col'] = df['col'].astype(float)
df['col'] = df['col'].astype('Int64')
By switching to float first you avoid object cannot be converted to an IntegerDtype error.
Note it is capital 'I' in the Int64.
More info here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
Working with pd.NA
In Pandas 1.0 the new pd.NA datatype has been introduced; the goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).
With this in mind they have created the dataframe.convert_dtypes() and Series.convert_dtypes() functions which converts to datatypes that support pd.NA. This is currently considered experimental but might well be a bright future.

I had a dataset like this
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79902 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79902 non-null object
1 Video Title 79902 non-null object
2 Video ID 79902 non-null object
3 Video Views 79902 non-null object
4 Comment ID 79902 non-null object
5 cleaned_comments 79902 non-null object
dtypes: object(6)
memory usage: 5.5+ MB
Removed the None, NaN entries using
dataset = dataset.replace(to_replace='None', value=np.nan).dropna()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79868 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79868 non-null object
1 Video Title 79868 non-null object
2 Video ID 79868 non-null object
3 Video Views 79868 non-null object
4 Comment ID 79868 non-null object
5 cleaned_comments 79868 non-null object
dtypes: object(6)
memory usage: 6.1+ MB
Notice the reduced entries
But the Video Views were floats, as shown in dataset.head()
Then I used
dataset['Video Views'] = pd.to_numeric(dataset['Video Views'])
dataset['Video Views'] = dataset['Video Views'].astype(int)
Now,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79868 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79868 non-null object
1 Video Title 79868 non-null object
2 Video ID 79868 non-null object
3 Video Views 79868 non-null int64
4 Comment ID 79868 non-null object
5 cleaned_comments 79868 non-null object
dtypes: int64(1), object(5)
memory usage: 6.1+ MB

Related

A merge in pandas is returning only NaN values

I'm trying to merge two dataframes: 'new_df' and 'df3'.
new_df contains years and months, and df3 contains years, months and other columns.
I've cast most of the columns as object, and tried to merge them both.
The merge 'works' as doesn't return an error, but my final datafram is all empty, only the year and month columns are correct.
new_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date_test 119 non-null datetime64[ns]
1 year 119 non-null object
2 month 119 non-null object
dtypes: datetime64[ns](1), object(2)
df3
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 53 to 1297
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 case_number 191 non-null object
1 date 191 non-null object
2 year 191 non-null object
3 country 191 non-null object
4 area 191 non-null object
5 location 191 non-null object
6 activity 191 non-null object
7 fatal_y_n 182 non-null object
8 time 172 non-null object
9 species 103 non-null object
10 month 190 non-null object
dtypes: object(11)
I've tried this line of code:
df_joined = pd.merge(left=new_df, right=df3, how='left', on=['year','month'])
I was expecting a table with only filled fields in all columns, instead i got the table:
Your issue is with the data types for month and year in both columns - they're of type object which gets a bit weird during the join.
Here's a great answer that goes into depth about converting types to numbers, but here's what the code might look like before joining:
# convert column "year" and "month" of new_df
new_df["year"] = pd.to_numeric(new_df["year"])
new_df["month"] = pd.to_numeric(new_df["month"])
And make sure you do the same with df3 as well.
You may also have a data integrity problem as well - not sure what you're doing before you get those data frames, but if it's casting as an 'Object', you may have had a mix of ints/strings or other data types that got merged together. Here's a good article that goes over Panda Data Types. Specifically, and Object data type can be a mix of strings or other data, so the join might get weird.
Hope that helps!

copy output from describe and info commands to different dataframes python

I am reading a csv file as a dataframe in python. Then i use below two commands to get more information about those files.
Is there a way to copy output of these two commands into separate data frames?
data.describe(include='all')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null object
2 color 53940 non-null object
3 clarity 53940 non-null object
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB
Regarding df.describe, as it's of type Dataframe itself, you can either create a new dataframe directly or save it to csv as below:
des=pd.DataFrame(df.describe())
or
df.describe().to_csv()
Regarding df.info(), this is of type 'Nonetype' which means that cannot be saved directly. You can check for some alternative solutions here:
Is there a way to export pandas dataframe info -- df.info() into an excel file?

grouping DateTime by week of the day

I suppose, it should be easy question for experienced guys. I want to group records by week' day and to have number of records at particular week-day.
Here is my DataFrame rent_week.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1689 entries, 3 to 1832
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1689 non-null int64
1 createdAt 1689 non-null datetime64[ns]
2 updatedAt 1689 non-null datetime64[ns]
3 endAt 1689 non-null datetime64[ns]
4 timeoutAt 1689 non-null datetime64[ns]
5 powerBankId 1689 non-null int64
6 station_id 1689 non-null int64
7 endPlaceId 1686 non-null float64
8 endStatus 1689 non-null object
9 userId 1689 non-null int64
10 station_name 1689 non-null object
dtypes: datetime64[ns](4), float64(1), int64(4), object(2)
memory usage: 158.3+ KB
Data in 'createdAt' columns looks like "2020-07-19T18:00:27.190010000"
I am trying to add new column:
rent_week['a_day'] = rent_week['createdAt'].strftime('%A')
and receive error back: AttributeError: 'Series' object has no attribute 'strftime'.
Meanwhile, if I write:
a_day = datetime.today()
print(a_day.strftime('%A'))
it shows expected result. In my understanding, a_day and rent_week['a_day'] have similar type datetime.
Even request through:
rent_week['a_day'] = pd.to_datetime(rent_week['createdAt']).strftime('%A')
shows me the same error: no strftime attribute.
I even didn't start grouping my data. What I am expecting in result is a DataFrame with information like:
a_day number_of_records
Monday 101
Tuesday 55
...
Try a_day.dt.strftime('%A') - note the additional .dt on your DataFrame column/Series object.
Background: the "similar" type assumption you make is almost correct. However, as a column could be of many types (numeric, string, datetime, geographic, ...), the methods of the underlying values are typically stored in a namespace to not clutter the already broad API (method count) of the Series type itself. That's why string functions are available only through .str, and datetime functions only available through .dt.
You can make a lambda function for conversion and apply that function to the column of "createdAt" Columns. After this step you can groupby based on your requirement. You can take help from this code:
rent_week['a_day'] = rent_week['createdAt'].apply(lambda x: x.strftime('%A'))
Thank you Quamar and Ojdo for your contribution. I found the problem: it is in index
<ipython-input-41-a42a82727cdd>:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
rent_week['a_day'] = rent_week['createdAt'].dt.strftime('%A')
as soon as I reset index
rent_week.reset_index()
both variants are working as expected!

How to concatenate two equal dataframes in pandas, differentiating between repetitions by id?

In python3 and pandas I have two dataframes with the same structure
df_posts_final_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32669 entries, 0 to 32668
Data columns (total 12 columns):
post_id 32479 non-null object
text 31632 non-null object
post_text 30826 non-null object
shared_text 3894 non-null object
time 32616 non-null object
image 24585 non-null object
likes 32669 non-null object
comments 32669 non-null object
shares 32669 non-null object
post_url 26157 non-null object
link 4343 non-null object
cpf 32669 non-null object
dtypes: object(12)
memory usage: 3.0+ MB
df_posts_final_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33883 entries, 0 to 33882
Data columns (total 12 columns):
post_id 33698 non-null object
text 32755 non-null object
post_text 31901 non-null object
shared_text 3986 non-null object
time 33829 non-null object
image 25570 non-null object
likes 33883 non-null object
comments 33883 non-null object
shares 33883 non-null object
post_url 27286 non-null object
link 4446 non-null object
cpf 33883 non-null object
dtypes: object(12)
memory usage: 3.1+ MB
I want to unite them and I could just do it like this:
frames = [df_posts_final_1, df_posts_final_1]
result = pd.concat(frames)
But the "post_id" column has unique identification codes. So when there is an id "X" in df_posts_final_1 it doesn't need to appear two times in the final dataframe result
Example, if the code "FLK1989" appears in df_posts_final_1 and also in df_posts_final_2, I leave only the last record that was in df_posts_final_2
Please, does anyone know the correct strategy to do this?
Fix your code add groupby + tail
frames = [df_posts_final_1, df_posts_final_2]
result = pd.concat(frames).groupby('post_id').tail(1)
Or we do drop_duplicates
frames = [df_posts_final_2,df_posts_final_1]#order here is important
result = pd.concat(frames).drop_duplicates('post_id')
Try to use:
result = pd.concat(frames).drop_duplicates(subset='post_id', keep='last')
The keep='last' parameter will keep only the second one, as you want.

Investigating different datatypes in Pandas DataFrame

I have 4 files which I want to read with Python / Pandas, the files are: https://github.com/kelsey9649/CS8370Group/tree/master/TaFengDataSet
I stripped away the first row (column titles in chinese) in all 4 files.
But other from that, the 4 files are supposed to have the same format.
Now I want to read them and merge into one big DataFrame. I tried it by using
pars = {'sep': ';',
'header': None,
'names': ['date','customer_id','age','area','prod_class','prod_id','amount','asset','price'],
'parse_dates': [0]}
df = pd.DataFrame()
for i in ('01', '02', '12', '11'):
df = df.append(pd.read_csv(cfg.abspath+'D'+i,**pars))
BUT: The file D11 gives me a different format of the single columns and thus cannot be merged properly. The file contains like over 200k lines and thus I cannot easily look for the problem in that file but as mentioned above, I was assuming it has the same format, but obviously there's some small difference in the format.
What's the easiest way of now investigating into the problem? Obviously, I cannot check every single line in that file...
When I read the 3 working files and merge them; and read D11 independetly, the line
A = pd.read_csv(cfg.abspath+'D11',**pars)
still gives me the following warning:
C:\Python27\lib\site-packages\pandas\io\parsers.py:1130: DtypeWarning: Columns (
1,4,5,6,7,8) have mixed types. Specify dtype option on import or set low_memory=
False.
data = self._reader.read(nrows)
Using the method .info() in pandas (for A and df) yields:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 594119 entries, 0 to 178215
Data columns (total 9 columns):
date 594119 non-null datetime64[ns]
customer_id 594119 non-null int64
age 594119 non-null object
area 594119 non-null object
prod_class 594119 non-null int64
prod_id 594119 non-null int64
amount 594119 non-null int64
asset 594119 non-null int64
price 594119 non-null int64
dtypes: datetime64[ns](1), int64(6), object(2)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 223623 entries, 0 to 223622
Data columns (total 9 columns):
date 223623 non-null object
customer_id 223623 non-null object
age 223623 non-null object
area 223623 non-null object
prod_class 223623 non-null object
prod_id 223623 non-null object
amount 223623 non-null object
asset 223623 non-null object
price 223623 non-null object
Even if I would use the dtype-option on import, I would somehow still be scared of wrong/bad results as there might happen some wrong casting of datatypes while importing!?
How to overcome and solve the issue?
Thanks a lot
Whenever you have a problem that is too boring to be done by hand, the solution is to write a program:
for col in ('age', 'area'):
for i, val in enumerate(A[col]):
try:
int(val)
except:
print('Line {}: {} = {}'.format(i, col, val))
This will show you all the lines in the file with non-integer values in the age and area columns. This is the first step in debugging the problem. Once you know what the problematic values are, you can better decide how to deal with them -- maybe by pre-processing (cleaning) the data file, or by using some pandas code to select and fix the problematic values.

Categories

Resources