python drop non-integer rows, convert to int - python

Is there a simple way to drop rows containing a non-integer cell value, then/and convert strings to integers, then sort ascending? I have dataset (single column of what's supposed to be just record numbers) that has strings that I want to remove. This code seems to work, but then sorting seems to sort as if "float" is "string." For example, the record numbers are sorted like so:
0
1
2
200000000
201
3
Code:
import pandas
with open('GridExport.csv') as incsv:
df1 = pandas.read_csv(incsv, usecols=['Record Number'])
cln = pandas.DataFrame()
cln['Record Number'] = [x for x in df1['Record Number'] if x.isdigit()]
cln.astype(float)
print(cln.sort(['Record Number']))
Is there a way to do this without converting to float first? I'd like to drop the numbers that don't fit into int64

The problem in your code is that the line
cln['Record Number'].astype(float)
does not modify the data frame. Consequently, it treats the column as of type string and sorts it accordingly. If you print cln['Record Number'].dtype
after the statement, it should make it clear.
If you would like to modify it, you should do the assignment
cln['Record Number'] = cln['Record Number'].astype(float)

You may convert all string elements into float elements and conduct the following method for sorting
def numeric_compare(x, y):
return float(x)-float(y)
>>> sorted(['10.0','2000.0','30.0'],cmp=numeric_compare)
['10.0', '30.0', '2000.0']

Related

compare two data frames of different data types

I would like to compare two data frames of different lengths and different data types.
df1 ['num'] is of type 'Object' and the column 'num' contains integers and string objects.
num
100899
1980903
AB347980
RT198090
df2['num'] is of type 'float'
num
100899.0
1980903.0
937974938.0
2837982.0
This is what i have tried so far;
Converting df2 to integers and then comparing it to df1 using pd.concat()
2.converting df2 to objects and then comparing it to df1 using pd.merge. when i try this method the numbers don't match because one is of type float and other of integer type within the object data type.
Here is an option:
If you are sure that your entire column is filled with number values and no strings, you can do this:
df['num'] = df['num'].map('{:.0f}'.format)
If they are all number values, with some NaN values, you can do this:
df['num'] = df['num'].map('{:.0f}'.format,na_action = 'ignore')
If you have strings then it will error out. One option if this is the case is to use pd.to_numeric()
df['num'] = pd.to_numeric(df['num'],errors = 'coerce').map('{:.0f}'.format,na_action = 'ignore').fillna(df['num'])

I'm using Pandas in Python and wanted to know how to split a value in a column and search that value in the column

Normally when splitting a value which is a string, one would simply do:
string = 'aabbcc'
small = string[0:2]
And that's simply it. I thought it would be the same thing for a dataframe by doing:
df = df['Column'][Range][Length of Desired value]
df = df['Date'][0:4][2:4]
Note: Every string in the column have the same length and are all integers written as a string data type
If I use the code above the program just throws the Range and takes [2:4] as the range which is weird.
When doing this individually it works:
df2 = df['Column'][index][2:4]
So right now I had to make a loop that goes one by one and append it to a new Dataframe.
To do the operation element wise, you can use apply (see link):
df['Column'][0:4].apply(lambda x : x[2:4])
When you did df2 = df['Column'][0:4][2:4], you are doing the same as df2 = df['Column'][2:4].
You're getting the indexes 0 to 4 of df and then getting the indexes 2 to 4 of that one.

Python Pandas Changing Column String Values to Float values in a new column

I have a dataframe which contains a column of strings of float values (negative and positive float values) in the format
.000 (3dp). This number is represented as a string in Column1 and I would like to add a column2 to the DataFrame as a float and convert the string representation of the float value to a float value preserving the 3 dp. I have had problems trying to do this and have error message of "ValueError: could not convert string to float:" Grateful for any help
Code
dataframe4['column2'] = ''
dataframe4['column2']=dataframe4['column1'].astype('float64')
#Round column1, column2 float columns to 3 decimal places
dataframe4.round({'column1': 3, 'column2': 3})
I don't know if I totally understood your question but you can try
dataframe4['column2'] = dataframe4['column1'].apply(lambda x : float(x))
Edit : If there are some numbers with commas, you can try:
dataframe4['column2'] = dataframe4['column1'].apply(lambda x : float(x.replace(",","")))
The problem appears to be that you have commas in your floats, e.g. '9,826.000'
You can fix like below
import re
re.sub(r",", "", "1,1000.20")
# returns '11000.20' and the below works
float(re.sub(r",", "", "1,1000.20"))
# you can e.g. use apply to apply to all your numbers in the DataFrame
df["new_col"] = df["old_col"].apply(lambda x: float(re.sub(r",", "", x)))
To still show the resulting float with commas afterwards in pandas, you can change the display setting for float as described here
IDK how you want to output these, but e.g. in the to_excel function, you can specify a float format, cf here or re-format the column before output, similar to the above. See this answer for some ideas.

Pandas column of list: How to set the dtype of items

I have a dataframe which has multiple columns containing lists and the length of the lists in each row are different:
tweetid tweet_date user_mentions hashtags
00112 11-02-2014 [] []
00113 11-02-2014 [00113] [obama, trump]
00114 30-07-2015 [00114, 00115] [hillary, trump, sanders]
00115 30-07-2015 [] []
The dataframe is a concat of three different dataframes and I'm not sure whether the items in the lists are of the same dtype. For example, in the user_mentions column, sometime the data is like:
[00114, 00115]
But sometimes is like this:
['00114','00115']
How can I set the dtype for the items in the lists?
Pandas DataFrames are not really designed to house lists as row/column values, so this is why you are facing difficulty. you could do
python3.x:
df['user_mentions'].apply(lambda x: list(map(int, x)))
python2.x:
df['user_mentions'].apply(lambda x: map(int, x))
In python3 when mapping a map object is returned so you have to convert to list, in python2 this does not happen so you don't explicitly call it a list.
In the above lambda, x is your row list and you are mapping the values to int.
df['user_mentions'].map(lambda x: ['00' + str(y) if isinstance(y,int) else y for y in x])
If your objective is to convert all user_mentions to str the above might help. I would also look into this post for unnesting.
As mentioned ; pandas not really designed to house lists as values.
this should work, where I'm making the first columns lists contain strings
df[0].apply((lambda x: [str(y) for y in x]))

Select row from a DataFrame based on the type of the object(i.e. str)

So there's a DataFrame say:
>>> df = pd.DataFrame({
... 'A':[1,2,'Three',4],
... 'B':[1,'Two',3,4]})
>>> df
A B
0 1 1
1 2 Two
2 Three 3
3 4 4
I want to select the rows whose datatype of particular row of a particular column is of type str.
For example I want to select the row where type of data in the column A is a str.
so it should print something like:
A B
2 Three 3
Whose intuitive code would be like:
df[type(df.A) == str]
Which obviously doesn't works!
Thanks please help!
This works:
df[df['A'].apply(lambda x: isinstance(x, str))]
You can do something similar to what you're asking with
In [14]: df[pd.to_numeric(df.A, errors='coerce').isnull()]
Out[14]:
A B
2 Three 3
Why only similar? Because Pandas stores things in homogeneous columns (all entries in a column are of the same type). Even though you constructed the DataFrame from heterogeneous types, they are all made into columns each of the lowest common denominator:
In [16]: df.A.dtype
Out[16]: dtype('O')
Consequently, you can't ask which rows are of what type - they will all be of the same type. What you can do is to try to convert the entries to numbers, and check where the conversion failed (this is what the code above does).
It's generally a bad idea to use a series to hold mixed numeric and non-numeric types. This will cause your series to have dtype object, which is nothing more than a sequence of pointers. Much like list and, indeed, many operations on such series can be more efficiently processed with list.
With this disclaimer, you can use Boolean indexing via a list comprehension:
res = df[[isinstance(value, str) for value in df['A']]]
print(res)
A B
2 Three 3
The equivalent is possible with pd.Series.apply, but this is no more than a thinly veiled loop and may be slower than the list comprehension:
res = df[df['A'].apply(lambda x: isinstance(x, str))]
If you are certain all non-numeric values must be strings, then you can convert to numeric and look for nulls, i.e. values that cannot be converted:
res = df[pd.to_numeric(df['A'], errors='coerce').isnull()]

Categories

Resources