Text data stored differently - python

My problem is I have 2 values which should be the same, however they have this strange difference I don't know where its coming from.
The context is I have imported 3 files using pd.read_csv. I grouped the values using groupby, using some date field, and aggregated the offending variable using nunique, just to keep record of the count.
Then, using Tableau it actually counted different number unique records. I found a pair of records pandas says are different, while Tableau sees as equals.
Take a look:
df
A
0 100000306
1 100000306
x1 = df.iloc[0,0]
str(x1.values)
"['100000306']"
x2 = df.iloc[1,0]
str(x2.values)
'[100000306]'
Why is this happening and what can I do so pandas knows they are the same value?

You have different type in one columns
df.applymap(type)
A
0 <class 'str'>
1 <class 'int'>
Notice when you print df.A it will show object
df.A
0 100000306
1 100000306
Name: A, dtype: object

welcome to Stackoverflow!
I'm not sure what other processing steps you have done with your data but it seems the value stored in [0,0] is a string '100000306' as opposed to an integer 100000306. What you could do is to use pandas.to_numeric() in order to convert values in your column to float values where possible
df['A'] = pd.to_numeric(df['A'])

Related

Creating a new time-format column from a dask dataframe integer column

I have dask dataframe with one column named "hora" of integer type, and I want to create other column in time format. I show in the next example:
my data is:
hora
10
17
22
19
14
the result that I hope get for the first row is:
hora time
10 10:00:00
for that I am triying:
meta = ('time', 'datetime64[ns]')
df['hora'].map_partitions(dt.time, meta=meta).compute()
When I run code above throws:
TypeError: cannot convert the series to <class 'int'>
However I test the same example with series pandas and works.
I am applying the function "dt.time" the sameway in both cases, what is it the error?
Thanks very much in advance
By passing dt.time to map_partition, you are effectively doing dt.time(df) for each part of your dataframe. What you wanted was to apply the function to each value. You could have done either of the following:
ddf.assign(s2=ddf.hora.map(dt.time))
or
def mapper(df):
df['s2'] = df.hora.apply(dt.time)
return df
ddf.map_partitions(mapper)
(providing dtype is optional)

Not getting stats analysis of binary column pandas

I have a dataframe, 11 columns 18k rows. The last column is either a 1 or 0, but when I use .describe() all I get is
count 19020
unique 2
top 1
freq 12332
Name: Class, dtype: int64
as opposed to an actual statistical analysis with mean, std, etc.
Is there a way to do this?
If your numeric (0, 1) column is not being picked up automatically by .describe(), it might be because it's not actually encoded as an int dtype. You can see this in the documentation of the .describe() method, which tells you that the default include parameter is only for numeric types:
None (default) : The result will include all numeric columns.
My suggestion would be the following:
df.dtypes # check datatypes
df['num'] = df['num'].astype(int) # if it's not integer, cast it as such
df.describe(include=['object', 'int64']) # explicitly state the data types you'd like to describe
That is, first check the datatypes (I'm assuming the column is called num and the dataframe df, but feel free to substitute with the right ones). If this indicator/(0,1) column is indeed not encoded as int/integer type, then cast it as such by using .astype(int). Then, you can freely use df.describe() and perhaps even specify columns of which data types you want to include in the description output, for more fine-grained control.
You could use
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
data.describe(percentiles = perc, include = include)
where data is your dataframe (important point).
Since you are new to stack, I might suggest that you include some actual code (i.e. something showing how and on what you are using your methods). You'll get better answers

Pandas seems to change the value when accessing the data in a specific column

When I'm trying to access a specific value in my pandas dataframe, the output provides me with a tiny number (0.0000000000000001) adding to my original value. Why is this happening and how can I stop it?
The data is read in from a csv to a pandas dataframe, which has the value 1.009 contained in it (the csv is exactly 1.009), but when I try and access the value from it, specifying the column, then it gives me 1.0090000000000001. I don't want to simply round the number to x decimal places as my values have varying amounts of decimal places.
print(data_final.iloc[328])
# gives:
# independent 1.009
# dependent 7.757
# Name: 328, dtype: float64
print(data_final.iloc[328,0])
#gives: 1.0090000000000001
print(data_final['independent'].iloc[328])
#gives: 1.0090000000000001
I expected the output to be 1.009 however it is 1.0090000000000001!

how to determine if a cell has multiple values and count the number of occurences

I have a table as below where i need to count the number of times the type column has more than one value in it.
My logic at the moment is to go through each time and check if the type cell has more than one value in it and place a counter but i am not sure how to code this in Python correctly.
I tried this method below but i don't think it helps in my case considering that it is also hierarchical:
from collections import Counter
Counter(pd.DataFrame(data['Country'].str.split(',', expand=True)).values.ravel())
You can do:
## df is your data (gives pandas series)
df['type'].apply(lambda x: len(str(x).split(','))).value_counts()
## or convert it to dict
df['type'].apply(lambda x: len(str(x).split(','))).value_counts().to_dict()
Using get_dummies with sum
df=pd.DataFrame({'type':['big,green','big','small,red']})
df.type.str.get_dummies(sep=',').sum(1)
Out[382]:
0 2
1 1
2 2
dtype: int64
Maybe you should try this one:
df=pd.DataFrame({'type':['big,green','big','small,red']})
for i in df['type']: print(len(i.split(',')))

how to assign value to the pandas column?

I have a DataFrame, say one column is:
{'university':'A','B','A','C'}
I want to change the column into:
{'university':1,2,1,3}
According to an imaginary dict:
{'A':1,'B':2,'C':3}
how to get this done?
ps: I solved the original problem, it's something about my own computer setting.
And I changed the question accordingly to be more helpful.
I think you need map by dict - d:
df.university = df.university.map(d)
If need encode the object as an enumerated type or categorical variable use factorize:
df.university = pd.factorize(df.university)[0] + 1
Sample:
d = {'A':1,'B':2,'C':3}
df = pd.DataFrame({'university':['A','B','A','C']})
df['a'] = df.university.map(d)
df['b'] = pd.factorize(df.university)[0] + 1
print (df)
university a b
0 A 1 1
1 B 2 2
2 A 1 1
3 C 3 3
I try rewrite your function:
def given_value(column):
columnlist=column.drop_duplicates()
#reset to default monotonic increasing (0,1,2, ...)
columnlist = columnlist.reset_index(drop=True)
#print (columnlist)
#swap index and values to new Series columnlist_rev
columnlist_rev= pd.Series(columnlist.index, index=columnlist.values)
#map by columnlist_rev
column=column.map(columnlist_rev)
return column
print (given_value(df.university))
0 0
1 1
2 0
3 2
Name: university, dtype: int64
AttributeError: 'DataFrame' object has no attribute 'column'
Your answer is written in the Exception statement! DataFrame object doesn't have an attribute called column, which means you can't call on DataFrame.column at any point in your code. I believe your problem exists outside of what you have posted here, likely to be somewhere near the part where you imported the data as a DataFrame fro the first time. My guess is that when you were naming the columns, you did something like df.column = [university] instead of df.columns = [university]. The s matters. If you read the Traceback closely, you'll be able to figure out precisely which line is throwing the error.
Also, in your posted function, you do not need the parameter df as it is not used at any point during the process.

Categories

Resources