Convert object to string in pandas - python

I have variable in pandas dataframe with values as below
print (df.xx)
1 5679558
2 (714) 254
3 0
4 00000000
5 000000000
6 00000000000
7 000000001
8 000000002
9 000000003
10 000000004
11 000000005
print (df.dtypes)
xx object
I am like below in order to convert this as num
try:
print df.xx.apply(str).astype(int)
except ValueError:
pass
I did try like this
tin.tin = tin.tin.to_string().astype(int)
But this giving me MemoryError, as I have 3M rows.
Can some body help me in stripping special chars and converting as int64?

You can test if the string isdigit and then use the boolean mask to convert those rows only in a vectorised manner and use to_numeric with param errors='coerce':
In [88]:
df.loc[df['xxx'].str.isdigit(), 'xxx'] = pd.to_numeric(df['xxx'], errors='coerce')
df
Out[88]:
xxx
0 5.67956e+06
1 (714) 254
2 0
3 0
4 0
5 0
6 1
7 2
8 3
9 4
10 5

You could split your huge dataframe into chunks, for example this method can do it where you can decide what is the chunk size:
def splitDataFrameIntoSmaller(df, chunkSize = 10000):
listOfDf = list()
numberChunks = len(df) // chunkSize + 1
for i in range(numberChunks):
listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
return listOfDf
After you have chunks, you can apply your function on each chunk separately.

Related

ValueError: invalid literal for int() with base 10: '"034545104X"' Pandas Dataframe

ratings["isbn"] = ratings["isbn"].astype(int)
I am getting this error when trying to convert the column into integer format for analysis. I even tried to replace the quotation marks and X from the isbn column. Even then I am getting the error.
ratings_data['isbn'] = ratings_data['isbn'].replace({'"':''}, regex=True)
ratings_data['isbn'] = ratings_data['isbn'].replace({'X':''}, regex=True)
Problem is there is many another strings like X, you can find all ISBN non only ends with X and no numeric:
ratings_data = pd.read_csv('BX-Book-Ratings.csv', sep=';')
# print(ratings_data.head(10))
df = ratings_data[~ratings_data['ISBN'].str.contains(r'^\d+$|^\d+X$')]
print(df)
User-ID ISBN Book-Rating
54 276762 B0000BLD7X 0
55 276762 N3453124715 4
384 276884 B158991965 6
535 276929 2.02.032126.2 0
536 276929 2.264.03602.8 0
... ... ...
1146393 275970 014014904x 0
1147650 276009 01400.77022 0
1147916 276046 08348OO799 10
1148549 276331 \0432534220" 9
1149066 276556 055337849x 10
[3092 rows x 3 columns]
Possible solution is filter only data with X or numeric for processing:
ratings_data = pd.read_csv('BX-Book-Ratings.csv', sep=';')
# print(ratings_data.head(10))
ratings_data = ratings_data[ratings_data['ISBN'].str.contains(r'^\d+$|^\d+X$')]
print(ratings_data)
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
... ... ...
1149775 276704 1563526298 9
1149776 276706 0679447156 0
1149777 276709 0515107662 10
1149778 276721 0590442449 10
1149779 276723 05162443314 8
[1146688 rows x 3 columns]
ratings_data['ISBN'] = ratings_data['ISBN'].replace({'X':''}, regex=True).astype(np.int64)
print(ratings_data)
User-ID ISBN Book-Rating
0 276725 34545104 0
1 276726 155061224 5
2 276727 446520802 0
3 276729 52165615 3
4 276729 521795028 6
... ... ...
1149775 276704 1563526298 9
1149776 276706 679447156 0
1149777 276709 515107662 10
1149778 276721 590442449 10
1149779 276723 5162443314 8
[1146688 rows x 3 columns]

reading only few fields from a text file with multiple delimiter

I have a text file with multiple delimiters separating the value. from that I just want to read the pipe separated values
data is like this for example:
'
10|10|10|10|10|10|10|10|10;10:10:10,10,10,10 ... etc
'
I want to read only upto the 8 pipe separated values as a dataframe and ignore the values with ";,:". How do I do that?
It would be a two step process. First read csv with | as delimiter
df = pd.read_csv(StringIO(
"10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
), delimiter='|', header=None)
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10;10:10:10,10,10,10
Then update the last column by removing the string coming after [;,:]
df.iloc[:, -1] = df.iloc[:, -1].str.replace(r'[;,:].*', '', regex=True)
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10
If you know the exact character after which you have to ignore, you can use comment attribute as follows. Everything after that 1 char string would be ignored.
df = pd.read_csv(StringIO(
"10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
), delimiter='|', header=None, comment=';')
df
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10
This is longer than other proposed solutions, but also possibly faster because it only reads what's needed. It collects the result as a list but it could be another container type :
df = "10,10,10,10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
coll = []
start = 0
prevIdx = -1
while True:
try:
idx = df.index("|", start)
if prevIdx >= 0:
n = int(df[prevIdx+1:idx])
if isinstance(n, int): coll.append(n)
start = idx+1
prevIdx = idx
except:
break;
print(coll) # ==> [10, 10, 10, 10, 10, 10, 10]

How to multiply a specific row in pandas dataframe by a condition

I have a column which of 10th marks but some specific rows are not scaled properly i.e they are out of 10. I want to create a function that will help me to detect which are <=10 and then multiply to 100. I tried by creating a function but it failed.
Following is the Column:
data['10th']
0 0
1 0
2 0
3 10.00
4 0
...
2163 0
2164 0
2165 0
2166 76.50
2167 64.60
Name: 10th, Length: 2168, dtype: object
I am not what do you mean by "multiply to 100" but you should be able to use apply with lambda similar to this:
df = pd.DataFrame({"a": [1, 3, 5, 23, 76, 43 ,12, 3 ,5]})
df['a'] = df['a'].apply(lambda x: x*100 if x < 10 else x)
print(df)
0 100
1 300
2 500
3 23
4 76
5 43
6 12
7 300
8 500
If I do not understand you correctly you could replace the action and condition in the lambda function to your purpose.
Looks like you need to change the data type first data["10th"] = pd.to_numeric(data["10th"])
I assume you want to multiply by 10 not 100 to scale it with the other out of 100 scores. you can try this np.where(data["10th"]<10, data["10th"]*10, data["10th"])
assigning it back to the dataframe using. data["10th"] = np.where(data["10th"]<10, data["10th"]*10, data["10th"])

Rolling sum with strings

Say I have a dataframe containing strings, such as:
df = pd.DataFrame({'col1':list('some_string')})
col1
0 s
1 o
2 m
3 e
4 _
5 s
...
I'm looking for a way to apply a rolling window on col1 and join the strings in a certain window size. Say for instance window=3, I'd like to obtain (with no minimum number of observations):
col1
0 s
1 so
2 som
3 ome
4 me_
5 e_s
6 _st
7 str
8 tri
9 rin
10 ing
I've tried the obvious solutions with rolling which fail at handling object types:
df.col1.rolling(3, min_periods=0).sum()
df.col1.rolling(3, min_periods=0).apply(''.join)
Both raise:
cannot handle this type -> object
Is there a generalisable approach to do so (not using shift to match this specific case of w=3)?
How about shifting the series?
df.col1.shift(2).fillna('') + df.col1.shift().fillna('') + df.col1
Generalizing to any number:
pd.concat([df.col1.shift(i).fillna('') for i in range(3)], axis=1).sum(axis=1)
Rolling works only with numbers:
def _prep_values(self, values=None, kill_inf=True):
if values is None:
values = getattr(self._selected_obj, 'values', self._selected_obj)
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
if is_float_dtype(values.dtype):
values = ensure_float64(values)
elif is_integer_dtype(values.dtype):
values = ensure_float64(values)
elif needs_i8_conversion(values.dtype):
raise NotImplementedError...
...
...
So you should construct it manually. Here is one of the possible variants with simple list comprehensions (maybe there is a more Pandas-ish way exists):
df = pd.DataFrame({'col1':list('some_string')})
pd.Series([
''.join(df.col1.values[max(i-2, 0): i+1])
for i in range(len(df.col1.values))
])
0 s
1 so
2 som
3 ome
4 me_
5 e_s
6 _st
7 str
8 tri
9 rin
10 ing
dtype: object
Using pd.Series.cumsum seems like working (although bit of inefficient):
df['col1'].cumsum().str[-3:]
Output:
0 s
1 so
2 som
3 ome
4 me_
5 e_s
6 _st
7 str
8 tri
9 rin
10 ing
Name: col1, dtype: object

Python How to count a series with multiple items in one line

f = open("routeviews-rv2-20181110-1200.pfx2as", 'r')
#read file into array, ignore first 6 lines
lines = loadtxt("routeviews-rv2-20181110-1200.pfx2as", dtype='str',
delimiter="\t", unpack=False)
#convert to dataframe
df = pd.DataFrame(lines,columns=['IPPrefix', 'PrefixLength', 'AS'])
series = df['AS'].astype(str).str.replace('_', ',').str.split(',')
arr = numpy.array(list(chain.from_iterable(series)))
ASes= pd.Series(numpy.bincount(arr))
ValueError: invalid literal for int() with base 10: '31133_65500,65501'
I want to count each time an item appears in col AS. However some lines have multiple entries that need to be counted.
Refer to: Python Find max in dataframe column to loop to find all values
Txt file: http://data.caida.org/datasets/routing/routeviews-prefix2as/2018/11/
But that cannot count line 67820 below.
Out[94]: df=
A B C
0 1.0.0.0 24 13335
1 1.0.4.0 22 56203
2 1.0.4.0 24 56203
3 1.0.5.0 24 56203
... ... ...
67820 1.173.142.0 24 31133_65500,65501
... ... ...
778719 223.255.252.0 24 58519
778720 223.255.254.0 24 55415
The _ is not a typo, that is how it appears in the file.
Desired output.
1335 1
... ..
31133 1
... ..
55415 1
... ..
56203 3
... ..
58159 1
... ..
65500 1
65501 1
... ..
replace + split + chain
You can replace _ with ,, split and then chain before using np.bincount:
from itertools import chain
series = df['A'].astype(str).str.replace('_', ',').str.split(',')
arr = np.array(list(chain.from_iterable(series))).astype(int)
print(pd.Series(np.bincount(arr)))
0 0
1 0
2 2
3 4
4 1
5 6
6 1
7 0
8 0
9 0
10 1
dtype: int64

Categories

Resources