Python How to count a series with multiple items in one line - python

f = open("routeviews-rv2-20181110-1200.pfx2as", 'r')
#read file into array, ignore first 6 lines
lines = loadtxt("routeviews-rv2-20181110-1200.pfx2as", dtype='str',
delimiter="\t", unpack=False)
#convert to dataframe
df = pd.DataFrame(lines,columns=['IPPrefix', 'PrefixLength', 'AS'])
series = df['AS'].astype(str).str.replace('_', ',').str.split(',')
arr = numpy.array(list(chain.from_iterable(series)))
ASes= pd.Series(numpy.bincount(arr))
ValueError: invalid literal for int() with base 10: '31133_65500,65501'
I want to count each time an item appears in col AS. However some lines have multiple entries that need to be counted.
Refer to: Python Find max in dataframe column to loop to find all values
Txt file: http://data.caida.org/datasets/routing/routeviews-prefix2as/2018/11/
But that cannot count line 67820 below.
Out[94]: df=
A B C
0 1.0.0.0 24 13335
1 1.0.4.0 22 56203
2 1.0.4.0 24 56203
3 1.0.5.0 24 56203
... ... ...
67820 1.173.142.0 24 31133_65500,65501
... ... ...
778719 223.255.252.0 24 58519
778720 223.255.254.0 24 55415
The _ is not a typo, that is how it appears in the file.
Desired output.
1335 1
... ..
31133 1
... ..
55415 1
... ..
56203 3
... ..
58159 1
... ..
65500 1
65501 1
... ..

replace + split + chain
You can replace _ with ,, split and then chain before using np.bincount:
from itertools import chain
series = df['A'].astype(str).str.replace('_', ',').str.split(',')
arr = np.array(list(chain.from_iterable(series))).astype(int)
print(pd.Series(np.bincount(arr)))
0 0
1 0
2 2
3 4
4 1
5 6
6 1
7 0
8 0
9 0
10 1
dtype: int64

Related

ValueError: invalid literal for int() with base 10: '"034545104X"' Pandas Dataframe

ratings["isbn"] = ratings["isbn"].astype(int)
I am getting this error when trying to convert the column into integer format for analysis. I even tried to replace the quotation marks and X from the isbn column. Even then I am getting the error.
ratings_data['isbn'] = ratings_data['isbn'].replace({'"':''}, regex=True)
ratings_data['isbn'] = ratings_data['isbn'].replace({'X':''}, regex=True)
Problem is there is many another strings like X, you can find all ISBN non only ends with X and no numeric:
ratings_data = pd.read_csv('BX-Book-Ratings.csv', sep=';')
# print(ratings_data.head(10))
df = ratings_data[~ratings_data['ISBN'].str.contains(r'^\d+$|^\d+X$')]
print(df)
User-ID ISBN Book-Rating
54 276762 B0000BLD7X 0
55 276762 N3453124715 4
384 276884 B158991965 6
535 276929 2.02.032126.2 0
536 276929 2.264.03602.8 0
... ... ...
1146393 275970 014014904x 0
1147650 276009 01400.77022 0
1147916 276046 08348OO799 10
1148549 276331 \0432534220" 9
1149066 276556 055337849x 10
[3092 rows x 3 columns]
Possible solution is filter only data with X or numeric for processing:
ratings_data = pd.read_csv('BX-Book-Ratings.csv', sep=';')
# print(ratings_data.head(10))
ratings_data = ratings_data[ratings_data['ISBN'].str.contains(r'^\d+$|^\d+X$')]
print(ratings_data)
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
... ... ...
1149775 276704 1563526298 9
1149776 276706 0679447156 0
1149777 276709 0515107662 10
1149778 276721 0590442449 10
1149779 276723 05162443314 8
[1146688 rows x 3 columns]
ratings_data['ISBN'] = ratings_data['ISBN'].replace({'X':''}, regex=True).astype(np.int64)
print(ratings_data)
User-ID ISBN Book-Rating
0 276725 34545104 0
1 276726 155061224 5
2 276727 446520802 0
3 276729 52165615 3
4 276729 521795028 6
... ... ...
1149775 276704 1563526298 9
1149776 276706 679447156 0
1149777 276709 515107662 10
1149778 276721 590442449 10
1149779 276723 5162443314 8
[1146688 rows x 3 columns]

extracting a column from the text file containing headers and delimiters

I have a text file which looks like this:
~Date and Time of Data Converting: 15.02.2019 16:12:44
~Name of Test: XXX
~Address: ZZZ
~ID: OPP
~Testchannel: CH06
~a;b;DateTime;c;d;e;f;g;h;i;j;k;extract;l;m;n;o;p;q;r
0;1;04.03.2019 07:54:19;0;0;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;3;0;0;0;0;0
5,5523894132E-7;2;04.03.2019 07:54:19;5,5523894132E-7;5,5523894132E-7;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;0;0;0;0;0;0
0,00277777777779538;3;04.03.2019 07:54:29;0,00277777777779538;0,00277777777779538;2;Pause;3,5724446855812;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,00555555532278617;4;04.03.2019 07:54:39;0,00555555532278617;0,00555555532278617;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;1;0;0;0;0;0
0,00833333333338613;5;04.03.2019 07:54:49;0,00833333333338613;0,00833333333338613;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,0111112040002119;6;04.03.2019 07:54:59;0,0111112040002119;0,0111112040002119;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,013888887724954;7;04.03.2019 07:55:09;0,013888887724954;0,013888887724954;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
I need to extract the values from the column named extract, and need to store the output as an excel file.
Can anyone give me any idea how I can proceed?
So far, I have only been able to create an empty excel file for the output, and I have read the text file. I however don't know how to append output to the empty excel file.
import os
file=open('extract.csv', "a")
if os.path.getsize('extract.csv')==0:
file.write(" "+";"+"Datum"+";"+"extract"+";")
with open('myfile.txt') as f:
dat=[f.readline() for x in range(10)]
datum=dat[7].split(' ')[3]
data = np.genfromtxt('myfile.txt', delimiter=';', skip_header=12,dtype=str)
You can use the pandas module.
You need to read skip the first lines of your text file. Here, I consider not to know how many there are. I loop until I find a data row.
Then read the data.
Finaly, export it as dataframe with to_excel (doc)
Here the code:
# Import module
import pandas as pd
# Read file
with open('temp.txt') as f:
content = f.read().split("\n")
# Skip the first lines (find number start data)
for i, line in enumerate(content):
if line and line[0] != '~': break
# Columns names and data
header = content[i - 1][1:].split(';')
data = [row.split(';') for row in content[i:]]
# Store in dataframe
df = pd.DataFrame(data, columns=header)
print(df)
# a b DateTime c d e f ... l m n o p q r
# 0 0 1 04.03.2019 07:54:19 0 0 2 Pause ... 1 3 0 0 0 0 0
# 1 5,5523894132E-7 2 04.03.2019 07:54:19 5,5523894132E-7 5,5523894132E-7 2 Pause ... 1 0 0 0 0 0 0
# 2 0,00277777777779538 3 04.03.2019 07:54:29 0,00277777777779538 0,00277777777779538 2 Pause ... 1 1 0 0 0 0 0
# 3 0,00555555532278617 4 04.03.2019 07:54:39 0,00555555532278617 0,00555555532278617 2 Pause ... 1 1 0 0 0 0 0
# 4 0,00833333333338613 5 04.03.2019 07:54:49 0,00833333333338613 0,00833333333338613 2 Pause ... 1 1 0 0 0 0 0
# 5 0,0111112040002119 6 04.03.2019 07:54:59 0,0111112040002119 0,0111112040002119 2 Pause ... 1 1 0 0 0 0 0
# 6 0,013888887724954 7 04.03.2019 07:55:09 0,013888887724954 0,013888887724954 2 Pause ... 1 1 0 0 0 0 0
# Select only the Extract column
# df = df.Extract
# Save the data in excel file
df.to_excel("OutPut.xlsx", "MySheetName", index=False)
Note: if you know the number of lines to skip, you can simply load the dataframe with read_csv using the skiprows parameter. (doc).
Hope that helps!

Create single row for each entry in df rows

Hello I read in an excel file as a DataFrame whose rows contains multiple values. The shape of the df is like:
Welding
0 65051020 ...
1 66053510 66053550 ...
2 66553540 66553560 ...
3 67053540 67053505 ...
now I want to split each row and write each entry into an own row like
Welding
0 65051020
1 66053510
2 66053550
....
n 67053505
I tried have tried:
[new.append(df.loc[i,"Welding"].split()) for i in range(len(df))]
df2=pd.DataFrame({"Welding":new})
print(df2)
Welding
0 66053510
1 66053550
2 66053540
3 66053505
4 66053551
5 [65051020, 65051010, 65051030, 65051035, 65051...
6 [66053510, 66053550, 66053540, 66053505, 66053...
7 [66553540, 66553560, 66553505, 66553520, 66553...
8 [67053540, 67053505, 67057505]
9 [65051020, 65051010, 65051030, 65051035, 65051...
10 [66053510, 66053550, 66053540, 66053505, 66053...
11 [66553540, 66553560, 66553505, 66553520, 66553...
12 [67053540, 67053505, 67057505]
13 [65051020, 65051010, 65051030, 65051035, 65051...
14 [66053510, 66053550, 66053540, 66053505, 66053...
15 [66553540, 66553560, 66553505, 66553520, 66553...
16 [67053540, 67053505, 67057505]
But this did not return the expected results.
Appreciate each help!
Use split with stack and last to_frame:
df = df['Welding'].str.split(expand=True).stack().reset_index(drop=True).to_frame('Welding')
print (df)
Welding
0 65051020
1 66053510
2 66053550
3 66553540
4 66553560
5 67053540
6 67053505

Convert object to string in pandas

I have variable in pandas dataframe with values as below
print (df.xx)
1 5679558
2 (714) 254
3 0
4 00000000
5 000000000
6 00000000000
7 000000001
8 000000002
9 000000003
10 000000004
11 000000005
print (df.dtypes)
xx object
I am like below in order to convert this as num
try:
print df.xx.apply(str).astype(int)
except ValueError:
pass
I did try like this
tin.tin = tin.tin.to_string().astype(int)
But this giving me MemoryError, as I have 3M rows.
Can some body help me in stripping special chars and converting as int64?
You can test if the string isdigit and then use the boolean mask to convert those rows only in a vectorised manner and use to_numeric with param errors='coerce':
In [88]:
df.loc[df['xxx'].str.isdigit(), 'xxx'] = pd.to_numeric(df['xxx'], errors='coerce')
df
Out[88]:
xxx
0 5.67956e+06
1 (714) 254
2 0
3 0
4 0
5 0
6 1
7 2
8 3
9 4
10 5
You could split your huge dataframe into chunks, for example this method can do it where you can decide what is the chunk size:
def splitDataFrameIntoSmaller(df, chunkSize = 10000):
listOfDf = list()
numberChunks = len(df) // chunkSize + 1
for i in range(numberChunks):
listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
return listOfDf
After you have chunks, you can apply your function on each chunk separately.

Python from matrix to simple text file

I would like to ask you how to change file looks like this:
123 111 1
146 204 2
178 398 1
...
...
First column is x, second is y and the third mean the number in each square.
My matrix is 400x400 dimension. I would like to change it to the simple file
M file doesn't posses every square (for example 0 0 doesn't exist which mean that in output file i would like to have 0 in first row in first place.
My output file should look like this
0 0 1 0 0 0 1 0 7 9 3 0 2 0 ...
8 0 0 1 0 0 0 0 0 0 0 0 0 0 ...
7 8 9 0 7 5 0 0 3 2 4 5 5 7 ...
...
...
How can I change my file?
From first file i would like to reah second file. Like text file with 400lines each 400 characters splited by " " (blankspace).
Just initialize your matrix as as a list of list of zeros, and then iterate the lines in the file and set the values in the matrix accordingly. Cells that are not in the file will remain unchanged.
matrix = [[0 for i in range(400)] for k in range(400)]
with open("filename") as data:
for row in data:
(x, y, n) = map(int, row.split())
matrix[x][y] = n
Finally, write that matrix to another file:
with open("outfile", "w") as outfile:
for row in matrix:
outfile.write(" ".join(map(str, row)) + "\n")
You could also use numpy:
matrix = numpy.zeros((4,4), dtype=numpy.int8)

Categories

Resources